Hi, welcome back. As I mentioned in the last lecture, if you're a designer you need to know databases. If you're not already thoroughly acquainted with databases, the overview I'm giving in these talks will hopefully encourage you to look more into database design. Last time we talked about data tables, indexes, primary keys and junction tables. You can see from the student and course example that there's some sort of underlying principle we're following in our design. We've only got student information recorded in one place, the student table. We've got course information recorded in only one place, the course table. The relationship between the students and courses are stored in a junction table, which carries only the Student ID and the Course ID. The fact that student and course data are only in one place is not to save memory, we've got plenty of memory. It's to adhere to what is called third normal form. What this says is that the data's only started in one place, and if a reference is needed to that data it happens by way of an idea of some sort. A primary key or what's called a foreign key reference. The reasons this is done is so that if your address changes, for example, we changed it in one place, and all other records in the database that refer to your address will automatically get the change. This is equivalent to the software concept of modularity and it has the same advantages and disadvantages. In the example, a list of cash register transactions from the dining hall record one student number, and with that, the student's address and phone number can be directly obtained. Similarly, if the card swipe system, which unlocks doors around campus, records its activity, all that needs to be recorded about the student is the student ID. Which can be obtained from the swiped card. And all the other information about the student can be directly obtained through reference. Although third normal form is ruthlessly efficient, it is sometimes cumbersome. On a software project that involves database design there should be documented policies for when third normal form is to be avoided, and instances or locations where it was avoided should be documented. Here's an example of a situation where third normal form doesn't help us much. Consider a purchasing system like Amazon. You have a ship to address, and occasionally if you move, you update that address, so that what you order arrives at where you currently live. Let's say I updated my ship to address last January. If I noticed that something I ordered a while ago was never delivered and I asked what address did you ship that to? Third normal form would say that they tell me what my current address is. Clearly, what we need to do is save the old address. Now, this would violate the third normal rule of data being only in one place. This is called a denormalized data or sometimes, second and a half normal form. In this scheme, we might store the entire ship-to label and relevant customer information, for example, a phone number, in a data file. If the customer has had a lot of shipments, there can be quite a bit of duplication of information. But two factors work in our favor. First, storage is pretty cheap. You can store a lot of denormalized ship-to-labels on a ten terabyte magnetic disk. Second, historical data isn't accessed as often as current data, so speed of access isn't as important. Speaking of speed, sometimes speed is a reason to denormalize or copy data. If we have customer data stored in multiple third normal files, we have to put all that data back together again to make it useful. This is done in structured query language with join statements. Join statements except relationship criteria which specify how rows in one table are connected with rows in another table. Join statements, although extensive optimization can be made, can be time consuming. Having to join a half a dozen tables to produce a result and having to do this rapidly, over and over, for different data can produce respond time slow enough that for performance reasons, strict normalization is bypassed for something more speed oriented. In the example an order table contains an order ID, a customer ID, a date and the price. Now, a parts table might also contain the part's price, but it's reproduced in this table to save a join with a very large parts table. And it's also possible that the price of the part might change. What would be recorded in the order table was the price when the order was actually made. Extended Price is simply Unit Price times Quantity. This could be calculated, but if it's calculated many, many times that might take longer than just calculating it once and storing that result. Thus, there're arguments for second and a half normal form, and again, if tables are denormalized the reasons should be recorded. There are, in fact, five normal forms, of these third, or almost third is the most popular. For purely academic reasons here's a run down of normal forms. In first normal form, all rows in a data table must have the same number of columns, although some of the values can be null. No ragged arrays here. In second normal form, one column is the key and all other columns must provide information about only that key. In other words, columns are attributes of that primary key. In third normal form, there can be no correlation between non-key fields in a row. For example, City-State-Zip violates third normal form because a zip code identifies a city-state combination. And, of course, third normal form is the most common. In fourth normal form, a row in a table contains only one attribute of the key field. This is rarely used, so basically you have a key and an attribute, and in another table, that same key and a different attribute. Finally, in fifth normal form, we use fourth normal form tables to contain relations among attributes of a key. This could be a pretty complex subject and you can do some further reading on the web if you wish to about this, but you're almost guaranteed to not run into this. From a security standpoint, databases are extremely important because databases are where the crown jewels, the most valuable content of an organization are stored. There're some options on how to configure a database for security. The most comprehensive is to encrypt the entire database. This causes performance problems because to be used, all the data has to be decrypted. Another solution is to encrypt only sensitive non-keyed data in a table. We talked earlier about how tables are joined together to create complex but useful data results. Encrypting the keys or the data by which the tables are joined is a serious performance problem. However, if we leave the keys in plain text and only encrypt the sensitive data like names, addresses and phone numbers. When the keys, which are used to reassemble normal data, are encrypted, they also need to contain no identifiable information. The above method could defend against a database that made it into the hands of people who are not supposed to have it. But this is only one way of defending against a threat, and it's only one threat. It's mentioned because of our discussion of keys and joins. In the next lesson, we'll address database security in a broader sense, thanks.