Hi. My name is Sam Meiselman. I'd like to welcome you to this section of the course. In this, we will talk about databases and data types that contribute to your understanding of health science informatics and the uses of data. My personal background is that of a Data Engineer and Database Administrator here at Johns Hopkins. I've also been teaching database techniques to medical students here at the School of Medicine for the past eight years. I hope you enjoy this section of the course. In this Introduction to databases, we'll explore a context that you might not have thought of. Many of us have encountered in one way or another a database. The database is a regular collection of data. Data often doesn't abide by the rules that we like to set out for it. For example, name, people get married or they change, and we'll explore some of these in a minute. But there has to be a regular collection of that data in a consistent way. Because if data is not consistent, your conclusions that you draw from that data are fraught with error and mistakes. A database is a computerized record keeping system as opposed to a manual record keeping system. Manual systems, some of which got extremely complex, are sometimes even more efficient at the intake of the information. However, they fail miserably when it comes to any type of analytics or conclusions to draw from it. A database is a collection of stored operational data and relationships used by application systems of an enterprise. What does that mean? All this data that is collected has to be in a place, in a platform, if you will, where people can get at it. An application system of an enterprise, of a hospital, of insurance company, of all these things that might use clinical data, need to be able to access that central store of data, even if they were not the system that collected the data. The relationships are stored also in a consistent manner. For example, in our case, patients and visits, a patient identifier must be consistent across all the visits to properly link the visits to that patient, for example. That collection of data and relationships must be consistent and stored in a central location. Next, a database is a collection of data that we're talking about. Should be persistent. It should be there day-to-day. It shouldn't be a hit or miss chance of locating and interpreting your data, should be logically coherent in that there shouldn't be a hidden or implicit rules of trying to understand the database only through tradition or something. But rather, it should be very clear to understand that patient is a patient, a visit is a visit, a lab result is a lab result, things like that. It should be inherently meaningful as well. It should not be difficult to interpret what the entities are in the database. Sometimes there are situations, I can think of one example, where there was a database tracking student activity, and the entity wasn't a student, it was a student instance of that particular student, which is a very abstract concept. It's very hard to work with those kinds of entities. Ultimately, in database design, which is beyond the scope of this discussion, inherently meaningful data and relationships are the goal of a database. Last, it should be relevant to some aspects of the real world. Tracking and storing trivial information or information that may or may not be relevant has decreasing value. Of course, as storage and databases get bigger and bigger, there's much more trivial information that is tracked because someday it might be useful. But what this relevance is more is for the people consuming the data should understand how to draw conclusions from that data. Why databases? What was the historical and essentially need, business need, enterprise need that led to the creation of databases? Well, as we mentioned in the informatics discussion, regular collection of data in computerized record keeping evolved because paper just can't do it. Paper systems ultimately fail. In this picture from the 1940 US census, you see a number of clerks hard at work filling out file cards. These file cards were tabulated by various machines. They could do some very simple counter aggregation analytics, but essentially, that was it. You had to hire armies of people to tabulate this data. It was actually quite a difficult task. It was such a demand to improve this analytics capability that people paid in today's dollars, astronomical sums for even simple computations through the evolution of the first computing machines. At the beginning of these computing machines, evolving really with IBM, International Business Machines, counting machines, through the 1920s, '30s, and into the '40s, there were thought more of a computing device, not a storage device. The technology had not yet evolved that data could be stored on a device. These type of index cards evolved where a key punch, or index card punch machines would indicate various types of data on a punch card like this. When I was a kid in the 1970s, I was able to get a tour of some mainframe computers where they were still using large amounts of these cards. I remember stacks and stacks and stacks of these cards going into card readers. What the purpose of that was, was that this computing device, since they had very little storage, the entire computer program would have to be loaded first with instructions on the cards, very basic instructions, and then the dataset would be loaded next. Essentially, in today's terminology, if you would think about a laptop computer, for example, the entire operating system would have to be installed, the entire software package have to be installed, the entire data that it'd be computed would be installed, and then the entire thing will be wiped out every single time you would use this computer. This is highly inefficient by obviously today's standards, but that was the best things the technology could bear at that time. Data in the 1940s through mid '60s was really subordinate to the computation and application. Each application, for example, there was an application during World War II that was trying to calculate complex trajectories of Artillery for the war effort. That had nothing to do with another computer program that might calculate economic data. They had nothing to do. They stored no same data. They sort no context. The entire software had to be loaded every single time. There was no way to relate data to one application or another. Everything was completely siloed. It quickly evolved in the '50s and early '60s, the need to separate storage from the user application, the actual computation. Now we take this for granted, but allowing common storage for multiple applications was a tremendous leap forward in technology. By the mid '60s and '70s, the technology had evolved as illustrated by the IBM 360 Mainframe, where they were able to separate data storage from the application, but in a very limited way. The data storage at that time with this massive machine was a whopping seven megabytes, which in our comfortable modern world giggle because seven megabytes isn't even a song download today. But at that time, that was a tremendous advance.