The purpose of this lecture is to give a brief introduction to CSV or TSV and JSON file formats, which are two of the most common file formats that we're going to be seeing throughout this class and are ubiquitous for reading data from the web. So the two concepts we'll cover today are introduction to the file formats themselves and also a comparison between the advantages and disadvantages of each. So like I said, these are two of the most common format that we would use when ingesting data from the web and they're very easy to read and manipulate in Python. So for most of this course, we're going to be looking through two very popular and large data sets we've collected from the web. They are both publicly available. One is a data set of Amazon reviews, which is in CSV or more precisely in TSV tab-separated variable format, which you can download from this URL. The other is a data set from Yelp which is in JSON format and both of these are publicly available. So first, let's start looking at the Amazon dataset, which is in tab-separated variable format. So go ahead and collects the dataset from this URL on your own. We'll look at the "Gift Card" category from this dataset which is a subcategory of products for Amazon. Well choose this category for the moment simply because it's a reasonably small category to work with, and we can easily read and manipulate the files even on a laptop. So just to briefly introduce the concept of CSV files, it's a simple format that allows us to store data but as essentially tabular. So in this picture we have a little look at some of the first few lines of that file from Amazon. We can see here that it's what's called a human-readable format. In other words, we would easily be able to load it and even modify it using a text editor or a spreadsheet application, we could take this file and go ahead and open it in Microsoft Excel or something like that. Okay. So looking at this TSV file from Amazon in a little bit more detail, here are the first few lines and I'll run through a little bit about what each of these columns and rows actually means. So the first thing we see is there's a separator between each column. In this case, the separator is a tab. Each column is separated by a tab, whereas in a CSV files or comma-separated variable file, each field would be separated by a comma. Secondly, looking at the first row, there's a special row which is called the header. Essentially what it does, is indicate what each field in this file represents. So in this case we see that, the first column is going to correspond to the market price, the second will correspond to the customer ID, the third the review ID, etc. Then each other all over the file, in this case, corresponds to a single product review from Amazon. So it contains all of the fields necessary to specify a single review. Okay. So just to make it a little bit easy to read, all I've done here is try and lay out this file a little bit more cleanly by adding extra white-space to align the columns vertically. So this makes it very clear that this is really tabular data. It looks much like the data we would find in an Excel spreadsheet would look. So looking a little bit more into the file, there are many, many columns in this file because we have a lot of different types of data that are used to specify a review on Amazon. You can see all kinds of different fields, shop later. So next we have the product category, the star rating, the number of helpful votes they received. Out of the total number of votes they received was in a verified purchase, and then following that the actual content of the review itself. So that's essentially what CSV or TSV file formats look like. They're fairly simple and effective format but somehow they're limited to storing data the way we would call tabular. So JSON is another file format that's going to allow us to store different types of data. For example, imagine trying using a CSV file to store data like the following; The opening hours of a business which we might observe Yelp data, I don't know how you'd represent that. It could be for instance, a list of hourly ranges on different days, or the set of categories of a product. For example, what is the type of a business on Yelp? Is that a restaurant? Is a casual? Is it fine dining? There might be a set of arbitrary length that is used to specify what categories of product belongs to, so it's hard to come up with a fixed list of columns that would describe that category or list of playlist entries in this million playlist dataset which is quite different from the types of datasets we're looking at today. This is a dataset of playlist people have created on Spotify. Each of those could have variable length, would be again difficult to store in a format, where we had a fixed number of columns. So that's where JSON comes in. If we have data that we need to represent using structures like lists and sets, that's going to be inconvenient to do using the CSV/ TSV format. JSON is going to try and address this by allowing for more general forms of structured data. So let's look at an example from this Yelp dataset, which again you can go to this website and download once you agree to the terms of use for this dataset. Again, we'll look at the first line. It looks something like this. Again, it's a mess but I'll break down each of its individual parts. This is just a large somewhat human readable string. If I draw it a little differently against by adding white space, we can see its structure and a little bit more detail. So essentially this entry is made up of key value pairs. W e see the business ID followed by an ID, we see attribute followed by a list of attributes, we see longitude and latitude followed by numbers, state followed by AZ for Arizona, the address followed by a string, etc. So adding this additional white space doesn't have any effect on the meaning of the file nor it's changing the ordering of the entries, unlike in a TSV or CSV the file, there is no header in a JSON file. Rather, each entry in this file specifies what the different keys are like the business, the attributes, the longitude and the latitude. So each value in this file can take a variety of formats. For instance, we see string valued entries like the address, we see Boolean valued entries or true and false entries, like whether they accept credit cards, we see numbers like the longitude and latitude. We also see list such as the set of categories this business belongs to, it's a dentist, it's a general dentistry, health medical, etc. Finally, we actually see entries that are themselves, JSON objects. So the attribute list here is itself another JSON object which contains a set of key value pairs. All right. So briefly to summarize what are the advantages of CSV/TSV files versus JSON files, advantage of CSV and TSV is that it's simple human-readable format. We can easily open it up in a text editor or a spreadsheet application. But the disadvantage of this format is that it's somewhat limited in the types of data it can represent. It can really only represent tabular data, whereas JSON is going to allow for manipulation of complex and semi-structured data, but has the disadvantage that it's somewhat more difficult to explore and manipulate using GUIs, but in later lectures, we'll see how we can manipulate this data in Python. So briefly to summarize what we covered in this lecture, you should now understand the basic format of JSON and CSV files as well as the merits of both formats. So now you should be able to go ahead and download those two datasets from Amazon and Yelp and have a try on your own of opening at least some of the smaller files in a text editor and exploring their structure.