So now, in this last chapter, we're going to talk a little bit about visualizing data. But what we're really doing is we're summing everything up, because we're going to retrieve data from the network, we're going to process the data, we're going to store it in a database, we're going to then write it out and visualize it. So, it's all coming together, and and it turns out that this notion of gathering data, the data gathering using the network it's pretty common thing. It might take a cleaning or processing step where we're- the part of the problem is when you're pulling data off the net, you want to be able to start this process because it'll run, and run, and then your computer will crash or it'll go to sleep or something. So, you don't want to start from the beginning because it might be quite a bit of data and it takes a while to do it or as we've seen in some, you might be talking to an API that's got some rate limit that says, oh, you'd have to stop at 14 of these things, or stop at 200 or whatever. So, this is often a restartable process and it's usually a simple process. It's usually a relatively small amount of code, where you have a queue of things you want to retrieve, you go to the next one and then you stored in database, next one starting database and when you start the process up, you start filling this database up with stuff and then if it blows up and you restart it, the first thing it does is it reads the databases. "Oh, I don't need any of those", and then it starts to get the next one, and the next one, and the next one, and the next one. That is how you make this restartable. Databases are really good at having it, so that your program that's writing to the database can blow up and you don't corrupt your data. You don't have partial data, it's either written or it's not written and so these things can blow up in. Sometimes you just blow them up, because you want to blow them up and start them up, and you start them up again, they scan down, say, ''Where was I? Oh, I'll start here, here, here, here. '' So, this is often a slow and restartable process. It also might be limited but for some reason. So, this runs for awhile and the third thing we'll do in this chapter, it might run for days actually. Then, you have your data and then you start doing stuff inside your computer where you don't really care so much about the network, it might be this is raw data that came in off the APIs and you want to make the data in some new little format, so you might go from one database to another database or a database to a file and produced data that's really ready for visualization. If this might be a little complex or there might be flaws in it, you might write scanners that go like, "Oh, wait a sec, this is inconsistent, sometimes it looks like this and sometimes it looks like that, so I'll clean that stuff up." Then, using some visualization or doing some Python Programs that loop through the data once it's cleaned up and then do some summing or adding or who knows what they are that they're doing, but analyzing or visualizing. What we're going to use is we're going to use things like Google Maps to do our visualization, a lot of JavaScript and a thing called D3.js, which is a JavaScript library. Now, in this class, we're not teaching a JavaScript, we're not going teach you Google Maps. I provided all these things, so that when you run these programs, that stuff is all there. But if you want to learn and see some examples of how to make a little simple JavaScript visualization with a line or a word cloud or a map, we've got it, and you can take a look at those things. Now, this is one form of data mining and its really a data mining for an individual, where you're pulling this data, you're getting at local and then you're working with it there are other much more sophisticated data mining technologies that you might find yourself using. But often, you'll also find Python is part of these or Python helps you manage these or you write a Python Program to scan through these things or to prepare them or to do something. So, there's lots of different data mining technologies, this is just one oversimplified very Python-oriented data mining technology. I'd call this personal data mining. You should take classes. If you really want to become a Data Mining Expert, this is just giving you some of the skills that we've learned in this class and solving some data mining. So, the first application that we're going to data mine is an extension of an application we played with back in the JSON chapter. The idea is it has a Q of Locations. These are not pretty locations, meaning their user typed in your locations, they're actually from data from many years ago. It's anonymized data from the students who actually took one of my very first MOOCs, MOOC on Internet history, but it's reduced in anonymized just play with it. But it's not accurate. We don't have GPS coordinates. But if we use the Google GeoData API, but JSON we can do this, but we need to avoid rate limiting, so we're going to cache this in a database. Meaning, we're only going to retrieve data at once and then we're going to use the Google Maps API to visualize this in a browser. The sample code is right there and that sample code geodata.zip has a read me and it tells you exactly what to do to run this and it shouldn't be very hard for you to run it and produce a nice visual result. Here's a basic process diagram of what's going to happen, there is a list of the things to retrieve called where.data is just a list of the locations, but these are not correct, they don't have GPS, there just a as typed into a text field by a user and Geoload is going to start and start reading this and it checks to see if it's already in the database. This is a restartable process as I mentioned and then it looks to see the first unretrieved data and then it goes out and does a web service, parses that then puts it into the database and then goes to the next one, parses that puts it in a database and this runs for awhile and then maybe you blows up then you fix whatever or you start your computer backup and runs for a while. So, this is a restartable process that in effect is adding stuff to this database, it's an SQLite database and you can use the SQLite Browser to look at this if you like stuff we did in the database chapter. So, you can run that, you can see what you got, run at some more sewage which he got, debug it by using the SQLite Browser. Then at some point you've got all of your data and you want to you've got a couple of things we got this application called geodump.py that read through all of this data and then print some information out, nice summary information. It's really common to want to do this to get some summary information just for sanity checking, so you don't have to use SQLite Browser but this also writes out a little JavaScript file called where.js which then combined with where.html and the Google APIs. This uses JavaScript to put all these little pins on based on whatever data is in this database. So, that's our first end-to-end spider process visualize. First thing. So, up next, we're going to show how we can use this to build a very simple search engine and then run the PageRank algorithm.