Auto-scaling data processing pipelines. In this chapter, we'll look at Cloud Dataflow, which is a way to execute Apache Beam data pipelines on Google Cloud Platform. We look at how to write such data pipelines, and how to carry out MapReduce kinds of programs in Dataflow, and also some concepts such as how to deal with side inputs, and how to carry out streaming. So what is Dataflow? Dataflow is a way by which you can execute data processing pipelines on the cloud. So in this case, for example, we are using Dataflow to carry out a pipeline that reads from BigQuery and writes into Cloud Storage. It does so in a series of steps, and the key thing about Dataflow is that these steps called transforms can be elastically scaled. So for example, if it turns out that one of these steps needs to get executed in parallel on 15 machines, then it can get autoscale to those 15 machines, and the results can then move on to the next step of the transform which itself may get scaled only to five machines. So you essentially have something that's elastic, that each of whose steps are automatically scaled by the execution service. The code that we write is in an open source API called Apache Beama, and Dataflow is not the only place that you can execute Apache Beam pipelines, you can execute them on Flink or Spark etc, but we will look at Cloud Dataflow as the execution service for when we have a data pipeline that we would like to execute on the cloud. So the way this works is that you first create your pipeline, and then you do a series of applies. So in this pipeline, I'm reading from Text IO, and I'm reading from Google Cloud Storage, that's why there is a GS, and then doing some kind of filtering, some kind of grouping, some kind of filtering, some kind of transformation, and then writing those results out to another storage file on Cloud Storage. Now, each of these Filter1, Group1, Filter2, Transform1, all of these steps are user defined code. This is code that we write, in this case because this is Java, each of these are Java classes. Filter1, exhibit something that's kind of neat is that it's not just that we're applying Filter1 directly, we are applying Filter1 within the context of a parallel do. What this means is that Filter1 is going to get autoscaled run on a whole number of machines scaled-out, then the results of Filter1 are going to go streaming into Group1. The results of Group1 again are going to get applied to Filter2, but Filter2 is again going to be completely parallelized, and run in parallel on many machines and the results are Filter2 are going to stream into Transform1 and the results of Transform1 are going to go into the output file. Now, my use of stream there was explicit because what Dataflow lets you do is that it lets you write code that processes both historical batch data, data that's complete what will cut bounded sources and sinks as well as data that is unbounded in nature. So, we may have a data pipeline where we wanted to Filter1, Group1, Filter2 and Transform1. In the previous case, we applied Filter1, Group1, Filter2 and Transform1, we were reading from text and writing to text. In this case, we are reading from Pub/Sub, which is a messaging system which can work in real-time. So you're reading from Pub/Sub, and then you're writing back to Pub/Sub. But then, the question becomes, when I do a group, how can I do a group of something that is unbounded in nature? Well, you can't. Because you don't know if you're going to get some new member of the group. So anything that you calculate, let's imagine that you're calculating an average and mean, the mean is going to change over time as new data come in. So, when you're reading streaming data, typically what you do is that you also apply a window to it. So in this case, we are applying a sliding window of 60 minutes. Then, each of these groups, each of these transforms, if you're doing a mean, it's a mean within those 60 Minutes. In other words, it's a moving average. So if you want to do things like moving averages on real-time data, on streaming data, this is basically what you do. You change the input and output to read from something that's unbounded. So for example, from Pub/Sub. So Dataflow lets you ingest the data, both from batch, and from stream, use the same code, the same filtering, transforming, grouping code to carry out operations on both batch data, and on streaming data. Of course, when you do a mean on batch data, you would get a mean over the entire dataset. Whereas, if you are doing means on streaming data, you will have to do a window. A window for example, 60 minutes but you could do Windows-based and other things like the number of records, etc. Of course, you can apply the same Windowing transforms to batch as well. So that's the whole idea is to apply the same code that processes streaming data is also the same code that processes batch data.