Hi. My name is Vince Gonzales and I'm a Data Engineer with Google Cloud. This module, we'll introduce frameworks and features available to streamline your CI/CD workflow for data flow pipelines. In this module, we'll cover an overview of testing and CI/CD. We'll discuss unit testing, your Beam pipelines, integration tests, artifacts building, and considerations around deploying your pipelines. Let's get into the overview. Software engineers are no stranger to application life cycle management. After all, that's how we keep applications fresh and up-to-date. Data flow pipelines are no different. Data flow pipelines are authored according to well-understood best practices within software engineering. First, data flow pipelines need a comprehensive testing strategy. We should be implementing unit tests, integration tests, and end-to-end tests to ensure that our pipeline behaves as we expect. The approach to development should also be well-structured. A haphazard roll out can result in corrupted data being written to the sink or disruptions to your downstream applications. Finally, data engineer should strive to validate changes made to pipeline logic and have a rollback plan if there's a bad release. While all these considerations are similar to general application development, there are some key differences to point out. Data pipelines often aggregate data and this makes them stateful in that they must accumulate the result of some aggregation over time. This means that if you need to update your pipeline, you need to consider any state that may exist in the pipeline you're updating. We'll discuss this in more detail later, but when you change your pipeline, you'll need to account for existing state as well as any changes to the pipeline logic and topology. Changes you make need to be compatible with the pipeline you're updating. If they're not, you might have to devise alternate migration strategies that might require reprocessing data. If you do roll out about configuration, you could be dealing with more than just an unpleasant experience for end-users. If your pipeline makes non-idempotent side effects to external systems, you'll have to account for those effects after a rollback. This raises the stakes for ensuring safe releases. Now that we understand some of the challenges that come with testing and deploying data processing applications, let's take a look at what testing in CI/CD look like with Beam and data flow. Testing in Beam is well summed up by this diagram. You can read this from the center out starting with the Beam pipeline itself and some handcrafted test inputs. Then moving to other P-transforms and DoFn subclasses before considering integration testing, which involve real data sources and sinks. Let's talk about unit tests. All pipelines revolve around transforms, and the lowest level we typically deal with in Beam is the DoFn. Since these are essentially functions, we validate their behavior with unit tests that operate on input data sets. They produce output data sets that we validate with assertions. Similarly, we can provide test inputs to the entire pipeline, which might include our DoFns as well as other P-transforms and DoFn subclasses. We also assert that the results of the entire pipeline are what we expect. For system integration tests, we incorporate a small amount of test data using the actual IOs. This should be a small amount of data since our goal is to ensure the interaction with the IOs produces the expected results. Finally, end-to-end tests use a full testing data set, which is more representative of the data or pipeline we'll see in production. Whatever tool you're using in your CI/CD testing environment, you'll make use of the direct runner, which runs on your local machine, and your production runners, which run on the Cloud service of your choice like Cloud data flow. The direct runner will be used for local development, unit tests, and small integration tests with your data sources. You'll use your production runner when it's time to do larger integration tests and when you want to test performance, and when you want to test pipeline deployment and rollback. More broadly, the CI/CD life cycle looks something like this. It's iterative and moves through a cycle of development, artifact building, and testing, followed by deployment. In the development part of the life cycle, we write our code executing unit tests locally using the direct runner, and executing integration tests using the data flow runner. As we develop and test, we're committing to source repositories along the way. These commits and pushes trigger the continuous integration system to compile and test our code in an automated manner using Cloud Build or a similar CI system. Once the build's complete successfully, artifacts are deployed first to a pre-production environment where end-to-end tests are run. If these succeed, we deploy to our production environment.