Welcome, we will now look at a very interesting problem which is how to use a collection of computing nodes, a cluster, as a single parallel computer. This is referred to as distributed parallelism. And the idea is that you have nodes, say in a data center, and a single node might consist of a processor, a local memory, and a network interface card. That's one node. Let's say node zero, and they are many more nodes. There can be a data center, can have hundreds of thousands of such nodes with processor, local memory, and network interface. And the way we can connect them all together is with some kind of interconnect which varies depending on the technology. But what it accomplishes is a way for these nodes to communicate with each other. Now what we'd like to do is use this collection of nodes, that's referred to as a cluster, as a single parallel computer. And one of the biggest challenges that arises when you try to do something like this, is that of data distribution. So what we see is that each of these nodes has a local memory, but if you want to use multiple nodes as one computer, we have a logical view of a data structure that spans multiple nodes. So this logical view, which is also referred to as a global view, may be of an array, for example, of four elements, A, B, C, D, indexed by 0, 1, 2, and 3. But if you are spreading this across the nodes in a cluster, we'll have to take pieces of this global view array, and put them in local memory. This is something the programmer has to do because the computing system only has local memory in each node. So the local view, would be one per node, so NODE 0 may have, let's say this global view is of an array XG, NODE 0 would have an array XL which will have the two elements A and B in indices 0 and 1. And NODE 1 also has a local array also called XL which will have C and D, also indexed by 0, 1. So now we see that the programmer has to think about distributing the logical global view across these local arrays that have the same name but are distinguished per node. And in fact this abstraction is referred to as a single program, multiple data abstraction. Because what it appears like is that we're loading a local program with just an array XL consisting of two elements on each node and starting it. But globally, that single program is executing on different nodes and that same program can be operating on different elements of the data, because it's working on different local memories. So the interfaces that are commonly available is something called a Message Passing Interface, called MPI. And the way the program would look if you're writing in this interface is you'd have a main program. You'd have to do some kind of initialization. And then, you can initialize the local arrays. So, let's say we have a for I, and initialize XL [I] = I. Now, suppose we wanted the element of XL[I] to be the global index. Then, what we could do is instead of it being I, we could say it's the XL.length, times RANK plus I. Now, what is this doing? The RANK is basically equivalent to the node ID, so NODE 0 would have rank 0. NODE 1 would have rank 1. And this is saying for NODE 0, XL[0] will be, the rank is 0 and I is 0, would be 0. XL[1] would be 1. And for NODE 1, XL[0], and this is interesting, because the rank of NODE 1 is 1 and the length of XL is 2. So you'd have 2 plus 0. So you'd have 2 and XL[1] is 3. So now you have different values in the same XL location but on different nodes. And that is why we have the multiple data in the single program multiple data model. Now, in what we are looking at right now, for simplicity, we are thinking of each, processor or processors being single threaded and, of course, this course is all about multicore programming. So, one thing you can do is create many of these MPI processes within a node, one per core or you can find ways of combining the multicore parallelism we've learned earlier with this distributed parallelism. But in a nutshell, with distributed parallelism, we have the opportunity to use a collection of compute nodes as one parallel computer. All running the same program but having different effects, because each statement could be doing something different based on the rank of the node or the MPI process. This is very different from client server programming, and this is the foundation for a lot of high performance computing in scientific applications as well as emerging data analysis applications in the cloud.