In an earlier post I mentioned that one goal of the new introductory curriculum at Carnegie Mellon is to teach parallelism as the general case of computing, rather than an esoteric, specialized subject for advanced students. Many people are incredulous when I tell them this, because it immediately conjures in their mind the myriad complexities of concurrency: locks, monitors, deadlock, dining philosophers, …. And if you look at the recent research literature, you will see countless papers of the form (a) parallelism is important, so (b) let’s talk about concurrency. The situation has gotten so bad that I think that most computer scientists cannot distinguish the two concepts. (Or maybe they willfully confuse the two so as to justify funding for work on concurrency by appealing to the importance of parallelism.)
Given this state of affairs, I cannot explain what we are doing at Carnegie Mellon (to anyone who doesn’t already know) without first rebooting you. Trust me, I’m just going to hit Ctrl-Alt-Delete, and then click Restart. OK, we’re back.
The first thing to understand is parallelism has nothing to do with concurrency. Concurrency is concerned with nondeterministic composition of programs (or their components). Parallelism is concerned with asymptotic efficiency of programs with deterministic behavior. Concurrency is all about managing the unmanageable: events arrive for reasons beyond our control, and we must respond to them. A user clicks a mouse, the window manager must respond, even though the display is demanding attention. Such situations are inherently nondeterministic, but we also employ pro forma nondeterminism in a deterministic setting by pretending that components signal events in an arbitrary order, and that we must respond to them as they arise. Nondeterministic composition is a powerful program structuring idea. Parallelism, on the other hand, is all about dependencies among the subcomputations of a deterministic computation. The result is not in doubt, but there are many means of achieving it, some more efficient than others. We wish to exploit those opportunities to our advantage.
Now I can hear you object, but isn’t concurrency required to implement parallelism? Well, yes, it is, but concurrency is also required to implement sequentiality too! The timing signal on your processor chip is essentially a synchronization mechanism with which to coordinate the otherwise independent activity of the components of the processor. (Yes, I know that there are asynchronous processors, but this only makes my point stronger, you still need to control the concurrency.) The point is that concurrency is not relevant to parallelism, even if the engineers who build our parallel computing platforms must deal with concurrency. Another way to say the same thing is that parallelism is a useful abstraction, and abstractions should never be confused with their implementations. (Numbers are not certain sets, but they can be implemented in terms of sets. And so on.)
All well and good, parallelism is not concurrency, but what is it? What is an appropriate model for parallel computation? As in the sequential case, there are two answers, a machine-based approach and a language-based approach. I am going to criticize the machine-based formulation, and argue for the language-based formulation on two grounds: (a) it is inherently more elegant and natural, and (b) it can be taught directly to students without any fuss or bother. Much of what I have to say is derived from the seminal and beautiful work of Guy Blelloch, which I have absorbed over years of collaboration. To the extent I’ve gotten it right, give Guy the credit; to the extent that I’ve gotten it wrong, assign me the blame.
The canonical machine-based model is the PRAM, or parallel RAM, which generalizes the classical RAM model to allow p>0 processors that share a common memory. There are many variations on the PRAM, according to what sorts of interactions are allowed and how conflicts are resolved among the processors. This is not important for present purposes, so I will not attempt to categorize the variations here. The key idea is that the number, p, of processors is fixed in advance—results are stated in terms of how efficiently an algorithm executes on a p-processor PRAM as a function of the size, n, of the input. This is well and good as a cost model for parallel execution, but it totally sucks as a programming model! To write code for a PRAM, you fix the number of processors, and write the code for that number of processors (often the various cases are “analogous”, but in reality you must fix p and write code for that fixed p). Moreover, you must (officially, anyway) write low-level imperative RAM code, which introduces spurious dependencies among sub-computations by explicitly scheduling the steps of the computation. This is horrendous! Of course typically one writes pidgin C, and sketches how it is compiled, but the fact is you’re still writing sequential, imperative code to implement a parallel algorithm.
As I’ve argued previously, what is needed is a language-based model of computation in which we assign costs to the steps of the program we actually write, not the one it (allegedly) compiles into. Moreover, in the parallel setting we wish to think in terms of dependencies among computations, rather than the exact order in which they are to be executed. This allows us to factor out the properties of the target platform, such as the number, p, of processing units available, and instead write the program in an intrinsically parallel manner, and let the compiler and run-time system (that is, the semantics of the language) sort out how to schedule it onto a parallel fabric. Here’s a useful analogy. Just as abstract languages allow us to think in terms of data structures such as trees or lists as forms of value and not bother about how to “schedule” the data structure into a sequence of words in memory, so a parallel language should allow us to think in terms of the dependencies among the phases of a large computation and not bother about how to schedule the workload onto processors. Storage management is to abstract values as scheduling is to deterministic parallelism.
The paradigmatic example is provided by Quicksort, a naturally parallel algorithm. Many people think that the “big idea” of Quicksort is the in-place swapping technique that Hoare invented in the original formulation. I argue that this is a detail; the “big idea” is that it is parallelizable. To see this, we must work at a reasonable level of abstraction: a sort function permutes a non-empty sequence of (comparable) elements to obtain another sequence in sorted order.
fun qs xs = if |xs|=1 then xs else let val (xsl, xsg) = split xs in concat (qs xsl, qs xsg)
That is, we split the sequence into two parts (by some means) such that everything in the left part is less than everything in the right part, then sort the parts independently (in parallel!), and merge the results together. The exact complexity of Quicksort depends on the representation of sequences, and the complexity of the operations on them. For now, let us just gloss over this to focus on the overall structure of the computation. If we can ensure that the sequence is split roughly in half (by any of a number of means, including randomization), then the recursion is of logarithmic depth. At each stage we do linear work, but the recursive calls are independent of one another. The algorithm has a natural fork-join structure in which we (a) split the sequence into two parts, (b) in parallel sort the two parts, and (c) join the results together. (The splitting itself can be done in parallel as well, but we’ll not bother about that here.)
Analysis of Quicksort splits into two parts, the work and the depth (or span). The work is the overall number of steps required to perform the algorithm. On a sequence of length n, we perform O(n log n) steps to sort the sequence, as might have been expected. This is the sequential complexity, the amount of time it would take to run on a single processor. To account for parallelism, there is a second measure, called the depth, that measures the critical path length of the computation, the length of the longest chain of dependencies among subcomputation. No matter how much parallelism we may have available, we can never run faster than the depth; it represents the idealized parallel complexity of the algorithm, the amount of time it would take given any number of processors. Depending on the choice of data structures, the depth of Quicksort is O(log^2 n), or thereabouts, which makes it quite parallelizable. (Note that, as in our earlier discussion, a step of computation is a transition step in the operational semantics of the language we are using to express the algorithm.)
This is very nice and elegant, but now comes the really beautiful part, the concept of a provable implementation of the language. (The word “provable” is only needed here because of the shabby state of the field; having a proof that your implementation works, and how well it works, remains an oddity to be emphasized, rather than an expectation to be fulfilled by any self-respecting computer scientist.) The idea is that the language-based model can be implemented on a PRAM (with stated interconnect properties that I will suppress here) with a provable bound on the cost that takes account of the overhead of scheduling onto a p-processor fabric. Thus, p need not be specified in advance, but is rather determined by the run-time system and compiler, which are provided by the proof of the theorem.
Theorem A computation with work w and depth d can be implemented in a p-processor PRAM in time O(max(w/p, d)).
(Sometimes the second factor is d log p, depending on the assumptions we are making about the target PRAM.)
This is called Brent’s Principle, and it states a fairly obvious fact. First, we can never execute a depth d program in fewer than d steps. The depth is the critical path length; this is the amount of time we must wait to execute the algorithm, regardless of any available parallelism. Second, to the extent that parallelism arises from the dependency (rather, indepedency) structure of the computation, we can at best do p units of work at time, keeping all processors warm, until we are done. And that’s it!
The effectiveness of the language-based model of parallelism lies entirely in its ability to expose the dependency structure of the computation by not introducing any dependencies that are not forced on us by the nature of the computation itself. And the key to this is functional programming, which manifests itself here in the transformational approach to computation: sorting is conceived of as a mathematical function that transforms a given sequence into another sequence. It does not destroy the given sequence any more than adding two numbers destroys those numbers! Since Quicksort is a mathematical function, we need not worry that execution of qs xsl interferes with (depends on) qs xsg; we can readily run them in parallel without fear of untoward consequences. The payoff is that there are many fewer dependencies among the subcomputations, and hence many more opportunities for parallelism that can be exploited, in accord with Brent’s Principle, when scheduling the work onto a parallel fabric.
The upshot of all this is that functional programming is of paramount importance for parallelism. That’s (one reason) why we are teaching functional programming to freshmen at Carnegie Mellon. And it’s the main reason why we are able to. There’s nothing to it, just program like you do math, and the parallelism will take care of itself!
Update: Simon Marlowe made similar points to mine some time ago.
[...] adding lamda expressions to the language, Robert Harper from Carnegie Mellon University argues that functional programming can hide concurrency complexities from the programmer and allow the runtime environment to take care of them. This seems like the [...]
Nope. My first objections is “but isn’t parallelism required to implement concurrency?”
Concurrency, on the other hand (at least as you have defined it in this post), is really nothing more than the dynamic resolution at run time of scheduling which of the all the possibly non-determinsitic computations that could possibly be executing at that moment in time in such a way that they can execute in parallel so that the result is deterministic.
In fact, I don’t even agree with your definitions of the terms “concurrency” and “parallel”. It is clear that a “dictionary” definition of the terms, at least in the context in which they are being used, both mean “at the same time.”
You have attempted to add additional qualifiers to their usage to further specialize their meaning. IMHO, and this is certainly a point that reasonable people can disagree on, the additional qualifiers you have applied to the terms are not in common usage in the community that would use these terms for the purposes being discussed. Personally, I have never see the words used in a CS context in the way that you have described, nor do I see how your specialization of the terms adds to their descriptive power.
Assuming that the two terms mean something different, fundamentally both “concurrent” and “parallel” computations must produce results that are deterministic. Often when dealing with concurrent/parallel programming on typical hardware, the amount of time it takes to perform the computation can be non-deterministic, but it is always a deterministic result.
In practice, on the vast majority of commercially available computers, concurrent or parallel programming must eventually be reduced to some atomic primitives provided by the underlying architecture. For a LL/SC based architecture, the amount of time it takes to perform an atomic compare and swap is non-deterministic. This means that in practice, on most LL/SC based architectures, all parallel operations will most likely reduce to a primitive that takes a non-deterministic amount of time to complete.
It is my opinion that “concurrent” and “parallel” mean exactly the same thing, and that the distinction you assert exists is not one that it shared by the larger community.
I also find your definition somewhat vague and confusing- it uses the terms “deterministic” and “non-deterministic” to differentiate concurrency and parallelism such that
What does this mean? It would seem to imply that concurrent computations are “non-deterministic”, but non-determinstic how? The result they compute? Or the amount of time? Clearly we are only interested in computations that produce deterministic results, which leaves only the amount of time it takes to produce those deterministic results. The only conclusion is that “Concurrency takes a non-deterministic amount of time, and parallelism takes a deterministic amount of time.” But on most commercially available computers, truly deterministic, bound time parallel computations are simply not possible because the architecture does not provide atomic primitives that are guaranteed to complete in a bounded number of steps. Clearly this reasoning about your definition of the terms is in error because it implies that from a practical point of view, parallelism doesn’t exist, only concurrency.
John,
Concurrency is about externally visible behavior. Parallelism adds no expressive power to a language. The time a parallel operation takes to execute is whatever time it starts until it produces output. In mechanized solutions, model of parallelism is really a property derived from the language semantics itself – for example, programs defined in terms of absolutely non-circular attribute grammars can then have a model of parallelism that shreds the program graph by leveraging the non-circularity property. The results can also be memoized and in turn used for incremental rather than strictly parallel computation. A given model tells you how much performance you can squeeze out from it.
Also, there exists a form of concurrent programming, deterministic concurrency, that has no race conditions, is as easy to program as sequential programs, and can implicitly exploit parallel processors: dataflow programming. It is the only solution to shared memory concurrency that scales, and is arguably the most appropriate default solution for taking advantage of multi-core processors.
But there is more to concurrency research than multi-core processors. The Internet’s infrastructure has grown massively, and it is approaching its ultimate bottleneck – the speed of light! For most applications, the speed of light is embarassingly adequate, but roundtrip communication around the globe faces real communication issues, such as propagation delay, and therefore communication protocol design becomes the dominating factor in performance. Translation: The better we do this, the cooler the Cloud apps we can build that end-users get really excited about.
Finally, there is no reason why nondeterministic composition means composition in the sense of the Kahn Principle. True, real world concurrency (as in, what is implementd in programs today) might have meager or ambiguous agents, resulting in the input-output relation not matching the history relation of the program. We might be nondetereministically composing problems, and doing it in dumb ways. The point behind teaching concurrency is to teach how to do it effectively for real world problems, and avoid traps proven to exist, thanks to good theory – like the Brock-Ackerman Anomaly. You really, really don’t want to build systems that pervasively exhibit the Brock-Ackerman Anomaly, if at all. That’s because determinism is one of the most important properties you can shoot for in designing systems; you only give it up when you have to, not simply “cuz”.
As I understand it, the author views parallelism as an algorithmic or complexity-oriented concept. You can take any algorithm and analyze in terms of its efficiency in terms of parallelism. To do so requires no knowledge of the machine on which the algorithm runs. This process is “deterministic” in the sense that the orders of steps are clearly defined. “First the list is split in half, then the two halves are sorted in parallel.” “Given k processors working in parallel, the running time of this algorithm is O(k log k / n).”
On the other hand, concurrency is taken to mean the methods and steps required to actually implement parallel algorithms. These include mutexes and semaphores and all of the above. These are obviously machine-dependent and “non-deterministic” in the sense that the order and timing of steps are not clearly defined or known ahead of time. “Thread 2 waits until the lock is released by some other thread, then acquires the lock and waits on a signal….”
[...] for first-year students at Carnegie Mellon. Apart from parallelism, wbich I’ve already discussed, we are placing a strong emphasis on verification and proof as tools for the practicing programmer, [...]
I wrote a brief (some would say cryptic) note of some experimentation I did in this area – particularly a functional language based approach to achieve parallel execution of large computations without the need to address concurrency in any way in the expression of the problem.
http://schmerg.com/concurrent-programming-is-hard-so-dont-do-it
My particular bugbear concerned the composability of such systems – assembling computations from libraries of code and the like to run across large computation grids of multicore machines.
In particular the language assembles the program so that it can executed in the minimum depth (as you describe) when given sufficient resources, but it can also balance any arbitrary problem across fewer resources and determine an optimal minimum depth for execution, and all this is done without changing the source code that expresses the problem.
Do you count a memorization as a functional programming primitive? Without it, some algorithms will be exponentially slower.
I don’t think you can implement memorization without concurrency.
If state is the same as concurrency, you’re right, but ordinarily people don’t think of it that way. You can have memoization or not according to your selection of types. There is no reason to force it on you, or deprive you of it.
[...] Parallelism is not concurrency In an earlier post I mentioned that one goal of the new introductory curriculum at Carnegie Mellon is to teach [...] [...]
[...] Parallelism is not concurrency « Existential Type Useful read RT: @tunixman Parallelism is not concurrency « Existential Type http://bit.ly/f5HhMF (tags: via:packrati.us) [...]
There are many models for deterministic concurrency: temporal logic programming, functional reactive programming, future passing with single assignment variables (promoted heavily by Peter Van Roy), Kahn process networks, concurrent constraint programming. I do not agree that concurrency is inherently coupled to non-determinism.
I do distinguish concurrency and parallelism: concurrency is a semantic property that says many behaviors are happening at the same time, and parallelism is an implementation property that says many calculations are happening at the same time. Concurrency can be implemented without parallelism, and parallelism without concurrency.
Concurrency is sometimes an good opportunity for parallelism. And high-performance parallelism is often easier to manage if you control it semantically (i.e. with concurrency) rather than as a mere implementation/optimization detail.
I believe you are conflating nondeterminism with ‘observable non-determinism’. Since you mentioned PvR, check out CTM section 4.1 (page 238).
Bearing this distinction in mind, dataflow concurrency and concurrent constraint programming both involve nondeterminism.
The relevant issue is whether nondeterminism exists in the semantics. PvR was working with Oz semantics, so he was stuck in the rather precarious position of explaining ‘yes, Oz threads can be indeterministic, but this here model is deterministic’. If you extracted just the model in question into its own language, though, there would be no question that the semantics are deterministic.
Concurrent constraint programming is often extended with nondeterminism of a very observable form: choice, backtracking.
I think I understand the argument you’re making, but I would argue it’s not as cut-and-dried as you suggest.
It’s true that concurrency is primarily about non-determinism, and there exist useful deterministic approaches to parallelism for which concurrency is not terribly relevant. But efficient parallel systems are almost always non-deterministic, and I think this is no accident. Building efficient deterministic parallelism is hard, and I believe it’s not actually possible for the general case, because it requires too much of an understanding in advance of the costs of the different parts of a computation than is reasonable to expect.
So, while I think distinguishing concurrency and parallelism is important, and that a deterministic approach to parallelism is a great pedagogical place to start, I remain wholly unconvinced that concurrency is not relevant to parallelism.
Let me counter your argument with the following: You surely know Google’s Sawzall language for map-reduce queries. Where in map-reduce do you deploy concurrency to solve your problem? Where is the non-deterministic part? I think the answer to those questions are “nowhere”.
Another, more machine-level observation is the distinction between a GPU (mostly SIMD executing) and a SMP CPU (mostly MIMD executing). The SIMD approach is massively parallel, yet it stands no chance of ever doing concurrent work since all cores are step-locked in choreography to execute the same instruction.
Note that MIMD is much more applicable to a heavily concurrent setting. If we have 30000 individual concurrent processes, we can as well map them to as many physical cores we have and gamble that spinning the cores will yield at least some performance benefit. This doesn’t work in a SIMD setting since the c. processes might be executing wildly different tasks.
Sawzall, and its derivatives, well illustrate my point.
I am a “language-based model of computation” fellow traveler, and I agree that functional programming is a great basis for parallelism.
However, I have to disagree with your characterization that parallelism and concurrency having nothing to do with each other.
While your rationale is much stronger than some of the other merely pedantic presentations of the same idea I’ve seen elsewhere, I think that parallelism and concurrency are at the very least dual to each other, perhaps even the same phenomenon looked at from two different ends.
Parallelism happens when you start out thinking you have a single “atomic” computation, but then realize that you really want to break it up into independent sub-computations. You solve this problem, as you point out, by understanding where your real dependencies are, and where the perceived dependencies are merely artifacts of fuzzy thinking and sloppy organization.
As you also point out, functional programming is a great tool for helping to clarify these issues.
Concurrency, on the other hand, happens when you start out thinking you have two completely independent computations, but it turns out that they share dependencies, intentional and necessary or not. You solve this problem by understanding where your real dependencies are, and where they are merely artifacts of fuzzy thinking and sloppy organization, such as the gratuitous use of shared state.
Here too, functional programming is a great tool to clarify which is which.
I guess ultimately my objection is that, by separating the two phenomena, you are being insufficiently bold in your claims for FP! ;-)