Parallelism and Concurrency, Revisited


To my delight, I still get compliments on and criticisms of my post from three years ago (can it possibly be that long?) on parallelism and concurrency.  In that post I offered a “top down” argument to the effect that these are different abstractions with different goals: parallelism is about exploiting computational resources to maximize efficiency, concurrency is about non-deterministic composition of components in a system.  Parallelism never introduces bugs (the semantics is identical to the sequential execution), but concurrency could be said to be the mother lode of all bugs (the semantics of a component changes drastically, without careful provision, when composed concurrently with other components).  The two concepts just aren’t comparable, yet somehow the confusion between them persists.  (Not everyone agrees with me on this distinction, but neither have I seen a rigorous analysis that shows them to be the same concept.  Most complaints seem to be about my use of the words “parallelism” and “concurrency” , which is an unavoidable problem, or about my temerity in trying to define two somewhat ill-defined concepts, a criticism that I’ll just have to accept.)

I’ve recently gotten an inkling of why it might be that many people equate the two concepts (or see no point in distinguishing them).  This post is an attempt to clear up what I perceive to be a common misunderstanding that seems to explain it.  It’s hard for me to say whether it really is all that common of a misunderstanding, but it’s the impression I’ve gotten, so forgive me if I’m over-stressing an obvious point.  In any case I’m going to try for a “bottom up” explanation that might make more sense to some people.

The issue is scheduling.

The naive view of parallelism is that it’s just talk for concurrency, because all you do when you’re programming in parallel is fork off some threads, and then do something with their results when they’re done.  I’ve previously argued that this is the wrong way to think about parallelism (it’s really about cost), but let’s just let that pass.  It’s unarguably true that a parallel computation does consist of a bunch of, well, parallel computations.  So, the argument goes, it’s nothing but concurrency.  I’ve previously argued that that’s not a good way to think about concurrency either, but we’ll let that pass too.  So, the story goes, concurrency and parallelism are synonymous, and bullshitters like me are just trying to confuse people and make trouble.

Being the troublemaker that I am, my response is, predictably, nojust no.  Sure, it’s kinda sorta right, as I’ve already acknowledged, but not really, and here’s why: scheduling as you learned about it in OS class (for example) is an altogether different thing than scheduling for parallelism.  And this is the heart of the matter, from a “bottom-up” perspective.

There are two aspects of OS-like scheduling that I think are relevant here.  First, it is non-deterministic, and second, it is competitive.  Non-deterministic, because you have little or no control over what runs when or for how long.  A beast like the Linux scheduler is controlled by a zillion “voodoo parameters” (a turn of phrase borrowed from my queueing theory colleague, Mor Harchol-Balter), and who the hell knows what is going to happen to your poor threads once they’re in its clutches.  Second, and more importantly, an OS-like scheduler is allocating resources competitively.  You’ve got your threads, I’ve got my threads, and we both want ours to get run as soon as possible.  We’ll even pay for the privilege (priorities) if necessary.  The scheduler, and the queueing theory behind it (he says optimistically) is designed to optimize resource usage on a competitive basis, taking account of quality of service guarantees purchased by the participants.  It does not matter whether there is one processor or one thousand processors, the schedule is unpredictable.  That’s what makes concurrent programming hard: you have to program against all possible schedules.  And that’s why you can’t prove much about the time or space complexity of your program when it’s implemented concurrently.

Parallel scheduling is a whole ‘nother ball of wax.  It is (usually, but not necessarily) deterministic, so that you can prove bounds on its efficiency (Brent-type theorems, as I discussed in my previous post and in PFPL).  And, more importantly, it is cooperative in the sense that all threads are working together for the same computation towards the same ends.  The threads are scheduled so as to get the job (there’s only one) done as quickly and as efficiently as possible.  Deterministic schedulers for parallelism are the most common, because they are the easiest to analyze with respect to their time and space bounds.  Greedy schedulers, which guarantee to maximize use of available processors, never leaving any idle when there is work to be done, form an important class for which the simple form of Brent’s Theorem is obvious.

Many deterministic greedy scheduling algorithms are known, of which I will mention p-DFS and p-BFS, which do p-at-a-time depth- and breadth-first search of the dependency graph, and various forms of work-stealing schedulers, pioneered by Charles Leiserson at MIT.  (Incidentally, if you don’t already know what p-DFS or p-BFS are, I’ll warn you that they are a little trickier than they sound.  In particular p-DFS uses a data structure that is sort of like a stack but is not a stack.)  These differ significantly in their time bounds (for example, work stealing usually involves expectation over a random variable, whereas the depth- and breadth-first traversals do not), and differ dramatically in their space complexity.  For example, p-BFS is absolutely dreadful in its space complexity.  For a full discussion of these issues in parallel scheduling, I recommend Dan Spoonhower’s PhD Dissertation.  (His semantic profiling diagrams are amazingly beautiful and informative!)

So here’s the thing: when you’re programming in parallel, you don’t just throw some threads at some non-deterministic competitive scheduler.  Rather, you generate an implicit dependency graph that a cooperative scheduler uses to maximize efficiency, end-to-end.  At the high level you do an asymptotic cost analysis without considering platform parameters such as the number of processors or the nature of the interconnect.  At the low level the implementation has to validate that cost analysis by using clever techniques to ensure that, once the platform parameters are known, maximum use is made of the computational resources to get your job done for you as fast as possible.  Not only are there no bugs introduced by the mere fact of being scheduled in parallel, but even better, you can prove a theorem that tells you how fast your program is going to run on a real platform.  Now how cool is that?

[Update: word-smithing.]

About these ads

23 Responses to Parallelism and Concurrency, Revisited

  1. Sean Santos says:

    I’ve felt that parallelism and concurrency are conflated mostly because of two things:

    1) At some level, parallelism is generally implemented via multiple physical systems (e.g. cores, nodes in a cluster) running concurrently, performing operations that are made visible to each other at non-deterministic times.

    I think that the non-determinism is the important part here; it doesn’t seem to matter if it’s due to competitive scheduling, or to it being physically infeasible to sync up two systems in such a way as to guarantee deterministic interactions.

    2) People who tend to have to think deeply about parallelism tend to be disproportionately those not shielded from that underlying non-determinism.

    That is, people who are working with a deterministic parallel toolkit, which successfully hides most or all of the dangers of the underlying concurrency while preserving efficiency, can often reason about the correctness of their code almost as if it were serial. (For those lucky people who have problems that scale linearly, they can reason about performance that way too.)

    On the other hand, for people who don’t use such tools (including people who actually write libraries for parallel computation in the first place, or even lower level bits like parts of operating systems), they are often still working in an environment where they are exposed to race conditions and other symptoms of concurrency.

    (Of course, some people don’t have to work in such an environment, but do so anyway out of ignorance or cultural inertia. I work in computational physics, and there’s a well-known, tragic stereotype of the late-career scientist who hasn’t learned anything about parallel programming since skimming an MPI 1.0 manual two decades ago, who now mostly “develops” by throwing ancient, inscrutable Fortran code at stressed grad students who have almost no programming experience.)

    Tangents aside, I think that people draw a connection between concurrency and parallelism that’s hard to get rid of for social reasons. It’s not enough to announce the way of the future, where you do your dependency analysis and use a scheduler to churn out the details of a well-behaved parallel implementation, and thus all your parallel programs are guaranteed to be deterministic. You also have to get to a point where parallel programmers mostly live in that future, and as few as possible are spending time trying to re-invent special locking and communication systems from scratch, or some other tedious (for many of us) and error-prone (for all of us) endeavor.

    • Sean Santos says:

      Actually, to be a little more precise, I’m not sure that non-determinism is the problem with concurrency either, as much as “non-local” non-deterministic behaviors where the possible results can’t be easily enumerated. If you have many different threads each searching for one of many possible answers, it’s not necessarily problematic or counter-intuitive if the program finds a different valid answer each time you run it. A program that occasionally, non-deterministically selects one of several known valid well-defined execution paths, is not difficult in the same way as a program where seemingly every thread is at risk of causing an unforeseen race condition any time it does anything with shared or communicated data used by any other thread at all.

    • By defining an appropriate equivalence relation on inputs and outputs (i.e., by using types), your first example can be seen as deterministic: equal arguments give equal results. It’s all in how you define equality.

    • Agreed. That’s why I write these posts!

  2. Just one little nitpick. When you say that parallelism is “co-operative” that leads me to think of co-operative threading, as in when you explicitly yield control to another thread (and store the current continuation, etc…). You might want to come up with a better way to phrase it, maybe “homogeneous” or something? Or perhaps there is a better word, I don’t know. Overall, great post, I agree with you on pretty much everything.

  3. minglai0930 says:

    You wrote above: “Parallelism never introduces bugs (the semantics is identical to the sequential execution)” And Sec 40.5 of your PFPL 2/e book says: “Parallelism is about efficiency, not
    semantics; the meaning of a program is independent of whether it is executed in parallel or not.” Let’s see High Performance Fortran’s FORALL statement. Consider A=B+C, where A, B, and C are 1D array. The semantics of the FORALL statement is called copy-in-copy-out by some: B(1)+C(1), B(2)+C(2), etc are performed, and only then the results are assigned to A(1), A(2), etc. Is this parallelism? If so, what sequential execution is its semantics identical to? Certainly not A(1)=B(1)+C(1), A(2)=B(2)+C(2), right? Doesn’t this parallel construct have its own (parallel) semantics? Isn’t this semantics a deterministic and cooperative scheduling of components — and what do you call it?

    • minglai0930 says:

      My last sentence above should better be written as “Isn’t this semantics a deterministic composition of components — and what do you call it?” (You call non-deterministic composition of components “concurrency.”) And what’s the difference between “composition” and “scheduling”? (You also said – if I understand correctly – concurrency is non-deterministic and competitive scheduling.)

  4. minglai0930 says:

    Some people think that concurrency is logical parallelism, and parallelism is phyical concurrency. What you have written in this blog obviously disagrees to it. Fine. But “parallelism is about exploiting computational resources to maximize efficiency, concurrency is about non-deterministic composition of components in a system.”? One may set the priorities of threads in a competitive system to improve efficiency, isn’t it? In this case, it is about non-deterministic composition of components, right? So, it is concurrency, but it is also about efficiency? And, if concurrency is about non-deterministic composition of components, then what, if not parallelism, is about deterministic composition of components? Is it possible that parallelism and concurrency is about deterministic and non-deterministic of components, respectively, and both seek to improve efficiency?

    • minglai0930 says:

      The last sentence should read “Is it possible that parallelism and concurrency is about deterministic and non-deterministic composition of components, respectively, …”.

  5. Mark Hoemmen says:

    I’m walking into this conversation uninvited and without context… but it seems odd to me to think that the parallel “semantics is identical to the sequential execution,” since parallel algorithms so often differ from their sequential ones. Parallel game tree search is a discrete example — where one accepts that the answer might depend on the number of parallel execution units, and might be nondeterministic. Algebraic multigrid is an example from numerical algorithms — abstractly, it’s the same algorithm at scale, but in practice, running at scale calls for much more internal algorithmic tech (e.g., load balancing) than on a small number of processors. Floating-point arithmetic isn’t associative, so it takes special effort (about which few people care) to make sums deterministic, and even more effort to make them independent of the number of parallel execution units. I’m bringing up these examples to ask why it’s necessary even to force parallel semantics to be the same as sequential semantics. Parallel algorithm developers either don’t expect this, or enforce it in the algorithm itself. So, why should the programming model have to work so hard for something that algorithm developers don’t need?

    • In part the goal is to reform practice in parallel algorithms by bringing a PL perspective to it. The PRAM model, which bakes parameters such as the number of processors, or properties of the interconnect, into the algorithm, is not a good foundation on which to build. The whole business of (non)-associativity is a non-issue, despite conventional wisdom. If arithmetic is not associative, as is surely the case for floating point, then don’t pretend that it is. No one suggests to ignore semantics, rather just the opposite: the semantics should be independent of the exploitation of parallelism (you should not care whether p=1 or p=millions when designing and analyzing the algorithm). We now have decades of practical and theoretical developments (my most recent POPL paper with Blelloch, for example) showing that the separation I am advocating is workable, in fact superior, in both theory and practice.

    • Mark Hoemmen says:

      I get the feeling we are talking about very different things. I’m hearing from you that algorithm semantics should be completely independent of the number of processors P. What I’m saying, for example, is that algebraic multigrid implementations _intentionally_ behave differently for big P (say P >= 16k) than for small P, and that they need to do so in order to have _reasonable_ performance (forget linear scaling). This means that algorithm developers really don’t need semantics independent of P, even though their semantics takes P as a parameter (we don’t fix, say, P=16). As a result, I don’t think it’s necessary even to worry about the semantics changing with P, as long as there is some semantics that takes P as a parameter.

      I’ve actually read Prof. Blelloch’s book on data-parallel abstractions, though it’s a good 20+ years old at this point — I’ve heard of some more recent work of his but haven’t had a chance to look at it.

      I’m curious why you think people still care about the PRAM model. We numerical algorithms folks left the PRAM model behind in the ’80s. It says nothing about locality, and locality is everything for us.

      Right now we need help migrating our codes to use hybrid (MPI+X, where X = threads on CPUs or GPUs or …) parallelism. An area in which we need a LOT of help is ensuring that our implementations of programming models and basic thread-safe, thread-scalable data structures are correct. It seems like there is a huge gap between theory and practice in that regard, or even a gap between practice and available tools.

    • Guy and I just wrote a paper sbout locality showing that machine-based models are unnecessary even when considering memory hierarchy effects. I am not so much talking snout the PRAM per se, but about machine models more generally, and a programming style that emphasizes low-level manipulation of the scheduling of both memory and processors. Yes, you can tune your program for 6 processors with thus-and-such an interconnect and thus-and-such cache behavior, for sure. But then next year what are you going to do?

      There’s a direct analogy with memory allocation. Used to be people absolutely insisted on manual memory allocation, but relatively few do any more. It’s much more efficient, both in run-time measurement and in programmer effort, to use automatic storage allocation, in almost every situation. (The above-mentioned paper shows that it remains true even when cache effects are considered.) It’s the same with multiprocessors. Used to be that people thought it was absolutely essential to allocate processors yourself, lay out memory yourself, control data flow yourself. But it’s not; it’s better to leave that to a scheduler, and to make a clean separation between the conceptual parallelism at the level of a language model and its implementation on a hardware platform.

      Garbage collection = scheduling of storage, parallelism = scheduling of processors. That’s the whole idea.

    • Regarding correctness, our approach exactly addresses your concern. One view is that one ought not be mucking about with MPI if what you want to do is implement a parallel algorithm. Leave that to the run-time. Then your algorithm can be assessed for correctness and efficiency, both in principal and as-coded, without any concern for how it is scheduled onto processors or how memory is managed. If you want an ordered pair, using a damn ordered pair, and stop thinking in terms of malloc’ing some data structure and initializing it. If you want to run two things in parallel, do so, and stop mucking about with concurrency. I realize these are heretical suggestions in some quarters, but so was GC twenty-five (or even less) years ago. The main idea is simply to push for better abstractions.

    • Mark Hoemmen says:

      I appreciate your taking the time to reply. We certainly care quite a bit about abstractions; we’ve spent a good two years working on a programming model for shared-memory data-parallel computation that abstracts away details of data layout (e.g., “struct of arrays” vs. “array of structs”) and parallel dispatch. (It’s called Kokkos; I can share papers and code if you’re interested.) It performs well on several kinds of shared-memory processors, including conventional CPUs, Nvidia GPUs, and the Intel Xeon Phi. We’ve worked hard on making special cases of automatic memory management fast, and providing useful data structures, like multidimensional arrays (with layout chosen invisibly to match the hardware) and thread-scalable hash tables.

      The reason I mention this is to emphasize that we also favor abstracting away things things like the number of threads and the cache size. On the other hand, different kinds of algorithms perform differently at different levels of parallelism. Sparse matrix factorizations do well at small scale, where latencies are slow. Domain decomposition does well at larger scales. The cross-over point depends on the number of processors. We still design algorithms for general processor counts P (we don’t hard-code “P=6″ anywhere), but we do choose algorithms as a function of P — just like sort implementations switch from an N log N sort to an N^2 sort when N is small enough. “Cache-oblivious” algorithms also have switch-over points where it pays to stop recursing, though ideally, they aren’t sensitive to the switch-over point. I’m sure that you understand this; we’re almost certainly just talking past each other by using slightly different terms.

    • Likewise, I appreciate your thoughtfulness, and I defer to your experience building parallel codes. It is far, far harder to do well than one would naively think. May I suggest posting a link to your Kokkos paper here?

      My main point of this and my previous post is to argue against the reduction of parallelism to concurrency, and to propose a point of view that distinguishes them. The point of view I’ve been arguing is more programmatic, and will surely require refinement to get it right. But I think we can agree that parallelism is not about locks and channels, it’s about efficiency.

    • Mark Hoemmen says:

      I’ll be happy to post a link to our latest public presentation (Nov 2013):

      http://trilinos.sandia.gov/events/trilinos_user_group_2013/presentations/2013-11-TUG-Kokkos-Tutorial.pdf

      Our colleagues have a great paper (on which I’m not a coauthor, alas!) which is currently under revision. Please contact me if you would like a draft. There are earlier papers, but that one gives a good sense for how to port one’s code.

      Kokkos aims for practicality — we have a huge investment in existing MPI-only codes, so we need a programming model that gives us a gradual porting path. The result might not always be pretty, but it works and is highly portable — all we need is a C++ compiler (1998 standard suffices), and we don’t rely on source-to-source translation tools that might not build on every platform. As a result, though, we could be making decisions that sacrifice generality, ease of use, or simplicity of proving correctness, because we’ve chosen a library solution in an unsafe, low-level language.

      I definitely agree that “parallelism is not about locks and channels, it’s about efficiency.” Very, very few people writing parallel code should ever think about things like locks, channels, or memory models. (I’ve watched our immensely talented developer Dan Sunderland develop a lock-free, wait-free, thread-scalable hash table — it’s incredibly hard!) One of the things that made MPI codes easy to write (for numerical simulations at least) was that few developers had to write parallel code. Usually an expert wrote the communication phase. Most developers would implement sequential physics kernels or preconditioners. We’re trying to build on this by giving them some parallel operations and data structures, and then showing them how to write “sequential” code (comparable to a loop body) that Kokkos will then run in parallel. They struggle a lot with the concepts, but do well once they “get it.” I prefer that they think really hard and then write working code, then that they get something incorrect up and running fast ;-)

      Thanks so much for patiently explaining your point of view to me! I’m glad to join the discussion, as late as I was :-)

  6. I agree with the idea of treating concurrency and parallelism as separate concepts. For those who can’t see the distinction, I offer node.js as a real world example. Node is clearly for concurrent programming but offers nothing for parallelism. It shares a single processor between concurrent tasks (ignoring a handful of threads in the runtime).

  7. I agree that the meaning of these words important and they are used wildly inconsistently by different people in different settings.

    I think the most useful meaning for “parallel” is physical simultaneity. This includes the deterministic stuff that Bob talks about as well as plenty of other situations. Parallelism can happen at lots of different scales, from instructions in processor pipelines (basically invisible to programmers most of the time) to shared-memory multi-processors to GPUs/FPGAs to distributed clusters.

    I use “concurrent” as a more general term than “parallel” to mean multiple logical tasks overlapping in time. If you’re interested in concurrency, but not parallelism I tend to call that “interactivity”. Tools for making interactive software include event loops, cooperative threads, coroutines, etc. Interactivity can be done deterministically, too (as long as you include the arrival time of your real-world inputs as part of the “input”).

    Part of the reason that parallelism and interactivity get (wrongly, IMO) mushed together is that they both became much more important to Regular Joe programmers over the last decade or so. Parallelism, because of the rise of multi-cores. Interactivity, because of the rise of “internet-native” applications and the rich UIs in mobile devices.

    Getting back to more directly responding to Bob’s post, saying that “parallel” means “deterministic parallel” by definition feels wrong to me. I’m not 100% that’s what he’s asserting, though.

    And going off on my own hobby tangent… I think that memory isolation by default (i.e. using processes, not threads) is the key to making parallel programming palatable to Regular Joe programmers, not determinism.

    Ben

  8. The question for many application areas is a little different though: When is concurrent programming a helpful tool for parallel programming? For exploiting additional performance from multi-core systems?

    Most recently I’ve been looking at the use of Go for creating web servers. I seems pretty clear that concurrent programming is a good tool for building web servers on multi-core machines.

    How about GUI apps and frameworks for workstations, laptops and tablets? Not proven yet, but in my mind concurrent programming could help a lot. Especially in keeping programs simple and GUIs responsive.

    Of course, parallel programming of the type where a compute-bound calculation, like simulating hydrodynamics, can benefit from the tightly coupled “Do-together” sort of coding, is different from a system that is trying to be responsive for On Line Transaction Processing (OLTP).

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,294 other followers

%d bloggers like this: