Yet Another Reason Not To Be Lazy Or Imperative


In an earlier post I argued that, contrary to much of the literature in the area, parallelism is all about efficiency, and has little or nothing to do with concurrency.  Concurrency is concerned with controlling non-determinism, which can arise in all sorts of situations having nothing to do with parallelism.  Process calculi, for example, are best viewed as expressing substructural composition of programs, and have very little to do with parallel computing.  (See my PFPL and Rob Simmons’ forthcoming Ph.D. dissertation for more on this perspective.)  Parallelism, on the other hand, is captured by analyzing the cost of a computation whose meaning is independent of its parallel execution.  A cost semantics specifies the abstract cost of a program that is validated by a provable implementation that transfers the abstract cost to a precise concrete cost on a particular platform.  The cost of parallel execution is neatly captured by the concept of a cost graph that captures the dynamic data dependencies among subcomputations.  Details such as the number of processors or the nature of the interconnect are factored into the provable implementation, which predicts the asymptotic behavior of a program on a hardware platform based on its cost graph.  One advantage of cost semantics for parallelism is that it is easy to teach freshmen how to write parallel programs; we’ve been doing this successfully for two years now, with little fuss or bother.

This summer Guy Blelloch and I began thinking about other characterizations of the complexity of programs besides the familiar abstractions of execution time and space requirements of a computation.  One important measure, introduced by Jeff Vitter, is called I/O Complexity.  It measures the efficiency of algorithms with respect to memory traffic, a very significant determiner of performance of programs.  The model is sufficiently abstract as to encompass several different interpretations of I/O complexity.  Basically, the model assumes an unbounded main memory in which all data is ultimately stored, and considers a cache of M=c\times B blocked into chunks of size B that provides quick access to main memory.  The complexity of algorithms is analyzed in terms of these parameters, under the assumption that in-cache accesses are cost-free, so that the only significant costs are those incurred by loading and flushing the cache.  You may interpret the abstract concepts of main memory and cache in the standard way as a two-level hierarchy representing, say, on- and off-chip memory access, or instead as representing a disk (or other storage medium) loaded into memory for processing.  The point is that the relative costs of processing cached versus uncached data is huge, and worth considering as a measure of the efficiency of an algorithm.

As usual in the algorithms world Vitter makes use of a low-level machine model in which to express and evaluate algorithms.  Using this model Vitter obtains a lower-bound for sorting in the I/O model, and a matching upper bound using a k-way merge sort, where k is chosen as a function of M and B (that is, it is not cache oblivious in the sense of Leiserson, et al.)  Although such models provide a concrete, and well-understood, basis for analyzing algorithms, we all know full well that programming at such a low-level is at best a tedious exercise.  Worse, machine models provide no foundation for composition of programs, the single most important characteristic of higher-level language models.  (Indeed, the purpose of types is to mediate composition of components; without types, you’re toast.)

The point of Guy’s and my work this summer is to adapt the I/O model to functional programs, avoiding the mess, bother, and futility of trying to work at the machine level.  You might think that it would be impossible to reason about the cache complexity of a functional program (especially if you’re under the impression that functional programming necessarily has something to do with Haskell, which it does not, though you may perhaps say that Haskell has something to do with functional programming).  Traditional algorithms work, particularly as concerns cache complexity, is extremely finicky about memory management in order to ensure that reasonable bounds are met, and you might reasonably suspect that it will ever be thus.  The point of our paper, however, is to show that the same asymptotic bounds obtained by Vitter in the I/O model may be met using purely functional programming, provided that the functional language is (a) non-lazy (of course), and (b) implemented properly (as we describe).

Specifically, we give a cost semantics for functional programs (in the paper, a fragment of ML) that takes account of the memory traffic engendered by evaluation, and a provable implementation that validates the cost semantics by describing how to implement it on a machine-like model.  The crux of the matter is to account for the cache effects that arise from maintaining a control stack during evaluation, even though the abstract semantics has no concept of a stack (it’s part of the implementation, and cannot be avoided).  The cost semantics makes explicit the reading and allocation of values in the store (using Felleisen, Morrisett, and H’s “Abstract Models of Memory Management”), and imposes enough structure on the store to capture the critical concept of locality that is required to ensure good cache (or I/O) behavior.  The abstract model is parameterized by M and B described above, but interpreted as representing the number of objects in the cache and the neighborhood of an object in memory (the objects that are allocated near it, and that are therefore fetched along with the object whenever the cache is loaded).

The provable implementation is given in two steps.  First, we show how to transfer the abstract cost assigned to a computation into the amount of memory traffic incurred on an abstract machine with an explicit control stack.  The key idea here is an amortization argument that allows us to obtain tight bounds on the overhead required to maintain the stack.  Second, we show how to implement the crucial read and allocate operations that underpin the abstract semantics and the abstract machine.  Here we rely on a competitive analysis, given by Sleator, et al., of the ideal cache model, and on an amortization of the cost of garbage collection in the style of Appel.  We also make use of an original (as far as I know) technique for implementing the control stack so as to avoid unnecessary interference with the data objects in cache.  The net result is that the cost semantics provides an accurate asymptotic analysis of the I/O complexity of a functional algorithm, provided that it is implemented in the manner we describe in the paper (which, in fact, is not far from standard implementation techniques, the only trick being how to manage the control stack properly).  We then use the model to derive bounds for several algorithms that are comparable to those obtained by Vitter using a low-level machine model.

The upshot of all of this is that we can reason about the I/O or cache complexity of functional algorithms, much as we can reason about the parallel complexity of functional algorithms, namely by using a cost semantics.  There is no need to drop down to a low-level machine model to get a handle on this important performance metric for your programs, provided, of course, that you’re not stuck with a lazy language (for those poor souls, there is no hope).

About these ads

8 Responses to Yet Another Reason Not To Be Lazy Or Imperative

  1. Hey Bob! (nifty paper)
    Nifty! I actually had a soft pen and paper proof about that sort of well defined GC traversal order enabling good memory locality a few months ago. Its a really *easy* (super scare quotes) property to prove informally :), and at the time I was really shocked that theres not been any work over that insight. Got a bit side tracked so I never finished formalizing it :). I’m glad someones up and done this.

    Either way, its a grand first step in really making reasoning about how to get optimal memory locality while writing high level code tractable.

    I’m having a bit of fun presently building up a bit of intuition in this space as I incremental write some high performance functional programming style tools for numerical computation. Its an area in desperate need of good PL/FP love, and I’m trying do my part.

    -Carter

  2. chrisdone says:

    s/one might reasonable/one might reasonably

  3. andrejbauer says:

    This is a sociological remark. You PL types always say things like “provable implementation”. Nobody in logic would every say that. I suspect you mean “implementation which was proved correct by cool people”. I can live with the contraction “proved implementation”, but not with “provable” in place of “proved”. So what do you mean when you say “provable X” with “X” a noun?

    • Oh, don’t get me started on this. I’ve reluctantly adhered to the pre-existing terminology, despite reservations. If you or anyone can think of a better name, I’m eager to learn it. Admittedly “proved implementation” is more accurate than “provable implementation”, but it’s not much of an improvement. I find it ridiculous that we have to emphasize that we can actually prove the validity of our claims, in contrast to standard practice of, at most, showing a few well-chosen histograms and declaring victory.

    • Mike Shulman says:

      I suppose “provably correct implementation” is too many syllables?

    • it doesn’t exactly slide off the tongue, but it’s definitely better grammar.

    • one could say “provably efficient implementation”, which would be appropriately descriptive, but it’s a bit of a mouthful too (and i just hate having to qualify with “provably” to basically say “we’re serious about this, not some jokers with histograms.”

  4. Is “Improvement in a lazy context: An operational theory for call-by-need”, http://dl.acm.org/citation.cfm?id=292547 relevant here? We were using cost semantics to give a theory to call-by-need (and so we tracked lookup/update in the heap instead of I/O reads/writes), but many of the same contextual recursion principles should hold (a context lemma being key to our approach).

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,295 other followers

%d bloggers like this: