Wednesday, July 29, 2009

Update from the Wednesday Keynote

Today's keynote by Bill Dally was, in many ways, distinctly different from yesterdays in that it got down and dirty with technical specifics from the very first slide and stayed that way throughout the end. I'm not sure if everyone in the audience liked it, but I for one, had a great time.

Bill Dally's argument basically came down to the following - current general purpose processor architects are in a state of denial - denial about the fact that (1) in the absence of frequency scaling, performance can only be extracted by explicitly parallel architectures; and (2) power efficiency can only be obtained by locality-aware distributed on-chip memories, as opposed to flat cache hierarchies.

Both arguments were well motivated. From a parallelism perspective, Prof. Dally had a nice plot that showed that improvements in processor throughput have traditionally been driven by three factors - improvement in gate delay, improvement in clock frequency by deeper pipelining and architectural innovations that make use of the additional transistors available in every technology generation. Till 2001, this led to a 52% performance improvement every generation.

Here's the problem though - increasing clock frequency is no longer possible and extracting parallelism using single core superscalar out-of-order processors is running out of steam. In effect, all we're left with is the improvement in gate delay from technology scaling, which gives us only 20% improvement in performance every technology generation!! Explicit parallelism is the only way to get us back on the 52% performance gain per generation.

From a power efficiency stand-point, moving data even 1 mm across a chip apparently requires many times more energy than the floating point operation that produced the data. Spatial locality is therefore the key to energy efficiency, making it imperative to store data close to where it is produced. This provides the motivation for distributed caches across a chip, as opposed to a single big one.

All of this points the way to the kind of massively parallel GPU systems that NVIDIA produces as the platform of choice for the future - I would have been surprised if the VP of NVIDIA came up with any other conclusion (not that I disagree entirely with the conclusion)! Now I have a few quibbles with the argument that Prof. Dally presented, chiefly that he really only compared large unicore systems (from way back when) with massively parallel GPUs, but not really with the 4 or 8 core systems that Intel or AMD sell. Also, exposing parallelism to the programmer as opposed to having the h/w extract it brings to the front questions about how it will effect programmer efficiency and the time required to debug and verify parallel programs. I'm not a s/w guy, but I imagine parallel are significantly more complicated to debug/verify than single threaded ones.

What was really nice is that talk focused also on the synergestic relationship between EDA and the types of massively parallel platforms that were described, i.e., how is EDA benefited by such architectures and what can EDA tools/vendors do toaid the design of such systems.

A number of companies exhibiting at DAC have already embraced the former by offering EDA tools that have been architected for parallel execution. With regards to what EDA tools can do for NVIDIA GPUs, Prof. Dally pointed towards the need for tools that provide accurate power estimates at early stages in the design process and in general, the need for more sophisticated low power design methodologies.

All in all, it was an hour well spent!

1 comment:

  1. I think Siddarth has done a great job of summarizing Bill's talk.

    However, a major point that Bill was trying to make, from a programmer’s perspective, may not have been emphasized enough.

    I think Bill was also emphasizing that the old machine abstraction presented to the programmer, that of a flat memory and a single execution thread, has to change. Bill argued that maintenance of the abstraction has become too power inefficient. He points to, for example, the relatively large amount of power consumed by a chip’s wires, in moving data intra-chip, compared with the power consumed by the destination points for such wires (e.g., an ALU). Bill also points out that the cache mechanism, a major way to support the illusion of flat memory, is highly power inefficient.

    Bill’s conclusion seems to be that programmer’s must embrace a new machine abstraction that has hierarchical memory (because that is what allows memory to be local to where it is used) and many threads. Bill feels that CUDA provides this kind of model in a programmer-friendly way.