I went to the keynote today by nVidia’s (and Stanford’s) William Daily. The topic was the end of what he called denial architecture and the rise of throughput computing. Denial architecture was so called since it denied two things: that the world was sequential and that memory was flat. Throughput computing turned out to mean, surprise, surprise, the type of engines produced by nVidia.
As everyone knows the performance of a single processor is increasing only slowly due to power considerations. Instead we have to take our increased computing power in the form of additional processors. Architectures like this, such as nVidia’s chips, should continue to increase at about 70% per year for the foreseeable future. That is what I like to call “core’s law.” The number of cores on a chip is increasing exponentially. It’s just not all that obvious yet since we are still on the flat parts of the exponential curve.
Daily had some interesting analysis of the energy required to do a computation (such as a floating point multiply) versus the energy required to move the data a short distance, across the chip or off-chip. The bottom line is that computation is very cheap in both area and energy provided the data required is local, already close to the computational unit. When a lot of data is used in any sort of pipelined computation, where the output from one stage is immediately consumed by the next, then cached memory is a particularly bad architecture, something I’d never realized before. Writing the data out causes the cache-line to be fetched, then the data is read once. Finally, the value, which will never be used again, is written back to the main memory.
To take advantage of all this compute power, the programmer has to worry about managing the concurrency and worry about which memories are used to store which data. Programmers like to deal in abstractions which is why sequential programming and flat memory work so well. There are only 3 numbers in computer science, 0, 1 and infinity. Numbers like 50 processors each with 2K of memory are not something that the programmer wants to have to worry about.
But it seems there is no choice. The CUDA programming architecture gives a framework for writing these kinds of programs and certainly some of the results on computationally expensive algorithms are impressive. Done right, it is a one time cost to get back onto the performance curve as process generations unfold into the future. But it seems more like assembly language programming in some ways, since so much of the details of the hardware have to be taken into account. Chips like nVidia’s (and IBM’s cell architecture used in the Playstation) are notoriously hard to deal with because of this mismatch between computational resource and the programmer’s mental model of what has to be done.
This stuff is now being taught in universities so it will be interesting to see if a new generation of programmers who think this way find it any easier. It still seems really hard to take a lot of small computers and put them together so that they behave as one really huge one. But the payoff when it can be done is enormous. However, getting the software right continues to be the biggest problem in software.