Earlier I discussed power in integrated circuits. As most people know, power is the main reason that PC processors have had to move away from single core chips with increasingly fast clock rates and towards multi-core chips. Embedded chips are starting to go in the same direction too; modern cell-phones often contain three or more processors even without counting any special purpose ones used for a dedicated purpose like mp3 decode. The ARM Cortex is multicore.
Of course this moves the problem from the IC companies, how to design increasingly fast processors, to the software side, how to write code for multi-core chips. The IC companies have completely underestimated the difficulty of this.
The IC side of the house has assumed that this is a problem that just requires some effort for the software people to write the appropriate compilers or libraries. But in fact this has been a research problem for over forty years: how do you build a really big powerful computer out of lots of small cheap ones? It is unlikely to be solved immediately, although clearly a lot more research is going on in this area now.
There are some problems, traditionally known as “embarrassingly parallel,” which are fairly easy to handle. Far from being embarrassing, the parallelism is so simple that it is easy to make use of large numbers of processors at least in principle. Problems like ray-tracing, where each pixel is calculated independently, are the archetypal example. In fact nVidia and ATI graphics processors are essentially multi-core processors for calculating how a scene should be rendered (although they don’t use ray-tracing, they use cheaper polygon-based algorithms). In the EDA world, design rule checking or RET decoration are algorithms where it is (fairly) easy to parallelize them: divide the chip up into lots of areas, run the algorithm on each one in parallel and take a lot of care on stitching the bits back together again at the end.
But most problems are more like Verilog simulation. It is hard to get away from having a global timebase, and then the processors have to run in lock-step and the communication overhead is a killer. Yes, in limited cases processors can run ahead somewhat without violating causality (such as simulating fast processors connected by slow Ethernet) and so reduce the amount of required synchronization but the typical chip is not like that.
Years ago Gene Amdahl noted that the amount of speedup that you can get by building a parallel computer of some sort is limited not by what can be made parallel but what cannot. If, say, 10% of the code cannot be parallelized, then even if we take the limiting case that the parallel code finishes instantaneously, the maximum speedup is just 10 times. This has come to be known as Amdahl’s law. So that is the first limitation on how to use multi-core. To use hundreds of cores effectively then the amount of code that cannot be completely parallelized must be tiny.
The next problem is that it is not possible to divide up the problem at compile time and capture that decision in the binary. If you have a loop that you are going to go around 1000 times to calculate something for 1000 elements, then one way is to unroll the loop, spawn the calculation simultaneously on 1000 threads on 1000 processors and accumulate the results. If the amount of calculation is very large compared to the overhead of spawning and accumulating, this might be good. But if you only have two processors, then the first two threads will go ahead and the next 998 will block waiting for a processor to become available. All the overhead of spawning and accumulation and blocking is just that, overhead that slows down the overall computation. How to map computation to processors must be done partially at run-time when the resources available are known.
The other big problem is that most code already exists in libraries and in legacy applications. Even if a new programming paradigm is invented, it will take a long time to be universally used. Adding a little multi-threading is a lot simpler than completely re-writing Design Compiler in a new unfamiliar language, which is probably at least a hundred man-years of effort even given that the test suites already exist.
There are some hardware issues too. Even if it is possible to use hundreds of cores, the memory architecture needs to support enough bandwidth of the right type. Otherwise most of the cores will simply be waiting for relatively slow access to the main memory of the server, as shown in more detail in this IEEE Spectrum article. Of course it is possible to give each processor local memory, but if that is going to be effective those local memories cannot be kept coherent. And programming parallel algorithms in that kind of environment is known to be something only gods should attempt.
I’ve completely ignored the fact that it is known to be a hard problem to write parallel code correctly, and even harder when there really are multiple processors involved not just the pseudo-parallelism of multiple threads or processes. As it happens, despite spending my career in EDA, I’ve got a PhD in operating system design so I speak from experience here. Threads and locks, monitors, message passing, wait and signal, all that stuff we use in operating systems is not the answer.
Even if the programming problem is solved with clever programming languages, better education and improved parallel algorithms, the fundamental problems remain. Amdahl’s law limiting speedup, the bottleneck moving from the processor to the memory subsystem, and the need to dynamically handle the parallelism without introducing significant overhead. They are all hard problems to overcome. Meanwhile, although the numbers are small now, the number of cores per die is increasing exponentially; it just hasn’t got steep yet.
Our brains manage to be highly parallel though, and without our heads melting, so there is some sort of existence proof of what is possible. But, on the downside, we are really slow at calculating most things and also pretty error-proon.