I caught up with Dave Stewart and Skip Hovsmith of CriticalBlue (from Edinburgh, yay, one of my alma maters). They originally developed technology to take software and pull it out of the code and implement it in gates. They had some limited success with this. But now they have refocused their technology on the problem of taking legacy code and helping make it multicore ready with their Prism tool.
They do this by running the code and storing a trace of what goes on for later analysis. Previously they have done this only through simulation but now they can also use hardware boards to run the code. They don’t need a multicore CPU, just one with the same instruction set.
Having developed the trace they can do “what if there were 4 cores, or 32” type analysis without need to run it again. On typical code that wasn’t written with concurrency in mind the typical answer is “not much would happen” because there are too many dependencies. The example in the demo is doing a JPEG compression. Most of the time is spent in the DCT (discrete cosine transform) algorithm but the code doesn’t parallelize due to data dependencies. It turns out that the code was written in a way that makes sense in a single processor world: allocate a single workspace, run the loop 32 times using the workspace, then dispose of the workspace. Obviously if you try and parallelize this, then all iterations of the loop except one must block and wait for the workspace. If you move the workspace allocation into the loop (so that you allocate 32 of them) then all the iterations of the loop can run in parallel.
They don’t actually change the code. It turns out users wisely don’t want something mucking around with their code without their understanding what is going on. But they give users the tools to both move code onto multicore processors and simply get it ready. When doing maintenance on code, by using Prism a programmer can also remove unnecessary dependencies and thus get the code so that it will be able to take advantage of multicore processors (or to take advantage of larger numbers of cores) when they become available.
I’ve said before that microprocessor vendors, especially Intel, completely underestimated the difficulty of programming multicore processors when power constraints forced them to deliver computing power in the form of more cores rather than faster clock frequencies. Everyone realizes now that it is one of the major challenges in software going forward.
At DAC the nVidia keynote claimed that Amdahl’s law wasn’t really a limitation any more. I’m not a believer in that position. The part of any program that cannot be parallelized sets a firm bound on how much speedup can be obtained. Even if only 5% of the code cannot be parallelized, which seems high, that sets a limit of 20X speedup no matter how many cores are available. Critical Blue have an interesting tool for teasing out whatever parallelization is possible, often with relatively simple changes to the code as in the example I described above.