edagraffiti » methodology

3D chips: design tools

paulmcl — Tue, 18 Jan 2011 23:20:44 +0000

One of the open areas for 3D chip design is what the design methodology needs to be and what design tools will be required. A more fundamental issue is going to be the business model to pay for tool development. At least in the short term, only a few 3D designs are going to be done and so a conventional EDA “build the tools and wait for everyone to do 3D designs” is not going to work. In fact Antun Domic of Synopsys, presenting at the 3D conference, explicitly pointed this out: EDA works economically when a large number of people use the same methodology so that the methodology can be wrapped up in the tools and sold in volume. Wally Rhines at the EDAC CEO forecast meeting said the same thing: that if semiconductor vendors expected to get 3D tools without paying incrementally for them then it was unlikely to happen.

IBM didn’t really talk about the design tools needed to design their 3D server chip with the processor on top and the memory underneath. But clearly designing a huge DRAM with holes for TSVs punched through it all over the place and the interconnect full of decoupling capacitors wasn’t done by hand.

One talk was by Vassilios Gerousis from Cadence and Damien Riquet from ST (calling in from France at 2 in the morning his time) about a 2.5D chip they had designed. They were using conventional Cadence tools somewhat unconventionally to get the job done, since it appeared there was no real explicit 3D support .

The first challenge in 3D design is to be able to analyze different approaches for efficiency: routing congestion, TSV density, microbump density, impact on power supply (IR drop etc). Unfortunately this is not straightforward since there are not yet any tools that do this directly: multi-floor floorplanners. The next best is to build the design and be able to analyze it. At this point in the technology there is not a lot of flexibility about what goes on what die since usually the die are different processes (DRAM, analog, RF, CCD etc). But eventually when designs stack multiple die and blocks of IP have several layers where they might reside then automation will presumably be required here just as floorplanning has become essential for regular 2D design.

Cadence/ST created a tool to place the TSVs and the microbumps in regular arrays. Experience has shown that this tends to work better than putting them down randomly since you have some flexibility to design blocks with holes that have more than one place they can be located. They seemed to use a mixture of IC routing and custom design tools to design the interposer. They could then use conventional analysis tools and look at the system as whole from the point of view of power-supply analysis, static timing, thermal effects and so on. OpenAccess served as the link between the various tools, in particular allowing both digital tools (P&R) and custom tools (layout) to work on the same data.

I think the most interesting thing about this is that, at least for a relatively simple design with a non-active silicon interposer, it was possible to get the design done without requiring a complete portfolio of new tools with new support.

The biggest areas of opportunity (and biggest may be a relative term since it is not clear how big the market is for any of these) are floorplanning and general validation of the design (do all microbumps line up, are the voltage levels between die OK and so on). All the big EDA companies have some sort of place & route and floorplanning and might decide to play here. Then there is thermal analysis where a company like Gradient probably has an opportunity to extend their technology into another dimension. General analysis of the electrical aspects of the design could be an opportunity for Apache. Right now the biggest risk for an EDA company is likely to be over-investment rather than missing the boat. 3D ICs are coming but there is not going to be an instantaneous switch with thousands of 3D design starts any time soon, if ever.

3D chips: IBM server

paulmcl — Sat, 15 Jan 2011 22:26:40 +0000

The opening keynote of the 3D conference that I went to was by Subramanian Iyer of IBM. He described work they were doing on fully 3D chips for servers. The approaches I’ve already talked about don’t really work for the highest performance end of the spectrum.

Dramatic performance gains from architecture or pushing up clock rate are increasingly unlikely. Moving to a new process node brings performance gains but, of course, is enormously expensive. One of the remaining areas for improving system performance is increasing the size of the cache, and increasing the bandwidth of the cache/processor interface.

Flipping a memory over onto the processor is fine if the processor doesn’t need a heatsink. The highest performance server chips are now dissipating over 150-200W so they need to be on top. In addition, it is almost impossible to distribute the power across a chip like that. At 1V, 200W is 200A which you can’t even get into the chip through conventional wire-bond. And further, the dynamic fluctuations of processor load cause enormous voltage dips which eat up a lot of the potential performance if they are not eliminated.

Using a silicon interposer to connect memory to the processor doesn’t work either since there is just too much long interconnect. The only solution is to build a true 3D chip with the processor on top, the memory underneath and with TSVs going through the memory die to carry all the power and I/O to the processor.

Although SRAM is faster than DRAM, DRAM is so much smaller and dissipates so much less power that in this setup it is preferable, not least because you can have a bigger cache (the memory and processor dice need to be about the same size). The small size wins back enough performance that it is a wash with SRAM.

The picture to the right shows the basic architecture. On the top is the heatsinked processor die. There are no TSVs through the processor die. It is microbumped to attach to the memory die beneath it and to TSVs that go all the way through to carry the 200A of current that it requires direct from the package substrate. The processor/memory microbumps are at at a 50um pitch. The memory die to package bumps are 186um. The memory die has to be thinned in order to get the TSVs all the way through.

In addition, the memory die is choc-a-bloc with decoupling capacitors to reduce transient voltage droops. This allows for increased processor performance without having to give up area on the processor chip for the capacitors since they are in the metal on the memory die. At the keynote, there was a video showing the dramatic difference to the voltage across the chip with and without this approach to supplying power.

2½D: interposers

paulmcl — Thu, 13 Jan 2011 21:29:08 +0000

There are two classes of true 3D chips which are being developed today. The first is known as 2½D where a so-called silicon interposer is created. The interposer does not contain any active transistors, only interconnect (and perhaps decoupling capacitors), thus avoiding the issue of threshold shift mentioned above. The chips are attached to the interposer by flipping them so that the active chips do not require any TSVs to be created. True 3D chips have TSVs going through active chips and, in the future, have potential to be stacked several die high (first for low-power memories where the heat and power distribution issues are less critical).

The active die themselves do not have any TSVs, only the interposer. This means that the active die can be manufactured without worrying about TSV exclusion zones or threshold shifts. They need to be microbumped of course, since they are not going to be conventionally wire-bonded out. The picture at the head of this post shows (not to scale, of course) the architecture; click on the thumbnail for a larger image.

The image shows two die bonded to a silicon interposer using microbumps. There are metal layers of interconnect on the interposer, and TSVs to get through the interposer substrate to be able to bond with flip chip bumps to the package substrate. Flip-chip bumps are similar to micobumps but are larger and more widely spaced.

So is anyone using this in production yet? It turns out that Xilinx is using this for their Virtex-7 FPGAs. They call the technology “stacked silicon interconnect” and claim that it gives them twice the FPGA capacity at each process node. This is because very large FPGAs only become viable late after process introduction when a lot of yield learning has taken place. Earlier in the lifetime of the process, Xilinx have calculated, it makes more sense to create smaller die and then put several of them on a silicon interposer instead. It ends up cheaper despite the additional cost of the interposer because such a huge die would not yield economic volumes.

The Xilinx interposer consists of 4 layers of 65um metal on a silicon substrate. TSVs through the interposer allow this metal to be connected to the package substrate. Microbumps allow 4 FPGA die to be flipped and connected to the interposer. See the picture to the right. An additional advantage of the interposer is that it makes power distribution across the whole die simpler.

This seems to be the only design in high volume production, at least at the conference this was the example that every speaker seemed to use.

Going up: 3D ICs and TSVs

paulmcl — Tue, 11 Jan 2011 21:04:04 +0000

This is the first of several posts about 3D ICs. I attended the 3D architectures for semiconductor integration and packaging conference just before Christmas. I learned a lot but I should preface any remarks with the disclaimer that I’m not an expert on the subject, but I now know enough to be dangerous. But most people are not experts on this subject so I think it is worth a high level overview of what is happening.

The first thing is the 3D chips do seem to be happening. There are designs in production, there are lots of pilot projects and the ecosystem (in particular, who does what) seems to be starting to fall into place.

The first approach to talk about is flipping one chip and attaching it to the top of another. This is done by creating bonding areas on each chip, growing (usually copper) microbumps to create die-die interconnect at a pitch of approximately 50um. The big user of this technology is in digital camera chips. The CCD image sensor is actually thinned to the point that it is transparent to light and then attached to the image processing chip. The light from the camera lens passes through the silicon to the CCD unobstructed by interconnect etc which is all on the other side of the sensor.

This approach is also used for putting a flipped memory chip onto a logic chip (see picture). It is not well-known, but the Apple A4 chip is built like this, with memory on top of the processor/logic chip. There are now standardization committees working on the pattern of microbumps to use for DRAMs (analagous to standard pinout for DRAMs) so that DRAM from different manufacturers should be interchangeable. Unlike in the picture, the bumps are all towards the center of the die so that the pattern is unaffected by the actual die size which may differ between manufacturers and between different generations of design.

Although this technology is formally 3D, since there are two chips, it doesn’t require any connections through any chips and is a sort of degenerate case.

You probably have heard that the key technology for real 3D chips is the through-silicon-via (TSV). This is a via that goes from the front side of the wafer (typically connecting to one of the lower metal layers) through the wafer and out the back. The TSV is typically about 5-10um across and goes about 8-10 times its width in depth, so 50-100um. A hole is formed into the wafer, lined with an insulator and then filled with copper. Finally the wafer is thinned to expose the backside. Note that this means that the wafer itself ends up 50-100um thick. Silicon is brittle so one of the challenges is handling wafers this thin both in the fab and when they have to be shipped to an assembly house. They need to be glued to some more robust substrate (glass or silicon) and eventually separated again during assembly. The wafer is thinned using CMP (chemical mechanical polishing, similar to how planarization is done between metal layers in a normal semiconductor process) until the TSVs are almost exposed. More silicon is then etched away to reveal the TSVs themselves.

The picture to the right (click for a bigger image) shows Samsung’s approach. FEOL (which, for you designers, means front-end of line which means transistors and is nothing to do with front-end design) is done first. So the transistors are all created. Then the TSVs are formed. Then BEOL (which means back-end of line which means interconnect and is nothing to do with back-end design). After the interconnect is done then the microbumps are created. The wafer is glued to a glass carrier. The back is then ground down, a passivation layer is applied, this is etched to expose the TSVs and then micropads are created. This approach is known as TSVmiddle since the TSVs are formed between transistors and interconnect. There is also TSVfirst (build them before the transistors) and TSVlast (do them last and drill them through all the interconnect as well as the substrate).

There are two design issues with TSVs. First is the exclusion area around them. The via comes up through the active area and usually through some of the metal layers. Due to the details of manufacturing, quite a large area must be left around the TSV so that it can be manufactured without damaging the layers already deposited. The second problem is that the manufacturing process stresses the silicon substrate in a way that can alter the threshold values of transistors anywhere nearby, thus altering the performance of the chip in somewhat unpredictable ways.

Variation-aware Design

paulmcl — Wed, 05 Jan 2011 23:50:10 +0000

Solido has run an interesting survey on variation-aware design. The data is generic and not specific to Solido’s products although you won’t be surprised to know that they have tools in this area.

What is variation-aware design? Semiconductor manufacturing is a statistical process and there are two ways to handle this in the design world. One is to abstract away from the statistical detail into a pass/fail environment with concepts like minimum spacing rules and worst-case transistor timing. Meet the rules and the chip will yield. This is largely what we do in the digital world although with the complexity of modern design rules and the number of process corners that we now need to consider a lot of the complexity of the process is bleeding through anyway. But there is an underlying assumption in this approach that within-die variation is minimal. In fact the very idea of a process corner depends on this: all the n-transistors are at this corner and the p-transistors are at that corner.

But for analog this approach is no longer good enough, instead the design needs to be analyzed in the context of process variation for which the foundry needs to provide variation models. This requires statistical techniques in the tools to take the statistical data from the process and estimate its effect on yield, timing and power. It remains unclear to what extent these approaches will become necessary in the digital world as we move down the process nodes.

Solido had an agency survey several thousand IC designers of which nearly 500 completed the survey, so this is quite a large survey. They are a mixture of management and custom designers (so not digital designers).

The number #1 problem where they felt that advances were needed in tools were variation-aware design (66%) followed by parasitic extraction (48%). Coming up at the rear I don’t think anyone will be surprised that there isn’t a burning desire for major improvements in schematic capture (7%).

Of course the main reason people want variation-aware technology is to improve yield (74%) and avoid respins (64%) which is really just an extreme case of yield improvement! They also wanted to avoid project delays since over half of the groups had missed deadlines or had respins due to variation issues, typically causing a 2 month slip.

When asked which process node people though variation-aware design was important, surprisingly about 10% said that it was already important at 0.18µm, but that number is up to 60% by 65nm and 100% by 22nm.

So this is definitely something the analog guys need to worry about now, and digital need to be aware of. Indeed, Solido is part of the TSMC AMS reference flow (and other companies such as Springsoft and Synopsys have some variation-aware capabilities).

Carbon

paulmcl — Thu, 23 Dec 2010 11:35:10 +0000

In the latest piece that Jim Hogan and I put together about re-aggregation of value back at the system companies I talked a little bit about Carbon.

I got two things wrong, that I’d like to correct here. The first goes back a long way to the mergers of Virtutech, VaST and CoWare when I listed the other virtual platform companies that are still independent. I omitted Carbon since I didn’t actually realize they had acquired the virtual platform technology SOCdesigner from ARM when they did the deal to take responsibility for creating and selling ARM’s cycle accurate models.

SOCdesigner was originally a product from a company called Axys, based in southern California. I believe that they had technology pretty similar to VaST at the time, but it was hard to know since they were very secretive. Despite rules against doing so they would throw us out of their presentations at DAC and ESC (so we sent over our finance person to see what she could find out…but they even spotted her). ARM acquired Axys, which I never understood the reason for. Even ARM- based designs typically involve lots of models not from ARM, so it never seemed likely that ARM would be able to make SOCdesigner a successful standalone business, it seemed like a business for someone independent of the processor companies. After all, you can’t imagine MIPS putting much effort in to make their models run cleanly in SOCdesigner. At VaST we considered it less of a threat post-acquisition than before.

Anyway, Carbon got SOCdesigner (still called SOCdesigner) and used their own technology for turning RTL into fast C-based cycle accurate models to solve another problem ARM had, namely the cost of creating, maintaining and distributing cycle-accurate models. ARM had always had fast models of their processors and many peripherals, since that is what software developers required, and these are relatively cheap to produce (they only need to be functionally accurate so there are many corners that can be cut, for instance it is not usually necessary to model the cache or branch prediction since the only difference is the number of cycles used).

The second error was that I didn’t really realize that in the Carbon world there are now 3 speeds of models. RTL, cycle-accurate models, and fast models.

RTL models aren’t really in the Carbon world, actually. But cycle-accurate models are automatically generated from the RTL which means that they are correct by construction. These models are not fast enough for software development, and in fact it is impossible to create models that are fast enough for software development and simultaneously accurate enough for SoC development. However, given their RTL provenance they tie the software and the SoC design together accurately, which is really important because increasingly it is only possible to validate the software against the hardware and vice versa.

Fast models usually either come from the vendor of the processor or IP, or are created by the end-user. Processor models are not actually models in the usual sense of the word, they are actually just-in-time (JIT) compilers under the hood, converting instruction sequences from ARM, MIPS or whatever instructions into x86 instructions that run a full native speed. Fast peripheral models again are created by cutting lots of corners, but this is not something that can be done automatically since it is not clear (and often depends on the use to which the model will be put) which corners can be cut.

The remaining piece of the puzzle is the capability for the virtual platform to switch from fast models to cycle-accurate models. Boot up the system until it gets interesting (or perhaps just before to give the cycle-accurate models a bit of runway), then suck out all the state information from the fast models and inject into the cycle-accurate models. This gives the best of both worlds, fast models when you don’t care about the details of what is going on in the hardware, and complete accuracy when you do, either because you are responsible for verifying the hardware or debugging low-level software that interacts intimately with the hardware.

Evolution of design methodology, part II

paulmcl — Wed, 22 Dec 2010 19:34:51 +0000

The second half of the article that Jim Hogan and I wrote on re-aggregation of design at the system companies is now up at EEtimes.

The second part of the article looks at the implications for the EDA and IP industries of the changes that we outlined in the first part of the article.

System re-aggregation

paulmcl — Wed, 24 Nov 2010 18:53:39 +0000

For some time now Jim Hogan and I have been debating whether we are finally on the cusp of one of those design transitions that comes along once every decade or so: the move to gate-level from transistor, the move to synthesis, and so on.

The classic design methodology was built on an assumption that design today is roughly: write the RTL, automatically reduce it to layout, then write a little software for the control microprocessor. But now, for most SoCs, this is completely backwards: 80% or 90% of the design is pre-existing IP. The software load can be enormous but most of it isn’t being written for this specific SoC, it is inherited from earlier designs. This changes the whole nature of design and potentially causes one of those re-jigging of the supply chain and a re-jigging of who realized the most value.

One implication of all of this is that system companies like Apple can design their own systems without having to share so much of the margin with others, as they have done with the iPad A4 chip.

So Jim and I wrote a piece and it is running in EEtimes. I’m not sure how heavily it is going to end up being edited. I’ll put stuff up here that got cut for space reasons. The first part is here.

Polyteda

paulmcl — Tue, 22 Jun 2010 13:48:20 +0000

One startup I did run across that looks interesting is Polyteda. Let me first point out that this is all based simply on talking to them and I’ve not run their tools or done any other diligence.

They have a next generation DRC facing off against Calibre (also they have LVS). They are based out of the Ukraine. When I had a development group in Moscow when I was at Compass one of the things I noticed was that Russians thought differently (and yes, I know Ukrainians are not Russians). They had grown up in an era where the best computer you could expect was an old 286-based PC, you weren’t getting a state-of-the-art Sun workstation. Also, they grew up in an era where the Rouble wasn’t just not externally convertible, it wasn’t really internally convertible: you couldn’t buy stuff with it since it just wasn’t available. So they had a pride in doing a lot with a little, especially clever mathematics that just required a pen and paper, or processing large chips on inadequately powered computers. In addition, the Soviet universities were not seeded with American-educated professors, as was largely the case in India and China, so they were even educated differently from the rest of the world.

Polyteda is now qualified with UMC at 65nm, apparently the only DRC except Calibre since the others haven’t passed yet. While some are moving faster than others, it is safe to assume all of the foundries and many of the IDMs are at least investigating their technology if not actively working with Polyteda.

Polyteda takes a different approach to DRC, instead of being layer based, largely processing one layer at a time, it divides the chip up into areas and processes all rules on each area, fitting the whole thing in memory. They don’t overlap the areas (since that means double checking some things) so obviously this scales nicely to huge numbers of processors.

They have a much more programmatic way of expressing the design rules, including procedure calls, making for very compact rule-decks. Of course they can also read a Calibre deck but that is not a very efficient way of using the tools. Their language is more powerful, meaning they can check complex rules that other DRCs cannot, or can only do incredibly inefficiently: antenna rules, bizarre reflection rules, some stress rules and so on.

All other DRCs largely use scan-line algorithms, processing trapezoids as a line runs across the layers being analyzed. My guess is that they do not, that they instead do something closer to an approach I was shown back in those Compass Moscow days, following a polygon around from edge to edge and handling all the interactions. But that is pure speculation.

In this market, as Calibre showed at 0.35um, change can come fast. And since it is about a $400M market there is a lot of money up for grabs.

Is it time to start using high-level synthesis?

paulmcl — Fri, 30 Apr 2010 06:29:07 +0000

One big question people have about high-level synthesis (HLS) is whether or not it is ready for mainstream use. In other words, does it really work (yet)? HLS has had a long history starting with products like Synopsys’s Behavioral Compiler and Cadence’s Visual Architect which never achieved any serious adoption. Then there was a next generation with companies like Synfora, Forte and Mentor’s Catapult. More recently still AutoESL and Cadence’s CtoSilicon.

I met Atul, CEO of AutoESL, last week and he gave me a copy of an interesting report that they had commissioned from Berkeley Design Technology (BDT) who set out to answer the question “does it work?” at least for the AutoESL product, AutoPilot. Since HLS is a competitive market, and the companies in the space are constantly benchmarking at customers and all are making some sales, I think it is reasonable to take this report as a proxy for all the products in the space. Yes, I’m sure each product has its own strengths and weaknesses and different products have different input languages (for instance, Forte only accepts SystemC, Synfora only accepts C and AutoESL accepts C, C++ and SystemC).

BDT ran two benchmarks. One was a video motion analysis algorithm and the other was a DQPSK (think wireless router) receiver. Both were synthesized using AutoPilot and then Xilinx’s tool-chain to create a functional FPGA implementation.

The video algorithm was examined in two ways: first, with a fixed workload at 60 frames per second, how “small” a design could be achieved. Second, given the limitations of the FPGA, how high a frame rate could be achieved. The wireless receiver had a spec of 18.75 megasamples/second, and was synthesized to see what the minimal resources required were to meet the required throughput.

For comparison, they implemented the video algorithms using Texas Instruments TMS320 DSP processors. This is a chip that costs roughly the same as the FPGA they were using, Xilinx’s XC3SD3400A, in the mid $20s.

The video algorithm used 39% of the FPGA but to achieve the same result using the DSPs required at least 12 of them working in parallel, obviously a much more costly and power hungry solution. When they looked at how high a frame rate could be achieved, the AutoPilot/FPGA solution could achieve 183 frames per second, versus 5 frames per second for the DSP. The implementation effort for the two solutions was roughly the same. This is quite a big design, using ¾ of the FPGA. Autopilot read 1600 lines of C and turned it into 38,000 lines of Verilog in 30 seconds.

For the wireless receiver they also had a hand-written RTL implementation for comparison. AutoPilot managed to get the design into 5.6% of the FPGA, and the hand-written implementation achieved 5.9%. I don’t think the difference is significant, and I think it is fair to say that AutoPilot is on a par with hand-coded RTL (at least for this example, ymmv). Using HLS also reduces the development effort by at least 50%.

BDT’s conclusion is that they “were impressed with the quality of results that AutoPilot was able to produce given that this has been a historic weakness for HLS tools in general.” The only real negative is that the tool chain is more expensive (since AutoESL doesn’t come bundled with your FPGA or your DSP).

It would, of course, be interesting to see the same reference designs put through other HLS tools, to know whether these results generalize. But it does look as if HLS is able to achieve results comparable with hand-written RTL at least for this sort of DSP algorithm. But, to be fair to hand-coders, these sort of DSP algorithms where throughput is more important than latency, is a sort of sweet-spot for HLS.

If you want to read the whole report, it’s here.