I talked recently about customer support and how to handle it. One critical aspect of this is the internal process by which bugs get submitted. The reality is that if an ill-defined bug comes in, nobody wants to take the time to isolate it. The AEs want to be out selling and that if they just throw it over the wall to engineering then it will be their job to sort it out. Engineering feels that any bug that can’t easily be reproduced is not their problem to fix. If this gets out of hand then the bug languishes, the customer suffers and, eventually, the company too. As the slogan correctly points out, “Quality is everyone’s job.”
The best rule for this that I’ve ever come across was created by Paul Gill when we were at Ambit. To report a bug, an application engineer must provide a self-checking test case, or else engineering won’t consider it. No exceptions. And he was then obstinate enough to enforce the “no exceptions” rule.
This provides a clear separation between the AE’s job and the development engineers job. The AE must provide a test case that illustrates the issue. Engineering must correct the code so that it is fixed. Plus, when all that activity is over, there is a test case to go in the regression suite.
Today, most tools are scripted with TCL, Python or Perl. A self-checking test case is a script that runs on some test data and gives a pass/fail test as to whether the bug exists. Obviously, when the bug is submitted the test case will fail (or it wouldn’t be a bug). When engineering has fixed it, then it will pass. The test case can then be added to the regression suite and it should stay fixed. If it fails again, then the bug has been re-introduced (or another bug with similar symptoms has been created).
There are a few areas where this approach won’t really work. Most obviously are graphics problems: the screen doesn’t refresh correctly, for example. It is hard to build a self-checking test case since it is too hard to determine whether what is on the screen is correct. However, there are also things which are on the borderline between bugs and quality of results issues: this example got a lot worse in the last release. It is easy to build the test case but what should be the limit. EDA tools are not algorithmically perfect so it is not clear how much worse should be acceptable if an algorithmic tweak makes most designs better. But it turns out that for an EDA tool, most bugs are in the major algorithms under control of the scripting infrastructure and it is straightforward to build a self-checking test case.
So when a customer reports a bug, the AE needs to take some of the customer’s test data (and often they are not allowed to ship out the whole design for confidentiality reasons) and create a test case, preferably small and simple, that exhibits the problem. Engineering can then fix it. No test case, no fix.
If a customer cannot provide data to exhibit the problem (the NSA is particularly bad at this!) then the problem remains between the AE and the customer. Engineering can’t fix a problem that they can’t identify.
With good test infrastructure, all the test cases can be run regularly, and since they report whether they pass or fail it is easy to build a list of all the failing test cases. Once a bug has been fixed, it is easy to add its test case to the suite and it will automatically be run each time the regression suite is run.
That brings up another aspect of test infrastructure. There must be enough hardware available to run the regression suite in reasonable time. A large regression suite with no way to run it frequently is little use. We were lucky at Ambit that we persuaded the company to invest in 40 Sun servers and 20 HP servers just for running the test suites
A lot of this is fairly standard these days in open-source and other large software projects. But somehow it still isn’t standard in EDA which tends to provide productivity tools for designers, without using state of the art productivity tools themselves.
On a related point, the engineering organization needs to have at least one very large machine too. Otherwise inevitably customers will run into problems with very large designs where there is no hardware internally to even attempt to reproduce the problem. This is less of an issue today when hardware is cheap than it used to be when a large machine was costly. It is easy to forget that ten years ago, it cost a lot of money to have a server with 8 gigabytes of memory; few hard disks were even that big back then.
And with perfect timing, here’s yesterday’s XKCD on test-cases: