
I remember having a conversation with a missile engineer some time ago about the North Korean Nodong missile; he said “no one in their right mind would field a missile that has only been successfully tested once!” At the time, that made a lot of sense to me. But exactly how many tests do you need? And more importantly, how do you decide how many tests you need? These questions should all be determined by the reason why tests are performed in the first place.
I think I understand bullet testing. When developing a new bullet, you test it millions and millions of times to make sure they work right in all imaginable situations and that you have a high degree of confidence that they will work. But, of course, bullets only cost a dollar or two each so there is little problem with running a standard quality control test program to allow you to achieve real confidence they are going to work. National missile defense tests cost about $100 million each so we are never going have the “95% confidence that the system works 95% of the time” that some critics of missile defense have been advocating. (I’m not against that in principle; I’m just saying it’s never going to happen. Any missile defense development program has to be adjusted to that reality.)
I was reminded of my conversation with the missile engineer when NASA announced it was awarding SpaceX part of a $3.5 Billion dollar contract to deliver supplies to the ISS based on a single successful test flight. SpaceX is quickly becoming my favorite sociological experiment in missile development. Touted as a more cost effective way of getting into space, SpaceX has hired former NASA engineers and uses government facilities without, I’m sure, contributing to paying off their development costs; just some sort of use fee. But now, it seems, that the real way they are going to save money is not to have the sort of expensive testing program we might expect from a government development program. This isn’t going to be a rant against SpaceX, which as I say, is one of my favorite sociological experiments (which also doesn’t imply that I think they are doing the right things!) The problem is, I’m not sure what the government would use that development program for, anyways. If we are not using flight tests to determine statistical reliability, perhaps only one successful test is really all that is needed. If so, what does that tell us about countries just starting the development of their missiles?
Tests Associated with Various Development Programs
| Program |
No. of Tests |
No. Successful Tests |
| Falcon-1 (SpaceX) |
4 |
1 |
| W-76 |
4 |
2* |
| RS-24 |
3 |
3 |
| Al Samoud I (Iraq) |
37 |
33 |
| Al Samoud II (Iraq) |
24 |
22 |
| Nodong (DPRK only) |
2 |
1 |
| Taepo’dong I (DPRK) |
1 |
0** |
| Taepo’dong II (DPRK) |
1 |
0 |
*I have arbitrarily dropped the two tests with anomalous results from the successful column.
**First 2 stages successful.
Integration Tests
One reason I like the Falcon-1 test series so much is illustrated by the reason the third flight test failed. Developed by engineers and scientists who have had plenty of experience developing other missiles, this missile failed (I believe) because they were concentrating so much on economic factors, namely the reuse of the first stage engine. If you want to reuse an engine, you don’t want to go firing pyrotechnics that blow holes in the nozzle to quickly drain the fuel. On the other hand, if you don’t quickly and reliably shut the engine down, the remaining fuel might cause the first stage to continue to produce a little bit of thrust and hence risk bumping into the second stage engine and breaking it as they separate. That is exactly what happened. Could SpaceX have caught this error if it had run more ground checks? If so, were they cut to reduce design costs? I hope you see why I like it so much.
The RS-24 is another interesting case that seems to be devoted to testing an integrated system. Pavel Podvig has made a very convincing argument that the RS-24 is a Topol-M missile with more than one warhead uploaded onto its bus. In that case, perhaps it shouldn’t need very many flight tests to get it up to speed. In fact, one might think that only the post-boost bus needs testing. But perhaps even that doesn’t need much testing since some claim that the Topol-M’s bus was tested for more than one warhead without loading any more on it by simply maneuvering as if it did have the warheads. (Some Russians claim exactly the same thing for some US post-boost buses. The US responds to those charges by claiming that additional maneuvers were needed for range safety reasons. And so it goes.) On the other hand, as the Falcon-1 test series shows, integrating different components does introduce new modes of failure. Were three tests enough? Apparently so, since Russia has said they will now introduce the RS-24 into their arsenal.
Statistical Uncertainty
The US philosophy of testing nuclear weapons is perhaps the hardest to understand; not least because so much is buried in secrecy. One could have imagined that, since the US performed over 1054 nuclear explosion tests (it appears that some tests had more than one explosive device tested at a time) and “developed” a total of 112 nuclear weapons, they could have used these tests to establish a reasonable statistical reliability for each weapon. After all, this corresponds to nine tests per bomb design with a significant number left over for testing one-point-safety, which would be reassuring. Except that the US testing philosophy was never to test to this level.
Instead, our nuclear tests were supposed to develop weapon designers’ expertise; an expertise from which they could judge the reliability of a nuclear design without further testing. This must rely on two assumptions that are probably true most of the time: 1) that the non-nuclear components are tested individually and as a whole enough times to establish a statistical reliability for the non-nuclear functioning of the design and 2) the nuclear process involves so many “particles” that statistical fluctuations cannot have a significant effect on the design’s function.
Some doubt that the later is true for the W-76, a mainstay of the submarine leg of our nuclear triad. Critics have suggested that the possibility that a macroscopic instability exists that violates the second assumption. It is also one of the few warheads for which the US has released information on its testing. It had a total of four tests during its development and apparently two of them had “anomalies.” They could have had anomalously high yields, or anomalously low yields, or anomalies that didn’t affect the yield; the open literature doesn’t say. However, we know that one anomaly resulted in a retest and the other in a change in a component (but no retest). Fortunately, there have probably been enough tests of the W-76 with the few stockpile surveillance tests done in the later years of testing to establish a reasonable statistical reliability, especially when more than one warhead is devoted to each target.
Other Countries
Given these examples of developed countries’ R&D programs, Iraq’s development of the Al Samoud I and II are very reassuring. Not only did they use flight tests to iron out the bugs, they went on to what we would call an extensive operational test and evaluation series. The last 11 Al Samoud II flight tests were for verification of the “firing table,” determining the range under various conditions such as changes to the pitch program etc. (One of these failed, so the operation failure rate of the Al Samoud II was probably around 10%.) Still, I cannot help suspecting that somebody in a powerful position might have made a lot of money for each test flight flown. Hence their large numbers. Still, if other countries followed this sort of a testing program, we would never miss their development of an ICBM.
North Korea, on the other hand, doesn’t seem to need nearly as many flight tests. Apparently only one successful test was needed for DPRK to start selling its Nodong missile abroad. Various analysts have come up with ingenious reasons for this and they could very well be right. But, on the other hand, do we really understand why and how we test complex systems well enough to claim to understand North Korea’s? I am full of doubt.
Note added: Just to be clear, when I say I think there have probably been enough stockpile surveillance tests of the W-76 to give a reasonable statistical confidence to the W-76’s reliability, that was not the intention of the surveillance tests. In fact, this statement is based only on my estimates of the numbers tested that I derived from a correlation analysis and published in Jane’s Intelligence Review in July 2005. As I hope I made clear, the reliability of nuclear weapons is officially based on the judgment of the designers and not on tests. Perhaps not surprisingly, that is probably the case with all the other tests considered here.

[4]