Letter to the Transport Subcommittee on NERC/NSC, Monday 17 November, 1997

Professor Peter B. Ladkin

Special Report RVS-S-97-01

17 November 1997

To
The Transport Subcommittee
The House of Commons
London

The Software Development for the NERC

Honourable Members of the Transport Subcommittee,

I wish to provide you with information and commentary concerning the software development for the National En-Route Center commissioned by NATS.

The punch-line first. There are two fundamental system issues to which I believe the Subcommittee should give close attention:

the failure to `scale-up' operating software from 30 to 100+ workstations, indicating fundamental system-operation problems below the level of ATC functions;
the process by which the ever-changing ATC system requirements are managed, engineered, and incorporated into the system

I wish to comment on the six planning options briefed by NATS in September 1997 in the light of these two features. In particular, I fail to see how five of the six options briefed by NATS conform with fundamental software project-planning maxims, and wish to explain my thoughts to the Subcommittee. Further, I cannot find NATS briefing material on one important option, that of cancelling the software development and starting anew, so I wish to provide some of my own.

About myself: I am a British national, a computer scientist specialising in software-based system specification and verification, and failure (the flip side of verification). I was educated in Britain and California, performed research and consulted on software development methods in `Silicon Valley', before returning to Europe, where I have performed research in Germany, Switzerland, Scotland and France. I am currently Professor of Computer Networks and Distributed Systems at the University of Bielefeld in Germany. I am also a Visiting Professor at Middlesex University in the UK and the Laboratoire INRS-Télécommunications in Montréal until 1999. I am a pilot, holding a US Private Pilot licence with an Instrument Rating, with 750 hours total time in small airplanes, permitted to fly and use ATC services in any conditions which I consider suitable. Besides my specialist research, I comment regularly on computer-related incidents and accidents in aviation in RISKS, the Forum on Risks to the Public in Computers and Related Systems, an electronic journal supported by the ACM, the US-based international association of computer professionals, and also known as the internet newsgroup comp.risks. I also maintain a World-Wide-Web-based compendium of computer-related incidents with commercial aircraft, accessible via the URL above.

Why I am commenting on the NERC software project: I consider the success of the UK ATC modernisation project to be important for Britain, and further, of international importance, and I believe I can offer disinterested commentary on the project derived from my particular expertise in the engineering of distributed, concurrent, real-time systems (which NERC is); in the history of ambitious large software projects and their failures; and in my knowledge of air traffic control derived from my practical experience in the US air traffic control system.

My other sources of independent advice: I have been briefed on ATC software evaluation by my colleague Bob Ratner, of Ratner and Associates in Palo Alto, California, who is the software auditor for TAAATS, the Australian ATC modernisation project, and who has evaluated software system development and planning for a number of ATC systems. I have been briefed on experience with CAATS, the Canadian full ATC modernisation project, by Bob Fletcher, responsible for Project Risk and Software Safety Analysis for CAATS projects at NAV CANADA. NAV CANADA is the CAATS client, a private company equivalent to NATS in the UK. I have been provided with information concerning the TAAATS development by Bob Peake, who is TAAATS Project Manager for Airservices Australia.

Failure is possible: Fletcher, Peake and Ratner all emphasised to me that the new-generation ATC systems are complex projects at the cutting edge of hardware-software systems engineering and that such major developments all have their share of management and engineering problems. It is, shall we say, to be expected. But nevertheless these systems are essential. Opting out of development is not a real option for any country that wishes to maintain a presence in commercial aviation. The big question is how to manage this process.

Keeping this in mind, let me say right out that the current NERC software development could fail, and that significant warning signs are there. It caused me concern that this possibility was apparently not taken into account in the briefing material provided by NATS, in particular the Consultation Paper of September 1997. Peter Neumann of SRI International in California, editor of RISKS, has provided me with examples of ambitious software projects which failed, of which the London Stock Exchange Taurus system is by no means the worst example. Let me briefly draw to your attention some other complex aviation systems that almost failed, to indicate the costs that can be involved:

"The [US] B-1 bomber required an additional $1 billion to improve its ineffective air-defence software, but software problems prevented it from achieving its goals."
"The software for the modernisation of the Satellite Tracking Control Facility was reportedly about 7 years behind schedule, was about $300 million over budget, and provided less capability than required."
" The Airborne Self Protection Jammer (ASPJ), an electronic air-defense system installed in over 2000 [US] Navy [aircraft] was $1 billion over budget, 4 years behind schedule, and only marginally operationally effective and marginally operationally suitable."

The list goes on.

Features of the NERC system that make it particularly vulnerable are that it is distributed (parts of the system run on many different computers and must communicate reliably with each other to function correctly); concurrent (it performs many different tasks simultaneously); and real-time (complex operations must be performed in step with unfolding events in the world outside). There is no real hub to such a system - it's more reasonable to think of it as a lot of mutually-communicating tasks going on at once in different physical locations. It must be based on a fundamental structure of sound communication protocols and message-traffic management. By way of close analogy, one can think of the organisation as similar to that of a stock exchange, in which one can think of the traders as similar to the ATCO workstations. Such a system can go wrong in some similar ways to those in which, say, a market can crash (and more).

When the stock market crashes, there would be little point in taking brain scans of the traders to determine where the faulty neurons are. Yet this is the kind of measurement suggested by NATS as indicative of how the faulty software is being rectified - John Barrett of NATS was reported in a professional journal as talking of so-many `bugs' per so-many `lines of code'. But problems with distributed, concurrent, real-time systems mostly cannot be expressed as `bugs per so-many lines of code'.

For example, the New York phone system suffered a severe 7-hour `slowdown' (it came to a grinding halt) on January 15, 1990, by software that performed exactly as intended. It detected malfunctioning system components and restarted them. The restarted components broadcast `restart' messages to other components, which became overwhelmed with `restart' messages and restarted themselves, compounding the problem. In short, a chain reaction. And a significant safety problem (neither emergency 999-type services nor air traffic control could function normally). This feature of the system, its capability to instigate a `chain reaction' of this type, could not and cannot be expressed in terms of `bugs' per so-many `lines of code' - no individual piece of software was in itself faulty. Everything hangs on the overall design, in concert with the `right' (or `wrong') environmental triggers. These are `latent system errors'.

For a second example, a power failure to the phone systems again in New York on 17 September 1991, and the inability of the backup systems to handle the increased traffic, caused a halt to operations at all three New York airports for several hours. This was an ATC system failure triggered by an `external' failure which it had been supposedly designed to accomodate. Yet there was no `bug' in any `line of code' on which to lay the blame. The system design, along with a specific environmental trigger, led to this `common mode failure'.

Discovering and inhibiting such system behaviour is a matter of delicate and complex software engineering, and is not a routine matter. Many (such as myself) feel it is preferable to use mathematical methods to ensure they cannot arise in the first place, but others do not feel these methods are yet economically feasible. But there is almost universal agreement that effective methods of dealing with such problems involves more than removing `bugs' in `lines of code'.

A fundamental problem? The significant piece of evidence that there are fundamental system protocol design issues in NERC software lies in the fact that the developed system ran fine on 30 workstations but not on 100. This suggests that there are message-traffic and timing problems at the fundamental system-operation level. There is no reliable method for estimating how or if such problems can be safely engineered out of the system. I advise the Subcommittee not to be misled in thinking that any such estimates, from anybody, can be relied upon. The project has used 60 of a (re)planned 64 months for software development, if I have my figures right. And at this point, representing 93.7% of the total engineering effort put into this system, neither contractor nor client seems able to guarantee that, in the 6.3% of time remaining to installation, the basic system software will be reengineered to function as planned. I believe one should ask why, at this late stage, such guarantees have not been given.

If the present NERC system is to be used in the safety-critical ATC environment, the basic system problems must be carefully identified, and the system must be reengineered to avoid them. `Quick fixes' (which I would suspect is all that could be accomplished in the remaining time) won't suffice. The ultimate system risk is a passenger airplane crash in UK airspace, and we must keep in mind the possibility that ATC system problems can contribute to if not cause such an event. Computer system design and implementation issues have been implicated in a number of newer-generation aircraft accidents. However, it is apparently not yet clear that the NERC system can, in fact, be reengineered to avoid the fundamental problems. Since neither contractor nor client have offered reliable guidance on this, I recommend considering the possibility that the system will never work effectively.

Two sorts of `risk': There are at least two possible meanings to the word `risk' when used in the context of NERC software. The first is, as mentioned above, the risk of the system not performing its safety-critical function of ensuring safe air travel. The second is the risk that the system cannot be completed as designed to fulfil the requirements. I suspect this second meaning is what is meant by NATS when speaking in their September 1997 briefing material of `managing the risk' on the software -- since if the software is guaranteed to work, there is no risk of this second sort. `Managing' this risk consists in evaluating options should the worst happen. That means considering what to do if the system cannot be made to work.

The six options: NATS offers six options for project planning. The list raises significant questions. One is of omission: one further option not identified by NATS which should be considered is that of abandoning the current system development and starting afresh. One wonders why this was apparently not considered. In the case that the system will never work effectively, this is clearly the best option. Fletcher says of experience with CAATS that a careful evaluation of this option `helps clear the air'. He identifies three general options to consider: throwing the system away; extending the schedule on the current development without cash; and extending with cash. CAATS costs millions of dollars per month, so when such a project runs over schedule, both of these latter two options must be equally seriously considered.

Before considering NATS's six planning options, I should like to introduce a classic software project planning observation due to Fred Brooks. Brooks pointed out that if a project was late, it couldn't be speeded up very much, either by throwing bodies at it or by other means. Trying to speed it up means taking resources away from it to train new people, and then reintegrating these trainers along with the new people. This slows a late project down even more by diverting resources. Expecting it to catch up again, let alone surpass the previous pace of development, is often a pipe dream. It's like expecting a 200-meter runner who is behind the pace to stop to catch his breath, in order to catch up and then race ahead of the other runners to the finish line. Such an outcome is implausible, and it's no different in large, complex software projects. Let's call this observation Brooks' Law.

A second way to divert project resources is to reduce the functionality of the system to be delivered. Changing the functionality of a system in mid-development requires careful and considerable re-engineering. It may not matter whether the functionality is being decreased or increased - resources are required to reengineer the system and these resources are diverted from the original task. CAATS was reengineered to reduced functionality starting in Spring 1995, giving up the conflict-resolution and flow-management features as well as the automatic management of military centralised altitude-reservation and other requirements. Such a functionality reduction consists in putting major functions back onto the human controller. Fletcher estimates that the process with this `downward' redesign has taken about as long as it would have if they had continued with the original development plan - you gain no resources but you lose system functionality. He suggested the analogy of starting with an aircraft carrier and deciding you want a frigate: it's not just a simple matter of throwing pieces away. Let's call this the Second Law: changing functionality, even downwards, does not necessarily free up resources.

Yet a third way to divert project resources is to add tasks not in the original project plan. For example, installing MEDIATOR on the new NERC hardware, and then deinstalling it in order to install the new NERC software. It should be obvious that hindering development work on the basic NERC system will only increase any delay in deployment. Let's call this the Third Law.

Brooks' Law and the Third Law should in some sense be intuitively obvious. The Second Law comes from experience with engineering complex systems. What is common to all three is that, barring a miracle, they're pretty much inviolable. Let's see how NATS's options measure up.

NATS's Option 3 blatantly violates Brooks' Law, while Options 1 and 2 violate the Third Law. Option 4 violates the Third Law in a more subtle way by requiring an interface to be developed between present LATCC operations and NERC operations during the `transition to operations' stage. I understand from the briefing material that this interface between MEDIATOR operations and the NERC system has not been designed, let alone planned. Option 5 also violates the third law by incorporating the Clacton resectorisation twice, once at LATCC and then at Swanwick in the NERC system, requiring `significant system changes at NERC', according to NATS. The only option which does not violate one of the laws is Option 6. Let's consider NATS's assessment of this option.

NATS claims Option 6 is `inherently low risk' (September 1997). Consider anew that nearly 94% of the planned development time has run through, and neither contractor nor client seems prepared yet to guarantee a successful conclusion to the software development. At a briefing in April 1996, Barrett described this option as `not low risk, but achievable'. That's when he was considering `Full System Integration' by March 1997. Apparently it wasn't achievable by March 1997 and it seems to me worthwhile to inquire if it's achievable now. I'm not sure what caused NATS to revise its completion-risk estimate downwards between April 1996 and September 1997, but I'm fairly sure it wasn't the fact that their contractor missed another deadline by another year. The Subcommittee might like to inquire generally after the reasoning used in these achievability estimates.

Managing continuously changing requirements: I would like to pass now to an objective difficulty with air traffic control system development. A factor which complicates such development and distinguishes it from other software development is the continuously-changing nature of the system requirements. New technology and procedures such as MLS and GPS approaches, parallel instrument approaches to close-in runways, free flight, in-trail climbs over Oceanic airspace, and FANS, as well as the substantial growth of air traffic overall, mean that an ATC system requirements definition is chasing a moving target. This procedure must be managed particularly well. The process of determining the requirements and checking them for feasibility and consistency is known as requirements engineering and is widely regarded as a specific discipline in computer and software engineering. Both Ratner and Fletcher emphasised to me the overriding importance of requirements engineering management in ATC system development. Any change in requirements means devoting significant resources downstream in the software development to reengineer the system to accomodate the new requirements. Hence the importance of the Second Law. To obtain complete, consistent requirements definition, the relationship between contractor and client is crucial (so, in my experience, is the use of formal mathematical methods, although this is disputed by some). Fletcher advised me that during this process the contractor must mature into the ATC context, and tells me this has been attained in the CAATS project. Ratner's comments suggested that the quality of the requirements definition, its management, and the relation between contractor and client are useful project assessment tools. I was surprised to find little or no comment on these issues in the NATS briefing materials.

ATCO acceptance: Fletcher also emphasised to me the importance of end-user acceptance of the system. The end-users are the ATCOs. CAATS involved ATCOs very early on, and continuously, in system assessment, and NAV CANADA and their contractors have used throughout the project a particular simulation tool to enable this. NAV CANADA in fact operates a large ATC simulation facility which they advertise on their World-Wide-Web site. Fletcher feels ATCO involvement is particularly important given the changing nature of the requirements definition, and I would like to elaborate on this theme. If ATCOs are forced to use a system which they believe inconvenient, and whose software they do not trust, then their work will be much more inefficient than if they like the system, find it handy, and trust it. There is a lot at stake for both ATCOs and air traffic management. First, if the system is untrustworthy, it is possible that an ATCO could be disciplined or even prosecuted for his or her role in an incident to which the system design contributed. It is only relatively recently in aviation that the role of so-called `latent system errors' - consequences of overall system design, functioning, or management - has been identified as a major contributor to incidents and failures (by, amongst others, James Reason of the University of Manchester). It has been so much easier to blame the pilot or the ATCO. ATCOs are aware of this. In the present climate, they will be asked to use a complex system with a history of major problems, finished in a hurry to meet a planning deadline. How is NATS to ensure that the ATCOs trust such a system? Second, the difference in productivity between contented professional users and disgruntled professional users, no matter what the reason, can be huge - I would guess of the order of 20-30% or even more, although I have not attempted to measure it. I know from my own experience and that of my professional pilot colleagues that one simply turns off automation that one doesn't trust. ATCOs will find work-arounds for features they don't trust, and thus the system will be used as if those features weren't there. That doesn't imply it would make sense to remove them - see the Second Law - but it does imply that such a system will not be used to maximal effect. The question of trust is crucial for a safety-critical system.

After I had written the above paragraph, it came to my attention that the news journal Aviation Week and Space Technology for November 3, 1997, carries a brief article on p21 about `acrimony' between the U.S.~Federal Aviation Administration (FAA) and the U.S.~National Air Traffic Controllers' Association, about `human factors' related to the new workstations. The controllers don't like the setup with the new STARS terminals, and complain that the new system `does not reflect their inputs'. The controllers' association is apparently really serious about this - it `even recalled for lawmakers the 1981 PATCO strike'. The U.S.~Congress has called in the Mitre Corp. to `referee'. The U.S.~Department of Transportation Inspector General told the House transportation panel (roughly a US equivalent of the Transport Subcommittee) that the FAA has not conducted a formal human factors engineering evaluation, and `urged it to do so', according to the article. One would surely want to avoid a similar situation with the NERC `transition-to-operations' phase.

A further overrun? I notice that the planning in the briefing material from NATS has kept the ATCO training and acceptance phase roughly constant at one year, even during the initial development overruns. It is usual in software engineering that many problems are discovered by the users. If a 36-month development project has overrun by a further 75% because of problem resolution, one may expect at least a similar phase of problem discovery and resolution during the `training' phase, which could be considered to be the system `shakedown' phase. If it has taken the contractor almost twice as long as planned to get the first version of the system running, what are the reasons for not budgeting (at least) twice as long for the contractor to resolve the issues that arise during shakedown? What are the assumptions that were made when planning this phase? Not allowing for expanded shakedown seems to contradict general experience with complex systems that have experienced difficulties during development.

Evaluating termination: Finally, some words concerning the option of terminating development and starting afresh. Fletcher advised me of some essential issues, which I would like to list briefly to put some more flesh on the bones of this option.

Upon termination, who would retain the legal rights to the project data and documentation?
What has been/can be delivered of the software that's been developed?
Is the delivered part extensible, or otherwise workable by a third party? (NAV CANADA was prepared to take over some final development themselves if necessary. This helped during contractor negotiations.)
Crucial for reworkability are the methods used to develop the software: for example object-oriented? in Ada? Unix-based?; and also the documentation and progress measurement: were formal mathematical methods used? has the software been assessed using mathematical complexity measures (McCabe, `Software Science', Function Points, what have you - the UK has some of the world's experts in measurement of software)? what were the coding standards and how were they enforced? (I should emphasise here that this list does not at this point constitute any sort of recommendation on my part; rather, it constitutes examples of the sort of information one hopes could be gathered.)
What are the implications on the existing system of a further multi-year delay in system deployment? Can LATCC functionality and MEDIATOR be somehow extended to cover the new development period?
What is the social or `political' environment amongst the ATCOs? Could they work with another long delay? Do they have grounds for believing that the second new system would be more usable and trustworthy than the first version?

There are other broader political implications (impact on employment, etc) on which I do not wish to comment, but which the Subcommittee will undoubtedly wish to consider also.

To conclude: I have prepared notes on the NATS September 1997 briefing material, and have more briefing material on CAATS, as well as my own expertise, which I am able to share with the Subcommittee if required.

I am able and willing to give evidence before the Subcommittee should you consider it helpful. Thank you for the opportunity to contribute to this importance and consequential process.

Sincerely,
Peter B. Ladkin