I see that the new terminal at London’s Heathrow airport is in the midst of another weekend’s disruption. Problems on the terminal’s opening weekend resulted in over 200 flights cancelled and a backlog of 28,000 bags. The chaos has already cost British Airways, the sole user of the terminal, £16m, and some estimates put the eventual cost around £50m.
Initial problems reported included the failure of either passengers or staff to find the car parks, slow security clearance for staff, consequent delayed opening of check-in desks, and multiple unspecified failures of the baggage handling systems. Once the initial failures occurred, a cascade of problems followed as passengers began to clog up the people-processing mechanisms of the terminal.
This weekend’s disruption has been blamed on “a new glitch” in the baggage handling system. I suspect that means that when they solved one set of problems they unmasked another. A spokeswoman assures us that they’re merely planning how to put an identified solution in place. Her statement doesn’t include any reference to the fact that these problems often nest, like Russian dolls, and that the new solution may uncover—or introduce—new problems.
Of course, my reaction was, “Did they test the terminal before opening it?” The errors shown include both functional errors (people can’t find the car park) and non-functional ones (the baggage system failed under load). No system is implemented bug-free, but the breadth of error type got me wondering.
Fortunately, the Beeb covered some of the testing performed before the terminal opened. Apparently, operation of the terminal was tested over a six month period, using 15,000 people. The testing started with small groups of 30 – 100 people walking through specific parts of the passenger experience. Later, larger groups simulated more complex situations. The maximum test group used was 2,250. BAA said these people would “try out the facilities as if they were operating live.”
Do 2,250 people count as a live test? Are they numerous enough to cause the sorts of problems you’re looking for in a volume test?
I plucked a few numbers off the web and passed them through a spreadsheet. T5 was designed to handle 30 million passengers per year, which comes out to an average of 82,000 per day, or 5,000-odd per hour in the 16-hour operating day (Heathrow has nighttime flight restrictions). These are wildly low numbers, because airports have to handle substantial peaks and troughs. Say that on the busiest day you get 150% of flat average, or 7,500 people per hour. Assuming 75% of the people in the terminal are either arriving from or heading toward London, and the rest are stopping over for an average of 2 hours, that’s about 9,375 passengers in the terminal at a given time.
9,375 is more than 2,250. You can,however, magnify a small sample to simulate a large one (for instance, by shutting off 2/3 the terminal to compact them into a smaller space). It’s not just a numbers game, but a question of how you use your resources.
Most of the testing documentation will of course be confidential. But I found an account of one of the big tests. I would expect that any such report was authorised by BAA, and would therefore be unrealistically rosy; they want passengers to look forward to using the new terminal. But still, the summary shocked me.
In fact the whole experience is probably a bit like the heyday of glamorous air travel – no queues, no borders and no hassle.
Any tester can translate that one. It means:
We didn’t test the queuing mechanisms, border controls, or the way the systems deal with hassled passengers.
In software terms, there is something known as the happy path, which is what happens when all goes well. The happy path is nice to code, nice to test, nice to show to management. It is, however, not the only path through the system, and all the wretched, miserable and thorn-strewn paths must also be checked. This is particularly important in any scenario where problems are prone to snowballing. (Airport problems, of course, snowball beautifully.)
Based on the account I read, these testers were set up to walk the happy path. They were not paid for their labours, but were instead fed and rewarded with gifts. I’m sure food and goodie bags were cheaper than actual pay, but they dilute the honesty of the exchange. We’re animals at heart, and we don’t bite the hand that feeds us. We like people who give us presents. Getting those people—mostly British people—to act like awkward customers, simulate jet lag or disorientation, or even report problems must have been like getting water to flow uphill.
Furthermore, look at the profile of testers mentioned: an ordinary reporter and a bunch of scouts and guides. I wish I believed that the disabled, the families with cranky children, and the non-English speakers were just at another table at breakfast. But I don’t. I suspect the test population was either self-selecting, or chosen to be easy to deal with. In either case, it didn’t sound very realistic.
It’s possible that there was another test day for people who walked the unhappy path, and that it wasn’t reported. It’s possible that they did clever things, like salt the crowd with paid actors to clog up the works and make trouble, and that our reporter simply missed those incidents.
But I’ve worked on big projects for big companies, and that’s not what I’m betting. I suspect there were very good test plans, but that for reasons of cost and timing they were deemed impractical. So compromises were sought in large meetings with mediocre biscuits. Gantt charts were redrawn late at night using vague estimates that were then taken as hard facts. Tempers were lost, pecking orders maintained. People assured each other that it would be all right on the night.
It wasn’t.
I wish I believed that the next time someone does something like this, they’ll learn the lessons from the T5 disaster. But that’s happy path thinking, and I’m a tester. I know better.