Category Archives: Testing

Entomosemantics, or, how to talk about bugs

One of the skills they pay me the big bucks medium-sized Euro for at work is assessing the risks of changes going into production. To do it, I’ve become pretty good at evaluating the system that is being changed.

I could snow you with talk of checklists, metrics, and charts, but really, my most valuable analytical tools are my pattern-matching wetware and my experience. With those two things, I can usually describe the current state of the system and estimate its chances of going horribly wrong in the near future, just based on gut feel.

Below are my private terms for the various states of computer system health. I use different ones in official reporting. Usually.

  • clean: The system runs smoothly, with no visible bugs. I read the logs to calm down after stressful meetings.
  • stable: There are the occasional interface bugs, but the thing runs reliably. It feels like a melon you tap in the supermarket and decide to buy.
  • scruffy: Most users hit some kind of bug or another, but they can make it work most of the time. Regular users have workarounds the way commuters have rat-runs that avoid traffic blackspots.
  • buggy: This is when users begin to see the bugs they encounter as a pattern rather than individual occurrences. They start to wonder if the pattern of bugs indicates a deeper unreliabilty. They’re right to.
  • brittle: Bugs aside, it pretty much works…right up to the point where it shatters into little tiny pieces.
  • fragile: It falls over a lot. Ops can pretty much always get it back up again in a reasonable time. We spend a lot of time apologizing.
  • fucked: It’s broken. Again. Fortunately, we have backups, and we’re fairly sure they’ll work.
  • comprehensively fucked: The backups didn’t work. Shark time.

Entropy tells us that, barring intervention, systems tend to move down this sequence. But it’s not a linear progression. For instance, brittle and fragile, are parallel routes to fuckedness. They’re basically two different failure modes: the Big Bad Bang and Death by a Thousand Cuts.

The applicability of these categories to other matters is left as an exercise for the reader.

Cross-posted on Making Light, where any comments will live.

Ink, turpentine, paper, water

For at least 1500 years, Japanese artists have practiced suminagashi, the art of marbling paper with ink floating on water. The marbler uses brushes to place alternating drops of black calligraphy ink and turpentine on the surface of a full basin, then lays a sheet of paper down to capture the resulting patterns. They look like clouds, or smoke, or the grain of twisted trees. Each pattern is unique, unlike in Western marbling, where the creator can reproduce essentially the same design many times.

Ink, turpentine, water, paper. It seems so simple.

And it is very simple, but only after you accept one thing: you are not in control of the outcome. The ink goes where it wills, and the marbler can only follow. There are tricks to give the pattern an overall direction, such as controlling the amount of ink and turpentine or gently blowing over the surface of the water. But the heart of suminagashi is trusting what you can’t predict or control.

I recently read George Oates’s essay about the ways that Flickr created its community: Community: From Little Things, Big Things Grow on A List Apart. Two particular paragraphs really jumped out at me:

Embrace the idea that people will warp and stretch your site in ways you can’t predict—they’ll surprise you with their creativity and make something wonderful with what you provide.

There’s no way to design all things for all people. When you’re dealing with The Masses, it’s best to try to facilitate behavior, rather than to predict it. Design, in this context, becomes more about showing what’s *possible* than showing what’s *there*.

Flickr’s history has proven her right. There are any number of wildly varying communities on the site, many of them either accidentally or deliberately experimental. Flickr groups are even cited as a case study in Here Comes Everybody, Clay Shirkey’s recent book on online community dynamics.

And now it’s our turn.

Last year, my company (MediaLab, which makes a library search software package called Aqua Browser Library) released our new social library software: My Discoveries.

The essence of My Discoveries is this: allow users to add information to the library catalog. Let them tag things, make lists of related items, fill in ratings, write reviews. Then let others see what they’ve done. Turn the patron’s interaction with the library’s catalog into a conversation with the catalog, and with each other.

I’ve been involved in both the design and testing. One of the core principles we’ve kept in mind throughout the process is that we cannot predict what people will do with it1. Designing and testing in the light of that kind of uncertainty is very different, and much more interesting, than working to a known, restricted usage profile. It affects everything we do, from what characters are allowed in list names to which statistics we want to gather. How does one design metrics to detect the unpredictable?

Tags, lists, ratings, reviews. It seems so simple.

  1. Of course, we are not so naive as to think that all the new ideas that people come up with for My Discoveries will be good ones. I moderate a web community in my spare time, so I know how bad things can get. As a result, I have put a lot of attention into the administrative interface—and I expect do more on it in the future. If we give users room to innovate, we have to give librarians the wherewithal to detect and clean up misbehavior.

The very unhappy path to Terminal 5

I see that the new terminal at London’s Heathrow airport is in the midst of another weekend’s disruption. Problems on the terminal’s opening weekend resulted in over 200 flights cancelled and a backlog of 28,000 bags. The chaos has already cost British Airways, the sole user of the terminal, £16m, and some estimates put the eventual cost around £50m.

Initial problems reported included the failure of either passengers or staff to find the car parks, slow security clearance for staff, consequent delayed opening of check-in desks, and multiple unspecified failures of the baggage handling systems. Once the initial failures occurred, a cascade of problems followed as passengers began to clog up the people-processing mechanisms of the terminal.

This weekend’s disruption has been blamed on “a new glitch” in the baggage handling system. I suspect that means that when they solved one set of problems they unmasked another. A spokeswoman assures us that they’re merely planning how to put an identified solution in place. Her statement doesn’t include any reference to the fact that these problems often nest, like Russian dolls, and that the new solution may uncover—or introduce—new problems.

Of course, my reaction was, “Did they test the terminal before opening it?” The errors shown include both functional errors (people can’t find the car park) and non-functional ones (the baggage system failed under load). No system is implemented bug-free, but the breadth of error type got me wondering.

Fortunately, the Beeb covered some of the testing performed before the terminal opened. Apparently, operation of the terminal was tested over a six month period, using 15,000 people. The testing started with small groups of 30 – 100 people walking through specific parts of the passenger experience. Later, larger groups simulated more complex situations. The maximum test group used was 2,250. BAA said these people would “try out the facilities as if they were operating live.”

Do 2,250 people count as a live test? Are they numerous enough to cause the sorts of problems you’re looking for in a volume test?

I plucked a few numbers off the web and passed them through a spreadsheet. T5 was designed to handle 30 million passengers per year, which comes out to an average of 82,000 per day, or 5,000-odd per hour in the 16-hour operating day (Heathrow has nighttime flight restrictions). These are wildly low numbers, because airports have to handle substantial peaks and troughs. Say that on the busiest day you get 150% of flat average, or 7,500 people per hour. Assuming 75% of the people in the terminal are either arriving from or heading toward London, and the rest are stopping over for an average of 2 hours, that’s about 9,375 passengers in the terminal at a given time.

9,375 is more than 2,250. You can,however, magnify a small sample to simulate a large one (for instance, by shutting off 2/3 the terminal to compact them into a smaller space). It’s not just a numbers game, but a question of how you use your resources.

Most of the testing documentation will of course be confidential. But I found an account of one of the big tests. I would expect that any such report was authorised by BAA, and would therefore be unrealistically rosy; they want passengers to look forward to using the new terminal. But still, the summary shocked me.

In fact the whole experience is probably a bit like the heyday of glamorous air travel – no queues, no borders and no hassle.

Any tester can translate that one. It means:

We didn’t test the queuing mechanisms, border controls, or the way the systems deal with hassled passengers.

In software terms, there is something known as the happy path, which is what happens when all goes well. The happy path is nice to code, nice to test, nice to show to management. It is, however, not the only path through the system, and all the wretched, miserable and thorn-strewn paths must also be checked. This is particularly important in any scenario where problems are prone to snowballing. (Airport problems, of course, snowball beautifully.)

Based on the account I read, these testers were set up to walk the happy path. They were not paid for their labours, but were instead fed and rewarded with gifts. I’m sure food and goodie bags were cheaper than actual pay, but they dilute the honesty of the exchange. We’re animals at heart, and we don’t bite the hand that feeds us. We like people who give us presents. Getting those people—mostly British people—to act like awkward customers, simulate jet lag or disorientation, or even report problems must have been like getting water to flow uphill.

Furthermore, look at the profile of testers mentioned: an ordinary reporter and a bunch of scouts and guides. I wish I believed that the disabled, the families with cranky children, and the non-English speakers were just at another table at breakfast. But I don’t. I suspect the test population was either self-selecting, or chosen to be easy to deal with. In either case, it didn’t sound very realistic.

It’s possible that there was another test day for people who walked the unhappy path, and that it wasn’t reported. It’s possible that they did clever things, like salt the crowd with paid actors to clog up the works and make trouble, and that our reporter simply missed those incidents.

But I’ve worked on big projects for big companies, and that’s not what I’m betting. I suspect there were very good test plans, but that for reasons of cost and timing they were deemed impractical. So compromises were sought in large meetings with mediocre biscuits. Gantt charts were redrawn late at night using vague estimates that were then taken as hard facts. Tempers were lost, pecking orders maintained. People assured each other that it would be all right on the night.

It wasn’t.

I wish I believed that the next time someone does something like this, they’ll learn the lessons from the T5 disaster. But that’s happy path thinking, and I’m a tester. I know better.

OAT Completion Report: Secret Sonnet

A lesson to be learned from OAT
is that the planning which assumes a test
will only run just once requires the best
environmental outcome, that there’ll be
no faults to find, and that the personnel
will be available to run as planned.
This doesn’t happen – often tests are canned,
the system breaks, or scripts aren’t running well.
Each test should be assumed to run at least
two times, with some days left aside to do
investigations, and to test the new
code fixes some before they are released.
It’s no good planning that we’ll hit a date
if known retesting means that we’ll be late.

Originally written for work, in a continuous paragraph rather than broken into lines.

Akron and the Abi Field

When the going gets tough at work (as it is now), I often wonder why I do what I do. This is one of the little stories that remind me why I am a software tester.

Martin works for SkyScanner, a flight pricing site. He was testing out some code one evening, a couple of months ago, and ran into the sort of frozen-brain feeling you get after too long at the keyboard. So he pushed his wheely chair back from his desk, into my line of sight.

“Bun,” he said, “Name me two destinations. Just any cities.”

“Düsseldorf,” I replied, “and Akron, Ohio.”

“Thanks,” he said, and wheeled back to his desk to fiddle with the new test data. taptaptap. “[insert curse word].” taptaptap. “[insert worse curse word].” taptaptap.

I looked up as he rolled back into my line of sight, looking exasperated. “How do you do that?”

Turns out that Akron, Ohio, USA, is served by two airports, Akron and Akron Canton. And some clever soul, somewhere in the ancestry of the data they were working with, had remapped Akron Canton to Guangzhou Province in China. That was giving him some…funny results.

So they had to go clean up their data. And I remembered why I’m a software tester.


A couple of months ago, I took a somewhat less than fun exam on software testing.

So last week I got the results.

84%. A pass, with distinction.

So, dear people, what do you think my reaction was?

A. Yay! I passed!
B. Meh. It’s just an exam.
C. Where did I lose 16 whole marks?
D. All of the above, in turn.

Answers on a postcard, please.

How To Break Things Real Good

Martin has been absent because he’s been redesigning his side of the site. (Go check it out. It’s cool.) I’ve been absent for much less interesting* reasons.

Basically, I’ve been studying for a test. About testing. The Information Systems Examination Board (ISEB) Practitioner Certificate in Software Testing, or, as I think of it, How To Break Things Real Good.

After eight days of classroom instruction spread over two weeks, I had less than a month to cram the syllabus in between my ears (Only click on the link if you have persistent insomnia. Not suitable for reading whilst operating heavy machinery*). I did it – I can now go on at great length about the relative strengths of boundary value analysis and state transition testing in the design of functional tests, name 18 types of automated test tool, and describe three software development lifecycle models and how they relate to testing.

I wasn’t a very good classmate, I’m afraid. I got massively insecure early on in the instruction section, when I came in on the second week to find that someone extra had turned up and taken my seat and my course materials. The instructor was mortified, but I felt deeply unwelcome, and turned to the same obnoxious behaviour I used to get through high school. When I feel out of place, I become the most annoyingly, articulately intelligent pain in the posterior ever…trying to prove that separate does not equal inferior, I guess.

I did this throughout the second week of classes, and only got worse in the revision session. I even straightened the instructor out on his understanding of one area of the syllabus. Yes, I was right and he was wrong. But that doesn’t make it less obnoxious**. I hope I made up for it a little with some of the tutoring I did on the side.

The exam was a pig, but I knew it would be. I think I did OK, on balance, though I won’t know for a couple of months. The pass mark is 60%, and if I get over 80% I get a distinction. (Which is, in a small community, considered rather cool.) I’ll be content to pass.***

I promise, now that I’m done with that, I’ll post to the blog again. I’ll even go back and pick out the best photos I took over that time, tell you about the time Fionaberry did a face plant at full speed running downhill, and even update my cinnamon roll recipe. Promise.

* I don’t think it’s boring. But I know everyone else does.

** Peter, if you’re reading this, I am sorry.

*** This is a lie. I would be marginally content to hear that I got 100%. I’ll gnash my teeth over every missed point. I know I missed at least 7 marks, and it’s driving me nuts.