What could be more innocuous than a search engine interface? They’re simple, clean, uncluttered, with that empty text box almost welcoming your words. Type anything, anything at all; your fears, your wants, your worries – we’ll find you something, and no one else will ever know…
Don’t be fooled.
On August 4, 2006, US internet service provider America Online (AOL) released a collection of 21 million search queries from 657,426 customers. The searches had been made over a three-month period. The data, with identifying details replaced by numbers, contained the search strings entered, the date and time, and the websites subsequently visited – if any.
So far, so dull. The whole 2.1GB, 30 million-line text file was placed on the web for search researchers to hone their skills, test data extraction tools and investigate new search methods. But the content proved to be dynamite. Instead of quiet kudos from academia, AOL’s release was greeted with uproar, outrage, legal threats and demands to toughen US privacy laws. Two of the researchers, along with the company’s Chief Technology Officer, were within weeks and the data – hastily withdrawn just three days after release – has been copied and mirrored around the internet ever since.
But surely anonymised data can’t do any harm? Wrong. Within 48 hours of its release the New York Times had tracked down a 62-year old Georgia resident based solely on her searches. Wired News identified a 14-year old boy, while another woman was identified by an outside party and warned the logs revealed sensitive financial information about her.
What surprised outsiders was how little data it takes to sift out individuals from the faceless millions. Take user 19069577 who, on April 3, 2006, searched for “oregon lottery” and the following day on “56k internet connection aol”. There’s a 10-day gap before the next query: “hi from new zealand – introductions and greets” and that’s followed by “pig hunting starts in kinloch forest may 7”. A fortnight later, holiday presumably over and back in civilisation, they were looking for “aol broadband” and “www.workinginoregon”.
In just a handful of entries we learn that user 14994857 is keen on basketball, snorkelling, “ultimate fighting” and photography, has a 2002 Honda Accura RSX in need of a new rear strut and sought information on “obese children with sleep apnoea”.
Some users typed in their names to see if they appear on the web. Others, worried they might be subject to identity theft, searched on their own credit card or Social Security Numbers, unaware that in doing so they’d made themselves almost inevitable targets.
There’s also the seedy side. User 6120607 intermingles queries about “church pulpits”, “youth group bible lessons” and “bible facts” with hunts for “whiteslavegirls” and “anal creampies”, while 22381665 showed an interest in suicide and “very young incest”. User 17556639 asked “how to kill your wife”.
Then there’s the infamous User 927:
The queries start harmlessly enough. Sure, user 927 has some medical problems (“heal time for broken legs,” “human mold,” “mold on humans,” “skin mold”) but who has the time these days to keep themselves entirely fungi-free?
But things quickly take a turn for the worse with the sudden appearance of “dog sex” at 9:28 PM one evening. Half an hour later, the queries are about flowers (“anemona,” “arbutus,” “aster,” “pink camellia”), which lasts until 2 AM the next morning, and all appears well again. The following day, “forced rape porn” makes an appearance. “Testicle festivals” follows soon after. “Hentai pedofilia,” “bdsm electricity,” and “tormented elmo” (?) are entered. Things go downhill from there, getting downright unprintable (let’s just say that incest, torture, and urine are involved), until we run across the not-amusing-at-all “cut into your trachea.”
Seriously, who is this person?
Many users were (and still are) surprised to learn that search queries are even recorded. Google, for one, keeps this data indefinitely. Why? Because people confide more in that innocuous looking search box than they do in close friends or relatives. It gives whoever holds the data an insight into who you really are – and they can sell that insight to advertisers.
Although AOL pulled the data a few days after its release, the file was copied and mirrored around the internet and is still available in a variety of convenient download formats. The Internet Archive’s copy is here.