AOL releases a data set of search queries on their Labs site. The immediate reaction in the blogosphere was an massive outcry of "Privacy violation!". Here's famed blogger Mike Arrington on TechCrunch:
AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the abilitiy to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.
I think this is a massive over-reaction. Looking at the dataset it's not clear at all that it can be tied to an individual's IP address, much less their name, address, etc.
I'm surprised at the knee-jerk reaction. I think AOL's research team is doing something useful - share data so search can be studied and improved.
Here's a sample of the data
220 telephone directory ridgeville south carolina 2006-04-16 12:29:29 220 florida atlantic university 2006-04-16 15:57:58 220 florida international university 2006-04-20 06:18:32 5 http://hospitality.fiu.edu 220 house plans 2006-04-21 21:37:37 220 house plans 2006-04-22 04:48:43 220 house plans 2006-04-22 04:50:16 220 house plans 2006-04-22 08:58:27 220 windstorm insurance 2006-04-22 15:33:35 3 http://www.windnetwork.com 220 windstorm insurance 2006-04-22 15:33:35 9
I don't see how Arrington's claims make any sense. Arrington's notion that the data will be easily analyzed and can "often lead people to easily determine who the user is" is strange. Certainly it's not easy to tie a query sequence to an individual, nor is it anything that any web site owner couldn't try to do with their web logs.
Could the data have a social security number in it? Of course it could, but what does that mean? Is that a violation of privacy for someone? How many people type their own social security numbers into web search engines... I'll try look for them in the dataset and report back.
In the end, I think the people decrying this type of data have somehow over-defined privacy to mean something rather imaginative and non-sensical.