ShoeMoney recently asked "What if you could get a list of every [key]word your competitor has been bidding on?"
Shoemoney (who's blog I like) demonstrated this in his comments, showing lists of keywords that send traffic to a wide variety of sites. A poster on the SEM2 mailing list wondered how this could be done.
I can think of 4 scenarios:
I seriously doubt that anyone has hacked into AdWords yet, but it is a lucrative target. And it will happen someday. But it's not a business model, nor the type of thing someone would promote.
Scraping Google probably isn't that sustainable on a large scale, but it would work reasonably well for a narrow set of sites (i.e. a small number of competitors). GoogSpy apparently works by scraping.
Toolbars - like the Yahoo toolbar, the Google toolbar, and hundreds of others sit in your browser and send data to a server based on every page you visit. They can provide a huge amount of valuable data about browsing habits - and in aggregate that data could be used to do very sophisticated targeting. Getting keywords that people typed in, or keywords used to get from an ad to a site would be a simple task with access to enough toolbar clients.
Finally, ISP proxy logs - which I think are the most likely source of Shoemoney's data can be used to capture clickstream. Hitwise uses ISP logs (along with some toolbar / panel data). Hitwise's logs represent the browsing activity of 10 Million users. Hitwise charges about $25k / year to get access to their tools.
So it's most likely that ShoeMoney has struck a deal with an ISP. Or at least he's getting web proxy logs somehow.
ISPs use proxies to reduce bandwidth costs. They can cache a large percentage of web page data, and serve the data from the proxy. In any case, they get to record quite a bit of clickstream data from the people accessing the web through their servers.
One of the things they record for each http request is referrer - which is a string that often has the query string from a search engine click...
By parsing their logs, and correlating clicks on ads / SERPs to destination pages, you can get a good idea of the keywords that advertisers are buying to drive traffic to their site, and the search strings people are typing to reach websites.
In a similar vein, the AOL dataset that caused a privacy kerfuffle when AOL research released it, could be used to derive this type of mapping of keywords / sites.
Of course, it is pretty hard to ensure that the results from processing millions of log lines are accurate. There are a huge number of variations and limitations in log processing, so it's to be expected that you can't find all the keywords with high accuracy. Furthermore, some keywords that an advertiser buys will never be clicked on at all by users in the clickstream one is processing. So those words won't show up in the final results, obviously.
Google, Yahoo and the other big sites have a TON of data like this. And they could do it incredibly accurately - not just for their customers, but for customers from other search engines. In other words, they "see" a huge number of referers.
I think it'd be interesting and good if google / yahoo somehow provided this data more transparently.