Wednesday, January 25, 2006

Private Information Retrieval

Google, MSN and Yahoo were recently served with subpoenas for search data from their databases. The background, from an article by Columbia professor Tim Wu in Slate:
Back in the 1990s, Congress passed a succession of laws designed to keep porn off the Internet. Those laws didn't work—largely because the courts kept striking them down as violations of the First Amendment. Sick of losing and finding themselves back in the district court defending a reconfigured version of the anti-porn bill (now named the "Child Online Protection Act"), lawyers in the Justice Department decided to hire Berkeley professor Philip Stark. Stark's assignment: Use statistics to show what everyone already knows—that there's an awful lot of porn on the Internet.

Stark's plan is to demand that Google and the other major search engines supply him, and thus the government, with a random selection of a million domain names available for search and a million sample user queries. He will then demonstrate how truly nasty the Internet is, and in particular, just how hard it is to block out stuff that's "Harmful to Minors." Faced with this third-party subpoena, Microsoft and Yahoo! agreed to supply Stark with this information, but Google refused, calling the subpoena request "overbroad, unduly burdensome, vague, and intended to harass."
He goes on to make the point that the problem is not that Google et al were slapped with subpeonas, but that they had the data lying around in the first place. His suggestion: get rid of tell-tale information:
That's why the public's demand must be of Google—not the state. It should be that Google please stop keeping quite so much information attached to our IP addresses; please modify logging practices so that all identifying information is stripped. And please run history's greatest "search and delete," right now, and take out the IP addresses from every file that contains everyone's last five years of searches.
Coincidentally, I was chatting with Piotr Indyk the other day at Penn, and he made the very valid point that in a sense, theoretical computer science is already on the case, and has been studying this problem for a very long time. The field of private information retrieval is precisely about this:

Can you make queries from a database without the database knowing what you want ?
It seems impossible (if you don't already know the answer). But here's a silly solution:
Ask for all the data from the database ! There's no way the database will know what you really wanted.
The solution is indeed silly, but why exactly is it ? Well, because you'd need to extract all the n bits of information from the database [1] to get the answer bit [2] for one query So if we think of "stuff the database returns" as a bounded resource, can we do better ?

Well, as you might have guessed, if you want to be deterministic, then not really. But if you can toss coins when asking your queries, while still guaranteeing that the answer is always correct,
and if you assume that you are really querying multiple non-communicating copies of a database (not so unrealistic), then you can do a lot better. For example, if you wanted to know the value of a single bit (the "query") of a string of n bits (the "database"), you could pick a random subset of the string, and ask the first database for the XOR of the bits. Then XOR the index you want with the set, and ask the second database for a similar XOR. You can XOR the two answer to get your bit, and the databases are none the wiser.

Now this really doesn't save you much, since you'd send n bits in total (on average) to the servers, but the scheme can be generalized to k servers, in which case you'll end up with a total transmission of something like k log k n^(1/log k), which is significantly sublinear.

There are many variations: how smart can the servers be at trying to decipher your answers ? (in the above example, even an all powerful server can't do a thing) ? How much can the servers talk to each other ? what kinds of queries can you ask ? and so on.

One might argue that assuming the servers can't communicate with each other is an unreasonable restriction. Not so really, especially if you consider that it might be in Google's interests to facilitate this kind of querying [3]. Another interesting question that I don't think has been studied would be: can you construct effective link graph structures and search results without needing to retain any information about the pages themselves ?

Piotr made the interesting point that there's a lot of scope here for a company that wants to innovate in the search space: "Search with our engine! We have no clue what you're looking for, and we'll find it anyway!". In fact, there's a web page on practical PIR maintained at the Institut für Informatik at the Humboldt-Universität zu Berlin by Prof. Johann-Christoph Freytag and Dmitri Asonov.

Since the business of theoryCS these days is PR, it's worth pointing out that privacy is rapidly becoming a major area of concern, and work springing out of theoretical computer science is ideally placed to have an impact here. Just sayin'....

I drew my sources from the wonderful survey on PIR written by Bill Gasarch (who's current guest blogging at the Complexity Blog) for Lance Fortnow's Complexity Column. He also has a companion web page, and the two are a valuable resource of information on this topic.

(HT: BoingBoing)

[1] Theoretical computer scientist love referring to all sizes as 'n'. If there are more, we will often use 'k', 'm', and the often scorned 'U".

[2] The one thing we love more than calling things 'n' is designing questions with YES/NO answers.

[3] There is another post waiting to be written on why it's NOT in Google's best interests to help make private information retrieval a reailty. Such a post would have a heavy dose of phrases like "Web 2.0", "AdSense", "AdWords", "revenue streams" and "targeted search" and would bore everyone to tears.

Categories:

1. There are several reasons for Google to keep around search history and click-through information for individual IP addresses. But I believe this information can be summarized into a few hundred profiles (totally arbitrary estimate). Each user would fit into a profile (or a prob. distribution over profiles) and these profiles should give a reasonable approximation to a large enough fraction of users. Google can train these profiles over past data and then just throw the old data out.

Let's also say that Google doesn't keep information about which user belong to which group. This information belongs with the user. When I log in to personalized search, my computer sends my group ID to Google and its servers respond accordingly. If I wish to search for something secretly, I turn off personalized search. (Of course Google has to be trusted as far as this procedure is concerned, but I don't see why they wouldn't comply.)

As far as privacy is concerned, because Google is only keeping information that is broadly applicable (to large groups of people and not individuals), this should be acceptable (for example, w.r.t. concepts like k-anonymity, isolation, etc. that have appeared in privacy-related theoretical work recently).

Is there any reason this approach wouldn't work? I would suppose a good fraction of people would still be interested in using personalized search. Google seems to have enough data to train reasonably good models of profiles. Is there reason to believe a few profiles do not capture a large fraction of users?

Posted by Shuchi

2. Continuing on my last post, this is really an obvious approach to dealing with the issue. So either there is an obvious reason for failure, or Google (and others) are too lazy to implement it. Can anyone clarify what the problem is?

Posted by Shuchi

3. So my sense is that for these things to be used, someone has to demonstrate that they work "at scale" and don't destroy the fabled 0.13 second response time for google searches (which is now up to 2.43 seconds for "Private Information Retrieval" (measured on Fasterfox)). I am willing to believe that most theoretical solutions don't yet scale, and maybe what one needs is a couple of grad students with internshps at Google doing the needful :)

Posted by Suresh

4. But privatising the search terms doesn't quite ensure the privacy of what you search for. The feds, if they wanted to know, would ask to see your ISP's logs.

Posted by Amit C

5. Hi,

In lieu of a commentary by Google PR representative ;), here are a few thoughts on this issue.

Basically: Google advertising model is characterized by an extreme specificity - you can bid on any of a few hundred thousand (million?) of different words. One could argue that this is one of the reasons why this model has been so popular - instead of blowing a few million \$ on a Superbowl Ad, you can potentially spend much less if you discover the "right" keywords to advertise on. Google is happy as well - small revenue per each user gets multiplied by millions if everyone and their dog starts to advertise.

For this to work however, you have to keep and analyze all information you can get. Once you switch to user profiles, you are in the same league as all other ad agencies, with "soccer moms", "Nascar dads" etc etc.

So I guess the challenge is: how to maintain all this information, and yet not compromise users' privacy ? As Suresh points out, PIR is one way to do that. But this is tricky, since (by definition) any PIR system must spend linear time in n (n=few billion) to answer any query. Time for privacy/complexity tradeoffs ?

Posted by Piotr

6. But privatising the search terms doesn't quite ensure the privacy of what you search for

Well, they have to locate you in the first place. Google is a one-stop-shop, while there are many ISP's out there to demand the records from.

7. Piotr is too kind. Using PIR for search was his proposal :). But Piotr, why do you say that a PIR system must spend time linear in n ? isn't this by definition what you want to beat in a query ?

A privacy/complexity tradeoff is quite reasonable: you could even imagine a business model based on this, where you get more privacy, the more you are willing to pay.

Oooh. I just said "business model": I need to go take a shower.

Posted by Suresh

8. But Piotr, why do you say that a PIR system must spend time linear in n

Oops, sorry. That holds only if no preprocessing is allowed. I take it back.

Posted by piotr

9. If google is forced to provide the information, they should print the whole mess in an 8 point calligraphic font with no line terminators, then ship the 2 ton print out to Congress via US mail (postage due)

Otherwise they might be able to move the Googleplex out of US territory and tell Congress to pound salt. Someplace more 'fitting' to the mission statement... Like the moon. :)

Posted byMike Dougherty