Personal Search

I regularly find myself thinking, “I know I read a web page about [XYZ] last month, but where the hell was it?” I may be able to remember certain key phrases, and these sometimes help me find it again by using Google or some other search engine. Sometimes I can also find the page by doing a full-text search on my browser cache. (I use the “Find in Files” functionality of TextPad, because Windows’ own search is too slow.) But that doesn’t help if I was looking at the page more than a week or so ago, because it will have dropped out of the cache. (I have my cache set to 1GB.)

What I would really like is “Personal Search.” This would take the form of an extra option on a search engine that would alow me to restrict my searches to only the pages I have visited.

I don’t think it would be too difficult, technically. First of all, you would have to have some mechanism of reporting to the Search Engine Company (SEC) whenever you visit a page on the web. I think the Google Toolbar might already do this. Likewise, it shouldn’t be too hard to build something for Mozilla that would perform this task.

The Search Engine Company would then have to record this page view in a database, and associate it with your personal browsing history. It wouldn’t have to store the whole page itself, because chances are good that the page has already been spidered and is present in its main index already. If the page is new to the index, it will have to be added. (No big deal, and this even adds value to the main index as a whole.) Because the SEC only needs to store a list of URLs (and probably timestamps, too) against a user ID, this wouldn’t even take up impossible amounts of disk space.

Next, the SEC has to implement the search filter: whenever I do a search with the “only show results for pages I’ve visited” checkbox ticked, this should limit the search results appropriately, based on my browsing history. And voilà! My own Personal Search results.

There are a couple of down sides to this idea, though. For one, it requires the SEC to keep a complete track of my browsing activity. Depending on legal jurisdictions, this history could be used in ways I’m not entirely happy with. The scheme would have to have some way of turning off indexing completely, or for the duration of a browser session.

Secondly, not all web pages can be indexed by the SEC, and not all pages should be indexed by them, either. (For example, newspaper or magazine archives that require subscriptions.) There isn’t just the preference of the end user (me) to take into account, but also the preference of the web site owner. As a result, I may find that there are still gaps in my Personal Search. However, I think these gaps would still be less annoying than not being able to get back to web page XYZ that I remember from last month.

Finally, there’s a question of cost. To a certain extent, search engines fulfil a public service to the population of the Internet. “Personal Search” would be a service that I imagine people might be willing to pay for. After all, it means you don’t have to manage an enormous search index on your own computer. I could keep all the pages I’ve ever visited in a cache somewhere, but I really don’t want to spend a couple of hundred pounds on disk space every year.

It all sounds too easy. Can someone tell me now why this wouldn’t work? Or alternatively, can you tell me if there are any search engines out there that do this already?