|Home>> Articles>> >>
The Internet Archive and the Search for Integrity
Internet Archive to the Rescue
Wanting to emphasize the importance of retaining knowledge of history, George Santayana wrote the words made famous by the film, Rise and Fall of the Third Reich--"Those who cannot remember the past are condemned to repeat it." Of course, at the time the Internet Archive didn't exist; nor did the Information Age. If it had, perhaps he would have edited his philosophy to state, "Those who cannot discover the past are condemned to repeat it."
Certainly in times when new information amounts to five exabytes, or the equivalent of "information contained in half a million new libraries the size of the Library of Congress print collections" (How Much Information 2003?), it is perhaps fortunate that librarians possess a knack for discovering information. It is also in our favor that Brewster Kahle and Alexa Internet foresaw a need for an archive of Web sites.
Internet Archive and the Wayback Machine
Founded in 1996, the Internet Archive contains about 30 billion archived Web pages. While always open to researchers, the collection did not become readily accessible until the introduction of the Wayback Machine in 2001. The Wayback Machine enables finding archived pages by their Web address. Enter a URL to retrieve a dated listing of archived versions. You can then display the archived document as well as any archived pages linked from it.
The Internet Archive helped me successfully respond to the concerns the lawyers had about the prospective client. It contained evidence of a business relationship with a company clearly in the suspect industry. Broadening the investigation to include the newly discovered company led to information about an active criminal investigation. Suddenly, the pieces of the puzzle came together and spelled L-I-A-R.
Using the Internet Archive should be a consideration for any research project that involves due diligence, or the careful investigation of someone or something to satisfy an obligation. In addition to people and company investigations, it can assist in patent research for evidence of prior art, or copyright or trademark research for evidence of infringement. It can also come in handy when researching events in history, looking for copies of older documents like superceded statutes or regulations, or when seeking the ideals of a former political administration. (Note (25 October 2004): A special keyword search engine, called Recall Search, facilitates some of these queries. Unfortunately, it was removed from the site during mid-September. Messages posted in the Internet Archive forum indicate they plan to bring it back.)
Recall Search at the Internet Archive
But while the Internet Archive contains information useful in investigative research, finding what you want within the massive collection presents a challenge. If you know the exact URL of the document, or if you want to examine the contents of a specific Web site--as was the case in the scenario involving the prospective client--then the Wayback Machine will suffice. But searching the Internet Archive by keyword was not an option until recently. (Note: See the note in the previous paragraph.)
During September 2003, the project introduced Recall Search, a beta version of a keyword search feature. Recall makes about one-third, or 11 billion, Web pages in the archived collection accessible by keyword. While it further facilitates finding information in the Internet Archive, it does not replace the Wayback Machine. Because of the limited size of the keyword indexed collection and the problems inherent in keyword searching, due diligence researchers should use both finding tools.
Recall does not support Boolean operators. Instead, enter one or more keywords (fewer is probably better) and, if desired, limit the results by date.
Results appear with a graph that illustrates the frequency of the search terms over time. It also provides clues about their context. For example, a search for my name limited to Web pages collected between January 2002 and May 2003 finds ties to the concepts, "school of law," "government resources," "research site," "research librarian," "legal professionals" and "legal research." The resulting graph further shows peaks at the beginning of 2002 and in the spring of 2003.
Applying content-based relevancy ranking, Recall also generates topics and categories. Little information exists about how this feature works, and I have experienced mixed results. But the idea is to limit results by selecting a topic or category relevant to the issue.
Suppose you enter the keyword, Microsoft. The right side of the search results page suggests concepts for narrowing the query. For example, it asks if instead you mean Microsoft Windows, Microsoft Internet Explorer, Microsoft Word, and so on. Likewise, a search for turkey suggests wild turkey, the country of Turkey, turkey hunting, roast turkey and other interpretations.
While content-based relevancy ranking can be a useful algorithm, it is far from perfect. Some topics and categories generated might not seem to make sense. If the queries you run do not produce satisfactory results, consider another approach.
Pinpoint the specific sites you want to investigate by first conducting the research on the Web. In the prospective client example, an old issue of the newsletter of the company under criminal investigation (Company A) mentioned the prospective client's company (Company B). This clue led us to Company A's Web site where we found no further mention of Company B. However, with the Web site address in hand, we reviewed almost every archived page at the Internet Archive and found solid evidence of a past relationship. Additional research, during which we tracked down court records and spoke to one of the investigators, provided the verification we needed to confront the prospective client.
Advanced Search Techniques
You can display all versions of a specific page or Web site during a certain time period by modifying the URL. Greg Notess first illustrated this strategy in his On The Net column (See "The Wayback Machine: The Web's Archive," Online, March/April 2002).
A request for all archived versions of a page looks like this:
The asterisk is a wildcard that you can modify. For example, to find all versions from the year 2002, you would enter:
Or to find all versions from September 2002, you would enter:
Sometimes you encounter problems when you browse pages in the archive. For example, I often receive a "failed connection" error message. This may be the result of busy Web servers or a problem with the page. It may also occur if the live Web site prohibits crawlers.
To find out if the latter issue is the problem, check the site's robot exclusion file. A standard honored by most search engines, the robot exclusion file resides in the root-level directory. To find it, enter the main URL in your browser address line followed by robots.txt. Like this: http://www.domain.com/robots.txt .
If the site blocks the Internet Archive's crawler, it will contain two lines of text similar to the following:
If it forbids all crawlers, the commands should look like this:
It's common for Web sites to block crawlers, including the Internet Archive, from indexing their copyrighted images and other non-text files. If the Internet Archive blots out images with gray boxes, then the Web site probably prevents it from making the graphics available.
If the site does not appear to block the Internet Archive, don't give up when you encounter a "failed connection" message. Return to the Wayback Machine and enter the Web page address. This strategy generates a list of archived versions of the page whereas Recall presents specific matches to a query. One of the other dated copies of the page may load without problems.
While the Internet Archive does not contain a complete archive of the Web, it offers a significant collection that due diligence researchers should not overlook. Tools like the Wayback Machine and Recall Search provide points of access. However, these utilities only handle simple queries. You can search by Web page address or keyword. You cannot conduct Boolean searching or limit a query by key information. Moreover, Recall Search limits keyword access to one-third of the collection. Consequently, conduct what research you can elsewhere first using public Web search engines and commercial sources. Then use the information you discover to scour relevant sites in the Internet Archive.