The Limitations of Internet-Wide Search Engines

MARK CARMODY et al

In his 1998 report - Automated Indexes and Internet-Wide Search Engines Graham Greenleaf discussed some of the limitations of automated indexes and Internet-wide search engines. Professor Greenleaf was specifically commenting on automated indexes and search engines used in (Australian) legal research. Four general limitations were identified at that time as follows:

1. Robot indexes are not comprehensive and do not index every word on the web. Indeed the web does not contain every published page on every topic
2. Robot indexes contain too much noise. By that Professor Greenleaf was saying that “unless a search is for very unusual terms used only in a legal context, the search results will include a large number of acontextual items.”
3. Robot indexes are difficult for use in searching for material from particular countries. Professor Greenleaf reminds us that one of the inherent flaws with a technology based tool, such as the internet, is that more technology savvy countries such as the USA have greater accessibility to, and control of it. As a consequence Professor Greenleaf says the pool of information available tends to be “flooded with material from North America and other `content rich’ parts of the Internet.”
4. Users find searching difficult. Whilst recent law graduates may be up to speed with their ‘Boolean Syntax’, many not so recent graduates may still find the old law reports easier to work with.

Eight years on from this report, at least one of these limitations strikes me as still being a major cause for concern for those of us who use the internet for any form of research. That is the proliferation of North American content on the web. Clive James in his typically eloquent and humorous affectation, once described American television as a mechanism for ensuring subservience to American cultural imperialism.1) Ironically the Howard Government is preoccupied with protecting this country’s geographical boarders at a time when the real threat to our country might well be via our own broadband internet connections, through which, American Cultural Imperialist’s launch attacks upon us everyday.

Edit (Graeme Grovum): Help is at hand! You can now build your own legal research search engine to focus solely on those sites which do not bow down to American Cultural Imperialism

Disadvantages of Internet-Based Legal Research

There is no reason why one should not use online legal research tools, such as AUSTLII or LexisNexis. The internet technology allows that all recent and new decisions are available on the internet as soon as they come out, and that they are all searchable. There are editorials and compilations of laws available on almost every legal topic, such as the “Halisbury’s Laws of Australia” available through LexisNexis. All the current Australian legislation and Bills are available on the internet databases - so will that be sufficient in the current age to allow a lawyer to find all answers online?

Unfortunately, the answer is no.

There is a wealth of cases which to this day can only be found on paper. While there are some very well known cases, such as Carlill v Carbolic Smoke Ball Company [1893] 1 QB 256; Court of Appeal, 1892 Dec. 6,7, which case being one of the most important cases in the contract law, are available from various sources in full on the internet, it is quite hard to find most of the older cases on the internet at a single legal research site. Even when searching www it may prove hard to locate many cases for which the citation is known, and almost impossible to find cases which may be relevant to legal research a lawyer may be conducting, since there is no way to limit jurisdiction, law area etc while using Google or any other www search engine.

Currently a lot of Journals are addressing the problem by scanning in the jornals and making available the scanned copies online. Google has also joined projects where books and journals are scanned and made available (at least in part) on the interent. Even though only a fraction of material which exists in paper form is now available on the internet, there has been a good progress in digitisation of the material recently.

However, currenlty there are several problems with the searcheability of the materials. Many journal articles exist only as images converted into PDF format, and are not searchable, with the exception of their title which was entered manually by the company conducting the digitisation. Obviously, if legal cases existed in such form, they would not be much more useful than their paper counterparts. Probably paper copies would still have the upper hand over such digital copies as you can quickly flip through any given case and see whether or not it would be relevant; there are printed indices which could point a researcher into the right direction with the searcheable catchwords.

Another option, employed by some journals, is to run them through OCR (Optical Character Recognition) software and encoding the documents into PDF format. However, some problems exist here as currently OCR software is far from perfect, and the resulting output may have many errors and mistakes in the words, lowering the efficiency of searching such a document. Moreover, it may not be possible to verify the OCR version of the document with the original document, unless you go to a library where original document may be found.

A third possibility which currently exists is to encode documents into ‘searchable PDF’ format. This format is specially formulated for scanned documents and contains two layers. The first, visible layer, contains the image file of the scanned document. The second layer is a ‘text layer’ which contains the text from the document, encoded using an OCR software. Such a document is fully searchable, like a normal pdf, (but has some of the problems of a normal PDF encoded from OCR), but contains the original document as well, so that the accuracy of the searchable text could be verified. Such format is also not without problems, as it has the largest file size per document out of the three options. Nevertheless, as the internet capabilities grow and size of the files is now of lesser concern, this file format may be the best for legal research.

It may take a long time for a sufficient amount of case law to become available in digital form. As for now, off to legal library!

1) Clive James, The Meaning of Recognition – New Essays 2001-2005, (2005), Pan Macmillan, London.
 
the_limitations_of_internet-wide_search_engines.txt · Last modified: 2006/10/29 23:50 by pavel
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki