Search engines have dramatically altered the information landscape over the last two decades, and have provided information ecosystems for many categories of information users – ecosystems that previously did not exist and which now empower them and give them access and range that was previously only theoretical.
WiFi access and the use of handheld devices to access the web “anywhere, anytime” have made the web a ubiquitous information resource for the layperson at the same time that the increased power of advanced web-spidering and search engines make both precision and power user-malleable.
However, much of the information resources on the Internet are invisible to the web and are not spidered by the commonly used search-engines, which creates a divide between what is available to the general public and that reachable by the academic researcher.
We live in an era of both unprecedented ubiquity of man-made information sources, as well as an immediacy that has not existed before. In his book “Cosmos”, Carl Sagan puts the size of the collection held at the library of Alexandria as running as large as a million scrolls, in comparison the website “worldwidewebsize.com” gives the number of pages indexed on the Internet at 19.86 billion pages (Saturday, 13 March, 2010)[i].
Clearly we have reached a degree of information availability that beggars previous collections.
However, this has come with some challenges regarding the technology itself that prioritizes technical abilities over purely literacy aspects and in a very real sense the Internet can be seen as occupied by a “special club” with a small membership of “geeks” (Morville 2005) who have privileged access to information by virtue of special knowledge, devices, and information techniques.
Access to internet resources has an entry-bar set by technology in terms of computer hardware and software, but also a special heuristic techniques (Effken, Brewer et al. 2003)
The economic force unlocked by the linking of advertising with provision of free-to-use web-based search engines led to the so-called “Search-engine wars” in which vendors apply a range of different tactics to woo the public user whilst competing for subscribers. This drives not only the functionality offered by vendors, but also the range of searchable categories of informational artifacts. It additionally leads to some vendor specialization, such as concept-search like Kartoo[ii] and meta-search engines like Dogpile and those dealing with specific media such as YouTube.
”Apart from standard web search, search engines offer other search services such as image search, news search, mp3 music search and product price search. The current search engine wars will mean that there will be fierce competition between search engines to lure users to use their services, which is good news for the consumer of search, at least in the short term.” (Levene 2006)
This is not to say that the results of search-engines cannot be manipulated or “gamed” by both the people or organizations acting as information sources, as well as by third parties who may wish to influence the behavior of search engines. The term “Google-bombing” reflects an aspect of this practice.
The user thus needs to be aware that some participants may “game” the system and manipulate search-engines to artificially raise the search ranking of a specific site or page. (Poremsky 2004).
In order to combat this practice, and to make search-engines as competitive as possible, the vendors constantly engage in search-engine optimization, and the user should bear in mind that the algorithms and techniques used by search-engine vendors are trade secrets and subject to change, and that specific sites may be systematically or even deliberately selected or de-selected based on somewhat inscrutable rules. Web sites may also trigger anti-gaming algorithms designed to detect attempts to manipulate the search-engines and be removed form the result set entirely (the so-called “google death-penalty”), and would be entirely unknown to the user. (Levene 2006).
This thrust and parry relationship between the information suppliers and the search-engine vendors has given rise to an industry of supplying various tricks and techniques to safely influence visibility and palatability of information to search-engines (Kent 2004), as well as spawning guide-books for webmasters (Reynolds 2004) and every imaginable aspect of “Findability” (Morville and Rosenfeld 2006).
Information abounds on topics ranging from “Search-zones”, to the need for and creation of thesauri to catch miss-spelling or alternative and preferred terms (Poremsky 2004)
These have been so successful that they have created a further challenge to the researcher or user that has been aptly termed by some as “infoglut” (Herhold 2004) – that is, an overwhelming size of an informational query result-set such that a manageable hand-full of appropriate text is often not what is retrieved, but rather a result set that becomes simply too large to handle as it approaches several thousand or million texts.
As Herhold (2004) puts it:
“The implication for the design of retrieval languages is that disambiguation is a serious and very large problem. It is the homonym problem writ large, writ in the extended sense of including polysemy and contextual meaning, that is the chief cause of precision failures-i.e., infoglut-in retrieval.”
Various stratagems and approaches to infoglut from the information provider’s point of view have been suggested, ranging from clever use of information-mapping (Kim, Suh et al. 2003), to the creation of portals (Firestone and McElroy 2003) in which relevancy is driven by proximity to the user, measured in mouse-clicks[iii].
On the user end of the equation there are also guides for users and researchers including use of subscription-databases and intelligent agents (Foo and Hepworth 2000)
A very large result-set obviously challenges the information-processing capacity of the user, but also calls into question the heuristic technique used, bringing into light two distinct elements that bear attention, namely the precision of a query result, and its recall. (Herhold 2004, Pao 1989)
Precision : The proportion of retrieved documents which are also relevant. A low precision implies that most of the documents retrieved were not relevant, thus info-junk.
Recall : The proportion of all relevant documents that were found and retrieved. A low recall factor speaks of the effectiveness of the query in finding the universe of all documents that are relevant and also speaks to the phenomenon of the “invisible web” that is not targeted by search-engines (Smith 2001)
This invisible or “deep web” is hidden from view primarily because the information sources are not amenable to discovery by the typical search-engines that troll the “surface web” and thus forms an invisible web (Henninger 2003) often estimated as being orders of magnitude bigger than the total available for search – “Deep web” being 400-550 times bigger than the surface (Bergman 2007)
Smith (2001) explains this in terms of linking and permanence:
”Traditional search engines create their indices by spidering or crawling surface Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines can not “see” or retrieve content in the deep Web — those pages do not exist until they are created dynamically as the result of a specific search. Because traditional search engine crawlers can not probe beneath the surface, the deep Web has heretofore been hidden.” (Smith 2001)
Part of dealing with these different aspects of information retrieval is to deliberately adopt a technique or heuristic to searching.
The dilemma of needing terms and knowledge to find information, but needing access to usable information in order to know terms to use is approached in a discursive browse-search-browse pattern reminiscent of how people search for food. Heuristics is the partially formalized approach to the employment of various information-stratagems.
According to Spink & Cole, it is likely that human information-seeking behavior is a evolutionary correlate to other older foraging patterns (Herhold 2004), and thus not just an individualistic behavior, but a deeply social one.
Examples of how a user (or provider) of information can approximate these patterns include social-bookmarking (Hammond, Hannay et al. 2005), and tagging.
These drive a social taxonomy that makes searching and finding on the web a more ergonomically human activity through both the social aspect of observing what other people tag and being able to create information-paths through a folk-taxonomy or “folksonomy” (Mathes 2004, Porter 2005)
A similar approach is being adopted by many retailers on the web, where finding an item often results in a list of other items that users who bought the item under view “also bought”. E-stores such as Amazon or Barnes & Noble are thus able to guide purchases with collaborative filtering using patterns of other users.(Anderson 2004). This has ramifications for the business user who might wish to know what their peers are looking at.
Information tools accessible on the web that cater for the social aspects of information-seeking have been made available by both entrepreneurial groups such as Yahoo in their del.icio.us tagging tool, as well as by scientifically orthodox publications such as the journal Nature with their freeware tool and site connotea.
Folksonomy is therefore an applicable tool for the business researcher as well as the general public.
The ability to identify information quality is a further dimension, since the quality of information involves inter alia “the properties of accuracy, precision, credibility, currency, pertinence, precision, relevance, reliability, simplicity and validity.” (Evernden and Evernden 2003)
Information quality tends to deteriorate over time (Evernden and Evernden 2003) which is problematic in any collection where the architecture does not require the dating of items. It is important for the seeker to use this as a guide as to the trustworthiness of a collection.
A further available heuristic tactic is to use humans as search catalysts in a more direct and old-fashioned manner – Many library services provide library research assistants who are skilled and studied in taxonomies and search techniques, and are able to provide suggestions for search strings and databases.[iv]
For the seeker, parts of this invisible web are exposed via academic and research search tools operating on organizational or subscription collections, some of which are accessible through citation-manager software such as EndNote that have search and connection tools,
Crystal balls have proven notoriously inaccurate in seeing into the future with regards the Internet, and probably the best I can manage is to say that things will get bigger but more user-friendly, and that the social-bookmarking trends will continue. The drive towards “web 2.0” Social Networking and “web3.0″ semantic-web technologies, and contextual search tools will doubtless shape both user-interface design and make more, and more kinds of things available, as well as continue to make available texts and artifacts previously only available in hardcopy media.
Information architecture is likely to become increasingly important as collections increase in diversity and size (Morville and Rosenfeld 2006, Batley 2007).
Privacy is also likely to become increasingly important as Internet tools make it easier to identify users purely from the search queries they use – This was made clear when an AOL user was identified purely through her use of search terms (Barbaro and Zeller 2006). The user assumption that web activity is anonymous is unwarranted, and has implications for researchers whose subject-matter might be politically or socially controversial or disclose their business intent. There are thus serious privacy concerns with regards search-engines (Cohen 2005).
- Anderson, C. a. (2004). “The Long Tail.” Wired Magazine 12(10).
- Barbaro, M. and T. Zeller (2006). A Face Is Exposed for AOL Searcher No. 4417749. New York Times. New York.
- Batley, S. (2007). Information architecture for information professionals. Oxford, Chandos.
- Bergman, M. (2007). “The Deep Web: Surfacing Hidden Value.” Journal of Electronic Publishing.
- Cohen, A. (2005). What Google Should Roll Out Next: A Privacy Upgrade. New York Times. New York.
- du Preez, M. (2002). “Indexing on the Internet.” MOUSAION 20(1): 109-122.
- Effken, J. A., B. B. Brewer, et al. (2003). “Using computational modeling to transform nursing data into actionable information.” Journal of Biomedical Informatics 36(4-5): 351-361.
- Evernden, R. and E. Evernden (2003). Information First:Integrating Knowledge and Information Architecture for Business Advantage. Oxford, Butterworth-Heinemann: 1-27.
- Firestone, J., M. and M. McElroy, W. (2003). Key issues in the new knowledge management. Burlington MA, Elsevier Science.
- Foo, S. and M. Hepworth (2000). The implementation of an electronic survey tool to help determine the information needs of a knowledge-based organization.
- Hammond, T., T. Hannay, et al. (2005). “Social bookmarking tools (I): A general review.” D-Lib Magazine 11(4).
- Henninger, M. (2003). Searching Digital Sources. The Hidden Web: Finding quality information on the net. Sydney, Australia, UNSW Press.
- Herhold, K. (2004). “The Philosophy of Information.” Library Trends 52(3): 373-665.
- Kent, P. (2004). Surveying the Search Engine Landscape. Search Engine Optimisation for Dummies, Wiley.
- Kim, S., E. Suh, et al. (2003). “Building the knowledge map: an industrial case study.” Journal of Knowledge Management 7(2): 34-45.
- Levene, M. (2006). Navigating the Web. An Introduction to Search Engines and Web Navigation. London, Addison Wesley: 174-184.
- Loxton, M. H. (2003). “Patient Education: The Nurse as Source of Actionable Information.” Topics in Advanced Practice Nursing eJournal 3.
- Mathes, A. (2004) Folksonomies: Cooperative Classification and communication through shared Metadata. Volume, DOI:
- Morville, P. (2005). The Sociosemantic Web. In Ambient Findability. CA, O’Reilly.
- Morville, P. and L. Rosenfeld (2006). Information Architecture for the World Wide Web. California, O’Reilly Media.
- Morville, P. and L. Rosenfeld (2006). Push and Pull. Information Architecture for the World Wide Web. S. St.Laurent. California, O’Reilly Media.
- O’Reilly, T. (2005). “What Is Web 2.0 : Design Patterns and Business Models for the Next Generation of Software.” Retrieved 28 August, 2007, from http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html.
- Pao, M. (1989). Information Retrieval.
- Poremsky, D. (2004). Search Engines and How they Work In Google and Other Search Engines. Berkeley, CA, Peachpit Press: 3-18.
- Porter, J. (2005). “Folksonomies: A User-Driven Approach to Organising Content.” User Interface Engineering Retrieved September 6, 2007, from http://www.uie.com/events/uiconf/2006/articles/folksonomies.
- Reynolds, J. (2004). Search Engines and Directories. The Complete E-Commerce Book, CMPBooks: 233-247.
- Smith, B. (2001) Getting to know the Invisible Web. Library Journal.Com Volume, DOI:
EndNote is provided by the Thomson-Reuters group. See www.endnote.com
 The release of AOL search strings allowed a researcher to quickly identify a Mrs.Thelma Arnold, even though she was identified only as “searcher #4417749”
Which is really curious because on 21st September 2008 is said there were 27.61 billion pages. Did the web shrink or is the tool a bit buggy?
[ii] Sadly defunct now
[iii] Using “mouse-click” distance as a measure is a very effective way to put information at hand
[iv] Many libraries staff a 24×7 online helpdesk to guide patrons in finding materials. Most of these are staff pooled across many institutions and locations.
Matthew Loxton is the director of Knowledge Management & Change Management at Mincom, and blogs on Knowledge Management. Matthew’s LinkedIn profile is on the web, and has an aggregation website at www.matthewloxton.com
Opinions are the author’s and not necessarily shared by Mincom, but they should be.