[Published in the OSALL newsletter of March 2009]
A single incident some years back epitomises a mindset that I still find totally incomprehensible. I had managed to find a way to re-arrange my work-flow that streamlined a number of processes and worked well for me and improved my productivity. At a weekly meeting the department head asked about the change and, once I’d explained, I was told “you will go back to the old procedure because we’ve always done it that way”. I went back to the old procedure and within a short while moved to a different place of employment where having an active brain wasn’t seen as a threat but actually encouraged.
There may still be some individuals who resist change and, if that bobs their boat, perfect – but the adventure in discovering and testing new concepts and products, accepting/incorporating them into suitable facets of life or rejecting them after a good look, bobs mine and it’s good to be part of a community of fellow bobbees who embrace development and progress.
What would our teenagers do all evening if Herman Heunis had made a lifelong commitment to landlines and faxes? 23-year old Mark Zuckerberg would be one of millions sitting at home smsing individual friends instead of riding the crest of the social networking phenomenon if he hadn’t stepped out of his comfort zone.
Finding what we know, or can reasonably anticipate, is satisfying but how can we measure the success of our Internet searches against what we don’t know? In the light of a recent project, I have resolved to find relevant ways of searching for exactly what I want, specialised information, that isn’t just going to come over and shake me by the Google.
It’s likely that many of our searches target specific documents : either we find them or we don’t. When dealing with a citation, for example, we are not going to be satisfied with a ‘similar page’ result. It has to be the original or nothing. However, this is a good time to avail ourselves of opportunities provided by the current surge of activity that aims to open up less accessible information stored on the Internet. We’ve heard it referred to collectively in the past as, variously, the “Invisible Web”, “Hidden Web” or “Deep Web”. Until recently I hadn’t spent much time analysing the implications of the different terms, accepting that they all referred to information that wasn’t readily going to jump up and grab one in the search box.
I’ve recently had occasion to read up on the theory behind current ‘deep web’ search developments and it has definitely given me reasons to widen the ambit of my search strategies.
An observation on the terminology : ‘invisible’ and ‘hidden’ are not strictly accurate. They gave definition to the concept as early as 1994 but by 2001 the finer distinctions had been recognised and the term ‘deep web’ was coined. Obviously some pages have been deliberately tagged by their owners to repel search bots but there is a far wider range of reasons for some resources failing to appear in general search results.
As we’ve heard before, bots like pages with hyperlinks, the more the better, as it facilitates their passage around the Web. It therefore stands to reason that any unlinked ‘stand alone’ pages are going to escape the attention of these tireless little auto-indexers. Some search engines limit the number of pages that can be indexed per site, partly in an effort to avoid duplication (you’ve seen the “similar pages have not been displayed” reports at the end of ranked results).
Databases that require users to login are also not going to be bot-friendly and are therefore unlikely to be automatically indexed.
And then, maybe most significantly, how much information is generated on the fly when users submit queries and dynamic results are returned? Lots of it on every page but these results are not stored and so cannot be indexed by search engines. They are simply generated afresh on demand and are very disposable. They also don’t provide a permanent URL to give the bots a reference point.
The good news is that, despite the unqualified growth in the amount of information stored on the Web, the types of inhibitors to discovery are significantly less than in the past. I’m sure most of us recall that only a few years ago pdfs were generally not considered searchable and consequently formed part of the ‘hidden’ material ; so to PowerPoint and other filetypes that no longer present an obstacle.
So, how to trace this information? One type of facility is referred to as ‘federated search engines’. These portals are interfaces between users and multiple pre-selected sets of databases relating to specific fields. They do not rely on pre-created indexes as the targeted databases/resources are interrogated directly, or in ‘real time’, thus returning up-to-the-second data, a definite advantage over standard search engines. Wikipedia uses http://www.science.gov/ as an example of this type of service.
Web harvesting services cater for particular interest groups and focus on specific URLs. This approach optimises the results as it’s not trying to be all things to everyone. Wikipedia’s example in this case is Indonesian Scientific Index (http://www.isi.lipi.go.id/).
Many of the services that access information in the deep web rely directly on human input from their owners to identify the URLs to target, the depth to which searches are to be carried out, and the classification of resources and results. And so the wheel has turned : directory searches are once again coming into their own. A name that needs to introduction in this field is Gary Price. A number of resources refer to one of his services, DirectSearch, as a prime example of a deep web directory. Last seen, this service was being revamped and I am unable to access it at all at present. However, take a look at http://www.answers.com/topic/gary-price and another of Gary’s sites that you are likely to have used in the past : http://www.resourceshelf.com/.
As an incentive to recognising the significance of these hidden resources, one source estimates that, while standard search engines are currently indexing about 20 billion pages, the deep web consists of 1 trillion pages ; or, put another way, the deep web contains about 7 500 terabytes of data compared to 167 terabytes recognised by most searches.
A lot of the data that is mined by these facilitators comes from relatively ‘dry’ directories and government resources. It really is a matter of identifying what type of result you require and which service caters best to this type of request. If you are looking for someone’s phone number, maybe Google would throw up the result, but there are more direct ways of searching for this information.
This particular example also provided me with room for thought. I’m not particularly vociferous about personal privacy on the Internet because, as long as its information one puts up about oneself, it shouldn’t contain anything that shouldn’t be accessed by others. Problems do arise when people start publishing information about others. In this case, what surprised me was the ease with which I was able to find personal information that the ‘owner’ might not want available to all and sundry.
Pipl.com describes itself as the “most comprehensive people search on the web”. It also contains a concise explanation of why much of this type of information is found on the deep web : “since most personal profiles, public records and other people-related documents are stored in databases and not on static web pages, most of the higher-quality information about people is simply “invisible” to a regular search engine” (http://www.pipl.com/help/deep-web/).
My brother has been a resident in the US for some six years and, although we keep in contact by email and sms, I have never phoned him at home. So I decided to use him as my test case. Considering the amount of data that had to be searched, the few seconds it took to return the results was quite scary. About a dozen people with the same first and surnames were supplied but only one with the same middle initial. Not only was I given his home telephone number and the fact that he had moved house, I was offered a fee-based profile which would check his criminal and sex offender status, bankruptcy, small claims and judgments, address history, relatives, and so on – all without his knowledge that the search was being run. In fact, I’m not even sure if he is aware that the service exists. While the search was running, the information on the screen told me “Intelius is searching billions of current utility records, court records, county records, change of address records, property records, business records, and other public and publicly available information to find what you’re looking for”.
All I wanted was a phone number. But would we want even our home phone number easily accessible to anyone with Internet access? It would enable anyone with work-related queries to contact us 24/7 and there could be any other number of reasons why we would choose not to make this information available.
In all, this has been an extremely interesting foray into a developing world and one that I’d recommend to anyone who has reason to run Web searches on a regular basis.
99 Resources to Research & Mine the Invisible Web – http://www.collegedegree.com/library/college-life/99-resources-to
Opinions expressed in this column are my own and not necessarily those of my employer.