Reference Tools
Home

 

ANALYSIS OF ALLTHEWEB AND IXQUICK

USING A COMPLEX SEARCH

 

 

SEARCH QUESTION

In order to test the capabilities and results provided by a search engine and a metasearch engine I chose to research a question of personal interest.  My aunt was diagnosed with breast cancer in the fall of 2002 and chose to undergo both chemotherapy and radiation as her doctors recommended.  In the area of lifestyle changes, she is less compliant.  She refuses to give up smoking.  Because I am already aware that breast cancer is quite prone to lung metastases, the goal of my search was to find out if there was information on the World Wide Web to support the position that her survival prognosis would increase if she gave up smoking.

 

ENGINES SEARCHED

The Sunsite Berkeley web search tutorial suggests that www.google.com is the best general search engine.  I chose to search the one they mention as a good backup, www.alltheweb.com, since I am less familiar with it.  After  examining the metasearch engines they suggest, I chose to use www.ixquick.com.

 

SEARCH TERMS

After analyzing the subject and the capabilities of the engines I chose, and performing a few trial searches, I chose the following search string:

 

metastasis AND smoking AND “breast cancer” AND lungs

 

     .  Though one can get the same results without the Boolean AND on AllTheWeb where it is the default in the non-Boolean search box, the searcher must include it on Ixquick.  On that metasearch engine a simple string of terms is searched as “any of the terms” rather than “all of the terms,” in other words, the default is OR.  Because I wanted to use a search string that would be equally compatible to AllTheWeb and Ixquick, I included the AND, and used the Boolean search box on AllTheWeb.  The AND could have just as well have been a +, in fact at Ixquick, the query is so translated. Since AllTheWeb does not support truncation, I did not use smok*,  metasta*, and lung* as I would have on other engines.  My goal was to maximize the relevant results in the top layer, while minimizing the irrelevant ones. The particular order of terms was chosen after trying several alternatives.  Giving “breast cancer” or [lungs] prominence in the string decreased relevance to the particular question.  At Ixquick placing [metastasis] before [smoking] produced the best results, and since at AllTheWeb, the order of these terms did not change the relevance ratings of the top 20 results, that is the order I used for the search string.  The result set for this particular set of terms is still inordinately large at AllTheWeb, although only 100 of them are listed, but alterations or additions that decreased the result set, also reduced the relevance of the top results.  Even though Ixquick only pulls the top ten results from ten different engines, it also provides a figure for the larger result set from which those are pulled.  A preference choice at AllTheWeb is to only pull one result from each domain.  Since I indicated that choice, all results from that engine should be unique websites.  However, I marked it as questionable since some of the other preferences seem not to work correctly.

 

                         TOTAL RESULTS

 

 

ALLTHEWEB

IXQUICK

 # TOTAL RESULTS

14,194

13,466

# RESULTS LISTED

100

39

# UNIQUE WEBSITES

0?

35

RESULTS ANALYZED

top 20

top 20

 

 

METHODOLOGY

After performing the search at each engine, I analyzed the top twenty results retrieved by each for relevancy.  With few exceptions, the results could be technically considered relevant because the text indeed included all terms in the search string.  However, some results did not actually address the original query because the terms did not intersect as desired.  Sometimes this was because the resource was a particularly large one.  Other times it was just the nature of the result, in one case a glossary of terms, in another a bibliography.  Finally, in the subject area of breast cancer, even a short article can actually contain the three other terms without the desired intersection.

At AllTheWeb it was actually difficult to verify the inclusion of all terms because highlighting only appeared to work on the results page itself.  Only one result was not even in the area of cancer.  Inexplicably, at IxQuick a result directing the searcher to a fire safety company appeared for multiple permutations of the term string.  It is doubtful that the terms “breast cancer” or [metastasis] appeared anywhere on that site.  The following table illustrates the actual spread of results.

 

              NATURE OF TOP TWENTY RESULTS

 

 

ALLTHEWEB

IXQUICK

RELEVANT (all 4 terms present)

 

 

Query Relevant

2

11

Query Relevant But Link Problem

 

1

Query Relevant w/Site Registration

 

1

Smoking and Breast Cancer

1

1

General Breast Cancer Info

14

5

Focus on Other Cancer Type

2

 

IRRELEVANT (terms missing)

 

 

Treatment and Marijuana Smoking

1

 

Fire Safety Company

 

1

 

 

ENGINE CAPABILITIES AND FEATURES

 

Features I Liked at AllTheWeb

·       size of database – over 2 billion

·       clean lines and pleasing colors – one can actually change the colors to suite as well although it requires storing stuff on your own computer, and of the alternate choices only the gray palette is pleasing to me

·       capability of denoting search type on the homepage - web, news, pictures, video, audio, or FTP files

·       pull down menu to apply “all of the words” (default), “any of the words,” “exact phrase,” or “Boolean expression” to the basic search box either on the homepage or the advanced search page

·       prominence of customization choices – right next to the main search box there is a link to a page where the user can designate preferences on multiple parameters – less intuitive is the links to language and advanced search preferences from that page – also users who first click to advanced search will miss all of it.

·       expandable Boolean search box so that full query can always be read

·       inclusion of some “similar searches” at the top of the results – although the results of direct clicking are mixed, they can certainly give the user ideas for expanding the search.

·       lack of banner ads which they evidently recently removed

·       searching in multiple languages with the default changing with country of searcher

·       multiple limiting factors on the advanced search page: domain, IP address or range, file type or size, publication date, and the inclusion or exclusion of a particular types of file embedded in results

·       ability to include all or most of the above limitations right in the Boolean search box although the instructions for this are buried

·       ability to designate a field to search for each term in a search string although adding one for author designated keywords would help

·       availability of searching within a result set with a form to add terms

·       availability of investigating websites by merely entering URL in the query box, although this capability is only noted on the new features page

 

     Although I do not use AllTheWeb very often, I am familiar enough with the search engine to know that within the past year a number of features that made it unique have been removed.  The results page used to include whole sentences under each result rather than telegraphic indications of the first appearance of search terms.  It also used to provide unobtrusive subcategory folders that could be helpful in ranking results or narrowing a topic. AllTheWeb used to have much more Boolean capability.  Terms like NEAR that were supported before will now fail to produce results without any indication why.  Worse, nesting was supported before, and now instead of just ignoring parentheses, the engine unexpectedly reads them as indicating an OR relationship between enclosed terms.  It would appear that those in charge wish to make the engine more like Google, and in so doing increase use.  My opinion is searchers who made this engine their first choice, or at least used it heavily as a second engine now have much less reason to do so.

 

Features I Didn’t Like at AllTheWeb:

·       Removal of former results management features noted in the above discussion

·       Removal of the “exact phrase” choice from the menu box – although it is possible to include it by customization, this is not made clear by the phrasing on the preferences page, but only within the discussion of new features (which will probably soon disappear)

·       Removal of formerly supported Boolean capabilities noted in the above discussion

·       unusual reading of parentheses in Boolean string as discussed above

·       lack of truncation

·       although highlighting of results is available as a preference, this feature appears not to work or to apply only to the words that appear on the results page

·       apparent lack of left to right ranking of terms which in the case of this search made the top results less relevant to the actual query

·       appearance of results which do not actually contain all search terms

·       lack of cache

 

Features I Liked at Ixquick:

·       nature of metasearching – capability of searching the top ten results at multiple search engines at the same time

·       capability of denoting search type on the homepage: web, mp3, news, pictures

·       capability of denoting language on the homepage although this is limited to 14

·       apparent left to right ranking of the search terms by both Ixquick and the engines it searches which in this search made the top results much more relevant to the search question

·       possibility of searching using natural language although of course only engines that support it will appear in the results

·       full support of all Boolean language and conventions although both the information that this is so, and the fact that using certain ones will limit which engines can be polled is somewhat buried

·       searcher can choose to view the webpage results with the search terms highlighted

·       results page includes a few similar searches – in this case “cancer treatments” and “National Institute of Health guides” were referenced, and although the effectiveness of direct clicking on them is mixed, they can certainly provide good ideas for additional searches

·       each result includes both the number of engines reporting the result, denoted with stars, and the names of the those engines with their numerical ranking of the particular result

·       each polled engine represented in the results is linked so that the searcher can move to examine the results  there without repeating the query stage

 

Features I Didn’t Like at Ixquick:

·       small search box so that one can not read the entire query if is a longer one

·       most information that would help the searcher discover the how the engine works, how to form queries, or how to denote preferences is buried– the choice to fill the homepage with language choices instead of links to this type of information is a bad one in my opinion

·       lack of advanced search forms which make complex searches fairly inaccessible to the average searcher

·       inability to place date limitations on results

·       Boolean terms and conventions that are supposedly supported sometimes fail to produce results when other results clearly indicate they should have – for example using wildcards and/or NEAR should clearly have produced results in this case and they did not

·       searchers who don’t bother to click on the “Power Search Techniques” available only at the bottom of a results page will be missing much information about the nature of the search and therefore their results including

              1) that the default Boolean term is OR which has actually become unusual

              2) that even if the searcher uses AND or + the engine will allow results without all terms if their

                  concurrence proves difficult

              3) that though one can use either natural language or most Boolean terms and conventions, the metasearch

                  function will poll only engines that also support those capabilities

 

 

CONCLUSIONS

In the case of this sample search, it was clearly the metasearch engine that provided the results most relevant to the exact query.  Even though they were all different URLs, many of the Ixquick results focused on a single study done by the Medical Center at the University of California at Davis.  This study did indicate that smokers with breast cancer were more prone to develop pulmonary metastases.  It would have been nice to be able to check for additional good results hidden by this proliferation by using date limitations, but they are unavailable at Ixquick. Since searches at AllTheWeb and other engines did not produce evidence of additional good results for this particular query, I would not consider this a problem of duplication.  For one thing each URL provided a different summary of the subject area and different interpretations of the particular study.  In researching a matter of personal health and lifestyle choice, I would consider the varying study interpretations extremely helpful. 

     Because Ixquick is unlike other engines in its default settings, reads parentheses in an unusual way, and this and other important information is buried, I would not recommend this metasearch engine for naïve searchers.  For searchers who take the time to fully understand it, Ixquick can serve as a single source of information, as a way to locate a search engine or engines that provide the best results for a particular query, or as a backup for a subject directory browse as suggested by the South Carolina tutorial referenced in the class lecture for this Unit.  This is only true, however, for inquiries that are not hampered by the effective limitations on Boolean terms and conventions, which appear more restrictive than would be indicated by the Powersearch descriptions.

     I consider it highly probable that the lack of precision in the top twenty AllTheWeb results can be attributed to the engine’s apparent lack of left to right ranking of the search terms.  In this health query it is not enough to merely list pages that contain all four parts of the Boolean string.  There are too many of them.  The use of the Boolean term NEAR might prove helpful, but it was unavailable at AllTheWeb and did not seem to actually work at Ixquick even though it was offered.  So without that capability, it is imperative that the terms be ranked left to right.  The searcher makes intuitive placement of the terms based on their relative importance to the actual reference query and tests the placement if necessary.  If the engine then ignores this input, complex queries with long results lists will not be appropriately ranked.  This may not be important to the user who only wants one or two relevant responses, but those who require more depth will be disappointed.  The ranking problem, the limited Boolean capabilities, and the many other changes that have stripped this engine of the complexity and unique qualities available in earlier versions make me reluctant to consider using this engine again for any reason.  The engine now looks and acts too much like Google, without being as effective, to provide a viable alternative or second choice.

     Finally, the question that prompted these searches actually required an overview of the available information in a particular subject area, although it was a fairly narrow area.  According to the tutorials provided in this Unit’s lecture, a more appropriate strategy might be to first check subject directories or health databases which are more focused and which are better at delivering results from the Invisible Web.  An additional reason for Ixquick’s superior results is that some of the engines it polls, like AltaVista, actually use search strategies that are hybrids of the two approaches.


Evaluation and Assessment
Home