ANALYSIS OF ALLTHEWEB AND
IXQUICK
USING A COMPLEX SEARCH
SEARCH
QUESTION
In
order to test the capabilities and results provided by a search engine and a
metasearch engine I chose to research a question of personal interest. My aunt was diagnosed with breast cancer in
the fall of 2002 and chose to undergo both chemotherapy and radiation as her
doctors recommended. In the area of
lifestyle changes, she is less compliant.
She refuses to give up smoking.
Because I am already aware that breast cancer is quite prone to lung
metastases, the goal of my search was to find out if there was information on
the World Wide Web to support the position that her survival prognosis would
increase if she gave up smoking.
ENGINES
SEARCHED
The
Sunsite Berkeley web search tutorial suggests that www.google.com is the best general search
engine. I chose to search the one they
mention as a good backup, www.alltheweb.com,
since I am less familiar with it.
After examining the metasearch
engines they suggest, I chose to use www.ixquick.com.
SEARCH
TERMS
After
analyzing the subject and the capabilities of the engines I chose, and
performing a few trial searches, I chose the following search string:
metastasis AND smoking AND
“breast cancer” AND lungs
.
Though one can get the same results without the Boolean AND on AllTheWeb
where it is the default in the non-Boolean search box, the searcher must
include it on Ixquick. On that
metasearch engine a simple string of terms is searched as “any of the terms”
rather than “all of the terms,” in other words, the default is OR. Because I wanted to use a search string that
would be equally compatible to AllTheWeb and Ixquick, I included the AND, and
used the Boolean search box on AllTheWeb.
The AND could have just as well have been a +, in fact at Ixquick, the
query is so translated. Since AllTheWeb does not support truncation, I did not
use smok*, metasta*, and lung* as I
would have on other engines. My goal
was to maximize the relevant results in the top layer, while minimizing the
irrelevant ones. The particular order of terms was chosen after trying several
alternatives. Giving “breast cancer” or
[lungs] prominence in the string decreased relevance to the particular
question. At Ixquick placing
[metastasis] before [smoking] produced the best results, and since at AllTheWeb,
the order of these terms did not change the relevance ratings of the top 20
results, that is the order I used for the search string. The result set for this particular set of
terms is still inordinately large at AllTheWeb, although only 100 of them are
listed, but alterations or additions that decreased the result set, also
reduced the relevance of the top results.
Even though Ixquick only pulls the top ten results from ten different
engines, it also provides a figure for the larger result set from which those
are pulled. A preference choice at
AllTheWeb is to only pull one result from each domain. Since I indicated that choice, all results
from that engine should be unique websites.
However, I marked it as questionable since some of the other preferences
seem not to work correctly.
TOTAL RESULTS
|
|
ALLTHEWEB |
IXQUICK |
|
# TOTAL RESULTS |
14,194 |
13,466 |
|
#
RESULTS LISTED |
100 |
39 |
|
#
UNIQUE WEBSITES |
0? |
35 |
|
RESULTS
ANALYZED |
top 20 |
top 20 |
METHODOLOGY
After
performing the search at each engine, I analyzed the top twenty results
retrieved by each for relevancy. With
few exceptions, the results could be technically considered relevant because
the text indeed included all terms in the search string. However, some results did not actually
address the original query because the terms did not intersect as desired. Sometimes this was because the resource was
a particularly large one. Other times
it was just the nature of the result, in one case a glossary of terms, in
another a bibliography. Finally, in the
subject area of breast cancer, even a short article can actually contain the
three other terms without the desired intersection.
At
AllTheWeb it was actually difficult to verify the inclusion of all terms
because highlighting only appeared to work on the results page itself. Only one result was not even in the area of
cancer. Inexplicably, at IxQuick a
result directing the searcher to a fire safety company appeared for multiple
permutations of the term string. It is
doubtful that the terms “breast cancer” or [metastasis] appeared anywhere on
that site. The following table
illustrates the actual spread of results.
NATURE OF TOP TWENTY RESULTS
|
|
ALLTHEWEB |
IXQUICK |
|
RELEVANT
(all 4 terms present) |
|
|
|
Query
Relevant |
2 |
11 |
|
Query
Relevant But Link Problem |
|
1 |
|
Query
Relevant w/Site Registration |
|
1 |
|
Smoking
and Breast Cancer |
1 |
1 |
|
General
Breast Cancer Info |
14 |
5 |
|
Focus
on Other Cancer Type |
2 |
|
|
IRRELEVANT
(terms missing) |
|
|
|
Treatment
and Marijuana Smoking |
1 |
|
|
Fire
Safety Company |
|
1 |
ENGINE
CAPABILITIES AND FEATURES
Features
I Liked at AllTheWeb
·
size
of database – over 2 billion
·
clean
lines and pleasing colors – one can actually change the colors to suite as well
although it requires storing stuff on your own computer, and of the alternate
choices only the gray palette is pleasing to me
·
capability
of denoting search type on the homepage - web, news, pictures, video, audio, or
FTP files
·
pull
down menu to apply “all of the words” (default), “any of the words,” “exact phrase,”
or “Boolean expression” to the basic search box either on the homepage or the
advanced search page
·
prominence
of customization choices – right next to the main search box there is a link to
a page where the user can designate preferences on multiple parameters – less
intuitive is the links to language and advanced search preferences from that
page – also users who first click to advanced search will miss all of it.
·
expandable
Boolean search box so that full query can always be read
·
inclusion
of some “similar searches” at the top of the results – although the results of
direct clicking are mixed, they can certainly give the user ideas for expanding
the search.
·
lack
of banner ads which they evidently recently removed
·
searching
in multiple languages with the default changing with country of searcher
·
multiple
limiting factors on the advanced search page: domain, IP address or range, file
type or size, publication date, and the inclusion or exclusion of a particular
types of file embedded in results
·
ability
to include all or most of the above limitations right in the Boolean search box
although the instructions for this are buried
·
ability
to designate a field to search for each term in a search string although adding
one for author designated keywords would help
·
availability
of searching within a result set with a form to add terms
·
availability
of investigating websites by merely entering URL in the query box, although
this capability is only noted on the new features page
Although I do not use AllTheWeb very
often, I am familiar enough with the search engine to know that within the past
year a number of features that made it unique have been removed. The results page used to include whole
sentences under each result rather than telegraphic indications of the first
appearance of search terms. It also
used to provide unobtrusive subcategory folders that could be helpful in
ranking results or narrowing a topic. AllTheWeb used to have much more Boolean
capability. Terms like NEAR that were
supported before will now fail to produce results without any indication
why. Worse, nesting was supported
before, and now instead of just ignoring parentheses, the engine unexpectedly
reads them as indicating an OR relationship between enclosed terms. It would appear that those in charge wish to
make the engine more like Google, and in so doing increase use. My opinion is searchers who made this engine
their first choice, or at least used it heavily as a second engine now have
much less reason to do so.
Features
I Didn’t Like at AllTheWeb:
·
Removal
of former results management features noted in the above discussion
·
Removal
of the “exact phrase” choice from the menu box – although it is possible to
include it by customization, this is not made clear by the phrasing on the
preferences page, but only within the discussion of new features (which will
probably soon disappear)
·
Removal
of formerly supported Boolean capabilities noted in the above discussion
·
unusual
reading of parentheses in Boolean string as discussed above
·
lack
of truncation
·
although
highlighting of results is available as a preference, this feature appears not
to work or to apply only to the words that appear on the results page
·
apparent
lack of left to right ranking of terms which in the case of this search made
the top results less relevant to the actual query
·
appearance
of results which do not actually contain all search terms
·
lack
of cache
Features
I Liked at Ixquick:
·
nature
of metasearching – capability of searching the top ten results at multiple search
engines at the same time
·
capability
of denoting search type on the homepage: web, mp3, news, pictures
·
capability
of denoting language on the homepage although this is limited to 14
·
apparent
left to right ranking of the search terms by both Ixquick and the engines it
searches which in this search made the top results much more relevant to the
search question
·
possibility
of searching using natural language although of course only engines that
support it will appear in the results
·
full
support of all Boolean language and conventions although both the information
that this is so, and the fact that using certain ones will limit which engines
can be polled is somewhat buried
·
searcher
can choose to view the webpage results with the search terms highlighted
·
results
page includes a few similar searches – in this case “cancer treatments” and
“National Institute of Health guides” were referenced, and although the
effectiveness of direct clicking on them is mixed, they can certainly provide
good ideas for additional searches
·
each
result includes both the number of engines reporting the result, denoted with
stars, and the names of the those engines with their numerical ranking of the
particular result
·
each
polled engine represented in the results is linked so that the searcher can
move to examine the results there
without repeating the query stage
Features
I Didn’t Like at Ixquick:
· small search box so that one
can not read the entire query if is a longer one
· most information that would
help the searcher discover the how the engine works, how to form queries, or
how to denote preferences is buried– the choice to fill the homepage with
language choices instead of links to this type of information is a bad one in
my opinion
· lack of advanced search
forms which make complex searches fairly inaccessible to the average searcher
· inability to place date
limitations on results
· Boolean terms and
conventions that are supposedly supported sometimes fail to produce results
when other results clearly indicate they should have – for example using
wildcards and/or NEAR should clearly have produced results in this case and
they did not
· searchers who don’t bother
to click on the “Power Search Techniques” available only at the bottom of a
results page will be missing much information about the nature of the search
and therefore their results including
1) that the default Boolean term
is OR which has actually become unusual
2) that even if the searcher
uses AND or + the engine will allow results without all terms if their
concurrence proves difficult
3) that though one can use
either natural language or most Boolean terms and conventions, the metasearch
function will poll only
engines that also support those capabilities
CONCLUSIONS
In
the case of this sample search, it was clearly the metasearch engine that
provided the results most relevant to the exact query. Even though they were all different URLs,
many of the Ixquick results focused on a single study done by the Medical
Center at the University of California at Davis. This study did indicate that smokers with breast cancer were more
prone to develop pulmonary metastases.
It would have been nice to be able to check for additional good results
hidden by this proliferation by using date limitations, but they are
unavailable at Ixquick. Since searches at AllTheWeb and other engines did not
produce evidence of additional good results for this particular query, I would
not consider this a problem of duplication.
For one thing each URL provided a different summary of the subject area
and different interpretations of the particular study. In researching a matter of personal health
and lifestyle choice, I would consider the varying study interpretations
extremely helpful.
Because Ixquick is unlike other engines
in its default settings, reads parentheses in an unusual way, and this and
other important information is buried, I would not recommend this metasearch
engine for naïve searchers. For searchers
who take the time to fully understand it, Ixquick can serve as a single source
of information, as a way to locate a search engine or engines that provide the
best results for a particular query, or as a backup for a subject directory
browse as suggested by the South Carolina tutorial referenced in the class
lecture for this Unit. This is only
true, however, for inquiries that are not hampered by the effective limitations
on Boolean terms and conventions, which appear more restrictive than would be
indicated by the Powersearch descriptions.
I consider it highly probable that the
lack of precision in the top twenty AllTheWeb results can be attributed to the
engine’s apparent lack of left to right ranking of the search terms. In this health query it is not enough to
merely list pages that contain all four parts of the Boolean string. There are too many of them. The use of the Boolean term NEAR might prove
helpful, but it was unavailable at AllTheWeb and did not seem to actually work
at Ixquick even though it was offered.
So without that capability, it is imperative that the terms be ranked
left to right. The searcher makes
intuitive placement of the terms based on their relative importance to the
actual reference query and tests the placement if necessary. If the engine then ignores this input,
complex queries with long results lists will not be appropriately ranked. This may not be important to the user who
only wants one or two relevant responses, but those who require more depth will
be disappointed. The ranking problem,
the limited Boolean capabilities, and the many other changes that have stripped
this engine of the complexity and unique qualities available in earlier
versions make me reluctant to consider using this engine again for any
reason. The engine now looks and acts
too much like Google, without being as effective, to provide a viable
alternative or second choice.
Finally, the question that prompted these
searches actually required an overview of the available information in a
particular subject area, although it was a fairly narrow area. According to the tutorials provided in this
Unit’s lecture, a more appropriate strategy might be to first check subject
directories or health databases which are more focused and which are better at
delivering results from the Invisible Web.
An additional reason for Ixquick’s superior results is that some of the
engines it polls, like AltaVista, actually use search strategies that are
hybrids of the two approaches.