How to Evaluate a Clustering Search Engine
Many enterprise search vendors have announced that clustering of search results is now part of their product and user experience. The most recent case is Google (press center, blog post, blogosphere reaction). Microsoft researchers have also experimented with clustering, without these experiments finding their way yet into Microsoft’s products.
By definition, a clustering engine analyzes the top (say 200-500) search results from a query and displays the main themes, typically as folders that may consist of subfolders.
The spread of clustering engines is gratifying, since Vivisimo was founded on a breakthrough clustering algorithm, has been refining the approach and educating and selling into the search market since 2001, and has evolved into a complete enterprise search provider.
Just as with search results, or as with any other designed product, judging the quality of a clustering engine requires some skill. Before judging quality, let’s first explore clustering’s end user value, that is, how it enhances knowledge worker productivity.
Clustering enhances end user productivity in at least three ways:
At a glance, users gain an easy overview of the main distinct themes that are present in the top search results.
By clicking on clusters that satisfy their needs or interests, users can quickly arrive at search results that are valuable but low ranked, say, #73 or even #429 in the results list, and so would never be noticed otherwise. The user’s visibility into the content is greatly enhanced.
After arriving at a cluster or sub-cluster, related results are placed together (“clustered”) rather than scattered throughout the ranked list. This expedites finding related or the best content.
In short, clustering lets users overview, find, discover, and compare information more productively. How does clustering quality enhance or detract from this productivity gain?
To grasp an overview of the main themes, the cluster labels should be concise and natural-looking. Also, the clusters shouldn’t overlap too much in their contents, otherwise the user will be overloaded with too many clusters expressing overly related themes. If the clusters don’t overlap too much, so that on average a search result appears in only 1.2 to 1.5 clusters, then the main distinct themes will be shown, rather than similar/duplicate themes on overlapping content. Also, the cluster labels shouldn’t be artificially limited to labels that contain the query word, or labels that have two or more words in them, or some other artifact of an inferior clustering approach. Finally, the underlying search engine snippets (aka excerpts or dynamic summaries) should be full enough so that clustering has enough input text to work with.
To arrive at low-ranked but valuable results, the clustering engine should be fast enough so that 200-500 results or more can be clustered within an acceptable response time. If user authentication is an issue (note discussion here), then the response time should include the time for the search engine to verify that the user can view these documents. Also, the cluster labels should accurately express its contents, otherwise the user wastes time on wild goose chases.
In order for similar results to be placed together in the same clusters, the clustering software should possess the linguistic knowledge needed to correctly handle cases like the following:
sort out the meanings in middle ages, middle aged, and medieval, and in news release, new release, and press release.
realize that king and kingship are very related, unlike gun and gunship.
realize that unfearful and fearless are synonymous, but not unhelpful and helpless.
plus many thousands of other linguistic relationships that take time and background knowledge to learn, whether by humans in school or by computers.
There are endless other subtleties, but enough: what’s the bottom line? Here are some questions to ask about the quality of a clustering search engine:
Are the cluster folders determined by analyzing the top search results? If not, then no overview of the major themes is being given. Instead, the “folders” are probably based on query logs.
Does clicking on a cluster cause a new search to be done? If so, it’s not clustering but something else, likely query refinement, which leads to a discontinuous user experience.
Is the clustering engine able to handle 200-500 results or even more?
Are the cluster names concise and natural-looking, and do they correctly handle numbers, punctuation, diacritics, foreign words, etc.?
Is there evidence that the clustering engine possesses considerable linguistic knowledge? And in other languages besides English, if needed? For example, are many (30-40% or more) of the search results left unclustered into an Other category? This suggests a deficiency in detecting related meanings.
http://searchdoneright.com/2007/03/how-to-evaluate-a-clustering-search-engine/
The above post shows how to search on cluster search engines which is another way of deciding on the appropriate sites
Tuesday, March 30, 2010
How to Search the Internet Effectively
the link will let you know How to Search the Internet Effectively
http://www.media-awareness.ca/english/resources/special_initiatives/wa_resources/wa_teachers/tipsheets/search_internet_effectively.cfm
http://www.media-awareness.ca/english/resources/special_initiatives/wa_resources/wa_teachers/tipsheets/search_internet_effectively.cfm
How to Search Effectively Online
http://video.about.com/google/Search-Effectively-Online.htm
These search engine tips will help you narrow down your online searches, so you find exactly the information you want.
These search engine tips will help you narrow down your online searches, so you find exactly the information you want.
Monday, March 29, 2010
Basic Search Tips
To use search engines effectively, it is essential to apply techniques that narrow results and push the most relevent pages to the top of the results list. Below are a number of strategies for boosting search engine performance.
- IDENTIFY KEYWORDS
- BOOLEAN AND
- BOOLEAN OR
- BOOLEAN AND NOT
- IMPLIED BOOLEAN: PLUS & MINUS
- PHRASE SEARCHING
- PLURAL FORMS, CAPITAL LETTERS, AND ALTERNATE SPELLINGS
- TITLE SEARCH
- DOMAIN SEARCH
- HOST SEARCH
- URL SEARCH
- LINK SEARCH
For more details, click the following link!
Subscribe to:
Comments (Atom)