Focused Crawling: a New Approach to Topic-Specific Web Resource Discovery
Soumen Chakrabarti (IIT Mumbai), Martin van den Berg (Fuji-Xerox Research) and Byron Dom
Proceedings of the Eighth International World Wide Web Conference (WWW8)
Paper Description
This paper got the best paper award at the Eighth International Worldwide Web Conference. It presents a novel idea for Web crawling: combining classification and crawling to create a focused crawler. Businesses, organizations and hobbyists often want to create a collection of links to useful web pages on a given topic. Creating a good collection using standard search engines is very hard, since users need to know exactly the right set of keywords for each type of resource page. The focused crawler solves this problem by starting from a set of example documents, and selectively seeking out pages that are relevant to the topic. Further, general search engines only cover between 10% and 30% of web pages. Hence this technology, by identifying the right subset to search, can potentially get better coverage of a topic, especially with respect to the newest pages. Focused crawling is achieved by the use of two hypertext mining programs that were designed and developed by the authors; a classifer that evaluates the relevance of a hypertext document with respect to the focus topic, and a distiller that identifies hypertext links that are likely to lead to to relevant pages. Using empirical evaluation, the authors show that focused crawling discovers valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Focused crawling is also robust against large perturbations in the starting set of URLs. This paper has already spawned several other papers.
Ripple Joins for Online Aggregation
Peter Haas and Joe Hellerstein (UC Berkeley)
ACM SIGMOD Conference
Paper Description
This paper was runner-up for the best paper award in the ACM SIGMOD 1999 Conference - the leading conference in Data Management. Current relational database management systems can't handle ad hoc decision-support queries efficiently and it usually takes a long time for the users to get the final query results. On the other hand, this type of query is typically used to get a "big picture" of the underlying data sets and handling them in an online fashion (so that progressively refined running estimates of the final results are continuously displayed) becomes a very attractive approach. Traditional offline join algorithms are designed to minimize the time to completion of the query. In contrast, this paper presents a new family of join algorithms, called ripple joins, that are designed to minimize the time until an acceptably precise estimate of the query result is available, as measured by the length of a confidence interval. The paper shows how ripple joins can be implemented in an existing relational database system. In experiments, ripple joins compute reasonably precise online estimates in 1/100th of the time required for the best offline algorithms to produce exact answers.
This paper has opened up an interesting area of systems research and finds applications in areas ranging from data visualization to Internet query processing. It has already been cited in several recent database papers.
