link

July 30, Wednesday
12:00 – 13:30

Faceted Searching and Browsing Over Large Collections of Textual and Text-Annotated Objects
Students seminar
Lecturer : Dr. Wisam Dakka
Affiliation : Google-NYC
Location : 201/37
Host : Students seminar
Adviser: Luis Gravano and Panagiotis (Panos) Ipeirotis

Abstract: The vast majority of Internet users utilize search functionality to navigate the text and text-annotated collections of a variety of web sites. Users of sites such as the New York Times archive, YouTube, and others often face long lists of results for their queries due to the large size of the collections. Processing numerous items is also a hurdle for "exploratory" users who have no specific query in mind, such as a new shopper in an online store or a researcher accessing a news archive. In this work, we attempt to address this problem. We investigate faceted searching and browsing to provide users with access methods that are useful for discovering the content and the structure of long search results or large collections. Hierarchies that organize items based on their topics are common for browsing a large set of items. For example, Yahoo! uses a topic-based hierarchy to guide users to their web pages of interest. Google News and Newsblaster enable news readers to quickly navigate the daily news based on a hierarchy of topics and related events. We first present summarization-aware topic faceted searching and browsing, which integrates clustering and summarization, so that users can browse a list of summarized clusters in the query results instead of individual documents. We have built a fully functional summarization-aware system for daily news. In addition to the topic facet, time can be used as an alternative facet for browsing search results. We explore time as an important dimension and suggest a general framework for time-based language models to consider time in the retrieval task. In fact, many facets, other than topic and time, can be useful for faceted searching and browsing. As a result, we propose supervised and unsupervised methods to identify and extract multiple relevant facets from collections. Yet incorporating such facets in searching or browsing is not an easy task. A typical approach to utilize facets in searching and browsing is to build individual hierarchies for each facet. Unfortunately, these hierarchies are currently manually or semi-manually constructed and populated. This prevents deploying such hierarchies for large collections due to the cost of manually annotating each item in the collections. To solve this problem, we propose a system to automate the construction of hierarchies for the extracted facets, and corresponding human studies to verify the effectiveness of our methods. We apply the faceted hierarchies to a range of large data sets, including collections of annotated images, television programming schedules, and web pages.