Thursday, October 7, 2010

Le Zhao and Jonathan Elsas -- Wednesday October 13th

Please join us for two upcoming IR series talks on Wednesday, Oct 13, 2010.

Lunch will be provided by Yahoo!.

Date/Time: Wednesday, Oct 13, 2010, noon
Place: GHC 4405

First Speaker: Le Zhao
Title: Term Necessity Prediction

The probability that a term appears in relevant documents (P(t|R)) is a fundamental quantity in several probabilistic retrieval models, however it is difficult to estimate without relevance judgments or a relevance model. We call this value term necessity because it measures the percentage of relevant documents retrieved by the term – how necessary a term’s occurrence is to document relevance. Prior research typically either set this probability to a constant, or estimated it based on the term's inverse document frequency, neither of which was very effective.

This paper identifies several factors that affect term necessity, for example, a term’s topic centrality, synonymy and abstractness. It develops term- and query-dependent features for each factor that enable supervised learning of a predictive model of term necessity from training data. Experiments with two popular retrieval models and 6 standard datasets demonstrate that using predicted term necessity estimates as user term weights for the original query terms leads to
significant improvements in retrieval accuracy.

The paper will appear in CIKM 2010.


Second Speaker: Jon Elsas
Title: Rank Learning for Factoid Question Answering with Linguistic and Semantic Constraints

This work presents a general rank-learning framework for leveraging deep linguistic and semantic features for passage ranking within Question Answering (QA) systems. The passage ranking framework enables query-time checking of these complex and long-distance constraints among question features such as keywords and named entities. These constraints can include keyword ordering, annotation type-checking, verb-argument attachment and arbitrary long-distance paths through an annotation graph. We show that a trained ranking model using this rich feature set achieves greater than a 20% improvement in Mean Average Precision over baseline keyword retrieval models. We also show that for questions expressing the most complex linguistic semantic constraints, further gains in MAP are realized, yielding a 40% improvement over the baseline.

The paper will appear in CIKM 2010.

Tuesday, July 13, 2010

Jaime Arguello -- Friday July 16th

Time: Noon
Location: TBA

Speaker: Jaime Arguello, Language Technologies Institute, School of Computer Science, CMU

Title: Vertical Selection in the Presence of Unlabeled Verticals.


Vertical aggregation is the task of incorporating results from specialized search engines or verticals (e.g., images, video, news) into Web search results. Vertical selection is the subtask of deciding, given a query, which verticals, if any, are relevant. State of the art approaches use machine learned models to predict which verticals are relevant to a query. When trained using a large set of labeled data, a machine learned vertical selection model outperforms baselines which require no training data. Unfortunately, whenever a new vertical is introduced, a costly new set of editorial data must be gathered. In this paper, we propose methods for reusing training data from a set of existing (source) verticals to learn a predictive model for a new (target) vertical. We study methods for learning robust, portable, and adaptive cross-vertical models. Experiments show the need to focus on different types of features when maximizing portability (the ability for a single model to make accurate predictions across multiple verticals) than when maximizing adaptability (the ability for a single model to make accurate predictions for a specific vertical). We demonstrate the efficacy of our methods through extensive experimentation for 11 verticals.

This is joint work with Fernando Diaz and Jean-Francois Paiement from Yahoo! Labs and will be presented at SIGIR 2010.

Tuesday, July 6, 2010

Grace Yang -- Friday July 9th

Time: Noon
Location: GHC 4405

Speaker: Grace Hui Yang, Language Technologies Institute, School of Computer Science, CMU

Title: Collecting High Quality Overlapping Labels at Low Cost.

This paper studies quality of human labels used to train search engines’ rankers. Our specific focus is performance improvements obtained by using overlapping relevance labels, which is by collecting multiple human judgments for each training sample. The paper explores whether, when, and for which samples one should obtain overlapping training labels, as well as how many labels per sample are needed. The proposed selective labeling scheme collects additional labels only for a subset of training samples, specifically for those that are labeled relevant by a judge. Our experiments show that this labeling scheme improves the NDCG of two Web search rankers on several real-world test sets, with a low labeling overhead of around 1.4 labels per sample. This labeling scheme also outperforms several methods of using overlapping labels, such as simple k-overlap, majority vote, the highest labels, etc. Finally, the paper presents a study of how many overlapping labels are needed to get the best improvement in retrieval accuracy.

This paper is published in Proceedings of the 33th Annual ACM SIGIR Conference (SIGIR2010), Geneva, Switzerland, July 19-23, 2010.

Friday, May 28, 2010

Lee Jensen ( -- Thursday, June 10

Speaker: Lee Jensen (

Date/Time: Thursday, June 10, 2010, noon

Place: GHC 6115

Title: Exploiting Sequential Relationships for Familial Classification

Information hidden in sequences of related data provides a significant exploitable source for information extraction. In this work we demonstrate techniques for increasing the signal of classifications from sequences of dependent data. This includes using algorithms, as well as generating features, that take advantage of the sequence oriented nature of the data. The efficacy of these techniques is evaluated with the classification of familial relationships within United States census data. The census data proves an interesting corpus for this work because of the highly sequential nature of the instances, and the explicit classification relationship between one instance and an instance previously found in the sequence.

Thursday, April 8, 2010

Vasudeva Varma -- Friday April 16, 2010, noon

Speaker: Dr. Vasudeva Varma (Institute of Information Technology, Hyderabad, India)

Date/Time: Friday April 16, 2010, noon

Place: Newell-Simon Hall 3002

Title: Cross Language Information Access - An Indian Experience

In this talk, I briefly explain the cross language information retrieval research for Indian languages. There is a national wide mission mode project on cross language information access sponsored by Ministry of Communications and Information technology, Government of India involving ten Universities. I will talk about the major research issues and a possible roadmap for the CLIR research in Indian setting. I will also share our experiences with multi-lingual summarization.

Vasudeva Varma is a faculty member at International Institute of Information Technology, Hyderabad Since 2002. His research interests include search (information retrieval), information extraction, information access, knowledge management, cloud computing and software engineering. He is heading Search and Information Extraction Lab and Software Engineering Research Lab at IIIT Hyderabad. He is also the chair of Post Graduate Programs since 2009. He published a book on Software Architecture (Pearson Education) and over sixty technical papers in journals and conferences. In 2004, he obtained young scientist award and grant from Department of Science and Technology, Government of India, for his proposal on personalized search engines. In 2007, he was given Research Faculty Award by AOL Labs.