Friday, October 3, 2008

Rosie Jones - Thursday October 9th, 2008

3002 Newell-Simon Hall
Thursday, October 9, 2008

Speaker : Rosie Jones

Title: Web Search Sessions

Abstract: Traditionally, information retrieval examines the search query in isolation: a query is used to retrieve documents, and the relevance of the documents returned are evaluated in relation to that query. However, users typically conduct web and other types of searches in sessions, issuing a query, examining results, and the re-issuing a modified query to improve the results. We decribe the properties of real web search sessions, and show that users conduct searches for both broad and finer grained tasks, which can be both interleaved and nested. We show that user search reformulations can be mined to identify related terms, and that we can identify the boundaries between tasks with greater accuracy than previous methods.

Rosie Jones is a Senior Research Scientist at Yahoo!. Her research interests include web search, geographic information retrieval, and natural language processing. She received her PhD from the Language Technologies Institute at Carnegie Mellon University under the supervision of Tom Mitchell, where her doctoral thesis was titled Learning to Extract Entities from Labeled and Unlabeled Text. She is co-organizing the WSDM 2009 Workshop on Web Search Click Data (WSCD09). She served on the Senior PC for SIGIR in 2007 and 2008, and is a Senior Member of the ACM.

Thursday, October 2, 2008

Le Zhao - Thursday 2nd Oct 2008

We are going to have Le Zhao to give our first IR talk in this
semester. Reception will provided by Yahoo!. Here is the talk information:

Date: Thursday 2nd Oct 2008
Time: 2pm
Place: Wean Hall 7220

Speaker: Le Zhao
Title: A Generative Retrieval Model for Structured Documents

Structured documents contain elements defined by the author(s) and annotations assigned by other people or processes. Structured documents pose challenges for probabilistic retrieval models when there are mismatches between the structured query and the actual structure in a relevant document or erroneous structure introduced by an annotator. This paper makes three contributions. First, a new generative retrieval model is proposed to deal with the mismatch problem. This new model extends the basic keyword language model by treating structure as hidden variable during the generation process. Second, variations of the model are compared. Third, term-level and structure-level smoothing strategies are studied. Evaluation was conducted with INEX XML retrieval and question-answering retrieval tasks. Experimental results indicate that the optimal structured retrieval model is task dependent, two-level Dirichlet smoothing significantly outperforms two-level Jelinek-Mercer smoothing, and with accurate structured queries, the proposed structured retrieval model outperforms keyword retrieval significantly, on both QA and INEX datasets.

Based on work accepted at CIKM'08.

Friday, May 16, 2008

Justin Betteridge - Friday May 23rd

Please join us for an upcoming talk.

Lunch will be provided by Yahoo!

Linguistic Pattern Learning for Web Information Extraction

Who: Justin Betteridge
When: Friday, May 23rd, 12:00pm
Where: NSH 3002

Most approaches to automatically extracting structured information from the web
rely on surface text patterns. However, the manner in which such patterns are
defined, learned, and employed in the larger system varies with each case. In
this talk, I will outline the spectrum of previous work in this area and argue
for a linguistically-motivated definition, a hybrid heuristic/classifier-based
assessment, and a multi-purpose employment of textual patterns in the context of
Web Information Extraction (WIE). I will also give preliminary results from
adopting such an approach in our WIE system.

Wednesday, May 7, 2008

Grace, Hui Yang - Friday May 16th

Please join us for an upcoming talk.

Lunch will be provided by Yahoo!

Ontology Learning by Supervised Hierarchical Clustering

Who: Grace, Hui Yang
When: Friday, May 16th, 12:00pm
Where: NSH 3002

This work makes novel use of supervised clustering as the basic
framework to construct concept ontology interactively or
automatically. Supervised hierarchical clustering is used to
organize ontology fragments, which are identified by techniques in
natural language processing and information retrieval, into
hierarchies. At each clustering iteration, a distance metric is
learned from the clustering given by either pseudo or real
feedback. K-medoids clustering with sampling is then used to group
the concepts at the higher level. A web-based cluster naming
algorithm is also presented. By conducting a user evaluation, the
system is shown to be effective to save human efforts in the
interactive runs. Both automatic and interactive runs of the
experiments show that the approach is effective.

Friday, March 28, 2008

Nico Schlaefer - Friday, April 4, 12:00pm, NSH 3002

Please join us for an upcoming talk from Nico Schlaefer.

Lunch will be provided!

The Ephyra Question Answering System: Recent Results and Current Directions

Who: Nico Schlaefer
When: Friday, April 4, 12:00pm
Where: NSH 3002

This talk gives an overview of recent work on English question answering (QA) at CMU and our participation in last year’s TREC evaluation. QA is the task of retrieving accurate answers to natural language questions from a knowledge source such as the Web. The presentation includes a brief introduction to QA and the TREC competition, thus prior knowledge on QA is not required though helpful.

The talk focuses on the challenges that an end-to-end QA system needs to address, and the architectural and algorithmic solutions implemented in Ephyra, our English QA system. Ephyra is a modular and extensible framework that facilitates the integration of different QA techniques. The system is organized as a pipeline of reusable standard components for question analysis, query generation, search, answer extraction, and answer selection. The most recent setup combines a syntactic pattern learning and matching approach with answer-type based extraction techniques and a semantic answer extractor that is based on semantic role labeling.

Recently we have placed the Ephyra QA system into open source, making most of our code available to the research community. I will discuss why we took this step, and how you may benefit from our open source system - OpenEphyra - for your own research.

Wednesday, February 20, 2008

Monday, February 18, 2008

Jan Wiebe -- Subjectivity Analysis -- Friday, Feburary 22nd 2008, 12:00 pm (noon)

Please join us for our first IR Series talk this spring!

Lunch will be provided by Yahoo!

Speaker: Jan Wiebe
Professor, Department of Computer Science
Director, Intelligent Systems Program
University of Pittsburgh

Date/Time: Friday, 22nd, 12:00 pm (noon)

Location: 3002 Newell-Simon Hall (NSH)

Title: Subjectivity Analysis

Abstract: A growing area of research, "subjectivity analysis", is the computational study of affect, opinions, and sentiments expressed in text. Blogs, editorials, reviews (of products, movies, books, etc.), and even "objective" newspaper articles (which include many opinions and sentiments) are just some of the genres for which accurate identification and interpretation of opinions is critical for full text understanding. Subjectivity analysis will support developing tools for information analysts in governmental, commercial, and political domains who want to automatically track attitudes and feelings in the news and on-line forums. How do people feel about the latest iPod? Is there a change in the support for the new Medicare bill? A system able to automatically identify and extract opinions and sentiments from text would be an enormous help to someone sifting through the vast amounts of news and web data, trying to answer these kinds of questions. In this talk, I will first give an overview of our work in subjectivity analysis, and then will focus on experiments exploring interactions between subjectivity and word sense, showing that subjectivity is a property that can be associated with word meanings and that subjectivity classification can be beneficial for word sense disambiguation.

Bio: My research areas are artificial intelligence and natural language processing (NLP). My work with students and colleagues has been in discourse processing, pragmatics, word-sense disambiguation, and probabilistic classification in NLP. Our most recent work investigates automatically recognizing and interpretating expressions of opinions and sentiments in text, to support NLP applications such as question answering, information extraction, text categorization, and summarization.

New Home

Welcome to the new home of the CMU Information Retrieval Discussion Series. We're in the process of moving the old site here, so please be patient.

Tuesday, January 1, 2008

Past IR-Series Presentations

Friday, November 2, 2007 - 12:00-1:00 pm, Newell-Simon Hall (NSH) 3002
Title: CMU at TREC 2007
Speakers: Jonathan Elsas, Le Zhao and Yangbo Zhu (CMU)

Friday, October 5, 2007 - 12:00-1:00 pm, Newell-Simon Hall (NSH) 3002
Title: Estimating and Exploiting Uncertainty in Pseudo-Relevance Feedback
Speakers: Kevyn Collins-Thompson (CMU)

Friday, July 13, 2007 - 12:00-1:00 pm, Newell-Simon Hall (NSH) 3002
Title: Utility-based Information Distillation Over Temporally Sequenced Documents
Speakers: Yiming Yang (CMU)

Friday, May 18, 2007 - 12:00-1:00 pm, Newell-Simon Hall (NSH) 3002
Title: Collaborative Web Search - Exploiting User Activity for User Benefit
Speaker: Jill Freyne (University College Dublin)

Friday, January 19, 2007 - 12:00 NSH 3002
Title: Using Graphs and Random Walks to Discover Latent Similarities in Text
Speaker: Gunes Erkan

Friday, November 10, 2006, 2007 - 12:00 NSH 3002
Title: Personal Metasearch
Speaker: Paul Thomas

Friday, May 19, 2006 - 12:00 NSH 3002
Title: Collaborative Adaptive User Profile with Implicit and Explicit User Feedback
Speaker: Yi Zhang

Wednesday, April 19, 2006 - 12:00, NSH 3002
Title: Deriving Marketing Intelligence from Online Discussion
Speaker: Matthew Hurst and Natalie Glance

Wednesday, April 5, 2006 - 12:00, NSH 3002
Title: A Graphical Framework for Contextual Search and Name Disambiguation in Email
Speaker: Einat Minkov

Wednesday, March 8, 2006 - 12:00, NSH 3002
Title: Structured and Dynamic Topic Models
Speaker: John Lafferty

Wednesday, February 22, 2006 - 12:00, NSH 3002
Title: Automatically Labeling Hierarchical Clusters
Speaker: Pucktada (Puck) Treeratpituk

Friday, June 3, 2005 - 3:30, WeH 5409
Title: PageRank without Hyperlinks: Structural Re-ranking using Links Induced by Language Models
Speaker: Oren Kurland

Wednesday, April 27, 2005 - 4:30, WeH 4601
Title: Dynamic Construction of Content-Based Topologies in Hierarchical Peer-to-Peer Networks
Speaker: Jie Lu

Wednesday, March 16, 2005 - 4:30, WeH 4601
Title: Modeling Search Engine Effectiveness for Federated Search
Speaker: Luo Si

Wednesday, March 2, 2005 - 4:30, WeH 4623
Title: What is the matter? Explorations in text categorization
Speaker: Lillian Lee

Wednesday, January 19th, 2005 - 4:30, WeH 4601
Title: Detecting Action-Items in E-mail
Speaker: Paul N. Bennett

Wednesday, December 1, 2004 - 3:00, WeH 4625
Title: Probabilistic Models of Text and Images
Speaker: David Blei

Wednesday, November 17, 2004 - 3:00, WeH 4625
Title: Merging Rank Lists from Multiple Sources in Video Classification
Speaker: Wei-Hao Lin

Wednesday, November 10, 2004 - 3:00, WeH 4625
Title: Associating Names with Persons in Broadcast News Video
Speaker: Jun Yang

Wednesday, October 20, 2004 - 3:00, WeH 4625
Title: Graph Mining
Speaker: Christos Faloutsos

Wednesday, October 6, 2004 - 3:00, WeH 4625
Topic: Review of the SIGIR 2004 Best Paper, “ A Formal Study of Information Retrieval Heuristics” by Hui Fang, Tao Tao, and ChengXiang Zhai
Speaker: Kevyn Collins-Thompson

Friday, October 1, 2004 - 1:30, NSH 4513
Title: Combining Language Modeling Approach with String-matching in Near-Duplicate Detection in E-Rulemaking
Speaker: Puck Treeratpituk

Wednesday, September 22, 2004 - 2:30, NSH 4632
Learning to Summarize Interviews for Project Reports
Nikesh Garera

Thursday, August 26, 2004 - 3:30, WeH 4625
Analyzing Time Series Gene Expression Data
Jason Ernst

Tuesday, August 17, 2004 - 2:00, WeH 4625
Learning Table Extraction from Examples
Ashwin Tengli

Thursday, August 12, 2004 - 3:30, WeH 4625
Learning to Classify Email into "Speech Acts"
Vitor Carvalho

Thursday, July 8, 2004 - 3:30, WeH 4625
Resource Selection for Domain-Specific Cross-Lingual IR
Monica Rogati

Tuesday, January 22, 2004 - 12:00, NSH 4513
Dynamic Recommender System on User Taste Tendency Model
Soojung Lee

Thursday, December 4, 2003 - 12:00, NSH 4513
The Robustness of Content-Based Search in Hierarchical Peer to Peer Networks
M. Elena Renda

Thursday, October 30, 2003 - 12:00, NSH 4632
Boosting Support Vector Machines for Text Classification through Parameter-free Threshold Relaxation
Dr. James G. Shanahan

Thursday, October 23, 2003 - 12:00, NSH 4632
Content-Based Retrieval in Hybrid Peer-to-Peer Networks
Jie Lu

Thursday, October 16, 2003 - 12:00, NSH 4632
The Utility of Question Analysis in an Open-Domain Question Answering System
Yifen Huang

Thursday, August 28, 2003 - 3:30, NSH 4632
Searching Peer-to-Peer Networks
Dr. Bin Yu

Thursday, August 14, 2003 - 3:30, NSH 3001
Flexible Mixture Model for Collaborative Filtering
Luo Si

Modified Logistic Regression: An Approximation to SVM and its Applications in Large-Scale Text Categorization
Jian Zhang

Thursday, June 19, 2003 - 3:30, NSH 3001
Improving Text Classifier Probability Estimates
Paul Bennett

Thursday, June 5, 2003 - 3:30, NSH 3002
Radio Station Playlist Generation
Andrew P. Widdowson

Thursday, May 22, 2003 - 3:30, NSH 3001
Discussion on Secondary Structure Prediction for Protein Sequences
Yan Liu

Thursday, May 8, 2003 - 3:30, NSH 3001
Negative Pseudo Relevance Feedback for Multimedia Retrieval
Rong Yan

Thursday, April 10, 2003 - 3:30, NSH 3001
Web Image Retrieval Re-Ranking with Relevance Model
Wei-Hao Lin

Thursday, March 27, 2003 - 3:30, NSH 3001
Clustering Genes
Fan Li

Thursday, March 13, 2003 - 3:30, NSH 3001
Exploration and Exploitation in Adaptive Filtering Based on Bayesian Active Learning
Yi Zhang

Thursday, February 27, 2003 - 3:30, NSH 3001
Beyond Independent Topical Relevance: Evaluation Metrics and Methods for Aspect Retrieval
Dr. William Cohen

Thursday, February 13, 2003 - 3:30, NSH 3001
Overview of Database Selection Methods
Luo Si

Tuesday, January 14, 2003 - 11:00-12:30, Wean 4632
Topics and Techniques in (Structured) Document Retrieval
Paul Ogilvie

Instructions for Presenters

As the discussion series has progressed, our goals for the presentations have evolved. In addition to generating discussion among researchers, we have come to recognize the value of the series as a means to refine our presentation skills. In light of this, here are some guidelines for creating your presentation:

  • When preparing your presentation, view this as a normal conference talk and prepare accordingly.
  • Please prepare a short (20-30 minute) talk or a long (45 minute) talk according to the time slot the organizer has reserved.
  • You should assume that the audience is knowledgeable in IR and many of the techniques commonly used in the field. Unless the purpose of your talk is a general overview of a research problem, you should assume that the related research can be covered very briefly (one or two slides).
  • Focus on presenting your thoughts, issues, and contributions to the problem at hand.
  • If you have extra material that won't fit in the talk, prepare slides for them as it is very likely that we will be willing to hear more about the subject after the main talk is over.
  • Please don't be afraid to present work in progress. Even with the change of presentation format to conference talk style , we are still driven by our original goals of learning about current research and fostering collaboration on work in progress.

Thanks are due to Yiming Yang for some helpful suggestions.

About the Series

The CMU-IR Discussion Series is a student-run initiative to encourage discussion and foster research between the multiple Information Retrieval research groups here at Carnegie Mellon. We intend the informal presentations to be a vehicle for:
  • learning about each others' research,
  • discussing the big (and little) problems of a research area, and
  • fostering collaboration across groups.
The presentations in this series are of an informal nature. Interruptions, questions, and discussions on tangents are encouraged and welcomed. We wish this to be a low-stress, relaxed opportunity for our presenters to discuss topics of interest with us. Our interpretation of the meaning of "information retrieval" is fairly broad; we welcome discussions of related fields (Machine Learning, Agents, Statistics, and so on). While most of the presentations are from students researching Information Retrieval here at CMU, we also welcome discussions from students and faculty in other programs and institutions as well as industry professionals.

One of the goals of the series is to strike a nice balance between area overview presentations and technical presentations on specific approaches. Because many of us work on quite different areas of Information Retrieval, we often find it beneficial to have discussions that focus on the important problems in our respective research areas and the techniques that have been found to be broadly useful (and occasionally the spectacular failures). In order to keep grounded, we also have some technical discussions on specific techniques and approaches.

In an effort to make this series a valuable resource for others, we plan on posting the authors' slides (with permission). We also ask that authors provide a short reading list of articles (preferably online) for people who want to learn about the topics in more depth.

For LTI students: a presentation in the IR Discussion Series can fulfill your annual LTI talk requirement. Let us know if you wish to do this more than a week in advance, so that we can advertise the talk according to policy. You will still be required to make sure two faculty are present and that they fill out the form after your presentation.