AutoBib Home Page AutoBib banner
Home Data Code & Docs Related Links


Introduction

Welcome to the homepage of AutoBib Project!  In this project, we propose and implement a framework of extracting and integrating bibliographic information on the Web automatically using Hidden Markov Models. Here, you will find code and documentations related to this project, and you can also browse the experimental bibliographic data and check for its quality.

The Web has greatly improved availability and ease in retrieving scientific bibliographic information. Unfortunately, while data on the Web is mainly intended to be browsed by humans, this information is not easily manipulated by computers. Thus, the problem of automatically extracting specific piece of data from given sources, referred as Information Extraction, remains an important and challenging task. In this work we focus on automatic extraction of bibliographic information on the Web. Unlike the conventional rule-learning approach, we adopt the use of a powerful statistical tool, Hidden Markov Models (HMMs), for parsing unknown bibliographic records. The advantage of our approach is that the system can operate completely autonomously once the initial structured record database is given, thus requiring minimal maintenance cost.

Specifically, our approach consists of three important steps. First, we extract raw records from bibliographic Web sites using the state-of-the-art ``record-boundary discovery technique''. Secondly, we generate high quality training data ( maximum 95.8% hit rate when excluding tokens impossible to label) automatically for HMMs automatically using common sense heuristics. Lastly, we demonstrate that a specialized HMM with multiple delimiter and tag states is able to parse unknown bibliographic records with high success rates of 98.9% and 93.4% for two different sites while only using about 115 training entries for each site.

The results suggest that a well constructed HMM, with necessary software components as its support, can be an effective tool for automatic information extraction.

This project is done in the Computer Science Department at Duke University ,  under the supervision of   Prof. Jun Yang.

Organization

This Web site is organized as follows:

  • Home: The home page for the AutoBib project.
  • Data: Browse the experimental data and information extraction process.
  • Code & Docs: System overview, code, and documentations.
  • Related links: Useful links used in the project.

THE AUTOBIB PROJECT 
Duke University, 
Durham, NC 27708, USA 
TEL: 919-660-6587
URL: http://www.cs.duke.edu/~geng/autobib/
Last Modified:  Jan, 2003
Email  Mr. Junfei Geng
Email  Prof. Jun Yang