Object Mining and Extraction System

The web has become one of the primary ways that people, businesses, and organizations share information. However, due to the wealth of information available, finding and reusing information has become much more difficult.

Search engines have tried to deal with finding information by indexing all of the pages on the web, but they have several shortcomings. First, they cannot keep pace with the expansion of the web. Second, they ignore most of the information available, because they only look at static pages. By some estimates, more than 90% of the information on the web is "hidden" and only available through forms. Finally, they do not offer any granularity other than a basic page.

The goal of Omini is to get at the data behind Web forms. Omini software automatically extracts content objects and ignores irrelevant parts of the page. One of the key design features of Omini is its robustness even as the web pages from which it extracts data evolves, eliminating the need for a programmer to manually determine where objects are.

Omini's technology is useful in several different domains. We have already applied Omini as the foundation of XWRAPElite. XWRAPElite is an interactive online toolkit that generates wrappers which extract data from web sites and convert it into semantically relevant XML.
We are also in the process of constructing a search engine for dynamic web sites that is based on Omini. The search engine will be able to locate relevant data objects in web sites that are appropriate to the context of a search. This approach complements traditional search engines which index static web pages, such as HyperBee or Google.


This material is based upon work partially supported by the National Science Foundation under Grant No. 9988452. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Last Update Dec 2002
