Unsupervised Document Set Exploration Using Divisive Partitioning

Daniel Boley
Department of Computer Science and Engineering
University of Minnesota

Contact Information

4-192 EE/CS, 200 Union St. SE,
Minneapolis, MN 55455
Phone: (612) 625-3887
Fax: (612) 625-0572
Email: lastname at cs.umn.edu
Project WWW Page: "http://www.cs.umn.edu/~boley/PDDP.html"
Personal Home Page: "http://www.cs.umn.edu/~boley"

Sponsors

National Science Foundation under grant: IIS-9811229, IIS-0208621.

Keywords

Unsupervised clustering, data mining, classification, hierarchical taxomonies, Web agents.

Project References and Selected Publications


Project Software

Project Summary

This project is devoted to the research and development of a hierarchical divisive clustering algorithm. The basic underlying activity is the basic research needed to fully develop the divisive partitioning method and the applied research needed to demonstrate its use on specific applications and to make it easy for humans to use. The conceptual framework for the basic divisive partitioning method has been developed, and an efficient prototype implementation has been installed locally, on which all experimental results have been based. The research issues to be addressed in this project include completion of the theoretical foundations of the algorithm, experiments on larger datasets to measure the scalability of the methods, experiments on a datasets from a variety of sources as a way to gain understanding about the theory behind the method, extensions to the algorithm capabilities, extensions to other applications beyond text document clustering, and validation of algorithm results. Technology transfer forms an additional part of this project, and will be addressed by applying the methods developed as part of the project to other application datasets (e.g. genome, toxicology, medical literature datasets), as well as extending it to other paradigms such as hierarchical taxonomies, document rating aids and collaborative filtering. By incorporating it into a Web agent, the methods will be made available to a wide audience. All application datasets have either been collected by colleagues and will be shared as part of ongoing projects, or are available over the Web.

Goals, Objectives And Targeted Activities

The project goals are to develop the Principal Direction Divisive Partitioning (PDDP) Algorithm and demonstrate its effectiveness in a variety of applications.

The specific goals can be grouped into three main categories:
(1) demonstrate and test the applicability and effectiveness of the PDDP algorithm to new application areas, including very large data sets;
(2) extend the capabilities of the algorithm to include features for automatically determining the number of clusters or automatically producing labels to help users navigate through large datasets;
(3) develop user interfaces to allow easy access to the functionality of the algorithm, allowing users to apply it to a variety of datasets in a straightforward way.

The research plan included the following tasks:
(a) Demonstrate the scalability compared to older methods,
(b) Demonstrate the effectiveness in terms of quality of resulting clusters,
(c) Incorporate an autonomous stopping test,
(d) Develop methods for producing hierarchical taxonomic labels,
(e) Develop a user interface based on a Web agent.

Indication Of Success

Success is measured in the progress made within each task in the research plan. This project has been in progress for only 6 months during which we have actively pursued all 5 tasks mentioned above. In this period we have accomplished the following:
(i) We have developed the conceptual framework necessary to incorporate essential new features of the methods. The features are necessary to make the method a practical and useful method in the domain of Web exploration and include an automatic stopping test and a dynamic incremental updating feature [Ref. 1].
(ii) We have completed the design of a client-side Web agent incorporating this algorithm as a tool to automatically [i.e. without user participation] organize a large collection of Web pages visited by a user using a normal Web browser. The feasibility and practicality of the design has been demonstrated [Ref. 3], but robustness issues still need to be addressed.
  • Discoveries at and across the frontier of science and engineering.

    We have developed a new paradigm for partitioning large datasets and the conceptual framework to incorporate new features. This framework has been made possible only by a synergy of a variety of different scientific disciplines, with the result that methods have new capabilities not possible by using methods from only one scientific discipline.

  • Connections between discoveries and their use in service to society.
    The methods begin investigated here are of immediate and practical value to the analysis of large data sets from different applications: text documents, texture images, chemical toxicity parameters, etc. The automated analysis of large bodies of data from a variety of sources is a critical issue in the age of digital technology, and automated and unsupervised analysis tools are essential in this effort.

  • A diverse, globally-oriented workforce of scientists and engineers.
    Even at the early stages of this project, we have promoted the education and incorporation of women in the workforce of scientists and engineers. The interdisciplinary nature of the project also encourages exposure and learning of concepts across many scientific disciplines, leading to a more well-rounded preparation to be members of the scientific workforce.