Unsupervised Document Set Exploration Using Divisive Partitioning

Daniel Boley
Department of Computer Science and Engineering
University of Minnesota

Contact Information

4-192 EE/CS, 200 Union St. SE,
Minneapolis, MN 55455
Phone: (612) 625-3887
Fax: (612) 625-0572
Email: lastname at cs.umn.edu
Project WWW Page: "http://www.cs.umn.edu/~boley/PDDP.html"
Personal Home Page: "http://www.cs.umn.edu/~boley"

Keywords

Unsupervised clustering, data mining, classification, hierarchical taxomonies, Web agents.

Project References and Selected Publications

Streaming Data Reduction Using Low-Memory Factored Representations, by David Littau and Daniel Boley: Journal of Information Sciences, Special issue on Mining Streaming Data (to appear), 2004.
A comparative analysis on the bisecting K-means and the PDDP clustering algorithms, by Sergio M. Savaresi and Daniel Boley: Intelligent Data Analysis (to appear), 2004.
A Parallel Formulation of the Spatial Auto-Regression Model for Mining Large Geo-Spatial Datasets, by Baris M. Kazar, Shashi Shekhar, David J. Lilja, and Daniel Boley, MSI report UMSI 2004/33; SIAM 04 Datamining Conf. International Workshop on High Performance and Distributed Mining, 2004.
Online Motion Classification using Support Vector Machine, by D. Cao, O. Masoud, D. Boley, N. Papanikolopoulos: IEEE 2004 International Conference on Robotics and Automation, 2004.
Training Support Vector Machine using Adaptive Clustering, by D. Boley and D Cao: 2004 SIAM International Conference on Data Mining, April 22 - April 24, Lake Buena Vista, FL, USA, 2004.
Using Low-Memory Representations to Cluster Very Large Data Sets, by D. Littau and D. Boley: SDM'03 Third SIAM Conference on Data Mining, May 2003.
Transcriptional response of Pasteurella multocida to defined iron sources, by M.L. Paustian, B.J. May, D. Cao, D. Boley, V. Kapur: J. Bacteriol. 184(23):6714-6720, Dec 2002.
Incremental PDDP for the Clustering of Large Data Sets, by D. Littau and D. L. Boley: TR-01-016, 2001.
On the performance of bisecting K-means and PDDP, by D. Boley and S. Savaresi: SIAM Data Mining 2001.
A Scalable Hierarchical Algorithm for Unsupervised Clustering by D. Boley, in Data Mining for Scientific and Engineering Applications (R. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, R. Namburu, eds), p 383-400, Kluwer, 2001.
Choosing the cluster to split in bisecting divisive clustering algorithms, by S. Savaresi and D. Boley and S. Bittanti and G. Gazzaniga: TR-00-053, 2000.
Error Analysis of Automatic Speech Recognition Using Principal Direction Divisive Partitioning, by D. Mckoskey and D. Boley: in European Conference on Machine Learning (R. Lopez de Mantanaras and E. Plaza, ed.), p.263-270, Lecture Notes in Artificial Intelligence, 1810, Springer Verlag, 2000.
Principal Direction Partitioning in Data Mining, by D. Boley, slides of talk given at Stanford, February, 2000.
Unsupervised Clustering: A Fast Scalable Method for Large Datasets, by D. Boley and V. Borst, U of Mn CSE Report TR-99-029. An informal introduction to the methods.
D.Unsupervised Updating of a Classification Tree in a Dynamic Environment, by D. Boley and V. Borst, in Autonomous Agents'99 Conference, 1999, p.390-391.
Partitioning-Based Clustering for Web Document Categorization, D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore, Decision Support Systems, 1999, to appear.
Document Categorization and Query Generation on the World Wide Web Using WebACE, D. Boley, M. Gini, R. Gross, E.-H. (Sam) Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore, AI Review, 13:365-391, 1999.
Principal Direction Divisive Partitioning, D. L. Boley, Data Mining and Knowledge Discovery 2(4):325-344, 1998.
Raw text document data used in the last three reports above.
Selected Reprints by Boley annotated with author's own unsupervised clustering method.

Project Software

Principal Direction Divisive Partitioning Software (Version 2d, updated as of 24 Apr 2008, but now as a ZIP file.)

Project Summary

This project is devoted to the research and development of a hierarchical divisive clustering algorithm. The basic underlying activity is the basic research needed to fully develop the divisive partitioning method and the applied research needed to demonstrate its use on specific applications and to make it easy for humans to use. The conceptual framework for the basic divisive partitioning method has been developed, and an efficient prototype implementation has been installed locally, on which all experimental results have been based. The research issues to be addressed in this project include completion of the theoretical foundations of the algorithm, experiments on larger datasets to measure the scalability of the methods, experiments on a datasets from a variety of sources as a way to gain understanding about the theory behind the method, extensions to the algorithm capabilities, extensions to other applications beyond text document clustering, and validation of algorithm results. Technology transfer forms an additional part of this project, and will be addressed by applying the methods developed as part of the project to other application datasets (e.g. genome, toxicology, medical literature datasets), as well as extending it to other paradigms such as hierarchical taxonomies, document rating aids and collaborative filtering. By incorporating it into a Web agent, the methods will be made available to a wide audience. All application datasets have either been collected by colleagues and will be shared as part of ongoing projects, or are available over the Web.

Goals, Objectives And Targeted Activities

The project goals are to develop the Principal Direction Divisive Partitioning (PDDP) Algorithm and demonstrate its effectiveness in a variety of applications.

The specific goals can be grouped into three main categories:
(1) demonstrate and test the applicability and effectiveness of the PDDP algorithm to new application areas, including very large data sets;
(2) extend the capabilities of the algorithm to include features for automatically determining the number of clusters or automatically producing labels to help users navigate through large datasets;
(3) develop user interfaces to allow easy access to the functionality of the algorithm, allowing users to apply it to a variety of datasets in a straightforward way.

The research plan included the following tasks:
(a) Demonstrate the scalability compared to older methods,
(b) Demonstrate the effectiveness in terms of quality of resulting clusters,
(c) Incorporate an autonomous stopping test,
(d) Develop methods for producing hierarchical taxonomic labels,
(e) Develop a user interface based on a Web agent.

Indication Of Success

Success is measured in the progress made within each task in the research plan. This project has been in progress for only 6 months during which we have actively pursued all 5 tasks mentioned above. In this period we have accomplished the following:
(i) We have developed the conceptual framework necessary to incorporate essential new features of the methods. The features are necessary to make the method a practical and useful method in the domain of Web exploration and include an automatic stopping test and a dynamic incremental updating feature [Ref. 1].
(ii) We have completed the design of a client-side Web agent incorporating this algorithm as a tool to automatically [i.e. without user participation] organize a large collection of Web pages visited by a user using a normal Web browser. The feasibility and practicality of the design has been demonstrated [Ref. 3], but robustness issues still need to be addressed.

Discoveries at and across the frontier of science and engineering.

We have developed a new paradigm for partitioning large datasets and the conceptual framework to incorporate new features. This framework has been made possible only by a synergy of a variety of different scientific disciplines, with the result that methods have new capabilities not possible by using methods from only one scientific discipline.

Connections between discoveries and their use in service to society.
The methods begin investigated here are of immediate and practical value to the analysis of large data sets from different applications: text documents, texture images, chemical toxicity parameters, etc. The automated analysis of large bodies of data from a variety of sources is a critical issue in the age of digital technology, and automated and unsupervised analysis tools are essential in this effort.

A diverse, globally-oriented workforce of scientists and engineers.
Even at the early stages of this project, we have promoted the education and incorporation of women in the workforce of scientists and engineers. The interdisciplinary nature of the project also encourages exposure and learning of concepts across many scientific disciplines, leading to a more well-rounded preparation to be members of the scientific workforce.