Unsupervised Document Set Exploration Using Divisive Partitioning
Daniel Boley
Department of Computer Science and Engineering
University of Minnesota
Contact Information
4-192 EE/CS, 200 Union St. SE,
Minneapolis, MN 55455
Phone: (612) 625-3887
Fax: (612) 625-0572
Email: lastname at cs.umn.edu
Project WWW Page: "http://www.cs.umn.edu/~boley/PDDP.html"
Personal Home Page: "http://www.cs.umn.edu/~boley"
Sponsors
National Science Foundation under grant: IIS-9811229, IIS-0208621.
Keywords
Unsupervised clustering, data mining, classification, hierarchical
taxomonies, Web agents.
Project References and Selected Publications
-
Streaming Data Reduction Using Low-Memory Factored Representations, by David Littau and Daniel Boley: Journal of Information Sciences, Special issue on Mining Streaming Data (to appear), 2004.
-
A comparative analysis on the bisecting K-means and the PDDP clustering algorithms, by Sergio M. Savaresi and Daniel Boley: Intelligent Data Analysis (to appear), 2004.
-
A Parallel Formulation of the Spatial Auto-Regression Model for Mining Large Geo-Spatial Datasets, by Baris M. Kazar, Shashi Shekhar, David J. Lilja, and Daniel Boley, MSI report UMSI 2004/33; SIAM 04 Datamining Conf. International Workshop on High Performance and Distributed Mining, 2004.
-
Online Motion Classification using Support Vector Machine, by D. Cao, O. Masoud, D. Boley, N. Papanikolopoulos: IEEE 2004 International Conference on Robotics and Automation, 2004.
-
Training Support Vector Machine using Adaptive Clustering, by D. Boley and D Cao: 2004 SIAM International Conference on Data Mining, April 22 - April 24, Lake Buena Vista, FL, USA, 2004.
-
Using Low-Memory Representations to Cluster Very Large Data Sets, by D. Littau and D. Boley: SDM'03 Third SIAM Conference on Data Mining, May 2003.
-
Transcriptional response of Pasteurella multocida to defined iron sources, by M.L. Paustian, B.J. May, D. Cao, D. Boley, V. Kapur: J. Bacteriol. 184(23):6714-6720, Dec 2002.
-
Incremental PDDP for the Clustering of Large Data Sets, by D. Littau and D. L. Boley: TR-01-016, 2001.
-
On the performance of bisecting K-means and PDDP, by D. Boley and S. Savaresi: SIAM Data Mining 2001.
-
A Scalable Hierarchical Algorithm for Unsupervised Clustering by D. Boley,
in Data Mining for Scientific and Engineering Applications (R. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, R. Namburu, eds), p 383-400, Kluwer, 2001.
-
Choosing the cluster to split in bisecting divisive clustering algorithms, by S. Savaresi and D. Boley and S. Bittanti and G. Gazzaniga: TR-00-053, 2000.
-
Error Analysis of Automatic Speech Recognition Using Principal Direction Divisive Partitioning, by D. Mckoskey and D. Boley: in European Conference on Machine Learning (R. Lopez de Mantanaras and E. Plaza, ed.), p.263-270, Lecture Notes in Artificial Intelligence, 1810, Springer Verlag, 2000.
-
Principal Direction Partitioning in Data Mining, by D. Boley,
slides of talk given at Stanford, February, 2000.
-
Unsupervised Clustering: A Fast Scalable
Method for Large Datasets, by D. Boley and V. Borst,
U of Mn CSE Report TR-99-029. An informal introduction to the
methods.
-
D.Unsupervised Updating of a Classification Tree
in a Dynamic Environment, by D. Boley and V. Borst, in Autonomous Agents'99 Conference, 1999, p.390-391.
-
Partitioning-Based Clustering for Web Document Categorization,
D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore,
Decision Support Systems, 1999, to appear.
-
Document Categorization and Query Generation on the World Wide Web Using WebACE,
D. Boley, M. Gini, R. Gross, E.-H. (Sam) Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore,
AI Review, 13:365-391, 1999.
-
Principal Direction Divisive Partitioning,
D. L. Boley,
Data Mining and Knowledge Discovery 2(4):325-344, 1998.
-
Raw text document data used in the last three reports above.
-
Selected Reprints by Boley annotated with author's own unsupervised clustering method.
Project Software
Project Summary
This project is devoted to the research and development of a hierarchical
divisive clustering algorithm. The basic underlying activity is the basic
research needed to fully develop the divisive partitioning method and the
applied research needed to demonstrate its use on specific applications and
to make it easy for humans to use. The conceptual framework for the basic
divisive partitioning method has been developed, and an efficient prototype
implementation has been installed locally, on which all experimental results
have been based.
The research issues to be addressed
in this project include completion of the theoretical foundations of the
algorithm, experiments on larger datasets to measure the
scalability of the methods, experiments on a datasets from a variety
of sources as a way to gain understanding about the theory behind the
method, extensions to the algorithm capabilities, extensions to other
applications beyond text document clustering, and validation of algorithm
results. Technology transfer forms an additional part of this project,
and will be addressed by applying the methods developed as part of the
project to other application datasets (e.g. genome, toxicology, medical
literature datasets), as well as extending it to other paradigms such as
hierarchical taxonomies, document rating aids and collaborative filtering.
By incorporating it into a Web agent, the methods will be made available
to a wide audience. All application datasets have either been collected by
colleagues and will be shared as part of ongoing projects, or are available
over the Web.
Goals, Objectives And Targeted Activities
The project goals are to develop the Principal
Direction Divisive Partitioning (PDDP) Algorithm and demonstrate its effectiveness in a
variety of applications.
The specific goals can be grouped into three main
categories:
(1) demonstrate and test the applicability and effectiveness
of the PDDP algorithm
to new application areas, including very large data sets;
(2) extend the capabilities of the algorithm to include features for
automatically determining the number of clusters or automatically producing
labels to help users navigate through large datasets;
(3) develop user interfaces to allow easy access to the functionality of the
algorithm, allowing users to apply it to a variety of datasets in a
straightforward way.
The research plan
included the following tasks:
(a) Demonstrate the scalability compared to older methods,
(b) Demonstrate the effectiveness in terms of quality of resulting clusters,
(c) Incorporate an autonomous stopping test,
(d) Develop methods for producing hierarchical taxonomic labels,
(e) Develop a user interface based on a Web agent.
Indication Of Success
Success is measured in the progress made within each task in the research
plan. This project has been in progress for only 6 months during which we
have actively pursued all 5 tasks mentioned above.
In this period we have accomplished the following:
(i) We have developed the conceptual framework necessary to
incorporate essential new features of the methods. The features are
necessary to make the method a practical and useful method in the domain of
Web exploration and include an automatic stopping test and a dynamic
incremental updating feature [Ref. 1].
(ii) We have completed the design of a client-side Web agent incorporating
this algorithm as a tool to automatically [i.e. without user participation]
organize a large collection of Web
pages visited by a user using a normal Web browser.
The feasibility and practicality of the design has been demonstrated
[Ref. 3], but robustness issues still need to be addressed.
Discoveries at and across the frontier of science and engineering.
We have developed a new paradigm for partitioning large datasets
and the conceptual framework to incorporate new features.
This framework has been made possible only by a synergy of a variety of
different scientific disciplines, with the result that methods have new
capabilities not possible by using methods from only one scientific
discipline.
Connections between discoveries and their use in service to society.
The methods begin investigated here are of immediate and practical value to
the analysis of large data sets from different applications: text documents,
texture images, chemical toxicity parameters, etc. The automated analysis
of large bodies of data from a variety of sources is a critical issue in the
age of digital technology, and automated and unsupervised analysis tools are
essential in this effort.
A diverse, globally-oriented workforce of scientists and engineers.
Even at the early stages of this project, we have promoted the education and
incorporation of women in the workforce of scientists and engineers.
The interdisciplinary nature of the project also encourages exposure and
learning of concepts across many scientific disciplines, leading to a more
well-rounded preparation to be members of the scientific workforce.