Scalable Methods in Machine Learning
Daniel Boley
Many large scale machine learning problems are formulated as optimization problems, in which some measure of error or loss is to be minimized over a suitable training corpus. Real problems have too many data points to fit in a single computer. Hence the data and/or computation must be distributed over a network of computers. Often the only practical methods for extremely large problems are so-called splitting methods, but their convergence properties are extremely variable: sometimes very fast, sometimes very slow, in ways that can be hard to predict. The goal of this project is to gain a better understanding of the convergence behavior and to use this understanding to construct accelerated algorithms with more consistent convergence properties. This will allow the application of machine learning techniques to a much wider class of problems.
Click here for more information on this project.
Unsupervised Document Set Exploration Using Divisive Partitioning
Daniel Boley
Click here for more information on this project.
This is part of a larger project entitled "WebACE: Web Agent for Automated
Categorization and Exploration,"
with
Daniel Boley, Maria Gini, Vipin Kumar, Bamshad Mobasher, and others.
The challenge addressed in this project is the unsupervised exploration
and categorization of large unstructured data sets using a recently
developed algorithm based on divisive partitioning. Initial experiments
on World Wide Web document sets have demonstrated that this algorithm,
called Principal Direction Divisive Partitioning, is effective at
extraction of meaningful clusters while being scalable to large document
sets. The investigations within this project will include:
(i) a thorough study of its scalability properties and comparison
with other methods,
(ii) utilization of the hierarchical structures already generated within
the algorithm to create taxonomies on the document set and to classify
new documents autonomously,
(iii) adaptation and utilization as a supervised learning method,
(iv) extension to other applications and domains, including a study
of the algorithm's effectiveness in domains beyond document sets.
Potential applications and domains to be explored in this project include
(a) incorporation within autonomous agents for exploring the World
Wide Web, such as WebACE,
(b) classification of documents in large but more homogeneous medical
or legal databases,
(c) discovery of patterns within chemical toxicology data,
(d) extraction of clusters within genetic sequence databases. The goal is
to establish the divisive partitioning method as an effective alternative
among existing unsupervised clustering tools for use on document sets
and other domains.
Numerical Algorithms for Very Large, Sparse Dynamical Systems
Daniel Boley, Gary Balas, Youcef Saad
We develop and experiment with a variety of core numerical algorithms for large sparse or structured linear algebra problems. Part of the research is the development of effective algorithms for the analysis and design of very large control models such as those related to large flexible space structures. To accomplish this, a major focus of this research will be the design of effective and efficient linear algebra algorithms. Examples of problems we address are large Lyapunov equations, large order matrix pencils, large eigenvalue and singular values solvers. Parallel algorithms will be explored and exploited. Approximate solutions obtained via Krylov space methods, or other methods, will be analyzed. A standard sparse matrix library (SPARSKIT) will be used as the foundation for the software development. To evaluate the effectiveness of the algorithms, the quality of controller designs obtained from the approximate solutions will be explored.
Vehicle Navigation and Localization using Multiple Navigation Aids
Daniel Boley and Nikolaos Papanikolopoulos
This research aims to test the feasibility of combining two or more navigation methods in a simple manner to achieve higher accuracy and robustness than is possible with any single system. The aim is to do it with algorithms and computer processing that is fast enough on processors simple enough to consider installing in a vehicle. The methods combine fixes from the different aids, update a least-squares fit to the fixes, to obtain an improved fix and an estimate of the error in the fix. We will demonstrate a new paradigm that can deliver locations estimates of higher overall accuracy using fewer readings than the Kalman filter, and is capable of combining fixes from several different sources and paradigms, even if some sources yield fixes only sporadically. We use a total least squares estimate that can treat errors from dead reckoning with the same importance as errors from GPS or visual tracking. The method will also be able to combine partial fixes from each system to obtain a full fix, in cases where one or more sensors have a intermittent or temporary failure, or the environment changes.
Backward Error Assertions
We develop an assertion scheme based on the backward error analysis for error detection in algorithms that solve dense systems of linear equations, A x = b. Unlike previous methods, this Backward Error Assertion Model is specifically designed to operate in an environment of floating point arithmetic subject to round-off errors, and can be easily instrumented in a Watchdog processor environment. The complexity of verifying assertions is order n squared compared to the order n cubed complexity of algorithms solving A x = b. Unlike other proposed error detection methods, this assertion model does not require any encoding of the matrix A. Experimental results under various error models are presented to validate the effectiveness of this assertion scheme.
Genetic Algorithms for Robotics Applications
We use a Genetic Programming paradigm to have a robot learn to carry out simple navigation tasks in the presence of unknown obstacles. We have built a package to allow the use to design different room layouts and evaluate the performance of the generated algorithms under different conditions. A simple graphical interface makes it easy to design different layouts and operating conditions.