Data Mining for Rare Class Analysis

Project Award Number: 0308264

Principal Investigator	Co-PI
Vipin Kumar Computer Science Department University of Minnesota 200 Union Street SE, 4-192 Minneapolis, MN 55455, USA 612-626-8074 612-626-1596 kumar@cs.umn.edu www.cs.umn.edu/~kumar	Jaideep Srivastava Computer Science Department University of Minnesota 200 Union Street SE, 4-192 Minneapolis, MN 55455, USA 612-625-4012 612-626-1596 srivasta@cs.umn.edu www.cs.umn.edu/faculty/srivasta.html
Collaborator
Aleksandar Lazarevic Computer Science Department University of Minnesota 200 Union Street SE, 4-192 Minneapolis, MN 55455, USA 612-626-8096 612-626-1596 aleks@cs.umn.edu www.cs.umn.edu/~aleks

Keywords

rare class analysis

data mining

predictive models

feature construction

data streams

Project Summary

“Rare events” are those that occur very infrequently, and are thus very difficult to detect. However, when they do occur, their consequences can be quite dramatic, and quite often in a negative sense. Examples include network intrusions and security breaches, cardiac events, credit card and other types of financial fraud, telecom circuit overloads, traffic accidents, etc. Timely detection of rare events has been of interest for quite some time.

The problem of analyzing rare events has been variously called deviation detection, outlier analysis, anomaly detection, exception mining, etc. We use the term rare class analysis, which collectively refers to techniques for a number of problems related to rare events. These include (i) techniques for selecting the appropriate feature space to analyze rare events in an application domain, (ii) algorithms for building models for detecting and characterizing rare events, (iii) adaptive techniques for handling evolving concepts in data streams.

This project is conducting a research program that will investigate the issues in rare class analysis described above, and develop a suite of techniques to address them.

Our approach is based on the key observation that models that describe rare events are fundamentally different from those that describe normal events, due to a number of factors such as heterogeneity and multi-modality. Specifically, we will address the following computer science challenges:

· It has been shown that rare class analysis requires attention to features that make the rare events stand out - else they will be masked by the overwhelming volume of normal events. We are developing techniques that will create the appropriate feature space for rare class analysis, given an application.

· Standard predictive model building techniques do not work well for rare class problems. We have developed a novel 2-phase algorithm PNRule and its variants that have shown excellent promise. We are currently extending this work to address various other model-building problems in rare class analysis.

· In many rare class applications, data arrives in a continuous stream (e.g. network traffic, web logs), and there is a need to have an incremental analysis approach, which can continuously adapt to the evolving data stream. We plan to develop techniques for adaptation of rare class prediction models.

Publications and Products

1. M. Joshi, R. Agarwal, V. Kumar, Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong?, Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, July 2002.

2. M. Joshi, V. Kumar, CREDOS: Classification Using Ripple Down Structure (A Case for Rare Classes), IEEE International Conference on Data Engineering, 2003.

3. N. Chawla, A. Lazarevic, L. Hall, K. Bowyer, SMOTEBoost: Improving the Prediction of Minority Class in Boosting, Principles of Knowledge Discovery in Databases (PKDD), Cavtat Croatia, 2003.

Project Impact

Detection of rare events is a true example of the proverbial “needle-in-the-haystack” type of problem that data mining has the real potential to solve. With increasingly finer resolution of data collection instrumentation as well as dramatically increasing capacity and dropping cost of on-line storage, all kinds of processes are being monitored, from users browsing the Web, to cardiac pacemakers, to computer network activity. In many of these cases, we are interested in finding those few, unusual, special cases which are highly significant and of potentially very high value. We believe the techniques developed as part of this research will be applicable across this wide spectrum of applications.

Goals, Objectives and Targeted Activities

During the first project year, we plan to pursue the following research activities:

1. Feature Space Creation for Rare Class Analysis. We are adapting existing feature selection schemes in order to select features that are most suitable for building prediction models for rare classes. We are also using association patterns (frequent itemsets or episodes) to construct new features that are beneficial for building predictive models for rare classes. These schemes are expected to lead to significantly better classification performance (measured using precision, recall, F-value) on a variety of data sets that exhibit rare class phenomenon compared to the standard data mining algorithms. The effectiveness of these schemes will be evaluated through the performance of classification algorithms that use such selected/constructed features. The improvement in the prediction performance will be an indication that the selected/constructed features are indeed beneficial for the prediction task. The suitable number of selected/extracted features will also be estimated by changing the number of features and measuring the improvement in prediction performance. The number of features for which the best prediction performance was achieved will be considered as the optimal number of features.

2. Model Selection Criteria for Rare Classes. We are planning to modify the MDL (Minimum Description Length) principle to optimally prune the prediction models meant for capturing rare classes. The effectiveness of the proposed modified MDL principle will be assessed by comparing prediction performance of classification models employing different pruning criteria. In addition, comparison to the standard MDL principle will be performed both through the prediction power and the model complexity on a variety of data sets (e.g. the UCI repository, Web logs, network intrusion, damage detection) that exhibit rare phenomenon..

Area Background

The problem of detecting rare events has been variously called deviation detection, outlier analysis, anomaly detection, exception mining, etc. Techniques based on both supervised and unsupervised learning have been developed for this problem. Unsupervised learning methods analyze each event to determine how similar (or dissimilar) it is to the majority, and their success depends on the choice of similarity measures, dimension weighting, etc. Supervised learning methods build a model for rare events based on labeled data (the training set), and use it to classify each event. An additional desirable feature of supervised learning methods is that they produce models that can be easily understood. In this proposal we focus on supervised learning methods for detecting rare events.

Area References

R.Agarwal, M.Joshi, PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-study in Network Intrusion Detection), Proceedings of First SIAM International Conference on Data Mining, Chicago, April 2001.

P. Domingos, Metacost: A General Method for Making Classifiers Cost-sensitive, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 155--164 San Diego, CA. ACM Press, 1999.

C. Elkan. The Foundations of Cost-Sensitive Learning, Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, August, 2001.

W. Fan, S.J. Stolfo, J. Zhang, and P.K. Chan. AdaCost: Misclassification Cost-sensitive Boosting, Proceedings of the Sixth International Conference on Machine learning (ICML- 99), Bled, Slovenia, 1999.

N. Japkowicz, The Class Imbalance Problem: Significance and Strategies, Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI'2000): Special Track on Inductive Learning, Las Vegas, Nevada, 2000.

M. Joshi, V. Kumar, R. Agarwal, Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements, First IEEE International Conference on Data Mining, San Jose, CA, 2001.

M. Joshi, R Agarwal, V. Kumar, Mining Needles in a Haystack: Classifying Rare Classes via Two-Phase Rule Induction, Proceedings of ACM SIGMOD'01 conference on Management of Data, Santa Barbara, May 2001.

M. Kubat, S Matwin, Addressing the Curse of Imbalanced Training Sets: One Sided Selection, Proceedings of the Fourteenth International Conference on Machine Learning, 179--186 Nashville, Tennessee. Morgan Kaufmann, 1997.

F. Provost and T. Fawcett, Robust Classification for Imprecise Environments, Machine Learning, vol. 42/3, pp. 203-231, 2001.

Project Websites

www.cs.umn.edu/~aleks/rare_class
The web site contains the grant report.

Illustrations

www.cs.umn.edu/~aleks/rare_class

Online Software

Currently not available.

Online Data

UCI Machine Learning Data Repository:

http://www.ics.uci.edu/~mlearn/MLRepository.html

Other Resources

Currently not available.