Project Topics:

Please send ideas for topics to dmbook@cs.umn.edu

 

 

  1. Evaluating Performance of Classifiers

·        Compare the bias and variance of models generated using different evaluation methods (leave one out, cross validation, bootstrap, stratification, etc.)

·        References:

a.       Kohavi, R., A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (1995)

b.      Efron, B. and Tibshirani, R., Cross-Validation and the Bootstrap: Estimating the Error Rate of a Prediction Rule (1995)

c.       Martin, J.K., and Hirschberg, D.S., Small Sample Statistics for Classification Error Rates I: Error Rate Measurements (1996)

d.      Dietterich, T.G., Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms (1998)

 

  1. Support Vector Machine (SVM)

·        Present an overview of SVM or applying Support Vector Machines to various application domains.

·        References:

a.       Mangasarian, O.L., Data Mining via Support Vector Machines (2001)

b.      Burges, C.J.C., A Tutorial on Support Vector Machines for Pattern Recognition (1998)

c.       Joachims, T., Text Categorization with Support Vector Machines: Learning with Many Relevant Features (1998)

d.      Salomon, J., Support Vector Machines for Phoneme Classification (2001)

 

  1. Cost-sensitive learning

·        A comparative study and implementation of different techniques for ensemble learning such as bagging, boosting, etc.

·        References:

a.       Freund Y. and Schapire, R.E., A short introduction to boosting (1999)

b.      Joshi, M.V., Kumar, V., Agrawal, R., Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong? (2002)

c.       Quinlan, J.R., Boosting, Bagging and C4.5 (1996)

d.      Bauer, E., Kohavi, R., An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants (1999)

 

  1. Semi-supervised learning (classification with labeled and unlabeled data)

·        Applying different semi-supervised learning techniques to UCI data sets.

·        References:

a.       Nigam, K., Using Unlabeled Data to Improve Text Classification (2001)

b.      Seeger, M., Learning with labeled and unlabeled data (2001)

c.       Nigam, K. and Ghani, R., Analyzing the Effectiveness and Applicability of Co-training (2000)

d.      Vittaut, J.N., Amini, M-R., Gallinari, P., Learning Classification with Both Labeled and Unlabeled Data (2002).

 

  1. Classification for rare-class problems

·        A comparative study and/or implementation of different classification techniques to analyze rare class problems

·        References:

a.       Joshi, M.V., and Agrawal, R., PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-study in Network Intrusion Detection)  (2001)

b.      Joshi, M.V., Agrawal, R., and Kumar, V.,  Mining Needles in a Haystack: Classifying Rare Classes via Two-Phase Rule Induction (2001)

c.       Joshi, M.V., Kumar, V., Agrawal, R., Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong? (2002)

d.      Joshi, M.V., Kumar, V., Agrawal, R., On Evaluating Performance of Classifiers for Rare Classes (2002) (2002)

 

  1. Time Series Prediction/Classification

·        A comparative study and/or implementation of time series prediction/classification techniques

·        References:

a.       Geurts, P., Pattern Extraction for Time Series Classification (2001)

b.      Kadous, M.W., A General Architecture for Supervised Classification of Multivariate Time Series (1998)

c.       Giles, C.L., Lawrence, S. and Tsoi, A.C., Noisy Time Series Prediction using a Recurrent Neural Network and Grammatical Inference (2001)

d.      Keogh, E.J. and Pazzani, M.J., An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback (1998)

e.       Chatfield, C., The Analysis of Time Series, Chapman & Hall (1989)

 

  1. Sequence Prediction

·        A comparative study and implementation of sequence prediction techniques

·        References:

a.       Laird, P.D., Saul, R. Discrete Sequence Prediction and Its Applications. Machine Learning, 15(1): 43-68 (1994)

b.      Sun, R. and Lee Giles, C., Sequence Learning: From Recognition and Prediction to Sequential Decision Making (2001)

c.       Lesh, N., Zaki, M.J., and Ogihara, M., Mining features for Sequence Classification (1999)

 

  1. Association Rules for Classification

·        A comparative study and implementation of classification using association patterns (rules and itemsets)

·        References:

a.       Liu, B., Hsu, W., and Ma, Y., Integrating Classification and Association Rule Mining (1998)

b.      Liu, B., Ma, Y. and Wong, C-K, Classification Using Association Rules: Weaknesses and Enhancements (2001)

c.       Li, W., Han, J. and Pei, J., CMAR: Accurate and Efficient Classification Based on Multiple Class-Association (2001)

d.      Deshpande, M. and Karypis, G., Using Conjunction of Attribute Values for Classification  (2002)

 

  1. Spatial Association Rule Mining

·        A comparative study on spatial association rule mining.

·        References:

a.       Koperski, K., and Han, J., Discovery of Spatial Association Rules in Geographic Information Databases (1995)

b.      Shekhar, S. and Huang, Y., Discovering Spatial Co-location Patterns: A Summary of Results (2001)

c.       Malerba, D., Esposito, F. and Lisi, F., Mining Spatial Association Rules in Census Data (2001)

 

  1. Temporal Association Rule Mining

·        A comparative study and/or implementation of temporal association rule mining techniques

·        References:

a.       Li, Y., Ning, P., Wang, and S., Jajodia, S., Discovering Calendar-based Temporal Association Rules (2001)

b.      Chen, X. and Petrounias, Mining temporal features in association rules

c.       Lee, C.H., Lin, C.R. and Chen, M.S., On Mining General Temporal Association Rules in a Publication Database (2001)

d.      Ozden, B., Ramaswamy, Silberschatz, Cyclic Association Rules (1998)

e.       Literature on Sequential Association Rule Mining below

 

  1. Sequential Association Rule Mining

·        A comparative study and/or implementation of sequential association rule mining techniques

·        References:

a.       Srikant, R. and Agrawal, R., Mining Sequential Patterns: Generalizations and Performance Improvements (1996)

b.      Mannila, H. and Toivonen, H., Verkamo, A.I., Discovery of Frequent Episodes in Event Sequences (1997)

c.       Joshi, M., Karypis, G., and Kumar, V., A Universal Formulation of Sequential Patterns (1999)

d.      Borges J., and Levene, M., Mining Association Rules in Hypertext Databases (1998)

 

  1. Outlier Detection

·        A comparative study and/or implementation of outlier detection techniques.

·        References:

a.       Knorr, Ng, A Unified Notion of Outliers: Properties and Computation, - 1997

b.      Knorr, Ng, Algorithms for Mining Distance-Based Outliers in Large Datasets - 1998

c.       Breunig, Kriegel, Ng, Sander, LOF: Identifying Density-Based Local Outliers  -  2000

d.      Aggarwal, Yu, Outlier Detection for High Dimensional Data – 2001

e.       Tang, Chen, Fu, Cheung, A Robust Outlier Detection Scheme for Large Data Sets - 2001

  1. Parallel Formulations of Clustering

·        Study and possible implementation of parallel formulations of clustering techniques.

·        References:

a.       Olson, Parallel Algorithms for Hierarchical Clustering – 1993

b.      Nagesh, High Performance Subspace Clustering for Massive Data Sets - 1999  

c.       Skillicorn, Strategies for Parallel Data Mining, 1999

d.      Dhillon, Modha, A Data-Clustering Algorithm On Distributed Memory Multiprocessors  - 2000

  1. Clustering of Time Series

·        Study and possible implementation of time series clustering techniques on actual NASA time series data.

·        References:

a.       Oates, Clustering Time Series with Hidden Markov Models and Dynamic Time Warping -  1999

b.      Konstantinos Kalpakis, Dhiral Gada, and Vasundhara Puttagunta, Distance Measures for Effective Clustering of ARIMA Time Series

c.       Tim, Identifying Distinctive Subsequences in Multivariate Time Series by Clustering - 1999

  1. Scalable clustering algorithms

·        A comparative study of scalable data mining techniques.

·        References:

a.       Tian Zhang, BIRCH: An Efficient Data Clustering Method for Very Large Databases -. 1999

b.      Ganti, Ramakrishnan, Clustering Large Datasets in Arbitrary Metric Spaces, 1998

c.       Bradley, Fayyad, Reina Scaling Clustering Algorithms to Large Databases –1998

d.      Farnstrom, Lewis, Elkan, Scalability for Clustering Algorithms Revisited  -  2000 

  1. Clustering association rules and frequent item sets

·        A comparative study of techniques for clustering association rules.

·        References:

a.       Toivonen, Klemettinen, Pruning and Grouping Discovered Association Rules, 1995  

b.      Widom, Clustering Association Rules - Lent, Swami - 1997

c.       Gunjan K. Gupta , Alexander Strehl AND Joydeep Ghosh, Distance Based Clustering of Association Rules