Scalable Benchmarks, Software and Data for Data Mining, Analytics and Scientific Discoveries
Scalable Benchmarks, Software and Data for Data Mining, Analytics and Scientific Discoveries
Contact Information:
Vipin Kumar, PI
Department of Computer Science and Engineering
4-192, EE/CSci Building
University of Minnesota
Minneapolis, MN 55455
Phone (612) 625 0726
E-mail: kumar at cs.umn.edu URL: http://www.cs.umn.edu/~kumar
Michael Steinbach, Co-PI
Department of Computer Science and Engineering
5-225 E, EE/CSci Building
University of Minnesota
Minneapolis, MN 55455
Phone (612) 625-7503
E-mail: steinbach at cs.umn.edu
URL: http://www.cs.umn.edu/~steinbac
List of Supported Students and Staff:
Graduate Students:
Undergraduate Students:
Project Award Information:
- Award Number:# 0551551
- Duration: March 15, 2006 - March 14, 2009
- Title: Collaborative Research: CRI - Scalable Benchmarks, Software and Data for Data Mining, Analytics and Scientific Discoveries
- NSF directorate and division:NSF ORG: CNS, Division of Computer and Network Systems
- Award Abstract : http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0551551
- Keywords: data mining, data mining applications, data analysis
Project Summary:
Today's connect anytime and anywhere digital society is fueling tremendous data growth, transforming the way business, science, and society function. Data in
terabytes range are not uncommon today and are expected to reach petabytes in the near future for many application domains in science, engineering, business,
bioinformatics, and medicine. In addition, the complexity of data is also increasing. For these reasons, there is an increasing need for automated data analysis and
mining to extract the required information and knowledge from these data sets. However, the computational complexity of data mining algorithms combined with this
deluge of data creates an important challenge. Hence, without a significant leap forward in computing capabilities and technological innovation, the opportunity to
harvest this wealth of data will be lost. In this work, we aim to take important first steps towards such a revolution in computing capabilities and develop the
underlying infrastructure that will allow other researchers to embark upon this important challenge. Particularly, our goal is to (a) develop a benchmarking suite
that will be used to understand the bottlenecks in high performance data mining and guide in the development of next-generation processors and (b) devise data mining
kernels that can be efficiently executed on existing and future processors.
Duration: 3 years
Journal Publications:
- Van Ness, B; Ramos, C; Haznadar, M; Hoering, A; Haessler, J; Crowley, J; Jacobus, S; Oken, M; Rajkumar, V; Greipp, P; Barlogie, B; Durie,
B; Katz, M; Atluri, G; Fang, G; Gupta, R; Steinbach, M; Kumar, V; Mushlin, R; Johnson, D; Morgan, G, Genomic variation in myeloma:
design, content, and initial application of the Bank On A Cure SNP Panel to detect associations with progression-free survival, BMC
MEDICINE, p. , vol. 6, (2008). Published, 10.1186/1741-7015-6-2
- Varun Chandola, Arindam Banerjee, and Vipin Kumar, Anomaly Detection : A Survey, ACM Computing Surveys, Volume 41(3), July 2009. Tech Report
Books or Other One-time Publications:
- Rohit Gupta, Tushar Garg, Gaurav Pandey, Michael Steinbach, and Vipin Kumar, Comparative Study of Various Genomic Data Sets for Protein Function Prediction and Enhancements Using Association Analysis , bibl. SIAM International Data Mining Conference, (2007). Workshop proceedings published as CD Published
of Collection: Petros Drineas Vipin Kumar Michael W. Mahoney, "Workshop on Biomedical Informatics"
- Gaurav Pandey, Michael Steinbach, Rohit Gupta, Tushar Garg and Vipin Kumar, Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study, ACM SIGKDD 2007, pp 540-549.
- Gaurav Pandey, Lakshmi Naarayanan Ramakrishnan, Michael Steinbach, and Vipin Kumar, Systematic Evaluation of Scaling Methods for
Gene Expression Data, BIBM 2008, pp. 376-381, Philadelphia, PA, 3-5 Nov. 2008
- Boriah, S., Kumar, V., Steinbach, M., Potter, C., and Klooster, S., Land cover change detection: a case study, Proceeding of the 14th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining Las Vegas, Nevada, USA, August 24 - 27, 2008. DOI= http://doi.acm.org/10.1145/1401890.1401993
- Shyam Boriah, Varun Chandola and Vipin Kumar, Similarity Measures for Categorical Data: A Comparative Evaluation, Proceedings of the SIAM International Conference on Data Mining, SDM 2008, pp. 243-254, , April 24-26, 2008, Atlanta, Georgia
- Varun Chandola, Deepthi Cheboli, and Vipin Kumar, Detecting Anomalies in a Time Series Database, (2009). Technical Report, Computer Science Technical Report TR09-004
- Rohit Gupta, Gang Fang, Blayne Field, Michael Steinbach, and Vipin Kumar, Quantitative Evaluation of Approximate Frequent Pattern
Mining Algorithms, (2009). Technical Report, TR 09-005
Research Contributions:
The data sets and kernel algorithms being developed by our group
will become available to the community at large via the NU-MineBench
suite. Some of these algorithms and data sets are already being
requested by many other data mining researchers who like to try these
techniques on large climate data sets or who like to use the algorithms
developed in our group to solve their problems. Over the long term,
much of the distribution of data sets and kernels will be done
via NU-MineBench being maintained by our collaborators at
Northwestern University. During summer 2006 we contributed a parallel
implementation of Error-Tolerant Itemset routines.
Contributions to Resources for Research and Education:
PIs Kumar and Steinbach co-taught introduction to data mining course
at the University of Minnesota during Fall 2007. The course included
several lectures on the applications of data mining to climate
and bioinformatics as well as importance of computationally
efficient algorithms due the scale of the data.
Software, Metadata:
The software implements a number Error-Tolerant association mining algorithms.
These algorithms are important for finding association patterns in noisy
data, e.g., many types of biomedical data. This shareware is available with MineBench and from the following website: http://www.cs.umn.edu/vk/nwu