Generalization of the Association Analysis Framework

National Science Foundation Award Number: IIS-0916439 (August 1, 2009 - July 31, 2012)



Personnel:

Vipin Kumar, PI
Department of Computer Science and Engineering
4-192, EE/CSci Building
University of Minnesota
Minneapolis, MN 55455
Phone (612) 625 0726
E-mail: kumar at cs.umn.edu     URL: http://www.cs.umn.edu/~kumar

Michael Steinbach, co-PI
Department of Computer Science and Engineering
5-225C, EE/CSci Building
University of Minnesota
Minneapolis, MN 55455
Phone (612) 626-7503
E-mail: steinbac at cs.umn.edu     URL: http://www-users.cs.umn.edu/~steinbac/

List of Supported Students:

Graduate student(s): Undergraduate Students:

Collaborators:


Project Activities and Findings:

The area of data mining known as association analysis seeks to find patterns that describe the relationships among the binary attributes (variables) used to characterize a set of objects. The iconic example is market basket data, where the objects are transactions consisting of sets of items purchased by a customer, and the attributes are binary variables that indicate whether or not an item was purchased by a particular customer. The patterns are either sets of items that are frequently purchased together (frequent itemset patterns) or rules that capture the fact that the purchase of one set of items often implies the purchase of a second set of items (association rule patterns). A key strength of association pattern mining is that the potentially exponential nature of the search can often be made tractable by using support based pruning of patterns, i.e., eliminating patterns supported by few transactions. Efforts to date have created a well-developed conceptual (theoretical) foundation and an efficient set of algorithms. The framework that has been created has been extended well beyond the original application to market basket data to encompass new applications.

Despite the solid foundations of association analysis and the potential economic and intellectual benefits of pattern discovery and its various applications, this group of techniques is not widely used as a data analysis tool in most scientific and commercial domains. The reason is that there are many areas, such as those involving continuous and dense data with labels, where such techniques would be very useful, but cannot currently be easily and effectively applied. Our work on this project aims to extend association analysis to be more widely applicable. Our focus has been on biomedical data, although most of our work could be adapted to non-biological data as well.

Publications:

  1. Gang Fang, Wen Wang, Benjamin Oatley, Brian Van Ness, Michael Steinbach and Vipin Kumar, Characterizing Discriminative Patterns , Manuscript, arXiv: 1102.4104, communicated Feb 2011.
  2. Gang Fang, Wen Wang, Vanja Paunic, Benjamin Oately, Majda Haznadar, Michael Steinbach, Brian Van Ness, Chad L. Myers and Vipin Kumar, Construction and Functional Analysis of Human Genetic Interaction Networks with Genome-wide Association Data .
  3. Gang Fang, Michael Steinbach, Chad L. Myers and Vipin Kumar, Integration of Differential Gene-combination Search and Gene Set Enrichment Analysis: A General Approach.
  4. Michael Steinbach, Haoyu Yu, Gang Fang, Vipin Kumar, Using Constraints to Generate and Explore Higher Order Discriminative Patterns, 15th Pacific-Asia Conference on Knowledge Discovery in Databases (PAKDD 2011) Shenzhen, China, pp. 338-350, May 24-27.
  5. Michael Steinbach, Haoyu Yu, and Vipin Kumar, Identification of Co-occurring Insertions in Cancer Genomes Using Association Analysis , International Journal of Data Mining and Bioinformatics special issue for 2nd International Workshop on Data Mining for Biomarker Discovery (DMBD 2010), to appear in 2011.
  6. Bonnie Westra, Sanjoy Dey, Gang Fang, Michael Steinbach, Kay Savik, Cristina Oancea and Vipin Kumar, Interpretable Predictive Models for Knowledge Discovery from Home Care Electronic Health Records, Journal of Healthcare Engineering, pp. 55-74, Volume 2, Number 1 / March 2011.
  7. Gowtham Atluri, Jeremy Bellay, Gaurav Pandey, Chad Myers, Vipin Kumar, Discovering Coherent Value Bicliques In Genetic Interaction Data , In Proceedings of 9th International Workshop on Data Mining in Bioinformatics (BIOKDD'10), held in conjunction with 16th ACM Conference on Knowledge Discovery and Data mining (KDD), Washington D.C, July 2010.
  8. Subspace Differential Coexpression Analysis: Problem Definition and a General Approach, Gang Fang, Rui Kuang, Gaurav Pandey, Michael Steinbach, Chad L. Myers, and Vipin Kumar, Proceedings of the 15th Pacific Symposium on Biocomputing (PSB), 15:145-156, 2010.
  9. Gang Fang, Gaurav Pandey, Wen Wang, Manish Gupta, Michael Steinbach, Vipin Kumar, Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data, IEEE Transactions on Knowledge and Data Engineering (TKDE), to appear. Available in vol. 99, no. PrePrints, 2010.
  10. Rohit Gupta, Smita Agrawal, Navneet Rao, Ze Tian, Rui Kuang, Vipin Kumar, Integrative Biomarker Discovery for Breast Cancer Metastasis from Gene Expression and Protein Interaction Data Using Error-tolerant Pattern Mining, In Proceedings of the International Conference on Bioinformatics and Computational Biology (BICoB), March 2010
  11. Rohit Gupta, Navneet Rao, Vipin Kumar, Discovery of Error-tolerant Biclusters from Noisy Gene Expression Data In Proceedings of 9th International Workshop on Data Mining in Bioinformatics (BIOKDD'10), held in conjunction with 16th ACM Conference on Knowledge Discovery and Data mining (KDD), Washington D.C, July 2010.

Contributions within discipline:

We have created a number of new algorithms for characterizing discriminative patterns, including creating and analyzing genetic interaction data sets from SNP data, and finding co-occurrence patterns among insertions in tumor genomes. Some of the techniques we developed were for data with class labels, while other techniques apply to unlabeled data. We also created novel approaches for analyzing the maximum discriminating power of discriminating patterns and incorporated Gene Set Enrichment Analysis into our analysis approach.

Contributions to Human Resource Development:

The project has provided partial support for 6 graduate students who are pursuing studies in computer science. It has also provided support for four undergraduate students.

Contributions to Resources for Research and Education:

Software for several of the algorithms has been made available via the web. This will allow other researchers to use these algorithms and to build on the techniques we have developed.
We presented a graduate level seminar course, Mining Biomedical Data Sets , that incorporated some of our work, in Spring 2011. More specifically, our work on discriminative patterns was presented in some detail and a number of projects selected by students were related to this topic.

Invited Talks:

Mining Scientific Data: Past, Present, and Future, Keynote Presentation at SIAM International Conference on Data Mining 2010, April 29 - May 1, 2010, Columbus, Ohio