Research

My research interests are in developing machine learning, statistics, and optimization methods to answer questions raised in biology with a particular emphasis on protein structure problems. My doctoral work focused on computational studies of protein interactions with small molecules and DNA for which I developed machine learning and sequence alignment techniques. I also studied how representing protein structures as distance matrices allows tools from convex optimization such as semidefinite programming to be leveraged. Below I outline my contributions and my agenda moving forward. In particular, my experience has illustrated that the complexity of protein structure data is not easily handled by current machine learning methods which motivates me investigate methods to directly learn and predict geometric objects.

Publications

Author Title Year Journal/Proceedings PDF
Chris Kauffman & Karypis, G. Ligand Binding Residue Prediction 2010 Introduction to Protein Structure Prediction: Methods and Algorithms  PDF
Abstract: MOTIVATION: Identifying residues that interact with ligands is useful as a first step to understanding protein function and as an aid to designing small molecules that target the protein for interaction. Several studies have shown sequence features are very informative for this type of prediction while structure features have also been useful when structure is available. We develop a sequence-based method, called LIBRUS, that combines homology-based transfer and direct prediction using machine learning and compare it to previous sequence-based work and current structure-based methods.
RESULTS: Our analysis shows that homology-based transfer is slightly more discriminating than a support vector machine learner using profiles and predicted secondary structure. We combine these two approaches in a method called LIBRUS. On a benchmark of 885 sequence independent proteins, it achieves an area under the ROC curve ($ROC$) of 0.83 with 45% precision at 50% recall, a significant improvement over previous sequence-based efforts. On an independent benchmark set, a current method, FINDSITE, based on structure features achieves a 0.81 $ROC$ with 54% precision at 50% recall while LIBRUS achieves a $ROC$ of 0.82 with 39% precision at 50% recall at a smaller computational cost. When LIBRUS and FINDSITE predictions are combined, performance is increased beyond either reaching an $ROC$ of 0.86 and 59% precision at 50% recall.
AVAILABILITY: Software developed for this study is available at http://bioinfo.cs.umn.edu/supplements/binf2009 along with supplemental data on the study.
BibTeX:
@incollection{Kauffman2010,
  author = { Chris Kauffman and George Karypis},
  title = {Ligand Binding Residue Prediction},
  booktitle = {Introduction to Protein Structure Prediction: Methods and Algorithms},
  publisher = {Wiley},
  year = {2010}
}
Chris Kauffman & Karypis, G. LIBRUS: combined machine learning and homology information for sequence-based ligand-binding residue prediction 2009 Bioinformatics
Vol. 25(23), pp. 3099-3107 
PDF
Abstract: Motivation: Identifying residues that interact with ligands is useful as a first step to understanding protein function and as an aid to designing small molecules that target the protein for interaction. Several studies have shown that sequence features are very informative for this type of prediction, while structure features have also been useful when structure is available. We develop a sequence-based method, called LIBRUS, that combines homology-based transfer and direct prediction using machine learning and compare it to previous sequence-based work and current structure-based methods. Results: Our analysis shows that homology-based transfer is slightly more discriminating than a support vector machine learner using profiles and predicted secondary structure. We combine these two approaches in a method called LIBRUS. On a benchmark of 885 sequence-independent proteins, it achieves an area under the ROC curve (ROC) of 0.83 with 45% precision at 50% recall, a significant improvement over previous sequence-based efforts. On an independent benchmark set, a current method, FINDSITE, based on structure features achieves an ROC of 0.81 with 54% precision at 50% recall, while LIBRUS achieves an ROC of 0.82 with 39% precision at 50% recall at a smaller computational cost. When LIBRUS and FINDSITE predictions are combined, performance is increased beyond either reaching an ROC of 0.86 and 59% precision at 50% recall. Availability: Software developed for this study is available at http://bioinfo.cs.umn.edu/supplements/binf2009 along with Supplementary data on the study. Contact: kauffman@cs.umn.edu; karypis@cs.umn.edu
BibTeX:
@article{Kauffman2009,
  author = { Chris Kauffman and Karypis, George},
  title = {LIBRUS: combined machine learning and homology information for sequence-based ligand-binding residue prediction},
  journal = {Bioinformatics},
  year = {2009},
  volume = {25},
  number = {23},
  pages = {3099-3107},
  url = {http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/23/3099},
  doi = {http://dx.doi.org/10.1093/bioinformatics/btp561}
}
Chris Kauffman & Karypis, G. An Analysis of Information Content Present in Protein-DNA Interactions 2008 Proceedings of the Pacific Symposium on Biocomputing, pp. 477-488  PDF
Abstract: Understanding the role proteins play in regulating DNA replication is essential to forming a complete picture of how the genome manifests itself. In this work, we examine the feasibility of predicting the residues of a protein essential to binding by analyzing protein-DNA interactions from an information theoretic perspective. Through the lens of mutual information, we explore which properties of protein sequence and structure are most useful in determining binding residues with a particular focus on sequence features. We find that the quantity of information carried in most features is small with respect to DNA-contacting residues, the bulk being provided by sequence features along with a select few structural features. Supplemental information for this article is available at http://www.cs.umn.edu/~kauffman/supplements/psb2008
BibTeX:
@inproceedings{Kauffman2008,
  author = { Chris Kauffman and George Karypis},
  title = {An Analysis of Information Content Present in Protein-DNA Interactions},
  booktitle = {Proceedings of the Pacific Symposium on Biocomputing},
  year = {2008},
  pages = {477-488},
  url = {http://psb.stanford.edu/psb-online/proceedings/psb08/abstracts/2008_p477.html}
}
Chris Kauffman, Rangwala, H. & Karypis, G. Improving Homology Models for Protein-Ligand Binding Sites 2008 LSS Computational Systems Bioinformatics Conference  PDF
Abstract: In order to improve the prediction of protein-ligand binding sites through homology modeling, we incorporate knowledge of the binding residues into the modeling framework. Residues are identified as binding or nonbinding based on their true labels as well as labels predicted from structure and sequence. The sequence predictions were made using a support vector machine framework which employs a sophisticated window-based kernel. Binding labels are used with a very sensitive sequence alignment method to align the target and template. Relevant parameters governing the alignment process are searched for optimal values. Based on our results, homology models of the binding site can be improved if a priori knowledge of the binding residues is available. For target-template pairs with low sequence identity and high structural diversity our sequence-based prediction method provided sufficient information to realize this improvement.
BibTeX:
@inproceedings{kauffman08tr,
  author = { Chris Kauffman and Huzefa Rangwala and George Karypis},
  title = {Improving Homology Models for Protein-Ligand Binding Sites},
  booktitle = {LSS Computational Systems Bioinformatics Conference},
  year = {2008},
  url = {http://csb2008.org/program8.html}
}
Christopher Kauffman & Karypis, G. Computational tools for protein-DNA interactions 2012 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Vol. 2(1), pp. 14-28 
PDF
Abstract: Interactions between DNA and proteins are central to living systems, and characterizing how and when they occur would greatly enhance our understanding of working genomes. We review the different computational problems associated with protein-DNA interactions and the various methods used to solve them. A wide range of topics is covered including physics-based models for direct and indirect recognition, identification of transcription factor binding sites, and methods to predict DNA-binding proteins. Our goal is to introduce this important problem domain to data mining researchers by identifying the key issues and challenges inherent to the area as well as provide directions for fruitful future research.
BibTeX:
@article{Kauffman2011,
  author = { Christopher Kauffman and George Karypis},
  title = {Computational tools for protein-DNA interactions},
  journal = {Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery},
  publisher = {John Wiley & Sons, Inc.},
  year = {2012},
  volume = {2},
  number = {1},
  pages = {14--28},
  url = {http://dx.doi.org/10.1002/widm.48},
  doi = {http://dx.doi.org/10.1002/widm.48}
}
Kauffman, C. & George A Convex Programming Model for Protein Structure Prediction Domo working draft PDF
BibTeX:
@unpublished{kauffman2012,
  author = {Chris Kauffman and George},
  title = {A Convex Programming Model for Protein Structure Prediction}
}
Nakken, S., Christopher Kauffman & Karypis, G. Finding Functionally Related Genes by Local and Global Analysis of Medline Abstracts 2004 SIGIR04 Bio Workshop: Search and Discovery in Bioinformatics  PDF
Abstract: Discovery of biological relationships between genes is one of the keys to understanding the complex functional nature of the human genome. Currently, most of the knowledge about interrelating genes are found in immense amounts of various biomedical literature. Hence, extraction of biological contexts occurring in free text represents a valuable tool in gaining knowledge about gene interactions. We present a textual analysis of documents associated with pairs of genes, and describe how this approach can be used to discover and annotate functional relationships among genes. A study on a subset of human genes show that our analysis tool can act as a ranking mechanism for sets of genes based on their functional relatedness.
BibTeX:
@inproceedings{Nakken2004,
  author = {Sigve Nakken and  Christopher Kauffman and George Karypis},
  title = {Finding Functionally Related Genes by Local and Global Analysis of Medline Abstracts},
  booktitle = {SIGIR04 Bio Workshop: Search and Discovery in Bioinformatics},
  year = {2004}
}
Rangwala, H., Chris Kauffman & Karypis, G. A kernel framework for protein residue annotation 2009 Proceedings of the 13th Pacific--Asia Conference on Knowledge Discovery and Data-Mining  PDF
Abstract: Over the last decade several prediction methods have been developed for determining structural and functional properties of individual protein residues using sequence and sequence-derived information. Most of these methods are based on support vector machines as they provide accurate and generalizable prediction models. We developed a general purpose protein residue annotation toolkit (ProSAT) to allow biologists to formulate residue-wise prediction problems. ProSAT formulates annotation problem as a classification or regression problem using support vector machines. For every residue ProSAT captures local information (any sequence-derived information) around the reside to create fixed length feature vectors. ProSAT implements accurate and fast kernel functions, and also introduces a flexible window-based encoding scheme that allows better capture of signals for certain prediction problems. In this work we evaluate the performance of ProSAT on the disorder prediction and contact order estimation problems, studying the effect of the different kernels introduced here. ProSAT shows better or at least comparable performance to state-of-the-art prediction systems. In particular ProSAT has proven to be the best performing transmembrane-helix predictor on an independent blind benchmark.
BibTeX:
@inproceedings{rangwala09pakdd,
  author = {Huzefa Rangwala and  Chris Kauffman and George Karypis},
  title = {A kernel framework for protein residue annotation},
  booktitle = {Proceedings of the 13th Pacific--Asia Conference on Knowledge Discovery and Data-Mining},
  year = {2009}
}
Rangwala, H., Christopher Kauffman & Karypis, G. svmPRAT: SVM-based protein residue annotation toolkit. 2009 BMC Bioinformatics
Vol. 10, pp. 439 
PDF
Abstract: BACKGROUND: Over the last decade several prediction methods have been developed for determining the structural and functional properties of individual protein residues using sequence and sequence-derived information. Most of these methods are based on support vector machines as they provide accurate and generalizable prediction models. RESULTS: We present a general purpose protein residue annotation toolkit (svmPRAT) to allow biologists to formulate residue-wise prediction problems. svmPRAT formulates the annotation problem as a classification or regression problem using support vector machines. One of the key features of svmPRAT is its ease of use in incorporating any user-provided information in the form of feature matrices. For every residue svmPRAT captures local information around the reside to create fixed length feature vectors. svmPRAT implements accurate and fast kernel functions, and also introduces a flexible window-based encoding scheme that accurately captures signals and pattern for training effective predictive models. CONCLUSIONS: In this work we evaluate svmPRAT on several classification and regression problems including disorder prediction, residue-wise contact order estimation, DNA-binding site prediction, and local structure alphabet prediction. svmPRAT has also been used for the development of state-of-the-art transmembrane helix prediction method called TOPTMH, and secondary structure prediction method called YASSPP. This toolkit developed provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems.Availability: http://www.cs.gmu.edu/~mlbio/svmprat.
BibTeX:
@article{rangwala09bmcbio,
  author = {Huzefa Rangwala and  Christopher Kauffman and George Karypis},
  title = {svmPRAT: SVM-based protein residue annotation toolkit.},
  journal = {BMC Bioinformatics},
  year = {2009},
  volume = {10},
  pages = {439},
  url = {http://dx.doi.org/10.1186/1471-2105-10-439},
  doi = {http://dx.doi.org/10.1186/1471-2105-10-439}
}
Rangwala, H., Christopher Kauffman & Karypis, G. A Generalized Framework for Protein Sequence Annotation 2007 Proceedings of the NIPS Workshop on Machine Learning in Computational Biology  PDF
Abstract: Over the last decade several data mining techniques have been developed for determining structural and functional properties of individual protein residues using sequence and sequence-derived information. These protein residue annotation problems are often formulated as either classification or regression problems and solved using a common set of techniques. We develop a generalized protein sequence annotation toolkit (ProSAT) for solving classification or regression problems using support vector machines. The key characteristic of our method is its effective use of window-based information to capture the local environment of a protein sequence residue. This window information is used with several kernel functions available within our framework. We show the effectiveness of using the previously developed normalized second order exponential kernel function and experiment with local window-based information at different levels of granularity. We report empirical results on a diverse set of classification and regression problems: prediction of solvent accessibility, secondary structure, local structure alphabet, transmembrane helices, DNA-protein interaction sites, contact order, and regions of disorder are all explored. Our methods show either comparable or superior results to several state-of-the-art application tuned prediction methods for these problems. ProSAT provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems. The results of some of these predictions can be used to assist in solving the overarching 3D structure prediction problem.
BibTeX:
@inproceedings{rangwala07mlcb,
  author = {Huzefa Rangwala and  Christopher Kauffman and George Karypis},
  title = {A Generalized Framework for Protein Sequence Annotation},
  booktitle = {Proceedings of the NIPS Workshop on Machine Learning in Computational Biology},
  year = {2007}
}