Karsten Steinhaeuser is a Research Associate in the
Department of Computer Science and Engineering at the
University of Minnesota. His primary responsibilities currently include two major research efforts: an NSF Expeditions in Computing on
Understanding Climate Change: A Data Driven Approach and the
GOPHER project, which is an R&D partner in the
Planetary Skin Institute.
His research interests are centered around data mining and machine learning, in particular the construction and analysis of complex networks, with applications in diverse domains including (but not limited to) climate, ecology, and social networks. He is actively involved in shaping an emerging research area called
climate informatics, which lies at the intersection of computer science and climate sciences, and his interests are more generally in interdisciplinary research and scientific problems relating to climate change and sustainability. He co-organizes the
IEEE ICDM Workshop on Knowledge Discovery from Climate Data and the
International Workshop on Climate Informatics, among others, and is engaged in numerous other professional service and community building activities.
Publications
Book Chapters
Claire Monteleoni, Gavin A. Schmidt, Francis Alexander, Alexandru Niculescu-Mizil,
Karsten Steinhaeuser, Michael Tippett, Arindam Banerjee, M. Benno Blumenthal, Auroop R. Ganguly, Jason E. Smerdon, Marco Tedesco (2013).
Climate Informatics.
Computational Intelligent Data Analysis for Sustainable Development, T. Yu, N. Chawla, S. Simoff. (Eds.), Taylor & Francis, 81-126.
Keywords: climate informatics, machine learning, climate science, interdisciplinary research
© Taylor & Francis 2013
The goal of this chapter is to define climate informatics and to propose some grand challenges for this nascent field. Recent progress on climate informatics, by the authors as well as by other groups, reveals that collaborations with climate scientists also open up interesting new problems for machine learning. There are a myriad of collaborations possible at the intersection of these two fields. This chapter uses both top-down and bottom-up approaches to stimulate research progress on a range of problems in climate informatics, some of which have yet to be proposed. For the former, we present challenge problems posed by climate scientists, and discussed with machine learning, data mining, and statistics researchers at Climate Informatics 2011, the First International Workshop on Climate Informatics, the inaugural event of a new annual workshop in which all co-authors participated. To spur innovation from the bottom-up, we also describe and discuss some of the types of data available. In addition to summarizing some of the key challenges for climate informatics, this chapter also draws on some of the recent climate informatics research of the co-authors.
Pursuit of preventive healthcare relies on fundamental knowledge of the complex relationships between diseases and individuals. We take a step towards understanding these connections by employing a network-based approach to explore a large medical database. Here we report on two distinct tasks. First, we characterize networks of diseases in terms of their physical properties and emergent behavior over time. Our analysis reveals important insights with implications for modeling and prediction. Second, we immediately apply this knowledge to construct patient networks and build a predictive model to assess disease risk for individuals based on medical history. We evaluate the ability of our model to identify conditions a person is likely to develop in the future and study the benefits of demographic data partitioning. We discuss strengths and limitations of our method as well as the data itself to provide direction for future work.
K. Steinhaeuser and N. V. Chawla (2008).
Community Detection in a Large Real-World Social Network.
Social Computing, Behavioral Modeling, and Prediction, H. Liu, J.J. Salerno, M.J. Young (Eds.), Springer, 168-175.
Keywords: social network analysis, telecommunications data, community detection, node attributes
© Springer Science + Business Media, LLC 2008
Identifying meaningful community structure in social networks is a hard problem, and extreme network size or sparseness of the network compound the difficulty of the task.With a proliferation of real-world network datasets there has been an increasing demand for algorithms that work effectively and efficiently. Existing methods are limited by their computational requirements and rely heavily on the network topology, which fails in scale-free networks. Yet, in addition to the network connectivity, many datasets also include attributes of individual nodes, but current methods are unable to incorporate this data. Cognizant of these requirements we propose a simple approach that stirs away from complex algorithms, focusing instead on the edge weights; more specifically, we leverage the node attributes to compute better weights. Our experimental results on a real-world social network show that a simple thresholding method with edge weights based on node attributes is sufficient to identify a very strong community structure.
Refereed Journal Articles
S. Liess, A. Kumar, P. K. Snyder, J. Kawale,
K. Steinhaeuser, F. H. Semazzi, A. R. Ganguly, N. F. Samatova, and V. Kumar (2014). Different modes of variability over the Tasman Sea: Implications for Regional Climate.
Journal of Climate, in press.
Keywords: teleconnections, non-orthogonal modes of variability, Tasman Sea
© AMS 2014
A new approach is used to detect atmospheric teleconnections without being bound by orthogonality (such as Empirical Orthogonal Functions). This method employs negative correlations in a global dataset to detect potential teleconnections. One teleconnection occurs between the Tasman Sea and the Southern Ocean. It is related to the El Nino Southern Oscillation (ENSO), the Indian Ocean Dipole (IOD), and the Southern Annular Mode (SAM). This teleconnection is significantly correlated with SAM during austral summer, fall and winter, with IOD during spring, and with ENSO in summer. It can thus be described as a hybrid between these modes. Given previously found relationships between IOD and ENSO, and IOD's proximity to the teleconnection centers, correlations to IOD are generally stronger than to ENSO.
Increasing pressure over the Tasman Sea leads to higher (lower) surface temperature over eastern Australia (southwestern Pacific) in all seasons, and is related to reduced surface temperature over Wilkes Land and Adelie Land in Antarctica during fall and winter. Precipitation responses are generally negative over New Zealand. For one standard deviation the teleconnection index, precipitation anomalies are positive over Australia in fall, negative over southern Australia in winter and spring, and negative over eastern Australia in summer. When doubling the threshold, the size of the anomalous high-pressure center increases and annual precipitation anomalies are negative over southeastern Australia and northern New Zealand. Eliassen-Palm fluxes quantify the seasonal dependence of SAM, ENSO and IOD influences. Analysis of the dynamical interactions between these teleconnection patterns can improve prediction of seasonal temperature and precipitation patterns in Australia and New Zealand.
A. R. Ganguly, E. Kodra, A. Banerjee, S. Boriah, S. Chatterjee, S. Chaterjee, A. Choudhary, D. Das, J. Faghmous, P. Ganguli, S. Ghosh, K. Hayhoe, C. Hays, W. Hendrix, Q. Fu, J. Kawale, D. Kumar, V. Kumar, S. Liess, R. Mawalagedara, V. Mithal, R. Oglesby, K. Salvi, P. K. Snyder,
K. Steinhaeuser, D. Wang, and D. Wuebbles (2014).
Toward enhanced understanding and prediction of climate extremes using physics-guided data mining techniques.
Nonlinear Processes in Geophysics,
21, 777-795.
Keywords: climate modeling, data mining, climate extremes, big data
Extreme events such as heat waves, cold spells, floods, droughts, tropical cyclones, and tornadoes have potentially devastating impacts on natural and engineered systems, and human communities, worldwide. Stakeholder decisions about critical infrastructures, natural resources, emergency preparedness and humanitarian aid typically need to be made at local to regional scales over seasonal to decadal planning horizons. However, credible climate change attribution and reliable projections at more localized and shorter time scales remain grand challenges. Long-standing gaps include inadequate understanding of processes such as cloud physics and ocean-land-atmosphere interactions, limitations of physics-based computer models, and the importance of intrinsic climate system variability at decadal horizons. Meanwhile, the growing size and complexity of climate data from model simulations and remote sensors increases opportunities to address these scientific gaps. This perspectives article explores the possibility that physically cognizant mining of massive climate data may lead to significant advances in generating credible predictive insights about climate extremes and in turn translating them to actionable metrics and information for adaptation and policy. Specifically, we propose that data mining techniques geared towards extremes can help tackle the grand challenges in the development of interpretable climate projections, predictability, and uncertainty assessments. To be successful, scalable methods will need to handle what has been called "Big Data" to tease out elusive but robust statistics of extremes and change from what is ultimately small data. Physically-based relationships (where available) and conceptual understanding (where appropriate) are needed to guide methods development and interpretation of results. Such approaches may be especially relevant in situations where computer models may not be able to fully encapsulate current process understanding, yet the wealth of data may offer additional insights. Large-scale interdisciplinary team efforts, involving domain experts and individual researchers who span disciplines, will be necessary to address the challenge.
Until now, climate model intercomparison has focused primarily on annual and global averages of various quantities or on specific components, not on how well the general dynamics in the models compare to each other. In order to address how well models agree when it comes to the dynamics they generate, we have adopted a new approach based on climate networks. We have considered 28 pre-industrial control runs as well as 70 20th-century forced runs from 23 climate models and have constructed networks for the 500 hPa, surface air temperature (SAT), sea level pressure (SLP), and precipitation fields for each run. We then employed a widely used algorithm to derive the community structure in these networks. Communities separate .nodes. in the network sharing similar dynamics. It has been shown that these communities, or sub-systems, in the climate system are associated with major climate modes and physics of the atmosphere (Tsonis AA, Swanson KL, Wang G, J Clim 21: 2990.3001 in 2008; Tsonis AA, Wang G, Swanson KL, Rodrigues F, da Fontura Costa L, Clim Dyn, 37: 933.940 in 2011; Steinhaeuser K, Ganguly AR, Chawla NV, Clim Dyn 39: 889.895 in 2012). Once the community structure for all runs is derived, we use a pattern matching statistic to obtain a measure of how well any two models agree with each other. We find that, with the possible exception of the 500 hPa field, consistency for the SAT, SLP, and precipitation fields is questionable. More importantly, none of the models comes close to the community structure of the actual observations (reality). This is a significant finding especially for the temperature and precipitation fields, as these are the fields widely used to produce future projections in time and in space.
A systematic characterization of multivariate dependence at multiple spatio-temporal scales is critical to understanding climate system dynamics and improving predictive ability from models and data. However, dependence structures in climate are complex due to nonlinear dynamical generating processes, long-range spatial and long-memory temporal relationships, as well as low-frequency variability. Here we utilize complex networks to explore dependence in climate data. Specifically, networks constructed from reanalysis-based atmospheric variables over oceans and partitioned with community detection methods demonstrate the potential to capture regional and global dependence structures within and among climate variables. Proximity-based dependence as well as long-range spatial relationships are examined along with their evolution over time, yielding new insights on ocean meteorology. The tools are implicitly validated by confirming conceptual understanding about aggregate correlations and teleconnections. Our results also suggest a close similarity of observed dependence patterns in relative humidity and horizontal wind speed over oceans. In addition, updraft velocity, which relates to convective activity over the oceans, exhibits short spatiotemporal decorrelation scales but long-range dependence over time. The multivariate and multi-scale dependence patterns broadly persist over multiple time windows. Our findings motivate further investigations of dependence structures among observations, reanalysis and model-simulated data to enhance process understanding, assess model reliability and improve regional climate predictions.
Human populations are profoundly affected by water stress, or the lack of sufficient per capita available freshwater. Water stress can result from overuse of available freshwater resources or from a reduction in the amount of available water due to decreases in rainfall and stored water supplies. Analyzing the interrelationship between human populations and water availability is complicated by the uncertainties associated with climate change projections and population projections. We present a simple methodology developed to integrate disparate climate and population data sources and develop first-order per capita water availability projections at the global scale. Simulations from the coupled land-ocean-atmosphere Community Climate System Model version 3 (CCSM3) forced with a range of hypothetical greenhouse gas emissions scenarios are used to project grid-based changes in precipitation minus evapotranspiration as proxies for changes in runoff, or fresh water supply. Population growth changes, according to Intergovernmental Panel on Climate Change (IPCC) storylines, are used as proxies for changes in fresh water demand by 2025, 2050 and 2100. These freshwater supply and demand projections are then combined to yield estimates of per capita water availability aggregated by watershed and political unit. Results suggest that important insights might be extracted from the use of the process developed here, notably including the identification of the globe.s most vulnerable regions in need of more detailed analysis and the relative importance of population growth versus climate change in in altering future freshwater supplies. However, these are only exemplary insights and, as such, could be considered hypotheses that should be rigorously tested with multiple climate models, multiple observational climate datasets, and more comprehensive population change storylines.
The analysis of climate data has relied heavily on hypothesis-driven statistical methods, while projections of future climate are based primarily on physics-based computational models. However, in recent years a wealth of new datasets has become available. Therefore, we take a more data-centric approach and propose a unified framework for studying climate, with an aim towards characterizing observed phenomena as well as discovering new knowledge in the climate domain. Specifically, we posit that complex networks are well-suited for both descriptive analysis and predictive modeling tasks. We show that the structural properties of "climate networks" have useful interpretation within the domain. Further, we extract clusters from these networks and demonstrate their predictive power as climate indices. Our experimental results establish that the network clusters are statistically significantly better predictors than clusters derived using a more traditional clustering approach. Using complex networks as data representation thus enables the unique opportunity for descriptive and predictive modeling to inform each other.
Analyses of climate model simulations and observations reveal that extreme cold events are likely to persist across each land-continent even under 21st-century warming scenarios. The grid-based intensity, duration and frequency of cold extreme events are calculated annually through three indices: the coldest annual consecutive three-day average of daily maximum temperature, the annual maximum of consecutive frost days, and the total number of frost days. Nine global climate models forced with a moderate greenhouse-gas emissions scenario compares the indices over 2091-2100 versus 1991-2000. The credibility of model-simulated cold extremes is evaluated through both bias scores relative to reanalysis data in the past and multi-model agreement in the future. The number of times the value of each annual index in 2091-2100 exceeds the decadal average of the corresponding index in 1991-2000 is counted. The results indicate that intensity and duration of grid-based cold extremes, when viewed as a global total, will often be as severe as current typical conditions in many regions, but the corresponding frequency does not show this persistence. While the models agree on the projected persistence of cold extremes in terms of global counts, regionally, inter-model variability and disparity in model performance tends to dominate. Our findings suggest that, despite a general warming trend, regional preparedness for extreme cold events cannot be compromised even towards the end of the century.
Climate change is a pressing focus of research, social and economic concern, and political attention. According to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC), increased frequency of extreme events will only intensify the occurrence of natural hazards, acting global population, health, and economies. It is of keen interest to identify regions of similar climatological behavior to discover spatial relationships in climate variables, including long-range teleconnections. To that end, we consider a complex networks-based representation of climate data. Cross correlation is used to weight network edges, thus respecting the temporal nature of the data, and a community detection algorithm identifies multivariate clusters. Examining networks for consecutive periods allows us to study structural changes over time. We show that communities have a climatological interpretation and that disturbances in structure can be an indicator of climate events (or lack thereof). Finally, we discuss how this model can be applied for the discovery of more complex concepts such as unknown teleconnections or the development of multivariate climate indices and predictive insights.
We compare and evaluate different metrics for community structure in networks. In this context we also discuss a simple approach to community detection, and show that it performs as well as other methods, but at lower computational complexity.
A. R. Ganguly,
K. Steinhaeuser, D. J. Erickson III, M. L. Branstetter, E. Parish, N. Singh, J. B. Drake and L. Buja (2009).
Higher trends but larger uncertainty and geographic variability in 21st century temperature and heat waves.
Proceedings of the National Academy of Sciences USA,
106(37), 15555-15559.
Keywords: climate change, extremes, uncertainty, regional analysis
Generating credible climate change and extremes projections remains a high-priority challenge, especially since recent observed emissions are above the worst-case scenario. Bias and uncertainty analyses of ensemble simulations from a global earth systems model show increased warming and more intense heat waves combined with greater uncertainty and large regional variability in the 21st century. Global warming trends are statistically validated across ensembles and investigated at regional scales. Observed heat wave intensities in the current decade are larger than worst-case projections. Model projections are relatively insensitive to initial conditions, while uncertainty bounds obtained by comparison with recent observations are wider than ensemble ranges. Increased trends in temperature and heat waves, concurrent with larger uncertainty and variability, suggest greater urgency and complexity of adaptation or mitigation decisions.
Refereed Conference and Workshop Publications
J. Xu, T. L. Wickramarathne, N. V. Chawla, E. K. Grey,
K. Steinhaeuser, R. P. Keller, J. M. Keller, J. M. Drake, and D. M. Lodge (2014). Improving Management of Aquatic Invasions by Integrating Shipping Network, Ecological, and Environmental Data: Data Mining for Social Good.
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), New York, NY.
Keywords: global shipping network, invasive species, biodiversity
© ACM 2014
The unintentional transport of invasive species (i.e., non-native and harmful species that adversely affect habitats and native species) through the Global Shipping Network (GSN) causes substantial losses to social and economic welfare (e.g., annual losses due to ship-borne invasions in theLaurentian Great Lakes is estimated to be as high as USD 800 million). Despite the huge negative impacts, management of such invasions remains challenging because of the complex processes that lead to species transport and establishment. Numerous difficulties associated with quantitative risk assessments (e.g., inadequate characterizations of invasion processes, lack of crucial data, large uncertainties associated with available data, etc.) have hampered the usefulness of such estimates in the task of supporting the authorities who are battling to manage invasions with limited resources. We present here an approach for addressing the problem at hand via creative use of computational techniques and multiple data sources, thus illustrating how data mining can be used for solving crucial, yet very complex problems towards social good. By modeling implicit species exchanges as a network that we refer to as the Species Flow Network (SFN), large-scale species flow dynamics are studied via a graph clustering approach that decomposes the SFN into clusters of ports and inter-cluster connections. We then exploit this decomposition to discover crucial knowledge on how patterns in GSN affect aquatic invasions, and then illustrate how such knowledge can be used to devise effective and economical invasive species management strategies. By experimenting on actual GSN traffic data for years 1997-2006, we have discovered crucial knowledge that can signifficantly aid the management authorities.
V. Mithal, A. Khandelwal, S. Boriah,
K. Steinhaeuser, V. Kumar (2013). Change Detection from Temporal Sequences of Class Labels: Application to Land Cover Change Mapping.
SIAM International Conference on Data Mining (SDM), Austin, TX.
Keywords: time series analysis, change detection, remote sensing, land cover change
© SIAM 2013
Mapping land cover change is an important problem for the scientific community as well as policy makers. Traditionally, bi-temporal classification of satellite data is used to identify areas of land cover change. However, these classiffication products often have errors due to classifier inaccuracy or poor data, which poses significant issues when using them for land cover change detection. In this paper, we propose a generative model for land cover label sequences and use it to reassign a more accurate sequence of land cover labels to every pixel. Empirical evaluation on real and synthetic data suggests that the proposed approach is effective in capturing the characteristics of land cover classification and change processes, and produces signifficantly improved classification and change detection products.
X. Chen,
K. Steinhaeuser, S. Boriah, S. Chatterjee, V. Kumar (2013). Contextual Time Series Change Detection.
SIAM International Conference on Data Mining (SDM), Austin, TX.
Keywords: time series analysis, change detection, contextual change
© SIAM 2013
Time series data are common in a variety of fields ranging from economics to medicine and manufacturing. As a result, time series analysis and modeling has become an active research area in statistics and data mining. In this paper, we focus on a type of change we call contextual time series change (CTC) and propose a novel two-stage algorithm to address it. In contrast to traditional change detection methods, which consider each time series separately, CTC is defined as a change relative to the behavior of a group of related time series. As a result, our proposed method is able to identify novel types of changes not found by other algorithms. We demonstrate the unique capabilities of our approach with several case studies on real-world datasets from the nancial and Earth science domains.
A. Karpatne, M. Blank, M. Lau, S. Boriah,
K. Steinhaeuser, M. Steinbach, V. Kumar (2012). Importance of Vegetation Type in Forest Cover Estimation.
Conference on Intelligent Data Understanding (CIDU), Boulder, CO.
Keywords: remote sensing, land cover, forest cover estimation
© IEEE 2012
Forests are an important natural resource that play a major role in sustaining a number of vital geochemical and bioclimatic processes. Since damage to forests due to natural and anthropogenic factors can have long-lasting impacts on the ecosystem of the planet, monitoring and estimating forest cover and its losses at global, regional and local scales is of primary concern. Developing forest cover estimation techniques that utilize remote sensing datasets offers global applicability at high temporal frequencies. However, estimating forest cover using satellite observations is challenging in the presence of heterogeneous vegetation types, each having its unique data characteristics. In this paper, we explore techniques for incorporating information about the vegetation type in forest cover estimation algorithms. We show that utilizing the vegetation type improves performance regardless of the choice of input data or forest cover learning algorithm. We also provide a mechanism to automatically extract information about the vegetation type by partitioning the input data using clustering.
V. Mithal, Z. O'Connor,
K. Steinhaeuser, S. Bortiah, V. Kumar, C. Potter, S. Klooster (2012). Time Series Change Detection using Segmentation: A Case Study for Land Cover Monitoring.
Conference on Intelligent Data Understanding (CIDU), Boulder, CO.
Keywords: time series analysis, segmentation, remote sensing, land cover change
© IEEE 2012
Automatic identification of changes in land cover from remote sensing data is a critical aspect of monitoring the planet.s ecosystems. We use time series segmentation methodology for detecting land cover changes from Moderate Resolution Imaging Spectroradiometer-based vegetation index. In this paper, we investigate segmentation scores based on difference between models and propose two approaches for normalizing the difference based score. The first approach uses permutation testing to assign a p-value to model difference. The second approach builds on bootstrapping methodology used in statistics which estimates the null distribution of complex statistics whose standard errors are not analytically derivable by generating alternative versions of the data by a resampling strategy. More specifically, given a time series with either a single or two segments, we propose a method to estimate the distribution of model difference statistic for each segment. The proposed approach allows normalizing model difference statistic when complex models are being used in the segmentation algorithm. We study the strengths and weaknesses of the two normalizing approaches in the context of characteristics of land cover data such as seasonality and noise using synthetic and real data sets. We show that relative performance of normalization approaches can vary significantly depending on the characteristics of the data. We illustrate the utility of these approaches for detection of deforestation in Mato Grosso (Brazil).
X. Chen
†, A. Karpatne
†, Y. Chamber
†, V. Mithal, M. Lau,
K. Steinhaeuser, S. Boriah, M. Steinbach, V. Kumar, C. Potter, S. Klooster, T. Abraham, J.D. Stanley (2012). A New Data Mining Framework for Forest Fire Mapping.
Conference on Intelligent Data Understanding (CIDU), Boulder, CO. † Equal Contribution
Keywords: remote sensing, land cover change, forest fire mapping, time series analysis, change detection
© IEEE 2012
Forests are an important natural resource that support economic activity and play a significant role in regulating the climate and the carbon cycle, yet forest ecosystems are increasingly threatened by fires caused by a range of natural and anthropogenic factors. Mapping these fires, which can range in size from less than an acre to hundreds of thousands of acres, is an important task for supporting climate and carbon cycle studies as well as informing forest management. Currently, there are two primary approaches to fire mapping: field- and aerial-based surveys, which are costly and limited in their extent; and remote sensing-based approaches, which are more cost-effective but pose several interesting methodological and algorithmic challenges. In this paper, we introduce a new framework for mapping forest fires based on satellite observations. Specifically, we develop unsupervised spatio-temporal data mining methods for Moderate Resolution Imaging Spectroradiometer (MODIS) data to generate a history of forest fires. A systematic comparison with alternate approaches in two diverse geographic regions demonstrates that our algorithmic paradigm is able to overcome some of the limitations in both data and methods employed by prior efforts.
J. Kawale, S. Chatterjee, D. Ormsby,
K. Steinhaeuser, S. Liess, V. Kumar (2012). Testing the Significance of Spatio-temporal Teleconnection Patterns.
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Beijing, China.
Keywords: spatio-temporal data mining, teleconnections, climate data, dipoles, significance testing
© ACM 2012
Dipoles represent long distance connections between the pressure anomalies of two distant regions that are negatively correlated with each other. Such dipoles have proven important for understanding and explaining the variability in climate in many regions of the world, e.g., the El Niño climate phenomenon is known to be responsible for precipitation and temperature anomalies worldwide. Systematic approaches for dipole detection generate a large number of candidate dipoles, but there exists no method to evaluate the significance of the candidate teleconnections. Statistical significance testing is an important mechanism that helps in assessing the relevance of the patterns generated to determine whether they are interesting or spurious, i.e., generated by random chance. In this paper, we present a novel method for testing the statistical significance of a class of spatio-temporal patterns called teleconnections or dipoles. One of the most important challenges in addressing significance testing in a spatio-temporal context is how to address the spatial and temporal dependencies that show up as high autocorrelation. We present a novel approach that uses the wild bootstrap to capture the spatio-temporal dependencies, in the special use case of teleconnections in climate data. Our approach to find the statistical significance takes into account the autocorrelation, the seasonality and the trend in the time series over a period of time. This framework is applicable to other problems in spatio-temporal data mining to assess the significance of the patterns.
S. Chatterjee*,
K. Steinhaeuser, A. Banerjee, S. Chatterjee, and A. R. Ganguly (2012). Spare Group Lasso: Consistency and Climate Applications.
SIAM International Conference on Data Mining (SDM), Anaheim, CA. * Best Student Paper
Keywords: sparse regression, group lasso, climate data, multivariate predictive modeling
© SIAM 2012
The design of statistical predictive models for climate data gives rise to some unique challenges due to the high dimensionality and spatio-temporal nature of the datasets, which dictate that models should exhibit parsimony in variable selection. Recently, a class of methods which promote structured sparsity in the model have been developed, which is suitable for this task. In this paper, we prove theoretical statistical consistency of estimators with tree-structured norm regularizers. We consider one particular model, the Sparse Group Lasso (SGL), to construct predictors of land climate using ocean climate variables. Our experimental results demonstrate that the SGL model provides better predictive performance than the current state-of-the-art, remains climatologically interpretable, and is robust in its variable selection.
Various clustering methods have been applied to climate, ecological, and other environmental datasets, for example to define climate zones, automate land-use classification, and similar tasks. Measuring the "goodness" of such clusters is generally application-dependent and highly subjective, often requiring domain expertise and/or validation with field data (which can be costly or even impossible to acquire). Here we focus on one particular task: the extraction of ocean climate indices from observed climatological data. In this case, it is possible to quantify the relative performance of different methods. Specifically, we propose to extract indices with complex networks constructed from climate data, which have been shown to effectively capture the dynamical behavior of the global climate system, and compare their predictive power to candidate indices obtained using other popular clustering methods. Our results demonstrate that network-based clusters are statistically significantly better predictors of land climate than any other clustering method, which could lead to a deeper understanding of climate processes and complement physics-based climate models.
A. Pelan,
K. Steinhaeuser, N. V. Chawla, D. A. de Alwis Pitts and A. R. Ganguly (2011). Empirical Comparison of Correlation Measures and Pruning Levels in Complex Networks Representing the Global Climate System.
IEEE Symposium Series on Computational Intelligence and Data Mining (CIDM), Paris, France.
Keywords: complex networks, climate data, correlation measures, network properties
© IEEE 2011
Climate change is an issue of growing economic, social, and political concern. Continued rise in the average temperatures of the Earth could lead to drastic climate change or an increased frequency of extreme events, which would negatively affect agriculture, population, and global health. One way of studying the dynamics of the Earth's changing climate is by attempting to identify regions that exhibit similar climatic behavior in terms of long-term variability. Climate networks have emerged as a strong analytics framework for both descriptive analysis and predictive modeling of the emergent phenomena. Previously, the networks were constructed using only one measure of similarity, namely the (linear) Pearson cross correlation, and were then clustered using a community detection algorithm. However, nonlinear dependencies are known to exist in climate, which begs the question whether more complex correlation measures are able to capture any such relationships. In this paper, we present a systematic study of different univariate measures of similarity and compare how each affects both the network structure as well as the predictive power of the clusters.
K. Steinhaeuser, N. V. Chawla and A. R. Ganguly (2010). Complex Networks in Climate Science: Progress, Opportunities and Challenges.
NASA Conference on Intelligent Data Understanding (CIDU), Mountain View, CA.
Keywords: complex networks, climate data, network properties, community detection, open questions
© NASA 2010
Networks have been used to describe and model a wide range of complex systems, both natural as well as man-made. One particularly interesting application in the earth sciences is the use of complex networks to represent and study the global climate system. In this paper, we motivate this general approach, explain the basic methodology, report on the state of the art (including our contributions), and outline open questions and opportunities for future research.
While data mining aims to identify hidden knowledge from massive and high dimensional datasets, the importance of dependence structure among time, space, and between different variables is less emphasized. Analogous to the use of probability density functions in modeling individual variables, it is now possible to characterize the complete dependence space mathematically through the application of copulas. By adopting copulas, the multivariate joint probability distribution can be constructed without constraint to specific types of marginal distributions. Some common assumptions, like normality and independence between variables, can also be relieved. This study provides fundamental introduction and illustration of dependence structure, aimed at the potential applicability of copulas in general data mining. The case study in hydro-climatic anomaly detection shows that the frequency of multivariate anomalies is affected by the dependence level between variables. The appropriate multivariate thresholds can be determined through a copula-based approach.
To discover patterns in historical data, climate scientists have applied various clustering methods with the goal of identifying regions that share some common climatological behavior. However, past approaches are limited by the fact that they either consider only a single time period (snapshot) of multivariate data, or they consider only a single variable by using the time series data as multi-dimensional feature vector. In both cases, potentially useful information may be lost. Moreover, clusters in high-dimensional data space can be displaycult to interpret, prompting the need for a more effective data representation. We address both of these issues by employing a complex network (graph) to represent climate data, a more intuitive model that can be used for analysis while also having a direct mapping to the physical world for interpretation. A cross correlation function is used to weight network edges, thus respecting the temporal nature of the data, and a community detection algorithm identifies multivariate clusters. Examining networks for consecutive periods allows us to study structural changes over time. We show that communities have a climatological interpretation and that disturbances in structure can be an indicator of climate events (or lack thereof). Finally, we discuss how this model can be applied for the discovery of more complex concepts such as unknown teleconnections or the development of multivariate climate indices and predictive insights.
C. Moretti
†,
K. Steinhaeuser†, D. Thain and N. V. Chawla (2008). Scaling Up Classifiers to Cloud Computers.
IEEE International Conference on Data Mining (ICDM), Pisa, Italy. † Equal Contribution
Keywords: distributed data mining, cloud computing, large datasets, scalability analysis
© IEEE 2008
As the size of available datasets has grown from Megabytes to Gigabytes and now into Terabytes, machine learning algorithms and computing infrastructures have continuously evolved in an effort to keep pace. But at large scales, mining for useful patterns still presents challenges in terms of data management as well as computation. These issues can be addressed by dividing both data and computation to build ensembles of classifiers in a distributed fashion, but trade-offs in cost, performance, and accuracy must be considered when designing or selecting an appropriate architecture. In this paper, we present an abstraction for scalable data mining that allows us to explore these tradeoffs. Data and computation are distributed to a computing cloud with minimal effort from the user, and multiple models for data management are available depending on the workload and system configuration. We demonstrate the performance and scalability characteristics of our ensembles using a wide variety of datasets and algorithms on a Condor-based pool with Chirp to handle the storage.
Knowledge discovery from temporal, spatial and spatiotemporal data is critical for climate change science and climate impacts. Climate statistics is a mature area. However, recent growth in observations and model outputs, combined with the increased availability of geographical data, presents new opportunities for data miners. This paper maps climate requirements to solutions available in temporal, spatial and spatiotemporal data mining. The challenges result from long-range, long-memory and possibly nonlinear dependence, nonlinear dynamical behavior, presence of thresholds, importance of extreme events or extreme regional stresses caused by global climate change, uncertainty quantification, and the interaction of climate change with the natural and built environments. This paper makes a case for the development of novel algorithms to address these issues, discusses the recent literature, and proposes new directions. An illustrative case study presented here suggests that even relatively simple data mining approaches can provide new scientific insights with high societal impacts.
K. Steinhaeuser and N. V. Chawla (2008). Is Modularity the Answer to Evaluating Community Structure in Networks?
International Conference on Network Science (NetSci), Norwich, UK.
Keywords: complex networks, community detection, evaluation metrics, modularity, rand index
A significant increase in the ability to collect and store diverse information over the past decade has led to an outright data explosion, providing larger and richer datasets than ever before. This proliferation in dataset size is accompanied by the dilemma of successfully analyzing this data to discover patterns of interest. Extreme dataset sizes place unprecedented demands on high-performance computing infrastructures, and a gap has developed between the available real-world datasets and our ability to process them; data volumes are quickly approaching Tera and Petabytes. This rate of increase also defies the subsampling paradigm, as even a subsample of data runs well into Gigabytes. To counter this challenge, we exploit advances in multi-threaded processor technology. We explore massive thread-level parallelism -- provided by the Cray MTA-2 -- as a platform for scalable data mining. We conjecture that such an architecture is well suited for the application of machine learning to large datasets. To this end, we present a thorough complexity analysis and experimental evaluation of a popular decision tree algorithm implemented using fine-grain parallelism, including a comparison to two more conventional architectures. We use diverse datasets with sizes varying in both dimensions (number of records and attributes). Our results lead us to the conclusion that a massively parallel architecture is an appropriate platform for the implementation of highly scalable learning algorithms.
K. Steinhaeuser, N. V. Chawla and P. M. Kogge (2006). Exploiting Thread-Level Parallelism to Build Decision Trees.
ECML/PKDD Workshop on Parallel Data Mining (PDM), Berlin, Germany.
Keywords: high-performance data mining, large datasets, cray mta-2
© Springer 2006
Classification is an important data mining task, and decision trees have emerged as a popular classifier due to their simplicity and relatively low computational complexity. However, as datasets get extremely large, the time required to build a decision tree still becomes intractable. Hence, there is an increasing need for more efficient tree-building algorithms. One approach to this problem involves using a parallel mode of computation. Prior work has successfully used processor-level parallelism to partition the data and computation. We propose to use Cray.s Multi-Threaded Architecture (MTA) and extend the idea by employing thread-level parallelism to reduce the execution time of the tree building process. Decision tree building is well-suited for such low-level parallelism as it requires a large number of independent computations. In this paper, we present the analysis and parallel implementation of the ID3 algorithm, along with experimental results.
The management of wireless sensor networks in the presence of multiple constraints is an open problem in systems research. Existing methods perform well when optimized for a single parameter (such as energy, delay, network bandwidth). However, we might want to establish trade-offs on the fly, and optimize the information flow/exchange. This position paper shall serve as a preliminary proof-of-concept that techniques and algorithms from the machine learning and data mining domains can be applied to network data to learn relevant information about the routing behavior of individual nodes and the overall state of the network. We describe two simple examples which demonstrate the application of existing algorithms and analyze the results to illustrate their usefulness.
Last modified: June 25, 2014