Avoiding False Discoveries: A completely new addition in the second edition is a chapter on how to avoid false discoveries and produce valid results, which is novel among other contemporary textbooks on data mining. It supplements the discussions in the other chapters with a discussion of the statistical concepts (statistical significance, p-values, false discovery rate, permutation testing, etc.) relevant to avoiding spurious results, and then illustrates these concepts in the context of data mining techniques. This chapter addresses the increasing concern over the validity and reproducibility of results obtained from data analysis. The addition of this chapter is a recognition of the importance of this topic and an acknowledgment that a deeper understanding of this area is needed for those analyzing data.
-
Classification: Some of the most significant improvements in the text have been in the two chapters on classification. The introductory chapter uses the decision tree classifier for illustration, but the discussion on many topics—those that apply across all classification approaches—has been greatly expanded and clarified, including topics such as overfitting, underfitting, the impact of training size, model complexity, model selection, and common pitfalls in model evaluation. Almost every section of the advanced classification chapter has been significantly updated. The material on Bayesian networks, support vector machines, and artificial neural networks has been significantly expanded. We have added a separate section on deep networks to address the current developments in this area. The discussion of evaluation, which occurs in the section on imbalanced classes, has also been updated and improved.
Anomaly Detection: Anomaly detection has been greatly revised and expanded. Existing approaches—statistical, nearest neighbor/density-based, and clustering based—have been retained and updated, while new approaches have been added: reconstruction-based, one-class classification, and information-theoretic. The reconstruction-based approach is illustrated using autoencoder networks that are part of the deep learning paradigm.
Association Analysis: The changes in association analysis are more localized. We have completely reworked the section on the evaluation of association patterns (introductory chapter), as well as the sections on sequence and graph mining (advanced chapter).
Clustering: Changes to cluster analysis are also localized. The introductory chapter added the K-means initialization technique and an updated discussion of cluster evaluation. The advanced clustering chapter adds a new section on spectral graph clustering.
Data: The data chapter has been updated to include discussions of mutual information and kernel-based techniques.
Exploring Data: The data exploration chapter has been removed from the print edition of the book, but is available on the web.
- Appendices: All appendices are available on the web. A new appendix provides a brief discussion of scalability in the context of big data.
|