Practical Data Science with R, Second Edition / Mount, John.

Practical Data Science with R, Second Edition is a task-based tutorial that leads readers through dozens of useful, data analysis practices using the R language. By concentrating on the most important tasks you'll face on the job, this friendly guide is comfortable both for business analysts an...

Full description

Saved in:
Bibliographic Details
Online Access: Full Text (via O'Reilly/Safari)
Main Authors: Mount, John (Author), Zumel, Nina (Author)
Corporate Author: Safari, an O'Reilly Media Company
Format: eBook
Language:English
Published: Manning Publications, 2019.
Edition:2nd edition.
Subjects:
Table of Contents:
  • Intro
  • Practical Data Science with R, Second Edition
  • Nina Zumel and John Mount
  • Copyright
  • Dedication
  • Brief Table of Contents
  • Table of Contents
  • Praise for the First Edition
  • front matter
  • Foreword
  • Preface
  • Acknowledgments
  • About This Book
  • What is data science?
  • Roadmap
  • Audience
  • What is not in this book?
  • Code conventions and downloads
  • Working with this book
  • Downloading the book's supporting materials/repository
  • Book forum
  • About the Authors
  • About the Foreword Authors
  • About the Cover Illustration
  • Part 1. Introduction to data science
  • Chapter 1. The data science process
  • 1.1. The roles in a data science project
  • 1.1.1. Project roles
  • 1.2. Stages of a data science project
  • 1.2.1. Defining the goal
  • 1.2.2. Data collection and management
  • 1.2.3. Modeling
  • 1.2.4. Model evaluation and critique
  • 1.2.5. Presentation and documentation
  • 1.2.6. Model deployment and maintenance
  • 1.3. Setting expectations
  • 1.3.1. Determining lower bounds on model performance
  • Summary
  • Chapter 2. Starting with R and data
  • 2.1. Starting with R
  • 2.1.1. Installing R, tools, and examples
  • 2.1.2. R programming
  • 2.2. Working with data from files
  • 2.2.1. Working with well-structured data from files or URLs
  • 2.2.2. Using R with less-structured data
  • 2.3. Working with relational databases
  • 2.3.1. A production-size example
  • Summary
  • Chapter 3. Exploring data
  • 3.1. Using summary statistics to spot problems
  • 3.1.1. Typical problems revealed by data summaries
  • 3.2. Spotting problems using graphics and visualization
  • 3.2.1. Visually checking distributions for a single variable
  • 3.2.2. Visually checking relationships between two variables
  • Summary
  • Chapter 4. Managing data
  • 4.1. Cleaning data
  • 4.1.1. Domain-specific data cleaning
  • 4.1.2. Treating missing values.
  • 4.1.3. The vtreat package for automatically treating missing variables
  • 4.2. Data transformations
  • 4.2.1. Normalization
  • 4.2.2. Centering and scaling
  • 4.2.3. Log transformations for skewed and wide distributions
  • 4.3. Sampling for modeling and validation
  • 4.3.1. Test and training splits
  • 4.3.2. Creating a sample group column
  • 4.3.3. Record grouping
  • 4.3.4. Data provenance
  • Summary
  • Chapter 5. Data engineering and data shaping
  • 5.1. Data selection
  • 5.1.1. Subsetting rows and columns
  • 5.1.2. Removing records with incomplete data
  • 5.1.3. Ordering rows
  • 5.2. Basic data transforms
  • 5.2.1. Adding new columns
  • 5.2.2. Other simple operations
  • 5.3. Aggregating transforms
  • 5.3.1. Combining many rows into summary rows
  • 5.4. Multitable data transforms
  • 5.4.1. Combining two or more ordered data frames quickly
  • 5.4.2. Principal methods to combine data from multiple tables
  • 5.5. Reshaping transforms
  • 5.5.1. Moving data from wide to tall form
  • 5.5.2. Moving data from tall to wide form
  • 5.5.3. Data coordinates
  • Summary
  • Part 2. Modeling methods
  • Chapter 6. Choosing and evaluating models
  • 6.1. Mapping problems to machine learning tasks
  • 6.1.1. Classification problems
  • 6.1.2. Scoring problems
  • 6.1.3. Grouping: working without known targets
  • 6.1.4. Problem-to-method mapping
  • 6.2. Evaluating models
  • 6.2.1. Overfitting
  • 6.2.2. Measures of model performance
  • 6.2.3. Evaluating classification models
  • 6.2.4. Evaluating scoring models
  • 6.2.5. Evaluating probability models
  • 6.3. Local interpretable model-agnostic explanations (LIME) for explai- ining model predictions
  • 6.3.1. LIME: Automated sanity checking
  • 6.3.2. Walking through LIME: A small example
  • 6.3.3. LIME for text classification
  • 6.3.4. Training the text classifier
  • 6.3.5. Explaining the classifier's predictions
  • Summary.
  • Chapter 7. Linear and logistic regression
  • 7.1. Using linear regression
  • 7.1.1. Understanding linear regression
  • 7.1.2. Building a linear regression model
  • 7.1.3. Making predictions
  • 7.1.4. Finding relations and extracting advice
  • 7.1.5. Reading the model summary and characterizing coefficient quality
  • 7.1.6. Linear regression takeaways
  • 7.2. Using logistic regression
  • 7.2.1. Understanding logistic regression
  • 7.2.2. Building a logistic regression model
  • 7.2.3. Making predictions
  • 7.2.4. Finding relations and extracting advice from logistic models
  • 7.2.5. Reading the model summary and characterizing coefficients
  • 7.2.6. Logistic regression takeaways
  • 7.3. Regularization
  • 7.3.1. An example of quasi-separation
  • 7.3.2. The types of regularized regression
  • 7.3.3. Regularized regression with glmnet
  • Summary
  • Chapter 8. Advanced data preparation
  • 8.1. The purpose of the vtreat package
  • 8.2. KDD and KDD Cup 2009
  • 8.2.1. Getting started with KDD Cup 2009 data
  • 8.2.2. The bull-in-the-china-shop approach
  • 8.3. Basic data preparation for classification
  • 8.3.1. The variable score frame
  • 8.4. Advanced data preparation for classification
  • 8.4.1. Using mkCrossFrameCExperiment()
  • 8.4.2. Building a model
  • Building a multivariable model
  • Evaluating the model
  • 8.5. Preparing data for regression modeling
  • 8.6. Mastering the vtreat package
  • 8.6.1. The vtreat phases
  • 8.6.2. Missing values
  • 8.6.3. Indicator variables
  • 8.6.4. Impact coding
  • 8.6.5. The treatment plan
  • 8.6.6. The cross-frame
  • Summary
  • Chapter 9. Unsupervised methods
  • 9.1. Cluster analysis
  • 9.1.1. Distances
  • 9.1.2. Preparing the data
  • 9.1.3. Hierarchical clustering with hclust
  • 9.1.4. The k-means algorithm
  • 9.1.5. Assigning new points to clusters
  • 9.1.6. Clustering takeaways
  • 9.2. Association rules.
  • 9.2.1. Overview of association rules
  • 9.2.2. The example problem
  • 9.2.3. Mining association rules with the arules package
  • 9.2.4. Association rule takeaways
  • Summary
  • Chapter 10. Exploring advanced methods
  • 10.1. Tree-based methods
  • 10.1.1. A basic decision tree
  • 10.1.2. Using bagging to improve prediction
  • 10.1.3. Using random forests to further improve prediction
  • 10.1.4. Gradient-boosted trees
  • 10.1.5. Tree-based model takeaways
  • 10.2. Using generalized additive models (GAMs) to learn non-monotone relationships
  • 10.2.1. Understanding GAMs
  • 10.2.2. A one-dimensional regression example
  • 10.2.3. Extracting the non-linear relationships
  • 10.2.4. Using GAM on actual data
  • 10.2.5. Using GAM for logistic regression
  • 10.2.6. GAM takeaways
  • 10.3. Solving "inseparable" problems using support vector machines
  • 10.3.1. Using an SVM to solve a problem
  • 10.3.2. Understanding support vector machines
  • 10.3.3. Understanding kernel functions
  • 10.3.4. Support vector machine and kernel methods takeaways
  • Summary
  • Part 3. Working in the real world
  • Chapter 11. Documentation and deployment
  • 11.1. Predicting buzz
  • 11.2. Using R markdown to produce milestone documentation
  • 11.2.1. What is R markdown?
  • 11.2.2. knitr technical details
  • 11.2.3. Using knitr to document the Buzz data and produce the model
  • 11.3. Using comments and version control for running documentation
  • 11.3.1. Writing effective comments
  • 11.3.2. Using version control to record history
  • 11.3.3. Using version control to explore your project
  • 11.3.4. Using version control to share work
  • 11.4. Deploying models
  • 11.4.1. Deploying demonstrations using Shiny
  • 11.4.2. Deploying models as HTTP services
  • 11.4.3. Deploying models by export
  • 11.4.4. What to take away
  • Summary
  • Chapter 12. Producing effective presentations.
  • 12.1. Presenting your results to the project sponsor
  • 12.1.1. Summarizing the project's goals
  • 12.1.2. Stating the project's results
  • 12.1.3. Filling in the details
  • 12.1.4. Making recommendations and discussing future work
  • 12.1.5. Project sponsor presentation takeaways
  • 12.2. Presenting your model to end users
  • 12.2.1. Summarizing the project goals
  • 12.2.2. Showing how the model fits user workflow
  • 12.2.3. Showing how to use the model
  • 12.2.4. End user presentation takeaways
  • 12.3. Presenting your work to other data scientists
  • 12.3.1. Introducing the problem
  • 12.3.2. Discussing related work
  • 12.3.3. Discussing your approach
  • 12.3.4. Discussing results and future work
  • 12.3.5. Peer presentation takeaways
  • Summary
  • Appendix A. Starting with R and other tools
  • A.1. Installing the tools
  • A.1.1. Installing Tools
  • A.1.2. The R package system
  • A.1.3. Installing Git
  • A.1.4. Installing RStudio
  • A.1.5. R resources
  • A.2. Starting with R
  • A.2.1. Primary features of R
  • A.2.2. Primary R data types
  • A.3. Using databases with R
  • A.3.1. Running database queries using a query generator
  • A.3.2. How to think relationally about data
  • A.4. The takeaway
  • Appendix B. Important statistical concepts
  • B.1. Distributions
  • B.1.1. Normal distribution
  • B.1.2. Summarizing R's distribution naming conventions
  • B.1.3. Lognormal distribution
  • B.1.4. Binomial distribution
  • B.1.5. More R tools for distributions
  • B.2. Statistical theory
  • B.2.1. Statistical philosophy
  • B.2.2. A/B tests
  • B.2.3. Power of tests
  • B.2.4. Specialized statistical tests
  • B.3. Examples of the statistical view of data
  • B.3.1. Sampling bias
  • B.3.2. Omitted variable bias
  • B.4. The takeaway
  • Appendix C. Bibliography
  • Practical Data Science with R
  • Index
  • List of Figures
  • List of Tables
  • List of Listings.