Course Materials

Unit 1: Overview

Reading

Yarkoni and Westfall (2017) paper
James et al. (2023) Chapter 2, pp 15 - 42

Slide decks

Videos

Lecture 1: An Introductory Framework ~ 9 mins
Lecture 2: More Details on Supervised Techniques ~ 23 mins
Lecture 3: Key Terminology in Context ~ 11 mins
Lecture 4: An Example of Bias-Variance Tradeoff ~ 27 mins
Discussion - Tuesday
Discussion - Thursday

Lab Materials

None this week

Application Assignment

No assignment this week

Quiz

Submit the unit quiz by 8 pm on Wednesday, January 22nd

Unit 2: Exploratory Data Analysis

Reading

[NOTE: These are short chapters. You are reading to understand the framework of visualizing data in R. Don’t feel like you have to memorize the details. These are reference materials that you can turn back to when you need to write code!]

Wickham, Çetinkaya-Rundel, and Grolemund (2023) Chapter 1, Data Visualization
Wickham, Çetinkaya-Rundel, and Grolemund (2023) Chapter 9, Layers
Wickham, Çetinkaya-Rundel, and Grolemund (2023) Chapter 10, Exploratory Data Analysis

Slide decks

Videos

Lecture 1: Stages of Data Analysis and Model Development ~ 10 mins
Lecture 2: Best Practices and Other Recommendations ~ 27 mins
Lecture 3: EDA for Data Cleaning ~ 41 mins
Lecture 4: EDA for Modeling - Univariate ~ 24 mins
Lecture 5: EDA for Modeling - Bivariate ~ 20 mins
Lecture 6: Working with Recipes ~ 13 mins
Discussion
Lab w/Zihan

Lab Materials (Zihan - Jan 28th)

Lab 1 w/Zihan

Application Assignment

data
data dictionary
cleaning EDA: qmd
modeling EDA: qmd
solutions: cleaning EDA; modeling EDA
Submit the application assignment by 8 pm on Wednesday, January 29th.

Quiz

Submit the unit quiz by 8 pm on Wednesday, January 29th.

Unit 3: Introduction to Regression Models

Reading

James et al. (2023) Chapter 3, pp 59 - 109

Slide decks

Videos

Lecture 1: Overview ~ 13 mins
Lecture 2: The Simple Linear Model, Part 1 ~ 16 mins
Lecture 3: The Simple Linear Model, Part 2 ~ 12 mins
Lecture 4: Extension to Multiple Predictors ~ 15 mins
Lecture 5: Extension to Categorical Predictors ~ 30 mins
Lecture 6: Extension to Interactions and Non-Linear Effects ~ 11 mins
Lecture 7: Introduction to KNN ~ 9 mins
Lecture 8: The hyperparameter k ~ 13 mins
Lecture 9: Distance and Scaling in KNN ~ 9 mins
Lecture 10: KNN with Ames ~ 12 mins
Discussion
Lab w/Coco - Feb 4th

Lab Materials

Lab 2 w/Coco - Feb 4th

Application Assignment

clean data: train; validate; test
data dictionary
lm qmd
knn qmd
solution: lm; knn
Submit the application assignment by 8 pm on Wednesday, February 5th.

Quiz

Submit the unit quiz by 8 pm on Wednesday, February 5th.

Unit 4: Introduction to Classification Models

Reading

James et al. (2023) Chapter 4, pp 129 - 164

Slide decks

Videos

Lecture 1: The Bayes Classifier ~ 9 mins
Lecture 2: Conceptual Overview of Logistic Regression ~ 19 mins
Lecture 3: EDA with the Cars Dataset ~ 12 mins
Lecture 4: Logistic Regression with Cars Dataset ~ 32 mins
Lecture 5: KNN with Cars Dataset ~ 19 mins
Lecture 6: LDA, DQA, RDA with Cars Dataset ~ 16 mins
Lecture 7: Comparisons among Classifiers ~ 11 mins
Discussion
Lab w/Zihan - Feb 11

Lab Materials

Lab 3 w/Zihan - Feb 11th

Application Assignment

data: raw; test
data dictionary
shells: cleaning EDA qmd; rda qmd; knn qmd
solution: modeling EDA; rda; knn
Submit the application assignment by 8 pm on Wednesday, February 12th.

Quiz

Submit the unit quiz by 8 pm on Wednesday, February 12th.

Unit 5: Resampling Methods for Model Selection and Evaluation

Reading

Kuhn and Johnson (2018) Chapter 4, pp 61 - 80
Supplemental: James et al. (2023) Chapter 5, pp 197 - 208 186

Slide decks

Videos

Lecture 1: Overview & Parallel Processing ~ 16 mins
Lecture 2: Introduction to Resampling ~ 11 mins
Lecture 3: Single Validation/Test Set Approach ~ 26 mins
Lecture 4: Leave One Out Cross Validation ~ 9 mins
Lecture 5: K-fold Cross Validation Approaches ~ 21 mins
Lecture 6: Repeated and Grouped K-fold Approaches ~ 11 mins
Lecture 7: Bootstrap Resampling ~ 11 mins
Lecture 8: Using Resampling to Select Best Model Configurations ~ 17 mins
Lecture 9: Resampling for Both Model Selection and Evaluation ~ 11 mins
Lecture 10: Nested Resampling ~ 14 mins
Discussion
Lab

Lab Materials

Lab - Feb 18 w/Coco

Application Assignment

data
data dictionary
qmd shell
solution
Submit the application assignment by 8 pm on Wednesday, February 19th.

Quiz

Submit the unit quiz by 8 pm on Wednesday, February 19th.

Unit 6: Regularization and Penalized Models

Reading

James et al. (2023) Chapter 6, pp 225 - 267

Slide decks

Videos

Lecture 1: An Introduction to Penalized/Regularized Algorithms ~ 15 mins
[Lecture 2: Intuitions about Penalized Cost Functions and Regularization ~ 11 mins
Lecture 3: Ridge Regression ~ 9 mins
Lecture 4: LASSO ~ 8 mins
Lecture 5: The Elastic net ~ 4 mins
Lecture 6: Emprical Example - Many good predictors ~ 23 mins
Lecture 7: Emprical Example - Good and zero predictors ~ 9 mins
Lecture 8: Emprical Example - LASSO for covariate selection ~ 8 mins
Discussion
Lab w/Zihan - Feb 25th

Lab Materials

Application Assignment

data
data dictionary
qmd shell
solution
Submit the application assignment by 8 pm on Wednesday, February 26th.

Quiz

Submit the unit quiz by 8 pm on Wednesday, February 26th.

Unit 8: Advanced Performance Metrics

Reading

Kuhn and Johnson (2018) Chapter 11, pp 247-266
Kuhn and Johnson (2018) Chapter 16, pp 419-435
Wyant et al, in press

Slide decks

Videos

Lecture 1: Unit Introduction ~ 15 mins
Lecture 2: The Confusion Matrix, Part 1~ 33 mins
Lecture 3: The Confusion Matrix, Part 2~ 11 mins
Lecture 4: The Receiver Operating Characteristic (ROC) Curve ~ 25 mins
Lecture 5: Selecting Model Configurations with Other Metrics ~ 10 mins
Lecture 6: Addressing Class Imbalance ~ 24 mins
Discussion
Lab - March 11th

Lab Materials

Application Assignment

data
qmd shell
solution
Submit the application assignment by noon on Friday, March 14th.

Quiz

Submit the unit quiz by 8 pm on Wednesday, March 12th.

Unit 9: Decision Trees, Bagging, and Random Forest

Reading

James et al. (2023) Chapter 8, Tree Based Methods; pp 327 - 352

In addition, much of the content from this unit has been drawn from four chapters in a book called Hands On Machine Learning In R. It is a great book and I used it heavily (and at times verbatim) b/c it is quite clear in its coverage of these algorithms. If you want more depth, you might read chapters 9-12 from this book as a supplement to this unit in our course.

Slide decks - Lecture - Discussion

Videos

Lecture 1: Decision Trees ~ 30 mins
Lecture 2: Decision Trees in Ames ~ 20 mins
Lecture 3: Bagged Treesi ~ 10 mins
Lecture 4: Bagged Trees in Ames ~ 6 mins
Lecture 5: Random Forest ~ 16 mins
Discussion
Lab - March 18th

Lab Materials

Application Assignment

data
qmd shell
solution
Submit the application assignment by noon on Friday, March 21st.

Quiz

Submit the unit quiz by 8 pm on Wednesday, March 19th.

Unit 10: Neural Networks

Reading

Neural Networks and Deep Learning, Chapter 1: Using neural networks to recognize handwritten digits

Slide decks

Videos

Lecture 1: But what is a Neural Network? ~ 19 mins
Lecture 2: Gradient descent, how neural networks learn ~ 21 mins
Lecture 3: What is backpropagation really doing? ~ 13 mins
Optional Lecture 4: Backpropagation calculus ~ 10 mins
Lecture 5: Introduction and the MNIST dataset ~ 14 mins
Lecture 6: Fitting neural networks in tidymodels with Keras ~ 18 mins
Lecture 7: Addressing overfitting - Intro and L2 ~ 5 mins
Lecture 8: Addressing overfitting - Dropout ~ 4 mins
Lecture 9: Addressing overfitting - Early stopping ~ 8 mins
Lecture 10: Selecting model configurations and final remarks ~ 8 mins
Discussion
Lab

Lab Materials

Application Assignment

wine quality datasets: training; test
qmd shell
solution

Submit the application assignment here by noon on Friday, April 4th

Quiz

Complete the unit quiz by 8 pm on Wednesday, April 2rd

Unit 11: Explanatory Approaches

Reading

Benavoli et al. (2017) paper: Read pages 1-9 that describe the correlated t-test and its limitations.
Kruschke (2018) paper: Describes Bayesian estimation and the ROPE (generally, not in the context of machine learning and model comparisons)

And these chapters in the book Interpretable Machine Learning. They are all short!

Molnar (2023) Chapter 2 - Interpretability
Molnar (2023) Chapter 3 - Goals of Interpretability
Molnar (2023) Chapter 4 - Methods Overview
Molnar (2023) Chapter 17 - Shapley Values:
Molnar (2023) Chapter 18 - SHAP:
Molnar (2023) Chapter 19 - Partial Dependence Plots
Molnar (2023) Chapter 20 - Accumulated Local Effects (ALE)
Molnar (2023) Chapter 21 - Feature Interactions
Molnar (2023) Chapter 23 - Permutation Feature Importance

Slide decks

Videos

Introduction to Model Comparisons ~ 6 mins
An Empirical Example of Feature Ablation ~ 13 mins
The Nadeau & Bengio Correlated t-test for Model Comparisons ~ 9 mins
Bayesian Estimation for Model Comparisons ~ 28 mins
Introduction to Feature Importance and the DALEX package ~ 11 mins
Permutation Feature Importance ~ 7 mins
SHAP Feature Importance ~ 14 mins
Visual Approaches to Understand Models ~ 11 mins
Discussion
Lab

Note the lab record of this week was mistakenly limiated to the speaker view not the screen until about 11’. But the contents are all in the lab html. Keras demo and early stop usage starts from around 45’.

Lab Materials

Application Assignment

Submit the application assignment here by noon on Friday, April 11th

Quiz

Complete the unit quiz by 8 pm on Wednesday, April 9th

Unit 12: NLP

Reading

Hvitfeldt and Silge (2022) Chapter 2: Tokenization
Hvitfeldt and Silge (2022) Chapter 5: Word Embeddings

NOTES: Please read the above chapters more with an eye toward concepts and issues rather than code. I will demonstrate a minimum set of functions to accomplish the NLP modeling tasks for this unit.

Also know that the entire Hvitfeldt and Silge (2022, book) is really mandatory reading. I would also strongly recommend this entire Silge and Robinson (2017) book. Both will be important references at a minimum.

Slide decks

Videos

Lecture 1: General Text (Pre-) Processing - the stringr package ~ 9 mins
Lecture 2: General Text (Pre-) Processing - regular expressions ~ 13 mins
Lecture 3: The IMDB Reviews Dataset ~ 6 mins
Lecture 4: Tokenization- Part 1 ~ 27 mins
Lecture 5: Tokenization- Part 2 ~ 13 mins
Lecture 6: Stopwords ~ 12 mins
Lecture 7: Stemming ~12 mins
Lecture 8: Bag of Words ~19 mins
Lecture 9: NLP in Action - Part 1 ~ 17 mins
Lecture 10: NLP in Action - Part 2 ~ 20 mins
Discussion
Lab

Lab Materials

Application Assignment

Submit the application assignment here by noon on Friday, April 18th

Quiz

Complete the unit quiz by 8 pm on Wednesday, April 16th

Unit 13: Applications

Reading

Ng (2018) pdf

Slide decks

Discussion

Videos

No lectures this week. Only lab and discussion section.

Lab Materials

Application Assignment

No assignment this week!

Quiz

Complete the unit quiz by 8 pm on Wednesday, April 23th

Unit 14: Ethics

Reading

The readings this week will come from O’Neil (2016); We will read the introduction, chapters 1, 3, 5, and the conclusion and afterword sections. A pdf of the book will be shared directly with you.
We will also read this article on emerging methods and tools for assessing model fairness.

Slide decks

Discussion

Videos

No lectures this week. Only discussion section.

Application Assignment

No assignment this week!

Quiz

Complete the unit quiz by 8 pm on Wednesday, April 30th

Unit 15: Final Exam Review

References

Benavoli, Alessio, Giorgio Coraniy, Janez Demsar, and Marco Zaffalon. 2017. “Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis.” Journal of Machine Learning Research 18: 1–36.

Hvitfeldt, Emil, and Julia Silge. 2022. Supervised Machine Learning for Text Analysis in R. https://smltar.com/.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2023. An Introduction to Statistical Learning: With Applications in R. 2nd ed. Springer Texts in Statistics. New York: Springer-Verlag.

Kruschke, John K. 2018. “Rejecting or Accepting Parameter Values in Bayesian Estimation.” Advances in Methods and Practices in Psychological Science 1: 270–80.

Kuhn, Max, and Kjell Johnson. 2018. Applied Predictive Modeling. 1st ed. 2013, Corr. 2nd printing 2018 edition. New York: Springer.

Molnar, Christoph. 2023. Intepretable Machine Learning: A Guide for Makiong Black Box MOdels Explainable. 2nd ed. https://christophm.github.io/interpretable-ml-book/.

Ng, Andrew. 2018. Machine Learning Yearning: Technical Strategy for AI Engineers in the Age of Deep Learning. DeepLearning.AI.

O’Neil, Cathy. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Reprint Edition. Broadway Books.

Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. 1rst ed. Beijing; Boston: O’Reilly Media.

Wickham, Hadley, Çetinkaya-Rundel Mine, and Garrett Grolemund. 2023. R for Data Science: Visualize, Model, Transform, and Import Data. 2nd ed. https://r4ds.hadley.nz/.

Yarkoni, Tal, and Jacob Westfall. 2017. “Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning.” Perspectives on Psychological Science 12 (6): 1100–1122.