Review Midterm Concepts Exam
Unit 1
- Differences between association and prediction
- What is supervised vs unsupervised machine learning and examples of each
- What is regression vs. classification and examples of each
- What is reducible vs. irreducible error and what factors contribute to each
- What is the difference between predictors vs. features
- What is a model configuration and what are the components/dimensions across which model configuratins vary
- What is bias, variance, and the bias-variance tradeoff
- What is overfitting
- What factors affect the bias and variance of a model
- What are pros/cons of model flexibility
- What are pros/cons of model interpretability
- Why do evaluate models using error in a held-out (validation or test) set?
- How is p-hacking related to overfitting?
Unit 2
- What is Exploratory data analysis and why is it important?
- What are the stages of analysis
- What is data leakage and examples of it. How do we prevent it?
- What can you do and not due with training, validation, and test sets to prevent data leakage
- What are typical visualizations for EDA depending on the measurement of the features/outcome
- What are typical summary statistics for EDA depending on the measurement of the features/outcome
Unit 3
- What are examples of performance metrics that can be used for regression models?
- What is the general linear model?
- How does it work (how are parameters estimated)
- What assumptions does it make and what consequences for violating those assumptions?
- What is it good for, what is it less good for?
- What transformations and other feature engineering steps are often useful for GLM
- How does KNN work
- What are its assumptions and requirements
- How does it make predictions?
- What does K affect and why would you use higher or lower values
- How do you calculate distance
- What transformations and other feature engineering steps are often useful for KNN
- Compare the strengths and weaknesses of GLM vs. KNN
Unit 4
- What is the Bayes classifier?
- How do we use probability to make class predictions
- What is the error rate of the Bayes classifier?
- What is probability, odds, and odds ratios in classification
- What is logistic regression?
- How does it make predictions?
- What decision boundaries does it support?
- How is KNN adapted for classification and how does it make predictions
- What are its assumptions and requirements
- What decision boundaries does it support
- What transformations and other feature engineering steps are often useful for KNN
- How does Linear discriminant analysis work
- What are its assumptions and requirements
- What decision boundaries does it support
- What transformations and other feature engineering steps are often useful for LDA
- How Does Quadratic discriminant analysis work
- What are its assumptions and requirements
- What decision boundaries does it support
- What transformations and other feature engineering steps are often useful for QDA
- What are the relative costs and benefits of these different statistical algorithms
Unit 5
What is bias vs. variance wrt model performance estimates
- How is this different from bias vs. variable of model itself
- What factors affect model bias/variance
- What factors affect bias and variance of performance estimate
What do we need training, validation and test sets and what do we use each for?
What are the important/common types of resampling and how do you do each of them?
- Validation set approach
- Leave One Out CV
- K-Fold and Repeated K-Fold
- Bootstrap resampling
How do these procedures compare with respect to
- bias of performance estimate
- variance of performance estimate
- computational cost
When/why do you need to do grouped resampling (e.g. Grouped K-fold)
How does varying k in k-fold affect bias and variance of performance estimate?
What is optimization bias and how do we prevent it?
Unit 6
What are the models that use subsetting approaches: Forward, Backward, Best Subset (covered in reading only)
- What are their pros/cons and when can they not be used
Cost and Loss functions
- What are they and how are they used
- What are the specific formulas for linear model, logistic regression, and variants of glmnet (ridge, LASSO, full elasticnet)
What is regularization
- What are its benefits?
- What are its costs?
How does lambda affect bias-variance trade-off in glmnet
What does alpha do?
Feature engineering approaches for dimensionality reduction: PCA (covered in reading only; and see appendix)
Other algorithms that do feature selection/dimensionality reduction: PCR and PLS (covered in reading only)
Contrasts of PCA, PCR, PLS, and glmnet/LASSO for dimensionality reduction (covered in reading only)