Review Midterm Concepts Exam

Unit 1

  • Differences between association and prediction
  • What is supervised vs unsupervised machine learning and examples of each
  • What is regression vs. classification and examples of each
  • What is reducible vs. irreducible error and what factors contribute to each
  • What is the difference between predictors vs. features
  • What is a model configuration and what are the components/dimensions across which model configuratins vary
  • What is bias, variance, and the bias-variance tradeoff
  • What is overfitting
  • What factors affect the bias and variance of a model
  • What are pros/cons of model flexibility
  • What are pros/cons of model interpretability
  • Why do evaluate models using error in a held-out (validation or test) set?
  • How is p-hacking related to overfitting?

Unit 2

  • What is Exploratory data analysis and why is it important?
  • What are the stages of analysis
  • What is data leakage and examples of it. How do we prevent it?
  • What can you do and not due with training, validation, and test sets to prevent data leakage
  • What are typical visualizations for EDA depending on the measurement of the features/outcome
  • What are typical summary statistics for EDA depending on the measurement of the features/outcome

Unit 3

  • What are examples of performance metrics that can be used for regression models?
  • What is the general linear model?
    • How does it work (how are parameters estimated)
    • What assumptions does it make and what consequences for violating those assumptions?
    • What is it good for, what is it less good for?
    • What transformations and other feature engineering steps are often useful for GLM
  • How does KNN work
    • What are its assumptions and requirements
    • How does it make predictions?
    • What does K affect and why would you use higher or lower values
    • How do you calculate distance
    • What transformations and other feature engineering steps are often useful for KNN
  • Compare the strengths and weaknesses of GLM vs. KNN

Unit 4

  • What is the Bayes classifier?
  • How do we use probability to make class predictions
  • What is the error rate of the Bayes classifier?
  • What is probability, odds, and odds ratios in classification
  • What is logistic regression?
    • How does it make predictions?
    • What decision boundaries does it support?
  • How is KNN adapted for classification and how does it make predictions
    • What are its assumptions and requirements
    • What decision boundaries does it support
    • What transformations and other feature engineering steps are often useful for KNN
  • How does Linear discriminant analysis work
    • What are its assumptions and requirements
    • What decision boundaries does it support
    • What transformations and other feature engineering steps are often useful for LDA
  • How Does Quadratic discriminant analysis work
    • What are its assumptions and requirements
    • What decision boundaries does it support
    • What transformations and other feature engineering steps are often useful for QDA
  • What are the relative costs and benefits of these different statistical algorithms

Unit 5

  • What is bias vs. variance wrt model performance estimates

    • How is this different from bias vs. variable of model itself
    • What factors affect model bias/variance
    • What factors affect bias and variance of performance estimate
  • What do we need training, validation and test sets and what do we use each for?

  • What are the important/common types of resampling and how do you do each of them?

    • Validation set approach
    • Leave One Out CV
    • K-Fold and Repeated K-Fold
    • Bootstrap resampling
  • How do these procedures compare with respect to

    • bias of performance estimate
    • variance of performance estimate
    • computational cost
  • When/why do you need to do grouped resampling (e.g. Grouped K-fold)

  • How does varying k in k-fold affect bias and variance of performance estimate?

  • What is optimization bias and how do we prevent it?

Unit 6

  • What are the models that use subsetting approaches: Forward, Backward, Best Subset (covered in reading only)

    • What are their pros/cons and when can they not be used
  • Cost and Loss functions

    • What are they and how are they used
    • What are the specific formulas for linear model, logistic regression, and variants of glmnet (ridge, LASSO, full elasticnet)
  • What is regularization

    • What are its benefits?
    • What are its costs?
  • How does lambda affect bias-variance trade-off in glmnet

  • What does alpha do?

  • Feature engineering approaches for dimensionality reduction: PCA (covered in reading only; and see appendix)

  • Other algorithms that do feature selection/dimensionality reduction: PCR and PLS (covered in reading only)

  • Contrasts of PCA, PCR, PLS, and glmnet/LASSO for dimensionality reduction (covered in reading only)