Review Final Concepts Exam

Unit 8: Advanced Performance Metrics

  • Understand costs and benefits of accuracy
  • What is a confusion matrix? How to interpret rows and columns?
  • Understand costs and benefits of other performance metrics based on confusion matrix
    • sensitivity
    • specificity
    • positive predictive value
    • negative predictive value
    • balanced accuracy
    • F score
    • Kappa
  • What is an ROC curve, what does it tell you? What is the role of a threshold in the curve?
  • What is the area under the ROC curve? How can it be interpreted and used as a performance metrix
  • What is unbalanced data? What problems does it cause
  • What are solutions to problems with unbalanced data and how do they work?
    • Selection of performance metric
    • Selection of classification threshold
    • Sampling and resampling approaches

Unit 9: Decision Trees, Bagged Trees and Random Forest

  • What is a decision tree?
    • How does it make predictions?
    • What are the components of a decision tree?
    • How are splits determined?
    • What are the advantages and disadvantages of decision trees?
    • What hyperparameters are important for decision trees?
  • How are bagged trees different from simple decision trees?
    • What is the bagging process?
    • How does bagging improve model performance?
    • What hyperparameters are important for bagged trees?
    • What is the effect of these hyperparametrers on the model?
    • What are the advantages and disadvantages of bagged trees?
  • How is Random Forest different from bagged trees?
    • What is the random forest process?
    • How does random forest improve model performance?
    • What hyperparameters are important for random forests?
    • What is the effect of these hyperparametrers on the model?
    • What why is mtry important?
    • What are the advantages and disadvantages of random forests?

Unit 10: Neural Networks

  • What is a neural network?
    • What are the components of a neural network?
    • What is an activation function? What are common activation functions and their costs/benefits
    • How are neural networks different for regression vs. classification vs. multi-class classification?
    • What are the typical hyperparameters for a neural network?
    • How do these hyperparameters affect the model? How do they affect bias/variance?
    • What are the advantages and disadvantages of neural networks?
    • What makes a model a deep learning model?

Unit 11: Explanatory Methods

  • How do we use feature ablation to statistically compare model configurations
  • How do we use Bayesian estimation for model comparisons.
  • How are p-values and posterior probabilities different
  • What is a Bayesian ROPE
  • What are examples of model specific approaches for feature importance
  • What are examples of model agnostic approaches for feature importance
  • How does permutation feature importance work (how are scores calculated) and what does it tell us?
  • What are Shapley values? What are the benefits of this approach?
  • What is the difference between local and global feature importance?
  • What Partial Dependence plots and Accumulated Local Effects (ALE) plots. WHat do they tell us? What are the advantages and disadvantages of each?

Unit 12: NLP

  • What are tokens, what types of tokens are used regularly, and how do you create them?
  • What are n-grams
  • What is stemming and lemmatization? How do they work? What are the advantages and disadvantages of each?
  • What are stop-words? How do you identify them? What are the advantages and disadvantages of removing stop-words?
  • How can you use one-hot encoding in the bag of words approach to create features from words (or other tokens)?
  • What are the disadvantages of using one-hot encoding for words?
  • What is a word embedding? How do you create them using Word2Vec (skip-gram and CBOW)? What are the advantages and disadvantages of word embeddings?
  • What is a sentiment analysis and how might you do it?

Unit 13: Applications

  • What are error analyses?
    • How do you do it?
    • What is an eyeball sample?
    • When is it useful?
    • What are the risks of error analysis?
    • How can you do error analysis without an eyeball sample?
    • How can you do error analysis for regression?
  • What can you learn from comparing training error to validation error?
    • Can you identify situations that indicate bias vs. variance vs. both sources of error
    • What is optimal error and how might you calculate it?
  • What are learning curves?
    • How do you calculate them?
    • What do they tell you?
    • Can you use them to identify more N will improve performance?
    • Can you use the to tell if better features or more complicated model will improve perforance

Unit 14: Ethics

  • What are some key issues to consider when evaluating the ethics of implementing a model. For each of these, can you define the issue, give examples, connect to the factors that affect this issue, and discuss the impact of this dimension on the ethics of the model.
    • model scale
    • model opacity
    • use of a proxy outcome
    • self-perpetuating vs. self-correcting errors
    • factors that contribute to potential damage if model is implemented