Sentiment Analysis - allowed NLTK

k-Fold Cross Validation

  • Break the data into (= 10) train/test splits
  • Train on those, report test accuracy

Leave One Out Cross Validation

  • Basically -fold cross validation
  • interations, leave only one out to test on every iteration
  • Good to see what your model is failing on

Validation Set

Original labeled data (training set, validation set, test set) “Cultivate a level of paranoia that is tantamount to fear about your model doing some weird shit” - Max

Training Failures

  • Overfitting
  • Underfitting
  • Lack of model power
  • Lack of signal in the data

Overfitting

Model fails to generalize - too tuned on the training set (Like taking a practice exam over and over and doing really well because you’ve already taken it)

Pruning

Limiting the number of branches of the decision tree A decision tree with only 2 branches is called a stump

Underfitting

Too general - memorized one rule and applying it everywhere If it can’t do anything on its training set (accuracy low) it’s definitely underfit

Insufficient Model Power

Some patterns are too complicated to represent with a decision tree Imagine your label is determined by Decision tree will struggle with this

Lack of Signal

The pattern cannot be determined from the current features you have You might just need more or different features

Class Imbalance

Serious issue - when an overwhelming majority if your data is a single class Positive vs. negative class

  • Positive requires action
  • Negative default behavior Uncommon signal we’re trying to detect Confusion Matrix

Precision

Measures how many of the things we classified as positive were actually positive

We use precision when we only care about being correct about the things we identify as positive.

  • Google does not care if it turns away 1000 good engineers; Google just wants to make sure the engineers it DOES hire are good

Recall

Recall is a measure of how many of our positive class that we missed.

Google doesn’t care about recall. Cancer screens do TSA screening We use recall when we want to make sure we don’t miss anything.

F1 Score

Harmonic mean of precision and recall

Confidence

ML algorithms can tell you how confident they are in an answer You may want to treat 51% confident of the positive class and 99$ confident differently. Many times you might need to choose a threshold for when to classify something as positive vs. negative

ROC Curve

As we change the threshold for classification, the model will begin to get more true positives, as well as more false positives. We can graph this as the threshold changes. This is called the Receiver-operating characteristic curve, or ROC curve

Parametric curve: false positive rate (x) vs. true positive rate (y)

Log Loss

A measure of accuracy that penalizes overconfidence

Tuning Hyperparameters

and max-depth Max-depth: between 3 and 7 KNN: between 3 and 13