Sentiment Analysis - allowed NLTK
k-Fold Cross Validation
- Break the data into (= 10) train/test splits
- Train on those, report test accuracy
Leave One Out Cross Validation
- Basically -fold cross validation
- interations, leave only one out to test on every iteration
- Good to see what your model is failing on
Validation Set
Original labeled data → (training set, validation set, test set) “Cultivate a level of paranoia that is tantamount to fear about your model doing some weird shit” - Max
Training Failures
- Overfitting
- Underfitting
- Lack of model power
- Lack of signal in the data
Overfitting
Model fails to generalize - too tuned on the training set (Like taking a practice exam over and over and doing really well because you’ve already taken it)
Pruning
Limiting the number of branches of the decision tree A decision tree with only 2 branches is called a stump
Underfitting
Too general - memorized one rule and applying it everywhere If it can’t do anything on its training set (accuracy low) it’s definitely underfit
Insufficient Model Power
Some patterns are too complicated to represent with a decision tree Imagine your label is determined by Decision tree will struggle with this
Lack of Signal
The pattern cannot be determined from the current features you have You might just need more or different features
Class Imbalance
Serious issue - when an overwhelming majority if your data is a single class Positive vs. negative class
- Positive requires action
- Negative default behavior Uncommon signal we’re trying to detect Confusion Matrix
Precision
Measures how many of the things we classified as positive were actually positive
We use precision when we only care about being correct about the things we identify as positive.
- Google does not care if it turns away 1000 good engineers; Google just wants to make sure the engineers it DOES hire are good
Recall
Recall is a measure of how many of our positive class that we missed.
Google doesn’t care about recall. Cancer screens do TSA screening We use recall when we want to make sure we don’t miss anything.
F1 Score
Harmonic mean of precision and recall
Confidence
ML algorithms can tell you how confident they are in an answer You may want to treat 51% confident of the positive class and 99$ confident differently. Many times you might need to choose a threshold for when to classify something as positive vs. negative
ROC Curve
As we change the threshold for classification, the model will begin to get more true positives, as well as more false positives. We can graph this as the threshold changes. This is called the Receiver-operating characteristic curve, or ROC curve
Parametric curve: false positive rate (x) vs. true positive rate (y)
Log Loss
A measure of accuracy that penalizes overconfidence
Tuning Hyperparameters
and max-depth Max-depth: between 3 and 7 KNN: between 3 and 13