5 questions to test your understanding
A fraud detection dataset contains 99.9% legitimate transactions and 0.1% fraudulent ones. A classifier that always predicts 'not fraud' achieves 99.9% accuracy. What does this reveal?
Why does fine-tuning a pretrained language model like BERT typically require far less labeled training data than training a classifier using TF-IDF features from scratch?
Bag-of-words models discard word order entirely, yet they can still achieve reasonable performance on many text classification tasks such as spam detection and topic classification.
Preprocessing steps like lowercasing and stop word removal usually improve text classification performance and should be applied universally.
Explain why overall accuracy is an insufficient evaluation metric for a text classifier trained on a severely imbalanced dataset, and what metrics should be used instead.