A data scientist builds an anomaly detection model for factory machine failures and says 'I'll use the statistically optimal threshold.' What is the fundamental problem with this statement?
AAnomaly detection models do not produce continuous scores, so threshold selection is not applicable
BStatistical optimality requires labeled anomaly data, which is never available in practice
CThere is no universally optimal threshold — the right cutoff depends on the business cost of false negatives versus false positives, which cannot be determined from the data alone
DThe threshold should always be set at 3 standard deviations from the mean, making the choice straightforward
Every anomaly detection threshold encodes a decision about relative error costs. In a factory, a false negative (missed failure) might cause catastrophic downtime; a false positive (unnecessary halt) wastes production time. These costs come from domain context, not from statistics. The same model with different thresholds would be correct in different applications. There is no statistically derived number that captures this business tradeoff — it must be set intentionally, informed by the cost structure of the specific application.
Question 2 Multiple Choice
Why do isolation forests use the average depth at which a point is isolated in random decision trees as its anomaly score?
ADeeper isolation indicates the point is in a denser region, requiring more splits to separate from similar points
BAnomalies in sparse regions are isolated in very few random splits, while normal points in dense clusters require many splits; short isolation paths inversely signal anomaly-ness
CRandom trees with more splits achieve higher accuracy, so deeper isolation paths produce more reliable scores
DIsolation depth is directly proportional to the z-score, providing a familiar statistical interpretation
Isolation forests exploit a geometric intuition: anomalies sit far from the crowd in sparse regions of feature space. A random split anywhere near an anomaly will quickly separate it from all other points. Normal points, clustered together, require many successive splits before one of them is finally isolated from its neighbors. No distance calculations or density estimates are needed — the algorithm uses the efficiency of random isolation as its signal, which is why it scales well to high-dimensional data.
Question 3 True / False
In anomaly detection for credit card fraud, it is generally better to use a lower detection threshold (more sensitive, more alerts) than in a manufacturing quality control application.
TTrue
FFalse
Answer: True
The cost structure differs between applications. In credit card fraud, a false negative (missed fraud) means real financial harm to a customer, while a false positive (flagging a legitimate transaction) causes minor inconvenience and a quick verification step. The asymmetry favors sensitivity. In manufacturing, a false positive that halts a production line can be extremely costly in lost output, while some escape of defects may be acceptable. Different cost structures demand different thresholds — the same model must be calibrated differently for each application.
Question 4 True / False
The Local Outlier Factor (LOF) method uses a global density threshold to identify anomalies, which is why it performs better than isolation forests on datasets with clusters of varying density.
TTrue
FFalse
Answer: False
LOF's key strength is precisely that it uses *local* density comparisons relative to a point's neighbors, not a global threshold. It asks: is this point's local density much lower than the density of its neighbors? A point in a naturally sparse cluster will not be flagged if its neighbors are equally sparse. A global threshold would fail on datasets with multiple clusters of different densities because what counts as 'anomalous' varies by region. LOF handles this by making each point's score relative to its local neighborhood.
Question 5 Short Answer
Why is anomaly detection fundamentally different from a standard binary classification problem, and how does this difference affect how the methods are trained?
Think about your answer, then reveal below.
Model answer: Binary classification trains on labeled examples of both classes. Anomaly detection cannot do this because anomalies are rare, diverse, and unpredictable — you may have no labeled anomaly examples, and future anomalies may be unlike anything seen before. Instead, anomaly detection methods learn what 'normal' looks like from unlabeled or predominantly normal data, then flag deviations from that learned normal. This means the methods must generalize to unseen anomaly types, not just distinguish known anomaly patterns from normal ones.
This distinction has practical consequences: you cannot evaluate an anomaly detector the same way you evaluate a classifier. If you train only on normal data, you have no held-out anomaly examples for cross-validation. Performance must often be evaluated on carefully curated test sets or via domain expert review. The fundamental challenge is that the model is learning an open-world definition of normality, not a closed-world boundary between two known classes.