Anomaly Detection

Finding the needle in the haystack of regular data.

The idea

Anomaly detection (or outlier detection) identifies data points that deviate significantly from normal behavior. It's used everywhere from credit card fraud detection to server monitoring. Rather than explicitly defining what "bad" looks like, we define what "normal" looks like, and flag everything else.

Step 1: Normal data clusters together.

How it works (Z-Score)

A simple statistical approach is the Z-Score, which measures how many standard deviations a data point is from the mean. If Z > 3, it's generally considered an anomaly.

import numpy as np

def detect_anomalies(data):
    mean = np.mean(data)
    std_dev = np.std(data)
    
    anomalies = []
    for item in data:
        z_score = abs(item - mean) / std_dev
        if z_score > 3: # Threshold
            anomalies.append(item)
            
    return anomalies

cpu_usage = [20, 22, 19, 21, 24, 99, 20, 21]
print(detect_anomalies(cpu_usage)) # [99]

Cost

Time Complexity: O(N) to compute mean/std, and O(N) to calculate Z-scores. Space Complexity: O(1) if done in a streaming fashion. More complex ML approaches (like Isolation Forests) take O(N log N).

Watch out for

Seasonality: CPU usage might normally spike at 3 AM for backups. A static threshold will flag this as a false positive.
Cold Starts: You need enough historical data to accurately define "normal" before you can detect an anomaly.