How to train an AI to find needles in a haystack of legitimate transactions.
Banks use Machine Learning to instantly block stolen credit cards. But training these models is notoriously difficult because of Class Imbalance. In the real world, 99.9% of transactions are perfectly legitimate, and only 0.1% are fraud. If you train a naive AI on this data, it will quickly learn a "genius" trick: just guess "Legitimate" every single time! It will score 99.9% accuracy and be completely useless. To actually catch fraud, we must artificially balance the dataset or heavily penalize the model for missing the rare fraudulent events.
We cannot deploy a model that just guesses "Legitimate". We have two main strategies to force the model to care about the minority class:
from sklearn.ensemble import RandomForestClassifier
# BAD: Naive training on imbalanced data
model = RandomForestClassifier()
model.fit(X_train, y_train) # Will just predict 'Legitimate' forever
# GOOD: Using Class Weights to penalize missing fraud
# '0' is Legitimate (weight 1), '1' is Fraud (weight 100)
model = RandomForestClassifier(class_weight={0: 1, 1: 100})
model.fit(X_train, y_train)
# Better metrics than 'Accuracy'
from sklearn.metrics import recall_score
# Recall tells us: "Out of all actual fraud, how much did we catch?"
By forcing the model to be hyper-sensitive to fraud, you increase the False Positive Rate. The model will start flagging legitimate transactions (like buying a coffee in a new city) as fraud, declining the user's card. This creates massive friction and angry customer support calls. Fraud detection is a constant tug-of-war between catching bad guys and annoying good guys.