The ML flywheel: How we continually improve our models to reduce card testing

Ryan Drapeau Payment Intelligence
Inayat Khosla Advanced Attacks
Blog > The ML flywheel > Header image

Illustration by Carolina Moscoso

Card testing is one of the most significant fraud threats to Stripe, its users, and the broader financial ecosystem. It is also one of the most challenging to detect and block, both because it blends in easily with legitimate traffic and because bad actors are constantly changing their tactics.

That’s why the systems we use to prevent it need to be both accurate—correctly distinguishing card testing from legitimate traffic—and nimble, built to adapt to the evolving landscape of threats. Stripe’s machine learning–based approach to combat card testing prioritizes these qualities through rapid detection and retraining. It’s a flywheel: detection enables us to add new data labels and features, which we then feed back into ML models (at different levels of abstraction) for retraining and redeployment. As a result of this approach, successful attacks on Stripe have decreased by 80% over the last two years, even as Stripe’s payment volume had expanded to over $1 trillion last year.

Testing for stolen card details that are still active

Card testing is a form of payment fraud in which bad actors seek to validate which stolen card numbers can be successfully charged in the future. Attacks usually occur in one of two ways: verification or enumeration. 

A verification attack is when a tester iterates through a known and finite set of stolen card credentials by attempting small or zero-dollar transactions in order to “verify” which ones haven’t been canceled or expired. An enumeration attack is more like guessing: a fraudulent actor tries to “enumerate” card numbers, often within a specific numeric range, in an attempt to differentiate between chargeable and nonchargeable cards. Once a card passes this initial testing phase, it is deemed active and unblocked, and its value to the fraudulent actor dramatically increases. They might then use the card to make substantial purchases, or they might sell the card details on the illegal market, both of which can have a significant impact on targeted businesses. These attacks represent a small percentage of all transactions, and the tactics underlying them are always changing—two factors that make identifying card testing with high precision a challenging ML problem.

Setting block thresholds with multiple layers of ML

At its most basic, ML generates a probabilistic guess about whether a transaction is card testing: above a certain threshold, the transaction is blocked; below that threshold, it proceeds. But determining the right threshold is a nuanced calculation. To generate it, we apply ML models at multiple levels of abstraction:

  1. At the highest level of abstraction, we use ML to estimate the overall prevalence of card testing on Stripe. This allows us to update our risk posture on a daily basis.
  2. At the next level down, we apply ML to estimate where card testing is likely to be taking place—which businesses, issuers, or surfaces are experiencing an attack. It’s not always obvious: a spike in transactions could be card testing, or it could be a flash sale.
  3. Finally, at the bottom of the abstraction hierarchy, we apply ML to individual transactions, leveraging a wide variety of signals.

The outputs of these three models work together to dynamically update the threshold at which we block potential card testing attacks. This allows us to intervene in real time on a small, precise slice of traffic, and ease our controls as soon as the attack subsides—thereby minimizing the impact on legitimate traffic. 

But the models are only as good as the data they’re trained on. To make sure they are tuned to the most current card testing trends, we’ve created an operating framework that enables rapid data labeling, retraining, and redeployment.

Labeling breakthrough attempts

Unlike disputes or declines, card testing doesn’t yield explicit labels that can be used to train models or evaluate prevalence or performance. Instead, labels need to be derived. We do this through a combination of processes. They include consolidating intelligence around newly identified attack vectors, automating discovery of hidden patterns by combining weaker signals, and applying manual expert review. The output of this combined review exercise is a refined set of transactions that we can label as card testing fraud. This allows us to act with confidence on both the suspicious transactions we were unsure about initially and to proactively block new forms of card testing. It’s the first step in a virtuous cycle where reactive recognition of breakthrough attacks turns into proactive identification within hours.

Rapid retraining

Once we have labels, we need to engineer features that will become new inputs to our redeployed ML models. And given how quickly card testing attacks can manifest, we need to be able to do this quickly. Our next-generation feature engineering platform, Shepherd, makes this easy. Shepherd—which we built through a partnership with Airbnb—allows multiple teams at Stripe with different skill sets to generate new features with minimal code changes.

Once we have these features, we need to test them. We do this using Shepherd and Flyte—an ML orchestration platform—which facilitates experimentation through standardized, automated workflows. We retrain our models using the proposed feature, deploy them on offline data, and evaluate the results. Then we select the highest precision features and conduct blue-green tests between the old model and the new one; if those go well, we deploy the refreshed model. All of this takes place as part of our immediate incident response.

Foundation models also play a role

This flywheel process has allowed us to reduce card testing on Stripe while keeping the false positive rate low. While this approach is effective for identifying historical patterns, it’s less able to spot fully novel or particularly subtle types of attacks on large businesses, where card testing attacks blend in with much higher transaction volume. To address these limitations, we augment our smaller, card testing–specific models with a large transformer model trained on billions of global transactions. It can detect patterns that are not easily discerned by simpler models. It also compresses payments into atomic embeddings, which we then leverage across multiple card testing use cases, such as training classifiers on sequences of embeddings to determine whether an entity is undergoing an attack in real time.

Attempts are up, but successful attempts are way down 

The overall result of this work is that successful card testing attacks on Stripe have declined significantly. If you’re a Stripe user interested in learning more about how Stripe protects your business from card testing, check out our documentation. And if you find this work exciting, come join us.

Like this post? Join our team.

Stripe builds financial tools and economic infrastructure for the internet.

Have any feedback or questions?

We’d love to hear from you.