A primer on machine learning for fraud detection

本指南介绍了 Stripe Radar 及我们利用 Stripe 网络检测欺诈的方式。

最后更新于 2021 年 12 月 15 日
  1. Introduction
  2. 在线信用卡欺诈简介
  3. Stripe Radar and the Stripe network
  4. The basics of machine learning
    1. How does machine learning work?
    2. Feature engineering
  5. Evaluating machine learning models
    1. Key terms
    2. Precision-recall and ROC curves
    3. Score distributions
    4. Computing precision and recall
  6. Machine learning operations: deploying models safely and frequently
  7. How Stripe can help
    1. Improving performance with rules and manual reviews
  8. Next steps

最近以来,电子商务的大幅加速导致在线支付欺诈也相应增加。在全球范围,欺诈为商家们预计带来每年 200 亿美元的损失。此外,由于运营成本、卡组织费用和客户流失的增加,欺诈造成的损失越来越多,商家的总成本实际上要高得多。

欺诈不仅代价高昂,而且老谋深算的骗子还在不断寻找新的方法来利用您的弱点,使欺诈行为打击起来更加困难。这正是我们打造 Stripe Radar 的原因所在,它是一款基于机器学习技术的防欺诈检测工具,与 Stripe 平台全面集成。Radar 的机器学习利用 Stripe 网络中每年处理的数千亿美元的支付数据,准确检测欺诈,并快速适应最新趋势,使您在不增加欺诈的情况下实现增长。

本指南介绍了 Stripe Radar 及我们利用 Stripe 网络检测欺诈的方式,对我们使用的机器学习技术进行概述,解释了我们对欺诈检测系统有效性和性能的考量方式,并描述了 Radar 套件中的其他工具如何可以帮助商家优化其欺诈性能。

在线信用卡欺诈简介

如果持卡人未授权付款,那么我们便将这笔付款视为欺诈。例如,如果欺诈者使用盗取来的未挂失的卡号下单,则付款可能会成功通过。然后,当持卡人发现他的信用卡被冒用时,他或她会向银行提出争议(也称为“撤单 (Chargeback)”),要求退款。

商家可以通过提交证明付款有效的证据来质疑撤单。但是,对于无卡交易,如果卡组织将付款视为真正的欺诈,那么持卡人获胜,商家将承担货物损失及其他费用。

从历史上看,很多商家在预测和阻止可疑欺诈时使用的是硬性规则。但是,硬编码规则——例如,阻止在国外使用的所有信用卡——可能会阻止很多真实的交易。另一方面,机器学习可以发现更细微的模式,帮助您最大化收益。用机器学习的术语来说,“漏报”是指系统错过了它本可以检测到的东西——即,我们这里说的欺诈性交易。“误报”是指系统标记了一些它不应该标记的东西——例如,阻止了合法客户。在我们深入了解机器学习的具体知识之前,先来了解一下其中相关的博弈是很重要的。

对于漏报,商家通常要负责原始交易金额及撤单费用(与银行撤销信用卡付款有关的成本)、因争议而带来的更高的卡组织费用,以及审查费用或解决争议而产生的更高的运营成本。另外,如果您的争议太多,您可能会被纳入卡组织的撤单监控计划,从而可能导致成本更高,甚至在某些情况下,根本无法再接受信用卡付款。

“误报”或“错误的银行拒付”是指合法客户的购买意图被阻止。“错误的银行拒付”会影响商家的毛利润和声誉。事实上,在最近的一项调查中,33% 的消费者表示,在遇到“错误的银行拒付”后,他们不会再去那个商家那里买东西了。

需要在预防更多的争议(漏报)和降低合法客户阻碍力度(误报)之间权衡——前者越少,后者就越需要容忍(反之亦然)。阻止更多欺诈,就意味阻止更多的好客户。另一方面,减少误报通常就会增加漏网之鱼,带来更多欺诈。商家需要根据利润、增长状况及其他因素来决定如何在这两者之间进行平衡。

如果一个商家的利润很低(例如,网上销售食品),那么一笔欺诈交易带来的成本可能需要用数百笔好的交易来抵消——使得每一次漏报都非常昂贵。在试图阻止潜在的欺诈行为方面,具有这种特征的商家可能倾向于广撒网。反过来,如果一个商家的利润率很高,比如 SaaS 企业,情况则正好相反。从一个被阻止的合法客户身上损失的收入可能超过增加的欺诈成本。

Stripe Radar and the Stripe network

Radar is Stripe’s fraud prevention solution that protects businesses against online credit card fraud. It is powered by adaptive machine learning, the result of years of data science and infrastructure work by Stripe’s dedicated machine learning teams. Radar’s algorithms evaluate every transaction for fraud risk and take action appropriately. High-scoring payments are blocked, and Radar for Fraud Teams provides tools so users can specify when other actions should be taken.

Stripe processes hundreds of billions in payments from millions of businesses and interacts with thousands of partner banks across the globe each year. This scale means we often can see signals and patterns much earlier than smaller networks. Aggregate data relevant to fraud from all Stripe transactions—collected automatically through the payments flow—is used to improve our fraud detection ability. Signals like the country in which the card was issued or the IP address from which the payment was made provide valuable insights when predicting whether the payment is likely to be fraudulent.

Previous encounters with a card across the Stripe network also offer a significant amount of data to inform our risk assessments. Ninety percent of the cards used on the Stripe network have been seen more than once, giving us much richer data to make assessments on whether they are being used legitimately or fraudulently.

Another key advantage to our machine learning is that Radar is built directly into Stripe and works out of the box. Other fraud prevention solutions generally require a substantial amount of both upfront and ongoing investment. First, businesses must integrate with the fraud product. This involves engineering work to send data on relevant events and payments. Second, businesses must complete an integration to pass payment labels—a categorization of whether or not the transaction was fraudulent—from their payment processor to their fraud provider or manually label payments themselves, which can be incredibly time consuming and error prone. Radar, on the other hand, receives “ground truth” information directly from the usual Stripe payment flow and taps into timely and accurate data directly from card networks and issuers—no engineering time or coding required.

Let’s dive into a more detailed look at machine learning and how we use it at Stripe.

The basics of machine learning

Machine learning refers to a body of techniques for taking large amounts of data and using that data to produce models that predict outcomes, such as the likelihood a charge will result in a fraud dispute.

One of the main applications of machine learning is prediction: We want to predict the value of some output variable given some input values. In our case, the output value is true if the payment is fraudulent and false otherwise (such binary values are called booleans), and an example of an input value could be the country the card was issued in or the number of distinct countries where the card was used across the Stripe network in the past day. We determine how to make a prediction based on previous examples of input and output data.

The data used to train (or generate) the models consists of records (often obtained from historical data) with both the output value and the various input values as we have in the following (highly simplified) example:

Amount in USD
Card country
Countries card used from (24h)
Fraud?
$10.00 US 1 No
$10.00 CA 2 No
$10.00 CA 1 No
$10.00 US 1 Yes
$30.00 US 1 Yes
$99.00 CA 1 Yes

While there are only three inputs in this example, in practice machine learning models often have hundreds or thousands of inputs. The output of the machine learning algorithm might be a model like the following decision tree:

When we observe a new transaction, we look at the input values and traverse the tree “20-questions style” until we reach one of its “leaves.” Each leaf consists of all the samples in the data set (the table above) satisfying the question-answer pairs along the path we followed down the tree, and the probability that we think the new transaction is fraudulent is the number of samples in the leaf that are fraudulent divided by the total number of samples in the leaf. Put another way, the tree answers the question, “Of transactions in our data set with properties similar to the transaction we’re examining now, what fraction was actually fraudulent?” The machine learning part is concerned with the construction of the tree—what questions do we ask, in what order, to maximize the chances that we can distinguish between the two classes accurately? Decision trees are particularly easy to visualize and reason about, but there are many different learning algorithms, each with their own unique way of representing the relationships we are trying to model.

Today’s machine learning models are prevalent—powering, behind the scenes, many of the products we frequently interact with—and generally much more sophisticated than the toy model above:

  • Google accurately and precisely provides spelling suggestions with its “Did you mean?” feature in Search using machine learning to model millions of language-related parameters in less than three seconds.
  • Amazon uses machine learning to predict purchases with its recommendation system based on the needs, preferences, and changing behaviors of users across its entire platform, even for new users with no historical data.

And, most relevant to this discussion, machine learning is the basis for Stripe Radar, which seeks to predict which of your payments are fraudulent.

How does machine learning work?

Academic machine learning courses will usually focus on the modeling process—the methods for translating data (e.g., the table above) into the models (e.g., the decision tree), which are the algorithms that tell you how input values (the country in which the card was issued, the number of countries where the card was used, etc.) map to output values (was the transaction fraudulent or not?). The process that takes the input data table above and produces the “best” tree is an example of a particular machine learning method. Modeling involves a number of steps, which depend on the nature of your data and the models you chose to use. While we won’t go into too much detail, a high-level overview follows.

First, we need to obtain training data. Before we can begin automatically detecting fraud, we need a dataset with examples of it. For each example, we need to have recorded (or be able to compute retrospectively) a range of input properties that could be useful in making future predictions about the output value. These input properties are called features. The collection of inputs together for a given sample is a feature vector. In our example above, the feature vector had a length of three (the country in which the card was issued, the number of countries where the card was used in the past day, and the payment amount in USD).

However, feature vectors with hundreds or thousands of features are not uncommon. In fact, Radar uses hundreds of features and most of them are aggregates computed from across the Stripe network. As our network size expands, each feature becomes more informative because our training data becomes more representative of the feature’s entire data set, including all non-Stripe data. The output value—in our running example, the boolean as to whether or not the transaction was fraudulent—is often called a target or label. The training data thus consists of a large number of feature vectors and their corresponding output values.

Second, we need to train a model. Given the training data, we need a method for producing our predictive model. Machine learning classifiers generally do not just output a class label—they typically assign probabilities that the given sample belongs to each possible class. For example, the output of a fraud classifier might be an assessment that the payment has a 65% chance of being fraudulent and a 35% chance of being legitimate.

There are many machine learning techniques that can be used to train models. For most industrial machine learning applications, traditional approaches like linear regression, decision trees, or random forests do just fine.

However, sophisticated techniques, namely neural nets and deep learning, inspired by the architecture of neurons in the brain, are responsible for many advances in the field, including AlphaFold’s predictions for 98% of all human proteins. The real advantages of neural nets only come when they’re trained on very large datasets, so in practice, many businesses aren’t able to take full advantage of them. Because of the size of our network, Stripe is able to take this more cutting-edge approach to deliver real results to our users. Our new models have improved Radar’s machine learning performance by over 20% year over year, helping us detect more fraud while keeping false positives low.

Feature engineering

One of the most involved parts of industrial machine learning is feature engineering. Feature engineering consists of two parts:
(1) formulation of features that have predictive value based on extensive knowledge of the problem domain and (2) engineering to make the values of those features available both for model training and for model evaluation in “production.”

In formulating a feature, a Stripe data scientist may have a hunch that a useful feature would be to compute whether the card payment is coming from an IP address that is common for that card. For example, a card payment originating from IP addresses seen before (like the home or workplace of the cardholder) is less likely to be fraudulent than if the IP address was from a different state. In this case, the idea is intuitive, but generally these hunches come from examining thousands of cases of fraud. For example, you may be surprised to learn that computing the difference between the time on the user device and the current Coordinated Universal Time (UTC) or the count of countries in which the card was successfully authorized helps detect fraud.

Once we have the feature idea, we need to compute its historical values so that we can train a new model including the feature—this is the process of adding a new column to the “table” of data we use to produce our model. To do this for our candidate feature, for every payment in Stripe’s history, we need to compute the two most frequent IP addresses from which preceding payments were made with the card. We might do this in a distributed fashion with a Hadoop job, but even then we may find that the job takes too much time (or memory). We might then try optimizing the computation by using a space-saving probabilistic data structure. Even for features that are intuitively simple, producing data for model training requires dedicated infrastructure and established workflows.

Not all features are handcrafted by engineers; some can be left for the model to compute with subsequent testing before deployment. Categorical values, such as the country of origin of a card or the merchant that processed a transaction (as opposed to numerical features), lend themselves well to this approach. These features often have a wide range of values, and defining a good representation for them can be challenging.

At Stripe, we train our models to learn an embedding for each merchant based on transaction patterns. An embedding can be thought of as the coordinates of the individual merchant compared to others. Similar merchants will often have similar embeddings (as measured by cosine distance), allowing the model to transfer learnings from one merchant to the next. The table below shows how these embeddings could look, given that Uber and Lyft are likely more similar to each other than to Slack. At Stripe, we use embeddings for a variety of categorical features, such as issuing bank, merchant and user country, day of the week, and more.

Illustrative embedding coordinates

Uber
2.34 1.1 -3.5
Lyft
2.1 1.2 -2
Slack
7 -2 1

The use of embeddings is increasingly common in large-scale industrial applications of machine learning. Word embeddings like these, for example, help capture the complex semantic relationships between words and have been involved in natural language processing milestones like Word2Vec, BERT, and GPT-3. Stripe produces embeddings to capture similarity relationships between different entities on the Stripe network the same way that the methods above capture similarities between words. Embeddings are a powerful way to learn higher-level concepts without explicit training. For example, fraud patterns are often unevenly distributed geographically. With embeddings, if our system identifies a new fraud pattern in Brazil, it can automatically identify the same pattern if it appears in the US, without further training. In this way, algorithmic advances help stay ahead of shifting fraud patterns, protecting our customers.

If you are interested in working on machine learning products at Stripe, get in touch!

Evaluating machine learning models

Once we’ve developed a machine learning classifier for fraud that uses hundreds of features and assigns a probability (or score) that the payment is fraud to every incoming transaction, we need to determine how effective the model is at detecting fraud.

Key terms

To better understand how we evaluate our machine learning systems, it’s useful to define some key terms.

Let’s start by supposing we’ve created a policy to block a payment if the machine learning model assigns the transaction a probability of being fraudulent of at least 0.7. (We write this as P(fraud)>0.7). Here are some quantities useful for reasoning about the performance of our model and policy:

  • Precision: The precision of our policy is the fraction of transactions we block that are actually fraudulent. The higher the precision is, the fewer false positives there are. Let’s say out of 10 transactions, P(fraud)>0.7 for six and, of those six, four are actually fraudulent. The precision is then 4/6=0.66.

  • Recall: Also known as sensitivity or the true positive rate, recall is the fraction of all fraud that is caught by our policy; that is, the fraction of fraud for which P(fraud)>0.7. The higher the recall is, the fewer false negatives there are. Let’s say out of 10 transactions, five are actually fraudulent. If four of these transactions are assigned a P(fraud)>0.7 by our model, then recall is 4/5=0.8.

  • False positive rate: The false positive rate is the fraction of all legitimate payments that are incorrectly blocked by our policy. Let’s say out of 10 transactions, five are legitimate. If two of these transactions are assigned a P(fraud)>0.7 by our model, then the false positive rate is 2/5=0.4.

While there are other quantities that are used when evaluating a classifier, we’ll focus on these three.

Precision-recall and ROC curves

The next natural question is what good values are for the precision, recall, and false positive rate. In a theoretically ideal world, precision would be 1.0 (that is, 100% of transactions that you classify as fraud are actually fraud), which would make your false positive rate 0 (you didn’t incorrectly classify a single legitimate transaction as fraudulent), and recall would also be 1.0 (100% of fraud is identified as such).

In reality, there is a tradeoff between precision and recall—as you increase the probability threshold for blocking, precision will increase (since the criterion for blocking is more stringent) and recall will decrease (since fewer transactions match the high probability criterion). For a given model, a precision-recall curve captures the tradeoff between precision and recall as the policy threshold is varied:

As our model gets better overall—due to training more and more data from across the Stripe network, adding features that are good predictors of fraud, and tweaking other model parameters—the precision-recall curve will change, as depicted in the example above. As it controls the trade-off for businesses on Stripe, we closely monitor the impact on the precision-recall curve when our data scientists and machine learning engineers modify models.

When considering a precision-recall graph, it’s important to distinguish between the two notions of “performance.” On its own, a model is better overall the closer it hugs the top-right of the chart (that is, where precision and recall are both 1.0). However, operationalizing a model usually requires the selection of an operating point on the precision-recall curve (in our case, the policy threshold for blocking a transaction), which controls the concrete impact using the model has on a business.

Put simply, there are two problems:

  • The data science problem of producing a good machine learning model by adding the right features. The model controls the shape of the precision-recall curve.

  • The business problem of picking a policy to decide how much potential fraud to block. The policy controls where on the curve we’re operating.

Another curve that is examined when evaluating a machine learning model is the ROC curve. (ROC is short for “receiver operating characteristic,” a relic of the curve’s origin in signal processing applications.) The ROC curve is a plot of the false positive rate (on the x-axis) and the true positive rate (which is the same as the recall) on the y-axis for various values of the policy threshold.

The ideal ROC will hug the top left of the graph (where recall is 1.0 and the false positive rate is 0.0), and as the model improves, the ROC will move more in that direction. One way to capture the overall quality of the model is by computing the area under the curve (or AUC); in the ideal case, the AUC will be 1.0. When developing our models, we look to see how the precision-recall curve, the ROC curve, and the AUC change.

Score distributions

Imagine that we have a model that randomly assigns a probability of fraud between 0.0 and 1.0 to a transaction. Practically, this model does nothing to discriminate between legitimate and fraudulent transactions and is of little use to us. This randomness is captured by the score distribution of the model—the fraction of transactions getting each possible score. In the completely random case, the score distribution would be close to uniform:

A model will have a uniform score distribution like the above if, for example, the model has no features that are even remotely predictive of fraud. As a model is improved—by adding predictive features, training on more data, and so forth—its power to discriminate between the fraudulent and legitimate classes will increase and the score distribution will become more bimodal, with peaks around the scores of 0.0 and 1.0.

On its own, a bimodal distribution does not tell you that a model is good. (A vacuous model that randomly assigns probabilities of just 0.0 and 1.0 would also have a bimodal score distribution.) However, in the presence of evidence that transactions with a low score are not fraudulent and transactions with a high score are fraudulent, an increasingly bimodal distribution is a sign of improved efficacy for a model.

Different models will often have different score distributions. When we release new models, we compare the old and updated distributions, in order to minimize any disruptive changes caused by a sudden shift in scores. In particular, we take into account merchants’ current block policies as measured by the threshold at which they block transactions, and aim to keep the proportion of transactions that falls above the threshold stable.

Computing precision and recall

We can compute the metrics above in two different contexts: during model training, using the historical data that drives the model development process, and after model deployment, using production data; that is, data from the world when the model is already being used to take action by, say, blocking transactions if P(fraud)>0.7.

For the former, data scientists will typically take the training data they have (reference the table from above) and randomly assign some fraction of the records to a training set and the other records to a validation set. One could imagine that the first 80% of rows go into the former and the last 20% into the latter, for example.

The training set is the data fed into a machine learning method to produce a model as described above. Once we have a candidate model, we can then use it to assign scores to each sample in the validation set. The validation set scores together with their output values are used to compute the ROC and precision-recall curves, the score distributions, and so forth. The reason we use a separate validation set that is held out from the training set is that the model has already “seen the answer” for its training examples and learned from these answers. A validation set helps us generate metrics that are an accurate measure of the predictive power of the model on new data.

Machine learning operations: deploying models safely and frequently

Once a model’s performance has been shown to outperform the current production model on a held out set, the next step is to deploy it to production. There are two key challenges to this process:

  • Real-time computations: We need to be able to compute the value of every feature for every new payment in real time because we want to be able to block all transactions that our classifier believes are likely to be fraudulent. This computation is entirely separate from the one used to produce training data—we need to maintain an up-to-date state on the two most frequently used IP addresses for every card ever seen at Stripe, and fetching and updating those counts needs to be fast because those operations happen as part of the Stripe API flow. Machine learning infrastructure teams at Stripe have made this easier by building systems to specify features in a declarative way and making the current values of the features available automatically in production with low latency.

  • Real-world user application: Deploying a machine learning model is different from deploying code. While code changes are often validated with precise test cases, model changes are usually tested on a large aggregate dataset using metrics such as the ones we defined above. But a model that is better at catching fraud in aggregate may not be better for every Stripe user. It may be that the improvement in performance is unevenly distributed, with a few large merchants seeing large gains while many small merchants see small regressions. A model may have higher recall but cause a spike in block rate, which would be disruptive to businesses (and their customers). Before we release a model, we verify that it performs well in practice. To do so, we measure the change each model would cause to a variety of metrics, such as false positive rate, block rate, and authorization rate on an aggregated and per merchant basis for a subset of Stripe users. If we find that a new model would cause an undesirable shift in one of those guardrail metrics, we adjust it for different subsets of users before releasing it to minimize disruptions and ensure optimal performance.

We’ve found that automating as much of the training and evaluation process as possible provides compounding benefits to model iteration speed. In the last year, we’ve invested in tooling to automatically and regularly train, tune, and evaluate models using our latest features and model architecture. For example, we continuously update performance dashboards after a model is trained—before it is released. That way, an engineer can easily detect if a model candidate has gotten stale on a subset of traffic before even releasing it and proactively retrain it.

After we release a model, we monitor its performance and start working on the next release. Because fraud trends change quickly, machine learning models quickly start experiencing drift: The data they were trained on no longer is representative of fraud today.

Using these tools, we’ve tripled the speed at which we release models, translating directly to large performance gains in production. In fact, even retraining a model from last month on more recent data (using the same feature definitions and architecture) and releasing it allows us to increase our recall by as much as half a percentage point each month. Being able to release models frequently and safely allows us to capitalize on and compound the gains of feature engineering and modeling work and adapt to changing fraud patterns for Radar users.

Once we put a model into production, we continuously monitor the performance of our model-policy pair. For payments that have scores below the threshold for blocking, we can observe the ultimate outcome—was the transaction disputed by the cardholder as fraud? Payments that have scores above the threshold, however, are blocked, and so we can’t know what their outcomes would have been. Computing the full production precision-recall or ROC curve is thus more involved than computing the validation curves because it involves counterfactual analysis—we need to obtain statistically sound estimates of what would have happened even to the payments we blocked. Over the years, Stripe has developed methods to do this, which you can learn more about in this talk.

We’ve just described a few of the measures of model efficacy that data scientists and machine learning engineers look at when developing machine learning models. Next, we’ll talk about how businesses should think about fraud prevention.

How Stripe can help

Fixating on just one number to capture your fraud performance may result in choices that are not optimal for your business. We’ve found that businesses will often overemphasize false negatives—they’re very concerned about fraud that is missed—and underemphasize false positives. This mindset often results in ineffective and costly brute-force measures like blocking all international cards. In general, you should be thinking about how all the various performance measures relate and what the right trade-offs are given your particular circumstances. Here’s an example of how these metrics tie together to help you determine the efficacy of your fraud prevention system:

APPROXIMATE MODEL FOR BREAK-EVEN PRECISION

If your average sale is $26 with a margin of 8%, your profit per sale is $26.00 × 8.00% = $2.08. On average, if your product costs $26.00 – $2.08 = $23.92 to produce and you’re levied a chargeback fee of $15, your total loss for a fraudulent sale is $23.92 + $15.00 = $38.92. Therefore, one fraudulent sale costs you the profit of $38.92 / $2.08 = 18.71 legitimate sales, and your break-even precision is 1 / (1 + 18.71) = 5.07%.

Radar’s machine learning thresholds trade off optimizing for merchants’ margins and keeping block rates stable across our user base. You can access a dashboard to see how Radar’s machine learning is performing for your business, as well as your custom rules performance if you’re using Radar for Fraud Teams. These tools enable you to easily compare your fraudulent dispute rates, false positive rates, and block rates to other similar businesses based on aggregated, custom cohorts of businesses that are in similar verticals or sizes to yours.

Improving performance with rules and manual reviews

With Radar for Fraud Teams, you can fine tune your protection by directly adjusting your risk threshold to block or allow more payments. Alongside the more automatic machine learning algorithms, Radar for Fraud Teams also lets individual businesses compose customized rules (for example, “block all transactions above $1,000 when the IP country does not match the card’s country”), request interventions, and manually review flagged payments in the Dashboard.

Such rules can be seen as simple “models” (they can be represented as decision trees, after all!), and they should be evaluated—with a full consideration of the tradeoff between precision and recall—in the same way as models. When you create a rule with Radar, we’ll present historical statistics on the number of matching transactions that were actually disputed, refunded, or accepted to help aid with these calculations before the rule is even implemented. Once live, you can see the impact on false positive and dispute rates by rule.

Just as important, rules, interventions, and manual reviews allow users to change the shape of the precision-recall curve in their favor by adding in proprietary, business-specific logic (rules) or by expending some additional effort (manual review).

If you realize that the machine learning algorithms are frequently missing a certain type of fraud particular to your business (and that fraud is easily identifiable to you), you can compose a rule to block it. That specific intervention will typically increase recall with little cost to precision, in effect moving the operating point along a less steep, more favorable precision-recall curve.

By sending some classes of transactions to manual review instead of blocking them outright, you can gain precision without a hit to recall. Similarly, by sending some transactions to manual review instead of allowing them outright, you can gain recall without a hit to precision.
Of course, in these cases, you are paying for these gains with additional human work (and exposing yourself to the accuracy of your team’s assessments), but having manual review, rules, and interventions to authenticate high-risk customers as additional tools gives you another lever to optimize fraud outcomes.

Next steps

We hope this guide helps you understand how machine learning is applied to fraud prevention at Stripe and how to gauge the efficacy of your fraud systems. You can learn more about Radar’s features or explore our docs.

If you have any questions or would like to learn more about Stripe Radar, please reach out.

准备好开始了?联系我们或创建账户。

创建账户,开始收款——不需要签订合同或提供银行信息。也可以联系我们,为您的业务量身制定最合适的套餐。