September 17, 2024

Releasing Our First Fraud-Specific LLM Benchmark

AUTHORS: Adnane Lahbabi, Ricardo Fernandez

Summary

We are releasing our first Fraud-specific LLM Benchmark. While there are plenty of LLM benchmarks available, it's crucial to understand a model’s performance, strengths, and weaknesses in the areas that matter for your usage. Since we are focused on specific capabilities that aren’t always covered by general benchmarks, we created our own “Fraud IQ” Benchmark to evaluate skills that are relevant for our uses.

Why a Fraud IQ Benchmark?

There are already plenty of benchmarks ranking LLMs on a wide range of tasks such as answering math questions, reasoning, and coding. These benchmarks can be useful as they can help people choose the right model for their specific needs. But when it comes to applying them in the context we operate in, things can be quite different.
In particular, fighting fraud comes with its own set of subtleties. Detecting or proving fraud in complex situations can be tricky—it involves interpreting weak signals, finding patterns in the noise, and understanding the broader context. It’s a mix of legal violations, hidden intent, numbers not adding up, and more, often requiring you to piece together fragmented information to uncover the truth.

How it started

A few months after GPT-4 was released last year, I wrote an article about ChatGPT and fraud. As a data scientist with years of experience in fraud detection and a Certified Fraud Examiner (CFE), I was curious to see how ChatGPT would perform on the CFE exam. The CFE is a credential awarded by the Association of Certified Fraud Examiners, the world’s largest anti-fraud organization. The exam covers four areas: Financial Transactions and Fraud Schemes, Law, Investigation, and Fraud Prevention and Deterrence.
Overall, the results showed that GPT-4 demonstrated rather interesting comprehension of fraud concepts.
LLMs capabilities in regard to fraud are part of what sparked my interest in exploring their potential to help fight fraud across industries. And at Alphaguard, we believe their impact is going to be huge: Fraud prevention, detection, and investigation often involve dealing with a lot of unstructured data—documents of all kinds and data gathered from online searches that need careful review.
Now, almost 18 months after GPT-4’s release, we’re ramping up to launch Alphaguard’s first product, expected at the end of November. As experts in using LLMs to fight fraud, we need to know which models are best for which tasks. Thus, creating evaluations and benchmarks is crucial for building the best possible product.
There are a lot of factors to consider when picking a LLM for a specific use. Models with larger context windows can process bigger documents without needing to split them up, while some models are better than others when dealing with multiple languages
Latency and cost are also important, as going through thousands of pages of documents can quickly become prohibitively expensive with the larger models.
One crucial factor in fraud investigations is data sensitivity and privacy—many organizations can’t (and shouldn’t) send their data through APIs they don’t fully have control over. In these cases, using open source models in a completely controlled environment is a huge advantage.

The Fraud IQ Benchmark

In this article, we present the results of a selection of models on a dataset of ‘CFE-like’ questions.
The dataset is based on questions that were publicly released by the ACFE at some point in time (which comes with some limitations discussed below) and is divided into at least 28 questions for each section of the CFE exam, and 137 questions in total.
While answering CFE questions alone doesn’t necessarily make someone a great investigator, the benchmark offers several key advantages:
  • The exam covers four areas: Financial Transactions and Fraud Schemes, Law, Investigation, and Fraud Prevention and Deterrence, providing a straightforward way to compare models across these critical areas in a fraud context.
  • It’s a challenging exam for humans—you need to prepare thoroughly to pass. You can try out some sample questions here. And it helps compare model performance to human CFEs, who must accurately answer 75% of questions in each section to pass.
  • This is a good indicator of models performance on reasoning with fraud concepts involved. As such, we can expect models that perform well to have the necessary abstraction to find potential anomalies in real situations, as well as being able to explain them, which is crucial when uncovering wrongdoings.
  • It’s relatively easy to compare models using this dataset, in particular, monitoring the performance of proprietary and open source models alike, gives us an indication of where we are, and what still needs to be improved.

Results

Here are the results of the tested models on the benchmark:
We consider that models that pass accurately answer 75% of questions in each section, which is the criteria used to obtain the CFE.
Model General Score Pass/Fail Financial Transactions Fraud Prevention Investigations Law Licence
mistral-7b(🇫🇷) 61.31 69.39 60.71 64.29 46.875 Apache2
mistral-large(🇫🇷) 84.67 83.67 92.86 75.00 87.500 Mistral Research License
mistral-large-2(🇫🇷) 91.97 91.84 96.43 85.71 93.750 Mistral Research License
mixtral-8x7b(🇫🇷) 75.91 69.39 82.14 75.00 81.250 Apache2
llama3.1-405b 87.59 83.67 96.43 82.14 90.625 Meta Llama 3 Community License Agreement
llama3-70b 81.75 75.51 96.43 75.00 84.375 Meta Llama 3 Community License Agreement
llama3-8b 62.04 63.27 64.29 60.71 59.375 Meta Llama 3 Community License Agreement
llama3.1-70b 85.40 79.59 89.29 85.71 90.625 Meta Llama 3 Community License Agreement
llama3.1-8b 67.15 63.27 71.43 64.29 71.875 Meta Llama 3 Community License Agreement
claude-3-haiku 75.91 69.39 78.57 85.71 75.000 Proprietary
claude-2.1 78.83 75.51 85.71 78.57 78.125 Proprietary
claude-3.5-sonnet 94.16 93.88 96.43 92.86 93.750 Proprietary
claude-3-opus 94.89 91.84 92.86 100.00 96.875 Proprietary
claude-3-sonnet 80.29 75.51 89.29 78.57 81.250 Proprietary
gemini-1.0-pro 74.45 77.55 82.14 64.29 71.875 Proprietary
gemini-1.5-flash 87.59 87.76 96.43 78.57 87.500 Proprietary
gemini-1.5-pro 86.86 85.71 92.86 85.71 84.375 Proprietary
o1-mini (run1) 80.43 77.55 85.71 72.41 87.500 Proprietary
o1-preview (run1) 92.09 95.92 96.43 76.67 96.875 Proprietary
o1-mini (run2) 81.16 77.55 96.43 68.97 84.375 Proprietary
o1-preview (run2) 92.75 97.96 92.86 79.31 96.875 Proprietary
gpt-4o-mini 80.29 77.55 85.71 75.00 84.375 Proprietary
gpt-4-turbo 89.78 89.80 92.86 85.71 90.625 Proprietary
gpt-4-turbo-preview 89.05 85.71 92.86 85.71 93.750 Proprietary
gpt-4 89.78 91.84 96.43 82.14 87.500 Proprietary
gpt-3.5-turbo 78.83 73.47 92.86 67.86 84.375 Proprietary
gpt-4o 91.24 87.76 92.86 92.86 93.750 Proprietary
chatgpt-4o-latest 91.97 91.84 92.86 89.29 93.750 Proprietary

Key Takeaways

  • Anthropic models (Claude 3 Opus and Claude Sonnet 3.5) are the top-performing models with a few points of margin.
  • The remarkable leap from GPT-3.5 to GPT-4, introduced as a major breakthrough, is striking here again, progressing from models that consistently provided correct answers to one that surpasses the scores of humans trained on the test.
  • Models generally perform uniformly across the four categories.
  • Open weight models have closed the gap with GPT-4 a little more than a year after its release.
  • Some “hard” questions with high failure rates across the models are more questions that require specific knowledge than hard “reasoning” questions; these are typically cases where adding RAG or fine-tuning on specific knowledge could push the performance of a model higher.

About o1:

  • o1-preview is the top-performing OpenAI model, with a small margin over chatgpt-4.0-latest, but not the top-performing model overall.
  • o1-preview leads significantly on the Financial Transactions and Fraud Schemes section, where its "enhanced reasoning capabilities" make a noticeable difference in analyzing complex situations involving non-straightforward fraud scenarios with multiple characters in the questions.
  • On the other hand, in the Investigations section, o1-preview performs much worse than other models, including much smaller models. Looking at the answers, it seems that when faced with fraud investigation scenarios, o1’s chain of thought diverts from analyzing the situation, as if it were hesitant to assist fraudsters in evading investigation. This appears to be tied to the safety refusal boundaries in place, which, in this case, are clearly impacting the model's performance. https://openai.com/index/learning-to-reason-with-llms/
  • We ran the test twice with o1-preview, noting, as mentioned by OpenAI in their release guide, that its performance can be affected by overly detailed guidelines in its prompts since it prefers to choose its own chain-of-thought path. For the second run, we adapted the prompts given to other models to include fewer instructions.

Open weight models comparison

Here is the comparison between the open-weight models, which are models that can be run on private servers to minimize the risk of sharing sensitive data. The number of parameters typically ranges from 7 billion (Mistral 7b) to 405 billion (Llama), hence the use of a log-scale for the x-axis. The best-performing open-weight model is the latest Mistral release, Large 2.
Other than that, we observe:
  • Smaller LLMs (with less than a few dozen billion parameters) don’t typically pass the exam, and the smallest open weight model to pass is Llama 3-70b, which could run locally on a good Mac Studio or similar.
  • Both Llama and Mistral provide open-weight models that are passing.

Model Output Examples

Here is some examples of questions of our benchmark, with a selection of models answers.
This sample of questions and answers show interesting potential uses of LLMs for assisting with risk mitigation and fraud investigations, while taking into account elements of human psychology.
Example #2 is interesting in how it frames the question to push for a 'True' response, which may seem like the obvious answer. It effectively discriminates between smaller models that fall into this trap and larger, more recent models that are less influenced by the phrasing and respond correctly.
Benchmark Questions
Category: Financial Transactions and Fraud Schemes

Example #1: Which of the following customer due diligence (CDD) procedures would be MOST APPROPRIATE for an organization to perform if it determines that a potential customer has little opportunity to commit fraud and therefore presents a minimal risk of engaging in illegal activity?

  • (a) Contacting the customer’s bank
  • (b) Verifying the customer’s identity
  • (c) Analyzing the customer’s net worth
  • (d) Identifying the customer
Official Explanation: Simplified due diligence procedures, such as identifying the customer, are appropriate when the potential customer presents a minimal risk of engaging in illegal activity. Correct Answer: (D)

Model Answer: d

Model Explanation: When a potential customer is determined to have little opportunity to commit fraud and presents a minimal risk of engaging in illegal activity, the most appropriate customer due diligence (CDD) procedure is to identify the customer. This basic step ensures that the organization knows who they are dealing with, without the need for more extensive measures like contacting the customer's bank or analyzing their net worth, which are more suitable for higher risk customers. The level of CDD should be proportionate to the assessed risk, and in this case, identifying the customer is sufficient given their low-risk profile.

✔️ This answer is correct.

Model Answer: b

Model Explanation: The correct answer is b) Verifying the customer's identity. This is a basic CDD procedure that is required for all customers, regardless of their risk profile. Verifying the customer's identity helps to ensure that the organization is dealing with a legitimate individual or entity and is an essential step in preventing fraud and other illicit activities.

✔️ This answer is incorrect.

Model Answer: b

Model Explanation: Verifying the customer's identity is a crucial step in the CDD process, especially for low-risk customers. This involves confirming the customer's name, date of birth, and other identifying information to ensure that they are genuine and not attempting to hide their true identity.

✔️ This answer is incorrect.

Model Answer: d

Model Explanation: Identifying the customer (d) is the most appropriate CDD procedure for a low-risk customer. This involves collecting basic information about the customer, which is a fundamental step in any CDD process. It is the least intrusive and most basic measure, making it suitable for customers who present minimal risk of engaging in illegal activity.

✔️ This answer is correct.

Model Answer: d

Model Explanation: Identifying the customer is the most suitable CDD procedure for a low-risk customer. This process involves collecting and verifying the customer's name, address, date of birth, and other relevant information to ensure that the organization has accurate and complete information about its customers. This step is essential for maintaining a good business relationship and complying with various regulations.

✔️ This answer is correct.

Model Answer: d

Model Explanation: For customers who present minimal risk, organizations can perform simplified due diligence, which involves identifying the customer by collecting basic information such as name and address. This satisfies regulatory requirements without the need for more rigorous procedures like identity verification or financial analysis, which are reserved for higher-risk scenarios.

✔️ This answer is correct.

Category: Fraud Prevention and Deterrence

Example #2: Like most other types of fraud, the motivation for financial statement fraud almost always involves personal gain.

  • (a) True
  • (b) False
Official Explanation: Unlike some other types of fraud (such as embezzlement), the motivation for financial fraud does not always involve personal gain. Most commonly, financial statement fraud is used to make a company’s earnings look better on paper. Some of the more common reasons why people commit financial statement fraud include: to encourage investment through the sale of stock, to demonstrate increased earnings per share or partnership profits interest, to cover an inability to generate cash flow, or to dispel negative market perceptions. Correct Answer: (B)

Model Answer: b

Model Explanation: The statement "Like most other types of fraud, the motivation for financial statement fraud almost always involves personal gain" is false. While personal gain can be a motivation for financial statement fraud, it is not the only or always the primary reason. Financial statement fraud is often committed due to pressure to meet market expectations, secure funding, conceal financial issues, avoid regulatory consequences, or maintain a positive public image. The motivations behind financial statement fraud are more complex and varied than just personal gain.

✔️ This answer is correct.

Model Answer: b

Model Explanation: The motivation for financial statement fraud is not limited to personal gain. It can also be committed to benefit the company, such as by making it appear more financially stable than it is, which could be for the company's survival rather than direct personal gain.

✔️ This answer is correct.

Model Answer: a

Model Explanation: The statement that the motivation for financial statement fraud almost always involves personal gain is generally true. Personal gain is a common motivator for individuals who commit financial statement fraud, as they seek to enrich themselves through various means such as embezzlement or insider trading.

✔️ This answer is incorrect.

Model Answer: b

Model Explanation: Financial statement fraud can be motivated by a variety of factors, not just personal gain. These can include meeting performance targets, avoiding penalties, or maintaining the company's reputation. Therefore, the statement that the motivation for financial statement fraud almost always involves personal gain is false.

✔️ This answer is correct.

Model Answer: a

Model Explanation: The statement is true. Financial statement fraud is committed to deceive users of the financial statements, usually for personal gain.

✔️ This answer is incorrect.

Model Answer: b

Model Explanation: The statement is false because the motivation for financial statement fraud does not almost always involve personal gain. While personal gain is a common factor, financial statement fraud can also be motivated by organizational pressures and aims, such as meeting financial benchmarks, avoiding negative market reactions, or complying with loan covenants. These motivations extend beyond personal benefit and can be driven by a desire to protect or enhance the company's financial standing.

✔️ This answer is correct.

Category: Law

Example #3: Generally, to establish an illegal gratuity violation, the government must prove that the gratuity was given for the purpose of influencing an official act.

  • (a) True
  • (b) False
Official Explanation: An illegal gratuity charge doesn’t require proof that the gratuity was given for the purpose of influencing an official act. An illegal gratuity charge only requires that the gratuity be given for, or because of, an official act. Correct Answer: (B)

Model Answer: b

Model Explanation: The statement is false. To establish an illegal gratuity violation, the government does not need to prove that the gratuity was given for the purpose of influencing an official act. Instead, the government must show that the gratuity was given because of an official act that the public official has performed, is currently performing, or will perform in the future. The key difference between an illegal gratuity and bribery is that an illegal gratuity does not require proof of intent to influence the official's actions, whereas bribery does.

✔️ This answer is correct.

Model Answer: b

Model Explanation: The statement is false. The government does not necessarily need to prove that the gratuity was given for the purpose of influencing an official act. Instead, it must show that the gratuity was connected to an official act, which could include a past act. The focus is on the link between the gratuity and the official's duties or actions, rather than the intent to influence future actions.

✔️ This answer is correct.

Model Answer: a

Model Explanation: The government must prove that the gratuity was given with the intent to influence an official act, which is a key element in establishing an illegal gratuity violation. This requires showing that the gratuity was given in exchange for some action or decision made by the official, and that the official had knowledge of the intent behind the gratuity.

✔️ This answer is incorrect.

Model Answer: b

Model Explanation: The statement is false because the government does not need to prove that the gratuity was given for the purpose of influencing an official act. Instead, the government must show that the gratuity was given for or because of an official act, which is a different standard. This distinction is important in legal proceedings related to illegal gratuities.

✔️ This answer is correct.

Model Answer: a

Model Explanation: The statement is true. In order to prove an illegal gratuity violation, the government must show that the gratuity was given with the specific intent to influence an official act.

✔️ This answer is incorrect.

Model Answer: b

Model Explanation: The statement is false. To establish an illegal gratuity violation, the government does not need to prove that the gratuity was given with the purpose of influencing an official act. Unlike bribery, which requires proof of corrupt intent to influence an official act, an illegal gratuity only requires that something of value was given to a public official because of an official act performed or to be performed. The crucial difference is that an illegal gratuity does not involve a quid pro quo or intent to influence; it is sufficient that the gratuity was given in connection with an official act.

✔️ This answer is correct.

Category: Investigations

Example #4: Black, a Certified Fraud Examiner, is conducting an admission-seeking interview of Red, a fraud suspect. Black asks Red: "Did you do this because you were mad, or was there a reason you did this?" Red replies: "I had a reason." Which of the following is the most appropriate response for Black to make?

  • (a) "It couldn't have been a very good reason."
  • (b) "Everyone says that."
  • (c) "I'm glad you had a good reason to do this."
  • (d) "I don't believe you."
Official Explanation: Once the benchmark admission has been made, the interviewer should reinforce the confessor's decision by returning to the theme for his rationalization. This will help the confessor feel comfortable and will let him know that the interviewer does not look down upon him. For example: "I am glad to hear that you had a good reason to do this. That reinforces what I thought all along—that you were caught up in extraordinary circumstances." Correct Answer: (C)

Model Answer: c

Model Explanation: The best response for Black is to say "I'm glad you had a good reason to do this." This validates Red's statement without being judgmental, and encourages Red to share more details about the reason behind the fraudulent actions. By showing empathy and open-mindedness, Black can build rapport with Red and increase the likelihood of obtaining truthful information during the interview.

✔️ This answer is correct.

Model Answer: No answer

Model Explanation: The most appropriate response for Black would be to acknowledge Red's statement and encourage further explanation without expressing disbelief or skepticism. This approach helps maintain a rapport and increases the likelihood of Red providing more details about the reason behind the actions.

✔️ This answer is incorrect.

Model Answer: c

Model Explanation: The most appropriate response for Black to make is to acknowledge and appreciate the suspect's willingness to provide a reason for their actions. This approach helps to maintain a positive and non-confrontational atmosphere, which is essential for an admission-seeking interview. By expressing gratitude for the suspect's honesty, Black can encourage them to continue providing information and may increase the likelihood of obtaining a full and truthful admission.

✔️ This answer is correct.

Model Answer: c

Model Explanation: In an admission-seeking interview, it is crucial to maintain a non-judgmental and open approach to encourage the suspect to provide more information. Option c, "I'm glad you had a good reason to do this," acknowledges Red's statement and encourages further explanation without being dismissive or confrontational. This response is most likely to keep the conversation going and potentially lead to more information being revealed.

✔️ This answer is correct.

Model Answer: c

Model Explanation: The interviewer's response should be open-ended and non-confrontational to encourage the suspect to provide more information. Option a: "It couldn't have been a very good reason" is dismissive and could discourage the suspect from sharing the reason. Option b: "Everyone says that" is not relevant to the interview and could distract from the focus on the suspected fraud. Option c: "I'm glad you had a good reason to do this" is misleading and could be perceived as accepting the fraudulent activity. Option d: "I don't believe you" is confrontational and could close down the interview prematurely.

✔️ This answer is correct.

Model Answer: c

Model Explanation: In an admission-seeking interview, the primary goal is to encourage the suspect to fully confess by creating a supportive and non-confrontational environment. When Red responds with ""I had a reason,"" Black should respond in a way that acknowledges and validates Red's feelings, encouraging him to elaborate on his motive. Option c, ""I'm glad you had a good reason to do this,"" shows empathy and invites Red to share more details about his reasons for committing the fraud. This approach is more effective than confrontational or dismissive responses, which might cause the suspect to become defensive or shut down the conversation.

✔️ This answer is correct.

Additional Notes and Limitations

While the benchmark results can help identify which model excels at specific exam topics, the current dataset and benchmark have limitations:
  • The first issue is that some questions can be found online. Even though the CFE exam is relatively niche and unlikely to have been a focus for development teams, there's still a real risk of contamination, as these questions may have made their way into training sets. There is no guarantee that the questions or the exam material is not included in some training datasets.
  • Questions don’t discriminate enough models, with a lot of them getting the same answers for the same questions.
  • As models improve, we will reach saturation and need more challenging benchmarks to distinguish between them.
  • As mentioned earlier, passing the CFE exam doesn’t necessarily make someone a great fraud investigator, so more relevant anti-fraud tasks could provide better insights on real use cases.
About Mistral Large 2 license: Weights for the instruct model are available and are also hosted on HuggingFace,  Mistral Research License / Mistral Commercial License. “These models can be used for non-commercial purposes with no strings attached. A commercial license must be bought for commercial use cases requiring self-deployment.”, see here.
About Meta Llama 3 Community license: Similarly, the model weights are available here

What’s Next

  • A priority is ensuring confidence in this benchmark by reducing the risk of potential leakage of the questions. This could be addressed by recreating existing questions or working with questions that have never been published before.
  • As some questions are “too easy” and some topics fail to discriminate between models effectively—especially as models approach perfect scores—we could filter out the easier questions, keeping only a selection of hard ones and adding additional, more challenging questions.
  • We’re also developing benchmarks for different tasks, more closely related to real-world use cases.
  • We are also adding more models, including fine-tuned models, models with engineered chain-of-thought, and models with access to specific knowledge bases, to see how well smaller models can catch up with additional tuning.

How We Use Generative AI at Alphaguard

At Alphaguard, we’re building AI agents powered by LLMs to empower human agents to fight fraud more efficiently, particularly in tedious and document-heavy tasks. We are taking into consideration all the requirements pertaining to fraud fighting, including explainability and keeping sensitive data secure.
As shown in the results above, using 'raw' models does not guarantee 100% results, and we are fine-tuning and engineering AI agents for optimal performance. We’re ramping up to launch Alphaguard’s first product, expected at the end of November.

If you found this content insightful and want to stay updated on our Fraud Benchmark releases, follow us on LinkedIn. Feel free to start a discussion about the benchmark and how you can use generative AI to fight fraud. And get in touch if you want to discuss how can generative AI help automate the time-consuming manual workflows in your organization.