AUTHORS: Adnane Lahbabi, Ricardo Fernandez
Summary
We are releasing our first Fraud-specific LLM Benchmark. While there are plenty of LLM benchmarks available, it's crucial to understand a model’s performance, strengths, and weaknesses in the areas that matter for your usage. Since we are focused on specific capabilities that aren’t always covered by general benchmarks, we created our own “Fraud IQ” Benchmark to evaluate skills that are relevant for our uses.
Why a Fraud IQ Benchmark?
There are already plenty of benchmarks ranking LLMs on a wide range of tasks such as answering math questions, reasoning, and coding. These benchmarks can be useful as they can help people choose the right model for their specific needs. But when it comes to applying them in the context we operate in, things can be quite different.
In particular, fighting fraud comes with its own set of subtleties. Detecting or proving fraud in complex situations can be tricky—it involves interpreting weak signals, finding patterns in the noise, and understanding the broader context. It’s a mix of legal violations, hidden intent, numbers not adding up, and more, often requiring you to piece together fragmented information to uncover the truth.
How it started
A few months after GPT-4 was released last year, I wrote an article about ChatGPT and fraud. As a data scientist with years of experience in fraud detection and a Certified Fraud Examiner (CFE), I was curious to see how ChatGPT would perform on the CFE exam. The CFE is a credential awarded by the Association of Certified Fraud Examiners, the world’s largest anti-fraud organization. The exam covers four areas: Financial Transactions and Fraud Schemes, Law, Investigation, and Fraud Prevention and Deterrence.
Overall, the results showed that GPT-4 demonstrated rather interesting comprehension of fraud concepts.
LLMs capabilities in regard to fraud are part of what sparked my interest in exploring their potential to help fight fraud across industries. And at Alphaguard, we believe their impact is going to be huge: Fraud prevention, detection, and investigation often involve dealing with a lot of unstructured data—documents of all kinds and data gathered from online searches that need careful review.
Now, almost 18 months after GPT-4’s release, we’re ramping up to launch Alphaguard’s first product, expected at the end of November. As experts in using LLMs to fight fraud, we need to know which models are best for which tasks. Thus, creating evaluations and benchmarks is crucial for building the best possible product.
There are a lot of factors to consider when picking a LLM for a specific use. Models with larger context windows can process bigger documents without needing to split them up, while some models are better than others when dealing with multiple languages
Latency and cost are also important, as going through thousands of pages of documents can quickly become prohibitively expensive with the larger models.
One crucial factor in fraud investigations is data sensitivity and privacy—many organizations can’t (and shouldn’t) send their data through APIs they don’t fully have control over. In these cases, using open source models in a completely controlled environment is a huge advantage.
The Fraud IQ Benchmark
In this article, we present the results of a selection of models on a dataset of ‘CFE-like’ questions.
The dataset is based on questions that were publicly released by the ACFE at some point in time (which comes with some limitations discussed below) and is divided into at least 28 questions for each section of the CFE exam, and 137 questions in total.
While answering CFE questions alone doesn’t necessarily make someone a great investigator, the benchmark offers several key advantages:
- The exam covers four areas: Financial Transactions and Fraud Schemes, Law, Investigation, and Fraud Prevention and Deterrence, providing a straightforward way to compare models across these critical areas in a fraud context.
- It’s a challenging exam for humans—you need to prepare thoroughly to pass. You can try out some sample questions here. And it helps compare model performance to human CFEs, who must accurately answer 75% of questions in each section to pass.
- This is a good indicator of models performance on reasoning with fraud concepts involved. As such, we can expect models that perform well to have the necessary abstraction to find potential anomalies in real situations, as well as being able to explain them, which is crucial when uncovering wrongdoings.
- It’s relatively easy to compare models using this dataset, in particular, monitoring the performance of proprietary and open source models alike, gives us an indication of where we are, and what still needs to be improved.
Results
Here are the results of the tested models on the benchmark:
We consider that models that pass accurately answer 75% of questions in each section, which is the criteria used to obtain the CFE.
Key Takeaways
- Anthropic models (Claude 3 Opus and Claude Sonnet 3.5) are the top-performing models with a few points of margin.
- The remarkable leap from GPT-3.5 to GPT-4, introduced as a major breakthrough, is striking here again, progressing from models that consistently provided correct answers to one that surpasses the scores of humans trained on the test.
- Models generally perform uniformly across the four categories.
- Open weight models have closed the gap with GPT-4 a little more than a year after its release.
- Some “hard” questions with high failure rates across the models are more questions that require specific knowledge than hard “reasoning” questions; these are typically cases where adding RAG or fine-tuning on specific knowledge could push the performance of a model higher.
About o1:
- o1-preview is the top-performing OpenAI model, with a small margin over chatgpt-4.0-latest, but not the top-performing model overall.
- o1-preview leads significantly on the Financial Transactions and Fraud Schemes section, where its "enhanced reasoning capabilities" make a noticeable difference in analyzing complex situations involving non-straightforward fraud scenarios with multiple characters in the questions.
- On the other hand, in the Investigations section, o1-preview performs much worse than other models, including much smaller models. Looking at the answers, it seems that when faced with fraud investigation scenarios, o1’s chain of thought diverts from analyzing the situation, as if it were hesitant to assist fraudsters in evading investigation. This appears to be tied to the safety refusal boundaries in place, which, in this case, are clearly impacting the model's performance. https://openai.com/index/learning-to-reason-with-llms/
- We ran the test twice with o1-preview, noting, as mentioned by OpenAI in their release guide, that its performance can be affected by overly detailed guidelines in its prompts since it prefers to choose its own chain-of-thought path. For the second run, we adapted the prompts given to other models to include fewer instructions.
Open weight models comparison
Here is the comparison between the open-weight models, which are models that can be run on private servers to minimize the risk of sharing sensitive data. The number of parameters typically ranges from 7 billion (Mistral 7b) to 405 billion (Llama), hence the use of a log-scale for the x-axis. The best-performing open-weight model is the latest Mistral release, Large 2.
Other than that, we observe:
- Smaller LLMs (with less than a few dozen billion parameters) don’t typically pass the exam, and the smallest open weight model to pass is Llama 3-70b, which could run locally on a good Mac Studio or similar.
- Both Llama and Mistral provide open-weight models that are passing.
Model Output Examples
Here is some examples of questions of our benchmark, with a selection of models answers.
This sample of questions and answers show interesting potential uses of LLMs for assisting with risk mitigation and fraud investigations, while taking into account elements of human psychology.
Example #2 is interesting in how it frames the question to push for a 'True' response, which may seem like the obvious answer. It effectively discriminates between smaller models that fall into this trap and larger, more recent models that are less influenced by the phrasing and respond correctly.
Additional Notes and Limitations
While the benchmark results can help identify which model excels at specific exam topics, the current dataset and benchmark have limitations:
- The first issue is that some questions can be found online. Even though the CFE exam is relatively niche and unlikely to have been a focus for development teams, there's still a real risk of contamination, as these questions may have made their way into training sets. There is no guarantee that the questions or the exam material is not included in some training datasets.
- Questions don’t discriminate enough models, with a lot of them getting the same answers for the same questions.
- As models improve, we will reach saturation and need more challenging benchmarks to distinguish between them.
- As mentioned earlier, passing the CFE exam doesn’t necessarily make someone a great fraud investigator, so more relevant anti-fraud tasks could provide better insights on real use cases.
About Mistral Large 2 license: Weights for the instruct model are available and are also hosted on HuggingFace, Mistral Research License / Mistral Commercial License. “These models can be used for non-commercial purposes with no strings attached. A commercial license must be bought for commercial use cases requiring self-deployment.”, see here.
About Meta Llama 3 Community license: Similarly, the model weights are available here
What’s Next
- A priority is ensuring confidence in this benchmark by reducing the risk of potential leakage of the questions. This could be addressed by recreating existing questions or working with questions that have never been published before.
- As some questions are “too easy” and some topics fail to discriminate between models effectively—especially as models approach perfect scores—we could filter out the easier questions, keeping only a selection of hard ones and adding additional, more challenging questions.
- We’re also developing benchmarks for different tasks, more closely related to real-world use cases.
- We are also adding more models, including fine-tuned models, models with engineered chain-of-thought, and models with access to specific knowledge bases, to see how well smaller models can catch up with additional tuning.
How We Use Generative AI at Alphaguard
At Alphaguard, we’re building AI agents powered by LLMs to empower human agents to fight fraud more efficiently, particularly in tedious and document-heavy tasks. We are taking into consideration all the requirements pertaining to fraud fighting, including explainability and keeping sensitive data secure.
As shown in the results above, using 'raw' models does not guarantee 100% results, and we are fine-tuning and engineering AI agents for optimal performance. We’re ramping up to launch Alphaguard’s first product, expected at the end of November.