Benchmarking LLMs for Fraud Detection

Learn how three benchmarking tools—AWS Bedrock CDK, LangChain Python pipeline, and Node.js API suite—evaluate LLMs on receipt fraud detection using OCR, embeddings, and RAG.

Overview

Fraud detection on receipt images is a complex challenge that blends computer vision, OCR, embeddings, retrieval-augmented generation, and prompt engineering. In this talk, I’ll share how I built three different benchmarking frameworks to evaluate large language models (LLMs) on fraud detection tasks:

AWS Bedrock Benchmarking Tool with AWS CDK – scalable infrastructure for structured testing of foundation models.
Local LLM Benchmarking with LangChain + Python – flexible experimentation pipeline for embeddings, OCR, and RAG.
API-based LLM Benchmarking (Node.js + LLM APIs) – direct model comparisons across OpenAI, Anthropic, and others.

These tools were used to process and analyze receipt images with a mix of OCR, embeddings, RAG, ground-truth human labels, LLM image detection, and prompt engineering. I’ll walk through the design decisions, show real results comparing models, and share insights on what worked, what failed, and where LLMs shine (or struggle) in fraud detection.

Links

Tech stack