Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
Benchmarking LLMs for Fraud Detection
Learn how three benchmarking tools—AWS Bedrock CDK, LangChain Python pipeline, and Node.js API suite—evaluate LLMs on receipt fraud detection using OCR, embeddings, and RAG.
Fraud detection on receipt images is a complex challenge that blends computer vision, OCR, embeddings, retrieval-augmented generation, and prompt engineering. In this talk, I’ll share how I built three different benchmarking frameworks to evaluate large language models (LLMs) on fraud detection tasks:
- AWS Bedrock Benchmarking Tool with AWS CDK – scalable infrastructure for structured testing of foundation models.
- Local LLM Benchmarking with LangChain + Python – flexible experimentation pipeline for embeddings, OCR, and RAG.
- API-based LLM Benchmarking (
Node.js + LLM APIs) – direct model comparisons across OpenAI, Anthropic, and others.
These tools were used to process and analyze receipt images with a mix of OCR, embeddings, RAG, ground-truth human labels, LLM image detection, and prompt engineering. I’ll walk through the design decisions, show real results comparing models, and share insights on what worked, what failed, and where LLMs shine (or struggle) in fraud detection.