Notes

Notes from the team.

Occasional pieces — open-source databases, LLMs, and the engineering choices that shape what we build. Written by the people doing the work.

Three pieces · 2022–2024

Meta unveils a QA benchmark to evaluate RAG systems.

Meta has introduced a factual question-answering (QA) benchmark designed to improve evaluation of Retrieval-Augmented Generation (RAG) systems used in large language models. The benchmark covers 4,409 diverse questions, providing a framework for assessing the accuracy and effectiveness of QA systems at scale.

Understanding RAG systems

Retrieval-Augmented Generation is a hybrid approach to question answering. A retrieval component first gathers relevant documents or passages from a large corpus; a generation component then uses that retrieved material to construct contextually appropriate answers. The combination is meant to deliver the precision of retrieval with the fluency of generation.

The need for a comprehensive benchmark

The complexity and diversity of real-world questions has grown faster than evaluation methods have. Meta's benchmark addresses this by offering a diverse question set, each item designed to test a different facet of system capability. The questions span domains and formats, so the resulting assessment is not biased toward any single subject area.

What's in the benchmark

  • Diverse question set. 4,409 questions covering a broad spectrum of topics — the diversity is the point.
  • Mock APIs. The benchmark includes simulated APIs that mimic real information sources. Systems have to retrieve and synthesize information realistically, not from a static corpus.
  • Realistic challenges. Ambiguous questions, incomplete information, multi-source synthesis — the practical failure modes of production QA.

Why it matters for LLM development

By providing a standardized, challenging evaluation framework, the benchmark lets researchers identify where their systems break. That visibility is the precondition for building reliable QA systems that hold up in real-world conditions — in a credit memo workflow, a regulatory filing search, or a customer-service bot — instead of just on a leaderboard.

Chat with PDF — the "hello world" problem of LLMs.

Engaging in a conversation with a PDF is harder than it looks, and ChatGPT runs into limits very quickly. The prevailing GPT-4 wrapper category — "chat with a doc / PDF" — is the most visible consumer AI use case. Reading a dense document is tedious; asking a model to parse and summarize it is convenient. Predictably, ChatGPT falls short on PDFs beyond about ten pages: it produces generic summaries and avoids elaboration when pushed.

The reasons matter, because they're the same reasons any production system has to solve.

OCR quality

Effective Optical Character Recognition is essential, especially for tables and embedded images. Existing free and commercial OCR engines struggle here — and most business and research PDFs contain exactly the structured content the OCR layer mishandles.

Context length

We now have 128k-context LLMs in commercial use, but it's unclear what ChatGPT actually deploys per query. Running OCR on a long document and feeding the text in raw frequently triggers errors, which suggests a smaller-context model behind the scenes.

RAG implementation

A reasonable fix is a straightforward Retrieval-Augmented Generation flow: chunk the document, embed the chunks, retrieve relevant segments at query time, present them to the LLM. Most consumer chatbots don't do this. Production systems should.

Highlighting source sections

An optimal Chat-with-PDF system surfaces the document sections each response comes from, so the user can verify the answer. This is the same audit-trail requirement that any regulated workflow needs — whether it's a credit memo citing a financial statement or a legal review tracing a clause.

An effective Chat-with-PDF application is therefore a non-trivial endeavour: OCR quality, context handling, retrieval, and source-citation all have to work together. The viable product is probably a small, focused operation rather than a heavily-backed startup — possibly a lifestyle business done well.

Using LLMs to extract unstructured data.

Named Entity Recognition (NER) is one of the primary applications of enterprise AI. In data processing, the synergy between NER and transcription is direct — after analyzing documents, the key terms can be extracted. Relation Extraction (RE) and Event Extraction (EE) complement NER in the same pipeline.

Before the LLM era, training a language model for these tasks — typically a transformer — was laborious. Each entity required a few hundred painstakingly labelled examples, accuracy assessment, and multiple iterations to reach the desired performance for every label.

LLMs simplify this dramatically. Depending on task complexity, a zero-shot natural-language prompt can extract key entities such as names and addresses from a document. For harder problems, few-shot learning is more reliable — you show the LLM four or five examples of documents with their extracted entities, then ask it to do the same on new ones. In practice this is more straightforward than fine-tuning a dedicated NER model.

If the zero-shot and few-shot approaches fall short, supervised fine-tuning becomes necessary. SFT works particularly well for specific, dense extractions, and pairs naturally with code-based entity-extraction strategies and data augmentation.

A comprehensive treatment of these methods is given in the survey paper "Large Language Models for Generative Information Extraction: A Survey".

Several existing production NER models exhibit subpar accuracy and require human-in-the-loop supervision and correction — a costly, labour-intensive pipeline. It is now imperative to transition to LLMs, which, in many cases, eliminate the need for human intervention and streamline the extraction process.