Building the Data Flywheel: How to Use Production Data to Improve Your LLM Application

Learn how to build a data flywheel for your LLM application using production data. Discover the 5-step process: from error analysis and clustering to creating golden test sets and improving prompts with Agenta's LLMOps platform. Master continuous improvement for reliable AI.

Dec 19, 2025

10 minutes

Ship reliable AI apps faster

Agenta is the open-source LLMOps platform: prompt management, evals, and LLM observability all in one place.

Star on Github

Get started

One of the biggest challenges in building with LLMs is moving beyond the prototype stage. Many teams find themselves stuck in a cycle of making random prompt changes based on gut feelings, or “vibes,” leading to inconsistent and unpredictable results. An improvement in one area often causes a regression in another. This happens because, unlike traditional software, LLM outputs are not deterministic. Their vast and unpredictable failure surface makes it impossible to anticipate every edge case before launch, a concept explored in this article on the data flywheel.

To escape this cycle and build truly reliable AI applications, you need a systematic, repeatable process for improvement. This is where the data flywheel comes in. The data flywheel is a powerful framework for turning the messy, real-world data from your production environment into a clean, well-oiled engine for continuous improvement. It’s a methodical process of capturing user interactions, identifying failure patterns, building targeted evaluations, and using those insights to make your application smarter with every iteration.

This guide will walk you through the practical, step-by-step process of building your own data flywheel, drawing on insights from industry leaders like Hamel Husain and Jason Liu. We will also show how an open-source LLMOps platform like Agenta can provide the necessary tooling for each step, from capturing production data to deploying the improved prompt. We will cover how to label and cluster your data, create robust test sets, build meaningful evaluations, and use those evaluations to improve your system.

Part 1: The Flywheel - Turning Production Data into Progress

For traditional software, you can often rely on a suite of unit and integration tests to ensure quality. With LLMs, this is not enough. As Hamel Husain notes in his Field Guide to Rapidly Improving AI Products, the most valuable insights come from looking at how your application actually performs in the wild. Your production data (the questions, the commands, the successful outcomes, and most importantly the failures) is gold. It is the most direct and honest feedback you will ever receive.

The data flywheel provides a structure to refine this raw gold into something you can build with. The process is a continuous loop:

Capture & Label: Collect traces from your production environment and have a human expert label them.
Analyze & Cluster: Identify and categorize the patterns in your failures.
Build Test Sets: Turn these failure patterns into a permanent, reusable “golden dataset.”
Evaluate: Systematically measure your application’s performance against this dataset.
Improve: Use the evaluation results to make targeted improvements to your prompts, RAG system, or model.
Deploy & Repeat: Ship the improvements and begin the cycle again, capturing new data and spinning the flywheel faster.

This loop transforms the development process from a series of disjointed fixes into a compounding system of improvement.

Part 2: A Step-by-Step Guide to Building Your Flywheel

Building a data flywheel does not require a massive investment in complex infrastructure from day one. It starts with a simple, manual process of looking at your data.

Step 1: Label It - The Power of Manual Error Analysis

The single most important and highest-ROI activity in AI development is manual error analysis. Before you can measure anything, you must first understand what is going wrong. This is a qualitative process of discovery, not a quantitative one of measurement.

The process, often called “open coding,” is straightforward. In Agenta, this starts with the Observability feature, which automatically captures every user interaction as a trace. You can then use the platform’s Trace Annotation feature to capture user feedback (thumbs up/down or implicit signals) and filter traces to find those with negative feedback.

Gather a Diverse Dataset: Collect a representative sample of 50-100 recent production traces. This should not be a purely random sample. Actively look for diversity. Include long conversations, interactions with negative user feedback, and queries that seem to push the boundaries of your application’s capabilities.
Appoint a “Benevolent Dictator”: Assign a single domain expert who deeply understands your users and your product to be the ultimate arbiter of quality. This avoids endless debates about what constitutes a “good” response.
Annotate Failures: For each trace, the expert should provide two simple annotations: a binary Pass/Fail score and a free-text comment describing the first point of failure. Focusing on the first failure is a powerful heuristic; upstream errors like faulty retrieval often cause a cascade of downstream problems.

This initial step is about building intuition. You are not just logging bugs; you are developing a deep understanding of how users interact with your system and where it falls short.

Step 2: Cluster It - From Raw Notes to a Failure Taxonomy

Once you have a collection of raw, descriptive failure comments, the next step is to bring structure to them. This process, known as “axial coding,” involves grouping similar failures into a coherent “failure taxonomy.” Agenta’s Human Annotation features allow subject matter experts to review and categorize these traces directly from the UI, providing the structured data needed for this step.

While you can do this manually, an LLM can be a powerful assistant here. Export your free-text comments and use a prompt to ask an LLM to cluster them into a small set of themes. For example, you might discover that your failures fall into categories like these:

Failure Mode	Definition
Hallucination / Incorrect Information	The model provides factually incorrect answers.
Context Retrieval / RAG Issues	The system fails to retrieve the correct documents.
Irrelevant or Off-Topic Responses	The answer is unrelated to the user’s question.
Generic or Unhelpful Responses	The answer is too broad or does not directly address the user’s need.
Formatting / Presentation Issues	The response is poorly formatted (e.g., missing code blocks).
Interaction Style / Tone	The model’s tone is inappropriate for the context.

This taxonomy is your roadmap. It turns a long list of individual problems into a prioritized list of problem areas, showing you exactly where to focus your efforts.

Step 3: Create Test Sets - Your Ground Truth for Regression

With a clear understanding of your failure modes, you can now create your most valuable asset for iteration: a “golden dataset.” This is a permanent, reusable set of test cases that represents your most critical success and failure scenarios. Agenta simplifies this process by allowing you to Create Test Sets from Traces directly within the Observability view, instantly converting real-world failures into regression tests.

Your golden dataset should include:

Representative Failures: Select a handful of clear examples for each category in your failure taxonomy.
"Happy Path" Successes: Include examples of queries where the application is performing correctly. This is crucial for preventing regressions.
Synthetic Data (Optional but Recommended): If you have limited production data, you can use LLMs to generate synthetic examples to cover edge cases or adversarial inputs, a technique detailed here. For example, you can ask an LLM to generate variations of a user query that are phrased in unusual ways.

This dataset becomes the bedrock of your evaluation process. It is the yardstick against which you will measure all future changes.

Step 4: Create Evals - Measuring What Matters

Now that you have a test set, you can build evaluators to score your application’s performance automatically. The key is to create targeted evaluations for your specific failure modes. A dashboard full of generic, off-the-shelf metrics like “helpfulness” is often a vanity exercise that creates a false sense of progress, a point emphasized in Hamel Husain’s Field Guide. Agenta provides one place to evaluate systematically, supporting both Human Annotation and Automatic Evaluation with LLMs at scale against your newly created test sets.

Instead, for each category in your failure taxonomy, define a specific, measurable evaluation:

For Incorrect Information, you can use a semantic similarity check against a known-good answer or an LLM-as-a-judge to score for factual correctness.
For Bad Tone, an LLM-as-a-judge is highly effective. You can ask it to rate the response on a scale from “formal” to “casual.”
For Retrieval Issues, you can measure classic information retrieval metrics like precision and recall on the retrieved documents.

By running these targeted evaluations against your golden dataset every time you make a change, you transform your improvement process from guesswork into a scientific discipline.

Step 5: Improve the Prompt (and Beyond)

Armed with data-backed insights from your evaluations, you can now make targeted improvements to your system. The flywheel has shown you what to fix; now you can focus on how to fix it.

Prompt Engineering: This should always be your first step. If your evaluations show that the model is consistently misunderstanding a certain type of query, you can refine your prompt with clearer instructions, few-shot examples, or a more defined role. Agenta’s Prompt Management and Playground allow subject matter experts to collaborate on prompt iteration, version control changes, and deploy the improved prompt to production without touching the codebase. You can even use an LLM to help you improve your prompts.
RAG System Tuning: If your RAG-specific evaluations show poor retrieval performance, you can focus your efforts on improving your chunking strategy, experimenting with different embedding models, or adding a reranking step.
Fine-Tuning: Fine-tuning a model is a powerful but expensive option. It should only be considered when you have a large dataset (1,000+ examples) of a persistent failure mode that cannot be solved through prompt engineering or RAG improvements alone, as discussed in this guide on prompt engineering vs. fine-tuning.

After making an improvement, you run your evaluations against the golden dataset again. If the scores improve and no regressions are introduced, you can deploy with confidence, and the flywheel completes another turn.

Conclusion

The data flywheel is more than just a process; it represents a fundamental shift in how we approach building reliable LLM applications. It moves teams from reactive "whack-a-mole" debugging to a proactive, data-driven system of continuous improvement. As Jason Liu puts it, in the world of AI, “experimentation speed is your only moat.” The data flywheel is the engine that drives that speed.

Don’t feel like you need to build this entire system overnight. Start small. This week, take 30 minutes to manually review 20 user conversations. The insights you gain will be the first turn of your own data flywheel, and the first step toward building an application that gets smarter with every user interaction.

References and Thanks

This guide is inspired by the content and approaches created by Jason Lui and Hamel Husain. Both offer courses to learn AI engineering. Although I did not test them personally, I can vouch to the quality of the content and video they create. So definitely worth checking out!