Skip to main content

Chat Sessions in Observability

Overview

Chat sessions bring conversation-level observability to Agenta. You can now group related traces from multi-turn conversations together, making it easy to analyze complete user interactions rather than individual requests.

This feature is essential for debugging chatbots, AI assistants, and any application with multi-turn conversations. You get visibility into the entire conversation flow, including costs, latency, and intermediate steps.

Key Capabilities

  • Automatic Grouping: All traces with the same ag.session.id attribute are automatically grouped together
  • Session Analytics: Track total cost, latency, and token usage per conversation
  • Session Browser: Dedicated UI showing all sessions with first input, last output, and key metrics
  • Session Drawer: Detailed view of all traces within a session with parent-child relationships
  • Real-time Monitoring: Auto-refresh mode for monitoring active conversations

How to Use Sessions

Using the Python SDK

Add session tracking to your application with one line of code:

import agenta as ag

# Initialize Agenta
ag.init()

# Store the session ID for all subsequent traces
ag.tracing.store_session(session_id="conversation_123")

# Your LLM calls are automatically tracked with this session
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)

Using the Chat Run Endpoint

You can also instrument sessions when calling Agenta-managed prompts via the /chat/run endpoint:

import agenta as ag

# Initialize the Agenta client
agenta = ag.Agenta(api_key="your_api_key")

# Call the chat endpoint with session tracking
response = agenta.run(
base_id="your_base_id",
environment="production",
inputs={
"chat_history": [
{"role": "user", "content": "What is the weather like?"}
]
},
# Add session metadata to group related conversations
metadata={
"ag.session.id": "user_456_conv_789"
}
)

# Follow-up in the same session
follow_up = agenta.run(
base_id="your_base_id",
environment="production",
inputs={
"chat_history": [
{"role": "user", "content": "What is the weather like?"},
{"role": "assistant", "content": response["message"]},
{"role": "user", "content": "What about tomorrow?"}
]
},
metadata={
"ag.session.id": "user_456_conv_789" # Same session ID
}
)

Using OpenTelemetry

If you're using OpenTelemetry for instrumentation:

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('my-app');
const span = tracer.startSpan('chat-interaction');

// Add session ID as a span attribute
span.setAttribute('ag.session.id', 'conversation_123');

// Your code here
span.end();

The UI automatically detects session IDs and groups traces together. You can use any format for session IDs: UUIDs, composite IDs like user_123_session_456, or custom formats.

Use Cases

Debug Chatbots

See the complete conversation flow when users report issues. Instead of viewing isolated requests, you can analyze the entire conversation context and understand why a particular response was generated.

Monitor Multi-turn Agents

Track how your agent handles follow-up questions and maintains context across turns. See which turns are expensive, identify where latency spikes occur, and understand conversation patterns.

Analyze Conversation Costs

Understand which conversations are expensive and why. Session-level cost tracking helps you identify optimization opportunities and set appropriate pricing for your application.

Optimize Performance

Identify latency issues across entire conversations, not just single requests. See which conversational patterns lead to performance problems and optimize accordingly.

Getting Started

Learn more in our documentation:

What's Next

We're continuing to enhance session tracking with upcoming features like session-level annotations, session comparisons, and automated session analysis.

JSON Multi-Field Match Evaluator

The JSON Multi-Field Match evaluator lets you validate multiple fields in JSON outputs simultaneously. This makes it ideal for entity extraction tasks where you need to check if your model correctly extracted name, email, address, and other structured fields.

What is JSON Multi-Field Match?

This evaluator compares specific fields between your model's JSON output and the expected JSON values from your test set. Unlike the old JSON Field Match evaluator (which only checked one field), this evaluator handles any number of fields at once.

For each field you configure, the evaluator produces a separate score (either 1 for a match or 0 for no match). It also calculates an aggregate score showing the percentage of fields that matched correctly.

Key Features

Multiple Field Comparison

Configure as many fields as you need to validate. The evaluator checks each field independently and reports results for all of them.

If you're extracting user information, you might configure fields like name, email, phone, and address.city. Each field gets its own score, so you can see exactly which extractions succeeded and which failed.

Three Path Format Options

The evaluator supports three different ways to specify field paths:

Dot notation (recommended for most cases):

  • Simple fields: name, email
  • Nested fields: user.address.city
  • Array indices: items.0.name

JSON Path (standard JSON Path syntax):

  • Simple fields: $.name, $.email
  • Nested fields: $.user.address.city
  • Array indices: $.items[0].name

JSON Pointer (RFC 6901):

  • Simple fields: /name, /email
  • Nested fields: /user/address/city
  • Array indices: /items/0/name

All three formats work the same way. Use whichever matches your existing tooling or personal preference.

Nested Field and Array Support

Access deeply nested fields and array elements without restrictions. The evaluator handles any level of nesting.

Per-Field Scoring

See individual scores for each configured field in the evaluation results. This granular view helps you identify which specific extractions are working well and which need improvement.

Aggregate Score

The aggregate score shows the percentage of matching fields. If you configure five fields and three match, the aggregate score is 0.6 (or 60%).

Example

Suppose you're building an entity extraction model that pulls contact information from text. Your ground truth looks like this:

{
"name": "John Doe",
"email": "[email protected]",
"phone": "555-1234",
"address": {
"city": "New York",
"zip": "10001"
}
}

Your model produces this output:

{
"name": "John Doe",
"email": "[email protected]",
"phone": "555-1234",
"address": {
"city": "New York",
"zip": "10002"
}
}

You configure these fields: ["name", "email", "phone", "address.city", "address.zip"]

The evaluator returns:

FieldScore
name1.0
email0.0
phone1.0
address.city1.0
address.zip0.0
aggregate_score0.6

You can see immediately that the model got the email and zip code wrong but correctly extracted the name, phone, and city.

Auto-Detection in the UI

When you configure the evaluator in the web interface, Agenta automatically detects available fields from your test set data. Click to add or remove fields using a tag-based interface. This makes setup fast and reduces configuration errors.

Migration from JSON Field Match

The old JSON Field Match evaluator only supported checking a single field. If you're using it, consider migrating to JSON Multi-Field Match to gain:

  • Support for multiple fields in one evaluator
  • Per-field scoring for detailed analysis
  • Aggregate scoring for overall performance tracking
  • Nested field and array support

Existing JSON Field Match configurations continue to work. We recommend migrating to JSON Multi-Field Match for new evaluations.

Next Steps

Learn more about configuring and using the JSON Multi-Field Match evaluator in the Classification and Entity Extraction Evaluators documentation.

PDF Support in the Playground

The Playground now supports PDF attachments for chat applications. You can include PDF documents in your prompts to build applications that analyze documents, answer questions about content, or extract information from files.

What is PDF Support?

PDF support lets you attach PDF documents to chat messages when testing prompts in the Playground. The feature works with vision-capable models from OpenAI, Gemini, and Claude. These models can read and understand PDF content to answer questions or perform analysis.

This is useful when you're building applications that need to work with documents. Examples include invoice processing, contract analysis, document Q&A, or content extraction.

Supported Providers

PDF support works with vision-capable models that handle document inputs.

How to Attach PDFs

To attach a PDF to a chat message, click "Add attachment" in the message input. You'll see three options:

Upload a File

Select a PDF from your computer. The file is converted to base64 and sent with your prompt.

Provide a URL

Paste the URL to a publicly accessible PDF. The model fetches the PDF from the URL.

Use a File ID

If you've uploaded a file through a provider's API (like the Gemini Files API), you can use the file ID instead. The model retrieves the file from the provider's storage.

Using PDFs in Evaluations

PDF attachments work in both automatic and human evaluations. You can include PDFs in your test sets and run evaluations across multiple documents.

PDFs in Observability and Tracing

When you trace requests that include PDFs, you can see the PDF attachment information in the trace data.

Example Use Cases

Invoice Processing

Create a prompt that extracts key information from invoices:

Extract the following information from this invoice:
- Invoice number
- Date
- Total amount
- Vendor name
- Line items

Return the information as structured JSON.

Attach sample invoices as PDFs. Test the prompt with different invoice formats to ensure reliable extraction across vendors.

Contract Analysis

Build a prompt that analyzes legal contracts:

Review the attached contract and identify:
- Key obligations for each party
- Important dates and deadlines
- Termination clauses
- Liability limitations

Provide a summary in plain language.

Attach contract PDFs and verify that the model identifies critical terms consistently.

Document Q&A

Create an assistant that answers questions about documents:

You are a document assistant. Answer the user's question based on the
attached PDF. Be specific and cite page numbers when possible.

Question: {{question}}

Attach various document types (reports, manuals, research papers) and test question-answering accuracy across different content.

Next Steps

Learn more about using the Playground to develop and test prompts with PDF attachments.

Agenta Documentation MCP Server

AI coding agents like Cursor, Claude Code, VS Code Copilot, and Windsurf can now access Agenta documentation directly through the Agenta MCP server.

The MCP server implements the Model Context Protocol, allowing AI assistants to search and retrieve Agenta documentation on demand. Instead of manually searching docs, your AI agent can answer questions about Agenta features, APIs, and code examples.

Read the full setup guide →

Projects within Organizations

You can now create projects within an organization. This feature helps you organize your work when you're building multiple AI products or managing different teams working on separate initiatives.

What Are Projects?

Projects provide a way to isolate and organize your AI work within an organization. Each project maintains its own scope for:

  • Prompts: All prompt templates and variants stay within the project
  • Traces: Observability data is scoped to the project that generated it
  • Evaluations: Test sets, evaluators, and evaluation results belong to specific projects

This scoping prevents clutter and makes it easy to focus on one product at a time.

Creating and Managing Projects

You can create a new project directly from the sidebar in the Agenta interface. Once created, you can switch between projects using the sidebar navigation.

Each team member can work in different projects simultaneously. The interface remembers your last active project, making it easy to pick up where you left off.

When to Use Projects

Projects work well when you need to:

  • Build multiple AI products for different use cases
  • Separate development work for different teams or departments
  • Keep client work isolated from internal tools

Next Steps

If you're managing complex AI initiatives across multiple products, projects give you the structure to keep everything organized. You can create your first project from the sidebar and start organizing your prompts and evaluations.

For questions about projects or organizational structure, check the FAQ or reach out through our support channels.

Provider Built-in Tools in the Playground

The Playground now supports provider built-in tools. You can use web search, code execution, file search, and other native provider tools directly when developing prompts.

What Are Provider Built-in Tools?

Provider built-in tools are capabilities that LLM providers offer natively. Unlike custom tools that you define with JSON schemas, these tools are managed by the provider. When the model needs them, the provider handles execution and returns results automatically.

Common built-in tools include:

  • Web search: Fetch current information from the internet
  • Code execution: Run Python or JavaScript code
  • File search: Search through uploaded documents
  • Bash scripting: Execute shell commands (Anthropic)

Supported Providers and Tools

Different providers offer different built-in tools:

OpenAI

  • Web Search: Access current information from the web
  • File Search: Search through files you upload to OpenAI

Anthropic

  • Web Search: Retrieve information from the internet
  • Bash Scripting: Execute bash commands in a sandboxed environment

Gemini

  • Web Search: Search the web for current information
  • Code Execution: Run Python code to perform calculations and data analysis

How to Use Built-in Tools

Adding Tools in the Playground

  1. Open your prompt in the Playground
  2. Click the "Add Tool" button in the configuration panel
  3. Choose the tools you want to enable for your prompt
  4. Test your prompt; the model will automatically use tools when needed

The tools are saved with your prompt configuration. When you commit changes, the tool configuration is stored with the variant.

Invoking with Tools via LLM Gateway

When you invoke prompts through Agenta as an LLM gateway, the tools are automatically included in the request. The provider handles tool execution during the call.

Your application receives the final response after all tool calls complete. You don't need to handle tool execution yourself.

Tool Definitions in the Registry

Tool definitions follow the LiteLLM format. You can view the exact tool schemas in the Prompt Registry. This helps you understand what parameters each tool accepts and how the provider will use it.

Example Use Cases

Create a prompt that answers questions using current information:

You are a research assistant. Answer the user's question with accurate,
current information. Use web search when you need recent data.

Question: {{question}}

Enable web search in the tool configuration. When users ask about current events or recent data, the model automatically searches the web for information.

Data Analysis with Code Execution

Build a data analysis prompt that performs calculations:

Analyze the following data and provide insights:

{{data}}

Calculate statistics and create visualizations as needed.

Enable code execution for Gemini. The model can run Python code to calculate statistics, process data, and generate visualizations.

Create a prompt that answers questions about uploaded documents:

Answer the user's question based on the uploaded documentation.
Be specific and cite relevant sections.

Question: {{question}}

Enable file search for OpenAI. The model searches through your uploaded files to find relevant information.

Next Steps

Learn more about using the Playground to develop and test prompts with provider built-in tools.

Reasoning Effort Support in the Playground

You can now configure reasoning effort for models that support this parameter, such as OpenAI's o1 series and Google's Gemini 2.5 Pro.

Reasoning effort controls how much computational thinking the model applies before generating a response. This is particularly useful for complex reasoning tasks where you want to balance response quality with latency and cost.

The reasoning effort parameter is part of your prompt template configuration. When you fetch prompts via the SDK or invoke them through Agenta as an LLM gateway, the reasoning effort setting is included in the configuration and applied to your requests automatically.

This gives you fine-grained control over model behavior directly from the playground, making it easier to optimize for your specific use case.

Jinja2 Template Support in the Playground

We're excited to announce a powerful update to the Agenta playground. You can now use Jinja2 templating in your prompts.

This means you can add sophisticated logic directly into your prompt templates. Use conditional statements, apply filters to variables, and transform data on the fly.

Learn more in our blog post or check the documentation.

Example

Here's a prompt template that uses Jinja2 to adapt based on user expertise level:

You are {% if expertise_level == "beginner" %}a friendly teacher who explains concepts in simple terms{% else %}a technical expert providing detailed analysis{% endif %}.

Explain {{ topic }} {% if include_examples %}with practical examples{% endif %}.

{% if False %} {{expertise_level}} {{include_examples}} {% endif %}

Note: The {% if False %} block makes variables available to the playground without including them in the final prompt.

Using Jinja2 Prompts

When you fetch a Jinja2 prompt via the SDK, you get the template format included in the configuration:

{
"prompt": {
"messages": [
{
"role": "user",
"content": "You are {% if expertise_level == \"beginner\" %}a friendly teacher...{% endif %}"
}
],
"llm_config": {
"model": "gpt-4",
"temperature": 0.7
},
"template_format": "jinja2"
}
}

The template_format field tells Agenta how to process your variables. This works both when invoking prompts through Agenta as an LLM gateway and when fetching prompts programmatically via the SDK.


Agenta Core is Now Open Source

We're open sourcing the core of Agenta under the MIT license. All functional features are now available to the community.

What's Open Source

Every feature you need to build, test, and deploy LLM applications is now open source. This includes the evaluation system, prompt playground and management, observability, and all core workflows.

You can run evaluations using LLM-as-a-Judge, custom code evaluators, or any built-in evaluator. Create and manage test sets. Evaluate end-to-end workflows or specific spans in traces.

Experiment with prompts in the playground. Version and commit changes. Deploy to environments. Fetch configurations programmatically.

Trace your LLM applications with OpenTelemetry support. View detailed execution traces. Monitor costs and performance. Filter and search traces.

Building in Public Again

We've moved development back to the public repository. You can see what we're building, contribute features, and shape the product direction.

What Remains Under Commercial License

Only enterprise collaboration features stay under a separate license. This includes role-based access control (RBAC), single sign-on (SSO), and audit logs. These features support teams with specific compliance and security requirements.

Get Started

Follow the self-hosting quick start guide to deploy Agenta on your infrastructure. View the source code and contribute on GitHub. Read why we made this decision at agenta.ai/blog/commercial-open-source-is-hard-our-journey.

What This Means for You

You can run Agenta on your infrastructure with full access to evaluation, prompting, and observability features. You can modify the code to fit your needs. You can contribute back to the project.

The MIT license gives you freedom to use, modify, and distribute Agenta. We believe open source creates better products through community collaboration.

Evaluation SDK

The Evaluation SDK lets you run evaluations programmatically from code. You get full control over test data and evaluation logic. You can evaluate agents built with any framework and view results in the Agenta dashboard.

Why Programmatic Evaluation?

Complex AI agents need evaluation that goes beyond UI-based testing. The Evaluation SDK provides code-level control over test data and evaluation logic. You can test agents built with any framework. Run evaluations in your CI/CD pipeline. Debug complex workflows with full trace visibility.

Key Capabilities

Test Data Management

Create test sets directly in your code or fetch existing ones from Agenta. Test sets can include ground truth data for reference-based evaluation or work without it for evaluators that only need the output.

Built-in Evaluators

The SDK includes LLM-as-a-Judge, semantic similarity, and regex matching evaluators. You can also write custom Python evaluators for your specific requirements.

Reusable Configurations

Save evaluator configurations in Agenta to reuse them across runs. Configure an evaluator once, then reference it in multiple evaluations.

Span-Level Evaluation

Evaluate your agent end to end or test specific spans in the execution trace. Test individual components like retrieval steps or tool calls separately.

Run on Your Infrastructure

Evaluations run on your infrastructure. Results appear in the Agenta dashboard with full traces and comparison views.

Getting Started

Install the SDK:

pip install agenta

Here's a minimal example evaluating a simple agent:

import agenta as ag
from agenta.sdk.evaluations import aevaluate

# Initialize
ag.init()

# Define your application
@ag.application(slug="my_agent")
async def my_agent(question: str):
# Your agent logic here
return answer

# Define an evaluator
@ag.evaluator(slug="correctness_check")
async def correctness_check(expected: str, outputs: str):
return {
"score": 1.0 if outputs == expected else 0.0,
"success": outputs == expected,
}

# Create test data
testset = await ag.testsets.acreate(
name="Agent Tests",
data=[
{"question": "What is 2+2?", "expected": "4"},
{"question": "What is the capital of France?", "expected": "Paris"},
],
)

# Run evaluation
result = await aevaluate(
name="Agent Correctness Test",
testsets=[testset.id],
applications=[my_agent],
evaluators=[correctness_check],
)

print(f"View results: {result['dashboard_url']}")

Dashboard Integration

Every evaluation run gets a shareable dashboard link. The dashboard shows full execution traces, comparison views for different versions, aggregated metrics, and individual test case details.

Next Steps

Check out the Quick Start Guide to build your first evaluation.