LLM-as-a-Judge

LLM-as-a-Judge is an evaluator that uses an LLM to assess LLM outputs. It's particularly useful for evaluating text generation tasks or chatbots where there's no single correct answer.

Configuration of LLM-as-a-judge

The evaluator has the following parameters:

The Prompt

You can configure the prompt used for evaluation. The prompt can contain multiple messages in OpenAI format (role/content). All messages in the prompt have access to the inputs, outputs, and reference answers (any columns in the testset). To reference these in your prompts, use the following variables (inside double curly braces):

{{inputs}}: all the inputs to the llm application formatted as key-value pairs
{{outputs}}: the output of the llm application
{{reference}}: the column with the reference answer in the testset (optional). You can configure the name of this column under Advanced Setting in the configuration modal.
{{correct_answer}}: alias for {{reference}} (for backward compatibility)
{{prediction}}: alias for {{outputs}} (for backward compatibility)
{{$input_column_name}}: the value of any input column for the given row of your testset (e.g. {{country}})

If no correct_answer column is present in your testset, the variable will be left blank in the prompt.

Here's the default prompt:

System prompt:

You are an expert evaluator grading model outputs. Your task is to grade the responses based on the criteria and requirements provided below. 

Given the model output and inputs (and any other data you might get) assign a grade to the output. 

## Grading considerations
- Evaluate the overall value provided in the model output
- Verify all claims in the output meticulously
- Differentiate between minor errors and major errors
- Evaluate the outputs based on the inputs and whether they follow the instruction in the inputs if any
- Give the highst and lowest score for cases where you have complete certainty about correctness and value

## output format
ANSWER ONLY THE SCORE. DO NOT USE MARKDOWN. DO NOT PROVIDE ANYTHING OTHER THAN THE NUMBER

User prompt:

## Model inputs
{{inputs}}
## Model outputs
{{outputs}}

The Model

The model can be configured to select one of the supported options (gpt-4o, gpt-5, gpt-5-mini, gpt-5-nano, claude-3-5-sonnet, claude-3-5-haiku, claude-3-5-opus). To use LLM-as-a-Judge, you'll need to set your OpenAI or Anthropic API key in the settings. The key is saved locally and only sent to our servers for evaluation; it's not stored there.

Output Schema

You can configure the output schema to control what the LLM evaluator returns. This allows you to get structured feedback tailored to your evaluation needs.

Basic Configuration

The basic configuration lets you choose from common output types:

Binary: Returns a simple pass/fail or yes/no judgment
Multiclass: Returns a classification from a predefined set of categories
Continuous: Returns a score between a minimum and maximum value

You can also enable Include Reasoning to have the evaluator explain its judgment. This option significantly improves the quality of evaluations by making the LLM's decision process transparent.

Advanced Configuration

For complete control, you can provide a custom JSON schema. This lets you define any output structure you need. For example, you could return multiple scores, confidence levels, detailed feedback categories, or any combination of fields.

The Prompt​

The Model​

Output Schema​

Basic Configuration​

Advanced Configuration​

The Prompt

The Model

Output Schema

Basic Configuration

Advanced Configuration