Golden test set

Filters

Adds a second LLM call per record (Claude). Slower, gives a 1–5 quality score.

Models to compare

Eval status

What this is. Each enabled model receives the same prompts the teacher LLMs answered when the synthetic data was generated — the system prompt, user input, and reference output for every example come from the golden set you've selected (which was bootstrapped from ml[-env]/raw/ records that scored well, then filtered for human-review). The model under test generates its own response to each prompt; the dashboard captures it for scoring against the teacher's reference output.

How grading works. Two layers, both visible in the comparison table below. Deterministic metrics (no LLM call) check whether the response is well-formed JSON, has the required fields, and stays within the structural rules of the prompt. LLM-as-judge — Claude Sonnet 4 — reads the prompt + expected output + the model's response and rates it 1–5 on Relevance, Completeness, Accuracy, and Format Quality. Hover over any metric or score for the rubric and value range.

Recent runs