Golden test set
Filters
Models to compare
Eval status
What this is. Each enabled model receives the same
prompts the teacher LLMs answered when the synthetic data was
generated — the system prompt, user input, and reference output
for every example come from the golden set you've selected (which
was bootstrapped from ml[-env]/raw/ records that
scored well, then filtered for human-review). The model under test
generates its own response to each prompt; the dashboard captures
it for scoring against the teacher's reference output.
How grading works. Two layers, both visible in the comparison table below. Deterministic metrics (no LLM call) check whether the response is well-formed JSON, has the required fields, and stays within the structural rules of the prompt. LLM-as-judge — Claude Sonnet 4 — reads the prompt + expected output + the model's response and rates it 1–5 on Relevance, Completeness, Accuracy, and Format Quality. Hover over any metric or score for the rubric and value range.