How BeLLMark Produces Trustworthy Results
A disciplined approach to LLM evaluation using blind testing, configurable judges, and transparent scoring
BeLLMark evaluates LLM responses against user-defined criteria. There is no fixed rubric. You decide what matters for your specific use case.
Each benchmark run uses criteria tailored to your needs. Example dimensions include accuracy, completeness, tone, format adherence, policy compliance, and reasoning quality. Unlike fixed benchmarks, BeLLMark adapts to what you care about.
Designing evaluation frameworks is hard. BeLLMark lowers the barrier from "design an evaluation framework" to "describe what you care about." The AI generates custom evaluation criteria from your use case description.
You describe your use case in plain language (e.g., "evaluating chatbot responses for customer support in a healthcare setting"). The LLM generates relevant criteria with descriptions and scoring guidance. You review, edit, and refine before running benchmarks.
When a benchmark runs, all model responses are collected, shuffled, and assigned blind labels (A, B, C, etc.). Judges evaluate responses without knowing which model produced which response. This prevents model-name bias.
Without blind evaluation, judges (whether human or LLM) may show preference based on brand recognition (e.g., favoring GPT-4o because of its reputation) rather than response quality. Blind labels eliminate this bias. Additionally, BeLLMark sends consistent system prompts to all models to ensure fair comparison.
BeLLMark supports single-judge and multi-judge modes, with two evaluation approaches: comparison mode and separate mode.
The judge ranks all responses together (e.g., A > B > C). Blind labels are shuffled to prevent position bias. This mode is faster and works well when you want relative rankings.
Each response is scored independently on an absolute scale (1-10 per criterion). This mode provides more granular feedback and works well when you need absolute scores, not just rankings.
Use a capable model as judge (e.g., Claude Opus 4.6, GPT-4o). Ideally, the judge should not be one of the models being evaluated to avoid self-preference. Judge temperature follows the generation temperature setting unless manually overridden via advanced configuration. Multi-judge mode (2-3 judges) increases robustness and reveals disagreements, which signal borderline or ambiguous responses.
AI-generated criteria provide a strong starting point, but human review and customization are critical for domain-specific accuracy.
The AI generates a set of evaluation criteria with descriptions and scoring guidance. For example, for a customer support use case, criteria might include "empathy," "accuracy of information," "actionability of advice," and "appropriate tone."
Domain experts should review, edit, and refine generated criteria before running benchmarks. AI-generated criteria are a starting point, not a final answer. Customization ensures the rubric aligns with your specific requirements and organizational priorities.
Start with AI-generated criteria. Run a small pilot benchmark. Review the results and judge reasoning. Adjust criteria to better capture what matters. Re-run with refined rubric. This iterative approach produces more reliable results than accepting the first draft.
How do you know your evaluation is reliable? BeLLMark provides automated calibration analysis, inter-rater reliability metrics, and bias detection to quantify confidence in results.
For critical evaluations (e.g., selecting an LLM provider for production use), use 2–3 judges and check the κ score in the Judge Calibration dashboard. A κ > 0.6 indicates substantial agreement. If judges disagree significantly, examine the bias analysis and individual judge reasoning. This hybrid approach combines automated statistical validation with human oversight for maximum confidence.
Understanding how scores are computed is essential for interpreting results and defending evaluation decisions to stakeholders.
Each response is scored on a 1–10 scale per criterion. The scale is anchored to three bands that help judges assign consistent scores:
By default, all criteria are weighted equally. When you define three criteria (e.g., Accuracy, Completeness, Actionability), each contributes one-third to the overall score. You can prioritize specific criteria through your rubric design by giving judges explicit instructions to weight certain qualities more heavily.
Scores are aggregated using arithmetic mean in two stages:
Averages alone can mask important disagreements. BeLLMark surfaces variance and computes confidence intervals to quantify uncertainty:
Every score in a BeLLMark report can be traced back to its source: the judge’s reasoning, the criterion applied, and the response evaluated. Exported reports include all statistical analysis — confidence intervals, significance tests, bias detection, and ELO ratings — so stakeholders can audit any result.
“Model A scored higher than Model B” is not the same as “Model A is significantly better than Model B.” BeLLMark applies rigorous statistical tests to distinguish real differences from noise.
When a model has no judgment for a question (failed generation, skipped, etc.), that question is excluded from statistical analysis for that model rather than treated as score=0. This prevents phantom low scores from biasing means, confidence intervals, and p-values downward.
LLM-as-judge is powerful but not perfect. Understanding limitations helps you design better evaluations.
BeLLMark encourages disciplined, systematic evaluation. It does not replace human judgment. For high-stakes decisions, always combine automated evaluation with human review of judge reasoning and sample responses. The goal is to make better-informed decisions faster, not to eliminate human oversight entirely.
Follow these guidelines to maximize the reliability and usefulness of your LLM evaluations.
Trustworthy benchmarks must be reproducible. A stakeholder should be able to re-run your evaluation with the same inputs and get comparable results. BeLLMark supports this through structured exports and versioning.
Every benchmark run can be exported in five formats. All formats include full run metadata, statistical analysis, bias detection results, and ELO ratings:
To reproduce a benchmark, record and preserve the following. All items are included in JSON exports automatically:
gpt-4o-2024-11-20) is captured per generation, providing an audit trail even if provider versioning changes.
When verifying a benchmark result (your own or someone else's), check:
BeLLMark includes run metadata in every export specifically so that evaluations can be defended, audited, and repeated. When presenting results to stakeholders, share the exported report — it contains everything needed to understand and verify how scores were produced.
Understanding BeLLMark's primary metric and its limitations is critical for making sound evaluation decisions.
Question Win Rate (majority vote): the percent of questions where a model is chosen as the winner by a majority of judges.
Ties/no-majority questions are reported explicitly. For the win-rate percentage, ties are treated as “not a win” (conservative), and the tie rate is shown alongside the win rate.
BeLLMark records the full run configuration so you can rerun the same benchmark later. However, because LLM generation and judging can be stochastic, reruns are expected to be comparable, not identical.
Win rate is not “truth.” It is a summary of performance under your specific prompts, criteria, and judges. If you change the prompt set or the judges, outcomes can change.
For a comprehensive list of v1 limitations, see the Known Limitations document. Key items include: