BeLLMark - Evaluation Methodology

1. What We Measure

BeLLMark evaluates LLM responses against user-defined criteria. There is no fixed rubric. You decide what matters for your specific use case.

1

Configurable Criteria Per Run

Each benchmark run uses criteria tailored to your needs. Example dimensions include accuracy, completeness, tone, format adherence, policy compliance, and reasoning quality. Unlike fixed benchmarks, BeLLMark adapts to what you care about.

Customer support: empathy, accuracy, actionability, appropriate tone
Legal documents: citation accuracy, completeness, clarity, risk identification
Code generation: correctness, efficiency, readability, adherence to style guides
Content creation: originality, SEO optimization, brand voice alignment, factual accuracy

2

AI Criteria Generation

Designing evaluation frameworks is hard. BeLLMark lowers the barrier from "design an evaluation framework" to "describe what you care about." The AI generates custom evaluation criteria from your use case description.

How It Works

You describe your use case in plain language (e.g., "evaluating chatbot responses for customer support in a healthcare setting"). The LLM generates relevant criteria with descriptions and scoring guidance. You review, edit, and refine before running benchmarks.

2. Blind Evaluation: Eliminating Bias

When a benchmark runs, all model responses are collected, shuffled, and assigned blind labels (A, B, C, etc.). Judges evaluate responses without knowing which model produced which response. This prevents model-name bias.

Why Blind Evaluation Matters

Without blind evaluation, judges (whether human or LLM) may show preference based on brand recognition (e.g., favoring GPT-4o because of its reputation) rather than response quality. Blind labels eliminate this bias. Additionally, BeLLMark sends consistent system prompts to all models to ensure fair comparison.

3. Judge Design

BeLLMark supports single-judge and multi-judge modes, with two evaluation approaches: comparison mode and separate mode.

Comparison Mode

The judge ranks all responses together (e.g., A > B > C). Blind labels are shuffled to prevent position bias. This mode is faster and works well when you want relative rankings.

"Response B is most accurate,
followed by A, then C."

Separate Mode

Each response is scored independently on an absolute scale (1-10 per criterion). This mode provides more granular feedback and works well when you need absolute scores, not just rankings.

"Response A: Accuracy 8/10,
Tone 7/10, Completeness 9/10"

Judge Model Selection Guidance

Use a capable model as judge (e.g., Claude Opus 4.6, GPT-4o). Ideally, the judge should not be one of the models being evaluated to avoid self-preference. Judge temperature follows the generation temperature setting unless manually overridden via advanced configuration. Multi-judge mode (2-3 judges) increases robustness and reveals disagreements, which signal borderline or ambiguous responses.

4. Rubric Design and Human Review

AI-generated criteria provide a strong starting point, but human review and customization are critical for domain-specific accuracy.

What AI Criteria Generation Produces

The AI generates a set of evaluation criteria with descriptions and scoring guidance. For example, for a customer support use case, criteria might include "empathy," "accuracy of information," "actionability of advice," and "appropriate tone."

Why Human Review Matters

Domain experts should review, edit, and refine generated criteria before running benchmarks. AI-generated criteria are a starting point, not a final answer. Customization ensures the rubric aligns with your specific requirements and organizational priorities.

Best Practice: Iterative Refinement

Start with AI-generated criteria. Run a small pilot benchmark. Review the results and judge reasoning. Adjust criteria to better capture what matters. Re-run with refined rubric. This iterative approach produces more reliable results than accepting the first draft.

5. Calibration and Confidence

How do you know your evaluation is reliable? BeLLMark provides automated calibration analysis, inter-rater reliability metrics, and bias detection to quantify confidence in results.

Automated Judge Calibration

Inter-Rater Reliability (Cohen’s / Fleiss’ Kappa): BeLLMark automatically computes Cohen’s κ for 2 judges or Fleiss’ κ for 3+ judges. Results are displayed with qualitative labels: slight (<0.2), fair (0.2–0.4), moderate (0.4–0.6), substantial (0.6–0.8), or almost perfect (>0.8).

Judge Calibration Analysis: The statistical dashboard identifies judges that systematically over- or under-score relative to their peers, flagging potential scoring drift with severity ratings.

Position Bias Detection: BeLLMark measures whether judges favor responses presented first or last by analyzing label bias as a proxy, reporting bias severity and direction in the statistical dashboard.

Length Bias Detection: Spearman rank correlation between response token count and judge score identifies judges that systematically favor longer or shorter responses. A warning is surfaced when |r| > 0.5.

Manual Calibration Strategies

Gold-Standard Examples: Include known-good responses as calibration anchors. If the judge consistently ranks these highly, confidence increases.

Consistency Checks: Run the same benchmark multiple times. Consistent results across runs indicate stable evaluation. BeLLMark’s ELO rating system tracks model performance across runs to surface trends.

Recommendation: Multi-Judge for High-Stakes Decisions

For critical evaluations (e.g., selecting an LLM provider for production use), use 2–3 judges and check the κ score in the Judge Calibration dashboard. A κ > 0.6 indicates substantial agreement. If judges disagree significantly, examine the bias analysis and individual judge reasoning. This hybrid approach combines automated statistical validation with human oversight for maximum confidence.

6. Scoring System & Aggregation

Understanding how scores are computed is essential for interpreting results and defending evaluation decisions to stakeholders.

Scale Semantics

Each response is scored on a 1–10 scale per criterion. The scale is anchored to three bands that help judges assign consistent scores:

1 – 3

Poor

The response misses the criterion, contains significant errors, or fails to address the prompt meaningfully.

4 – 6

Acceptable

The response partially addresses the criterion. It is usable but has notable gaps, inaccuracies, or missed nuances.

7 – 10

Good to Excellent

The response fully or exceptionally addresses the criterion. Scores of 9–10 indicate outstanding quality with no meaningful gaps.

Criteria Weighting

By default, all criteria are weighted equally. When you define three criteria (e.g., Accuracy, Completeness, Actionability), each contributes one-third to the overall score. You can prioritize specific criteria through your rubric design by giving judges explicit instructions to weight certain qualities more heavily.

Aggregation Method

Scores are aggregated using arithmetic mean in two stages:

1

Per criterion: If multiple judges evaluate the same response, their scores are averaged to produce a single per-criterion score for that response.

2

Overall score: The per-criterion averages are then averaged across all criteria to produce the model's overall score for that question.

3

Benchmark score: The per-question overall scores are averaged across all questions in the benchmark to produce each model's final benchmark score.

Variance & Confidence Intervals

Averages alone can mask important disagreements. BeLLMark surfaces variance and computes confidence intervals to quantify uncertainty:

Confidence Intervals

Wilson Score CI on win rates: Win rates are displayed with ± margin of error (e.g., “73% ±12%”). Wilson score is preferred over normal approximation because it remains reliable even with small sample sizes (n < 30), which is common in LLM benchmarks.

Bootstrap CI on scores: Overall model scores include bootstrap 95% confidence intervals computed from 10,000 resamples. These show the range within which the “true” score likely falls, accounting for question-to-question variance.

When to Investigate

Overlapping confidence intervals: If two models’ CIs overlap substantially, the difference may not be statistically meaningful. Check the pairwise significance tests in the statistical dashboard.

High judge disagreement: When multi-judge standard deviation exceeds 2 points on the 1–10 scale, this flags ambiguous or borderline responses. Review the individual judge reasoning.

Score clustering: When all models score within 1 point of each other, the benchmark may not be differentiating meaningfully. Consider adding more challenging prompts or refining criteria.

Criterion-level variance: A model scoring 9 on Accuracy but 4 on Completeness reveals specific strengths and weaknesses that an overall average would obscure. Always check criterion-level breakdowns.

Transparency Principle

Every score in a BeLLMark report can be traced back to its source: the judge’s reasoning, the criterion applied, and the response evaluated. Exported reports include all statistical analysis — confidence intervals, significance tests, bias detection, and ELO ratings — so stakeholders can audit any result.

6b. Statistical Significance & Hypothesis Testing

“Model A scored higher than Model B” is not the same as “Model A is significantly better than Model B.” BeLLMark applies rigorous statistical tests to distinguish real differences from noise.

Pairwise Significance Testing

Wilcoxon Signed-Rank Test: A non-parametric paired-sample test that compares each model pair question-by-question. Unlike a t-test, it does not assume normal score distributions — important because LLM judge scores are often bimodal or skewed.

Holm–Bonferroni Correction: When comparing multiple model pairs (e.g., 6 models = 15 pairs), the risk of false positives increases. BeLLMark applies Holm–Bonferroni correction to control the family-wise error rate, preventing overclaiming.

Cohen’s d Effect Size: Statistical significance alone is not enough — a “significant” p-value with a tiny effect is practically meaningless. BeLLMark requires both p < 0.05 (corrected) AND Cohen’s d ≥ 0.2 (small effect threshold) before reporting a comparison as significant.

Statistical Power Analysis: When sample size is insufficient to detect meaningful differences, BeLLMark flags the result as underpowered and recommends a minimum question count for reliable conclusions.

ELO Rating System

Cross-Run Rankings: BeLLMark maintains an ELO rating system that tracks model performance across multiple benchmark runs. This enables longitudinal comparison — how does a model perform over time, across different prompt suites and evaluation criteria?

Bayesian Adaptive K-Factor: New models start with wider rating swings (higher K) that narrow as more data is collected, following Bayesian principles. This prevents early flukes from permanently distorting ratings while allowing rapid convergence for well-tested models.

Complete-Case Analysis

When a model has no judgment for a question (failed generation, skipped, etc.), that question is excluded from statistical analysis for that model rather than treated as score=0. This prevents phantom low scores from biasing means, confidence intervals, and p-values downward.

7. Known Limitations and Mitigations

LLM-as-judge is powerful but not perfect. Understanding limitations helps you design better evaluations.

Domain-Specific Nuance

LLM judges may struggle with highly specialized domains (e.g., advanced medical or legal reasoning). Mitigation: Use multi-judge with domain-specific criteria, and include human spot-checks for critical evaluations.

Subtle Factual Errors

Judges may miss subtle factual mistakes if they sound plausible. Mitigation: Include ground-truth verification criteria, and use separate mode for absolute accuracy scoring.

Very Long Outputs

Judges may struggle to evaluate very long responses (e.g., 5000+ words) consistently. Mitigation: Break long evaluations into smaller sections, or use human review for outlier-length responses.

Cultural Context

LLM judges may not capture cultural nuances in tone or appropriateness. Mitigation: Include cultural context in criteria definitions, and validate with human reviewers from target audience.

Self-Preference

Some models may prefer their own style even in blind evaluation (e.g., Claude judging Claude). Mitigation: BeLLMark enforces self-judging prevention at the backend level — a model cannot be selected as both competitor and judge. The UI displays a warning if overlap is attempted, and the API rejects the request with an explanation.

Position Bias

Judges may favor the first or last response in a list. Mitigation: BeLLMark randomizes the presentation order of responses independently of blind labels (A/B/C), so no model systematically benefits from position. Additionally, BeLLMark measures position bias after the fact via label bias proxy analysis and reports severity in the statistical dashboard.

BeLLMark Philosophy: Disciplined Evaluation, Not Replacement for Judgment

BeLLMark encourages disciplined, systematic evaluation. It does not replace human judgment. For high-stakes decisions, always combine automated evaluation with human review of judge reasoning and sample responses. The goal is to make better-informed decisions faster, not to eliminate human oversight entirely.

8. Best Practices Checklist

Follow these guidelines to maximize the reliability and usefulness of your LLM evaluations.

Use at least 10-20 representative prompts that reflect your actual workload. More prompts increase statistical confidence in results.
Review and customize AI-generated criteria before running benchmarks. Domain experts should validate that criteria capture what matters.
Use a strong model as judge (e.g., Claude Opus 4.6, GPT-4o). Ideally, the judge should not be one of the models being evaluated.
Run multi-judge for high-stakes decisions. Use 2-3 judges and check agreement. Disagreements signal responses that need human review.
Check judge reasoning in expanded results — don't just look at scores. Understanding why a judge ranked responses helps validate the evaluation.
Re-run benchmarks when models update to check for regression. Model behavior can change with version updates. Track performance over time.
Include edge cases and failure modes in your prompt set. Test how models handle ambiguous, difficult, or adversarial inputs.
Document your evaluation methodology. BeLLMark exports include HTML reports that serve as documentation for stakeholders.

9. Reproducibility & Verification

Trustworthy benchmarks must be reproducible. A stakeholder should be able to re-run your evaluation with the same inputs and get comparable results. BeLLMark supports this through structured exports and versioning.

Exported Artifacts

Every benchmark run can be exported in five formats. All formats include full run metadata, statistical analysis, bias detection results, and ELO ratings:

HTML Report

Stakeholder-ready document with executive summary, charts, detailed scores, judge reasoning, confidence intervals, significance tests, and blind mapping reveal.

PDF Report

Portable document format for formal distribution. Same content as the HTML report, suitable for attachment to procurement documents or audit records.

PPTX Presentation

PowerPoint slide deck for presenting results in meetings. Includes summary charts, model rankings, and key statistical findings.

JSON Data

Machine-readable export with all scores, judge outputs, statistical analysis, model configurations, and run parameters. Suitable for programmatic analysis.

CSV Summary

Tabular format for spreadsheet analysis. Contains per-question, per-model scores and overall rankings.

What to Version

To reproduce a benchmark, record and preserve the following. All items are included in JSON exports automatically:

Prompt suite ID and version — which questions were used, in what order, with what system context.
Individual prompt/question IDs — each question has a unique identifier for traceability.
Judge model and prompt configuration — which model judged, at what temperature, with what system prompt and criteria definitions.
Tested model versions and parameters — exact model identifiers, temperature, reasoning settings, and any provider-specific configuration.
Run ID and timestamp — BeLLMark assigns a unique run ID to each benchmark for cross-reference.
Random seed — automatically captured per run. Used for blind label shuffling and presentation order randomization. Enables deterministic replay of the evaluation ordering.
Judge temperature per judgment — the exact temperature parameter used for each individual judgment is recorded, enabling precise reproduction.
Model version string — the exact model version returned by the API (e.g., gpt-4o-2024-11-20) is captured per generation, providing an audit trail even if provider versioning changes.
BeLLMark version — to ensure the same scoring logic and export format are used.

Re-running a Benchmark

1

Import the same prompt suite (or create questions matching the original IDs) to ensure identical inputs.

2

Configure identical judge settings — same judge model, temperature, system prompt, and criteria definitions.

3

Run the benchmark and compare results by run ID. Note: LLM outputs are non-deterministic, so exact scores may vary slightly. Focus on whether rankings and relative differences are consistent.

Verification Checklist

When verifying a benchmark result (your own or someone else's), check:

a

Compare exported JSON metadata — run parameters, model versions, and judge configuration should match.

b

Verify prompt set matches — confirm question IDs and content are identical between runs.

c

Check judge prompt consistency — even minor changes to judge instructions can shift scores.

d

Compare score distributions — rankings should be stable even if absolute scores shift by ±0.5 points due to LLM non-determinism.

Reproducibility Is a Feature, Not an Afterthought

BeLLMark includes run metadata in every export specifically so that evaluations can be defended, audited, and repeated. When presenting results to stakeholders, share the exported report — it contains everything needed to understand and verify how scores were produced.

10. Primary Metric & Interpretation

Understanding BeLLMark's primary metric and its limitations is critical for making sound evaluation decisions.

Primary Metric: Question Win Rate

Question Win Rate (majority vote): the percent of questions where a model is chosen as the winner by a majority of judges.

Tie Handling

Ties/no-majority questions are reported explicitly. For the win-rate percentage, ties are treated as “not a win” (conservative), and the tie rate is shown alongside the win rate.

Reproducibility Contract

BeLLMark records the full run configuration so you can rerun the same benchmark later. However, because LLM generation and judging can be stochastic, reruns are expected to be comparable, not identical.

Interpretation Guardrail

Win rate is not “truth.” It is a summary of performance under your specific prompts, criteria, and judges. If you change the prompt set or the judges, outcomes can change.

Known Limitations

For a comprehensive list of v1 limitations, see the Known Limitations document. Key items include:

LLM-as-judge is not ground truth — judge models have their own biases
Reproducibility is procedural, not bitwise — reruns produce comparable, not identical results
Small sample sizes produce wide confidence intervals — 15+ questions minimum recommended