Private, Systematic LLM Evaluation

Self-hosted benchmarking studio for comparing language models with blind A/B/C testing, configurable judges, and statistical analysis. No SaaS middleman — your prompts go directly to the providers you choose, or stay fully local with LM Studio.

Why BeLLMark

8 models. 25 prompts. 23 minutes — blind-tested, statistically ranked, report ready.

"How do we know we're selecting the right model?"

Justify the decision

You're putting an AI model into production and someone will ask why you picked it. BeLLMark gives you a blind comparison with confidence intervals, bias checks, and an exportable report you can attach to a sign-off.

Watch how it works →

"Which model is best for us?"

Find the best fit

Five models could work for your use case. Marketing benchmarks won't tell you which one handles your prompts, documents and data best. BeLLMark runs your actual workload against all of them and ranks the results.

"Which model gives the best value for quality, speed and cost?"

Find the sweet spot

Can a free local model do 90% of what a cloud flagship does? Is the faster model good enough? BeLLMark shows you the quality-cost-speed tradeoff per model so you pick with data, not vibes.

How It Works

1

Configure Models

Add API keys for OpenAI, Anthropic, Google, or local LM Studio endpoints.

2

Define Questions

Write custom prompts or use AI to generate domain-specific test cases.

3

Run Benchmark

Models generate responses in parallel, then judges evaluate them blindly.

4

Analyze Results

View charts, ELO rankings, statistical significance tests, and bias analysis. Export to HTML, JSON, CSV, PPTX, or PDF.

Everything you need for systematic LLM evaluation

Blind testing icon

Blind A/B/C Testing

Compare up to 6 models at once. Responses shuffled and labeled anonymously before judging, with revealed mappings only after scoring. Eliminates model bias and ensures objective evaluation.

LLM judge icon

LLM-as-Judge

Use one or more language models as judges with customizable criteria. Choose separate scoring or direct comparison modes.

AI criteria icon

AI Criteria Generation

Let an LLM design evaluation rubrics for your specific use case, or write custom scoring criteria from scratch.

Providers icon

9 LLM Providers

OpenAI, Anthropic, Google, Grok, DeepSeek, GLM, Kimi, Mistral, and local LM Studio models. Add your own providers easily.

Statistics icon

Statistical Analysis

Bootstrap CI, Wilcoxon significance tests, Cohen’s d effect sizes, ELO ratings, bias detection, and judge calibration — all built into the dashboard.

Exports icon

Rich Exports

Export to HTML reports, JSON, CSV, consulting-grade PPTX, or PDF — all formats include statistical summaries and confidence intervals.

Evaluation methodology you can defend

Enterprise-grade statistical rigor from blind evaluation to bias detection

Blind evaluation

Blind Evaluation

Responses are shuffled and assigned blind labels (A, B, C...). Judges evaluate without knowing which model produced which response. Mapping is revealed only after all scoring is complete.

Transparent rubrics

Transparent Rubrics

AI generates evaluation criteria from your use case description, or write custom rubrics from scratch. You review and approve before any benchmark runs. Your criteria, your standards.

Auditable reasoning

Auditable Reasoning

Every judge score includes written reasoning. Expand any result to read exactly why a judge scored a response the way it did. Multi-judge mode provides confidence through agreement.

Configure
Configure
Models
Shuffle
Shuffle
& Blind
Judge
Judge
Evaluation
Statistics
Score
Aggregation
Analysis
Statistical
Analysis
Export
Export
Report
Scoring System

Each response is scored 1–10 per criterion with defined anchors, then aggregated via weighted arithmetic mean.

Scores follow a three-tier scale: 1–3 (poor), 4–6 (acceptable), 7–10 (good to excellent). You define the criteria that matter for your use case — accuracy, completeness, tone, or any custom dimension. Each criterion carries a configurable weight.

Aggregation follows a three-stage pipeline: per-criterion scores are averaged across judges, then weighted by criterion importance, then averaged across questions to produce a final model score.

Confidence & Significance

Statistical tests tell you whether score differences are real or could be due to chance.

Wilson Score CI — confidence intervals on win rates that work correctly even with small sample sizes (unlike normal approximation).

Bootstrap CI — 1,000-resample confidence intervals on score differences between models, giving you a credible range for "how much better is Model A than Model B."

Wilcoxon Signed-Rank — non-parametric pairwise significance test. Doesn't assume scores are normally distributed. Reports p-values for every model pair.

Holm–Bonferroni Correction — adjusts p-values when comparing multiple models simultaneously, preventing false positives from repeated testing.

Statistical Power Analysis — estimates whether your sample size is large enough to detect meaningful differences, so you know if "not significant" means "no difference" or "not enough data."

Effect Sizes & Comparisons

Beyond "is it significant?" — how large is the difference, and how do all models rank together?

Cohen's d — standardized effect size for every model pair. Labeled as small (0.2), medium (0.5), or large (0.8+) so you can judge practical significance, not just statistical significance.

Pairwise Comparison Matrix — every model compared against every other model with significance indicators, effect sizes, and confidence intervals in a single table.

Friedman Test — non-parametric test for overall ranking significance across all models simultaneously.

Bias Detection

Automatically detects four types of evaluation bias that could undermine your results.

Position Bias — does the order in which responses are presented affect scores? Detected by comparing scores across presentation positions (A vs B vs C).

Length Bias — do longer responses get higher scores regardless of quality? Measured via Spearman rank correlation (ρ) between response length and score.

Self-Preference Bias — does a judge model favor responses from itself or its own provider? Flagged when detected.

Verbosity Bias — related to length bias but focused on token count relative to content density. LC (length-controlled) win rates correct for this by adjusting scores for response length.

Judge Calibration

Measures how consistent and reliable your judges are, both individually and as a group.

Cohen's Kappa (2 judges) / Fleiss' Kappa (3+ judges) — inter-rater agreement beyond what would be expected by chance. Values above 0.6 indicate substantial agreement; below 0.4 suggests judges are evaluating differently.

ICC (Intraclass Correlation) — measures the consistency of absolute score values across judges, not just ranking agreement.

Per-Judge Reliability — individual reliability scores identify if one judge is consistently an outlier, so you can investigate or remove unreliable judges.

ELO Rating System

Track model performance across multiple benchmark runs with an adaptive rating system.

Bayesian Adaptive K-Factor — new or uncertain models have higher K-factors (ratings change more per run), while established models have lower K-factors (ratings are more stable). This means a single anomalous result won't dramatically change a well-established model's ranking.

Cross-Run Tracking — ratings update automatically when any benchmark run completes. The global leaderboard reflects cumulative performance across all your evaluations.

Rating History — per-model rating history charts show how a model's ranking has evolved over time as more benchmarks are run.

Read the complete methodology document →
EU AI Act: How structured evaluation supports deployer obligations →

See BeLLMark in action

Real results from a completed benchmark — 8 models, 25 analytical reasoning questions, 3 independent judges. Explore the full analysis below.

Watch a full benchmark run

Or explore a real result yourself

Download sample exports: HTML PPTX PDF JSON CSV

Detailed results & per-criterion scoring

Full benchmark overview with model rankings, token usage, and cost breakdowns — plus granular scores by criterion with confidence intervals and win rates.

Benchmark overview
Evaluation scores

Ships with 5 research-backed benchmark suites

Ready-to-use evaluation sets covering reasoning, writing, compliance, calibration, and domain expertise. Import them with one click.

Analytical Reasoning
Multi-step logic, probability, scheduling
25 questions5 criteria
Instruction Compliance
Format constraints, word counts, prohibited elements
25 questions5 criteria
Long-form Writing
Essays, reports, audience calibration
25 questions5 criteria
Epistemic Calibration
Factual accuracy, hedging, confabulation
25 questions4 criteria
Domain Expert Communication
Technical accuracy across medicine, law, engineering
25 questions5 criteria

Built for teams who need to make informed AI decisions

Legal

Compliance & Legal

Evaluate LLM accuracy on legal reasoning, contract analysis, and regulatory interpretation without sending client data to external benchmarking services.

  • Test contract summarization accuracy
  • Compare legal reasoning capabilities
  • Validate compliance advisory quality
  • Air-gapped option with local models
EU AI Act evaluation requirements →
Consultant

AI Consultants

Provide clients with objective, data-driven model recommendations backed by systematic benchmarking on their specific use cases.

  • Generate client-specific test cases
  • Deliver professional HTML reports
  • Compare cost vs. performance tradeoffs
  • Justify model selection decisions
Engineer

Engineering Teams

Make informed decisions about which LLM to use in production by testing on real prompts before committing to API contracts.

  • Test local vs. cloud model quality
  • Validate prompt engineering changes
  • Compare reasoning model performance
  • A/B test prompt templates

Three ways to use BeLLMark

Same software. Your license depends on how you’ll use it and what evidence level you need.

Non-commercial

Free forever

€0

PolyForm Noncommercial 1.0.0

  • All current and future features
  • Self-hosted on your infrastructure
  • 9 LLM provider integrations
  • Blind A/B/C evaluation
  • Statistical analysis & ELO ratings
  • HTML / JSON / CSV / PPTX / PDF exports
  • Community support on GitHub
Clone from GitHub →

Personal, educational, research, and non-commercial use. Upgrade when you commercialise.

Enterprise

Multi-entity license

€2,999

one-time · flat rate · unlimited users

  • Everything in Commercial
  • 4+ legal entities under one license
  • Unlimited users across all entities
  • All current and future features
  • Free updates for life
  • Best-effort email support
  • 1-hour onboarding session with the creator
Buy Enterprise →

For organizations deploying BeLLMark across multiple subsidiaries, divisions, or legal entities.

Creem handles invoicing — European Merchant of Record, EU-invoice-compliant, VAT-handled in 40+ countries. 30-day money-back guarantee · [email protected]

Frequently Asked Questions

What does "per legal entity" mean?

One license covers unlimited users within a single legal entity (corporation, LLC, nonprofit, etc.). If you have multiple subsidiaries or separate legal entities, each needs its own license. Freelancers and sole proprietors need one license for their business use.

Is BeLLMark open source?

BeLLMark is source-available under the PolyForm Noncommercial 1.0.0 license. You can view, modify, and use the code for free for personal, educational, and non-commercial purposes. Commercial use requires a paid license (€799 one-time per legal entity, €499 introductory during the first 60 days). See our licensing terms for what the commercial license covers.

Do I need technical skills to use BeLLMark?

No programming knowledge required! BeLLMark has a clean web interface. If you can use a web browser and have API keys for LLM providers (like OpenAI or Anthropic), you can run benchmarks. Installation requires basic command-line familiarity (Docker or Python/Node.js).

How do I install BeLLMark?

Three installation options: (1) Docker Compose (recommended, one command), (2) Manual setup with Python backend + Node.js frontend, or (3) Production build served from a single backend process. Full instructions in the GitHub repository. Typical setup time: 5-10 minutes.

What LLM providers are supported?

BeLLMark supports OpenAI (GPT-4, GPT-5, o1), Anthropic (Claude Opus/Sonnet/Haiku), Google (Gemini 2.5/3), Grok, DeepSeek, GLM, Kimi, Mistral, and local LM Studio models. The architecture is modular—adding new providers is straightforward by implementing the OpenAI-compatible endpoint pattern.

How do updates work?

All updates are free for life — including future major versions. Pull the latest code from GitHub whenever a new version is released. No subscription fees, no forced upgrade cycles, no license keys to manage. Your commercial license covers all future features and improvements.

What about my API keys and data privacy?

BeLLMark runs entirely on your infrastructure with zero telemetry — no analytics, no phone-home, no BeLLMark-operated cloud. API keys are encrypted at rest in your local SQLite database. When using cloud LLM providers (OpenAI, Anthropic, etc.), prompts go directly from your server to the provider's API — never through BeLLMark. For fully air-gapped operation where nothing leaves your network, use local models via LM Studio. This architecture supports your compliance goals for frameworks like GDPR and HIPAA — actual certifications depend on your infrastructure setup and LLM provider agreements. Contact us for framework-specific guidance.

How does LLM-as-judge work and how do you validate it?

BeLLMark sends each model's response to a judge LLM along with your evaluation criteria. The judge scores each response on a 1-10 scale per criterion, providing written reasoning for each score. Responses are presented with blind labels (A, B, C) so the judge doesn't know which model produced which response. For validation, use multi-judge mode (multiple LLMs evaluate independently) and check agreement. Multi-judge mode and statistical calibration ensure reliable, reproducible results.

Can we use human raters alongside LLM judges?

Not yet as a built-in feature, but BeLLMark's results are fully exportable (HTML, JSON, CSV, PPTX, PDF) for human review. The recommended workflow: run LLM-as-judge for initial screening, then export the top candidates' responses for human evaluation. Native human evaluation workflows with integrated scoring are on our roadmap.

How do you handle rate limits, failures, and retries?

BeLLMark automatically retries failed API calls up to 3 times with progressive backoff (2s, 5s, 10s delays). It checkpoints before phase transitions (generation → judging) so partial progress is preserved. If an API call fails after all retries, the specific failure is logged and a manual retry button appears in the progress view. Other models and questions continue processing normally.

Do you support role-based access or multiple workspaces?

BeLLMark currently runs as a single-user application. For team use, we recommend deploying behind your existing authentication (VPN, reverse proxy with SSO, or network-level access control). Multi-user support with role-based access and team workspaces is on our roadmap. All benchmark data is stored in a single SQLite database that can be shared across the team.

How does BeLLMark compare to other evaluation tools?

vs. CLI tools (e.g., Promptfoo): BeLLMark provides a visual web interface with blind A/B/C evaluation, real-time progress, and consulting-grade export formats (PPTX, PDF). No YAML configuration required.

vs. Public leaderboards (e.g., Chatbot Arena): BeLLMark runs on your infrastructure with your own questions and criteria — no third-party benchmarking service ever sees your prompts.

vs. LLMOps platforms: BeLLMark is a focused evaluation studio, not a production monitoring tool. One-time purchase, no subscription, no usage limits.

Ready to evaluate LLMs the right way?