When someone asks how you chose your model, BeLLMark is your answer. Structured, reproducible evaluation with statistical rigour — the model selection documentation that deployers need.
The EU AI Act (Regulation 2024/1689) introduces legally binding obligations for organisations deploying AI systems in the European Union. For deployers of high-risk AI systems, the Act requires documented evidence of model evaluation, performance monitoring, and risk management. The full application date is 2 August 2026.
Organisations deploying AI in regulated contexts — credit scoring, HR screening, healthcare diagnostics, public services — will need documented, reproducible evidence that the models they selected were evaluated rigorously. For deployers of high-risk AI systems under Annex III, informal model comparisons and anecdotal testing may no longer suffice when an auditor asks: “How did you choose this model, and what evidence supports that decision?”
BeLLMark is an evaluation and documentation tool, not a compliance platform. The following table maps specific BeLLMark capabilities to the Act provisions they most closely support. The ratings reflect how BeLLMark’s pre-deployment evaluation workflows relate to each provision’s requirements. They do not assert that using BeLLMark satisfies any specific legal obligation.
| Act Provision | What It Requires | How BeLLMark Helps | Fit |
|---|---|---|---|
| Article 26(5) Deployer Monitoring |
Monitor operation of high-risk AI system on the basis of instructions for use (Art 26(5)). Art 26 also requires human oversight assignment, input data quality, log retention, and worker notification — which are outside BeLLMark’s scope. | Pre-deployment performance baselines that support monitoring workflows: blind model comparisons across versions, ELO rating history for longitudinal tracking, bias detection to surface evaluation artifacts | |
| Article 9 Risk Management (Provider obligation) |
Providers must establish a risk management system including testing against pre-defined metrics (Art 9(6)–(8)). This is a provider obligation; deployers are not directly subject to Art 9. | BeLLMark’s evaluation methodology — weighted criteria, configurable scoring, statistical power analysis, Wilson/bootstrap CIs — is analogous to Art 9 testing practices. Useful for deployers who also fine-tune models or wish to document their own evaluation rigour. | |
| Article 15 Accuracy & Robustness (Provider obligation) |
Providers must design systems for appropriate accuracy and declare relevant metrics in instructions for use (Art 15(1), (3)). Deployers receive this information; they are not required to measure accuracy under Art 15. | Deployers wishing to independently verify provider-stated accuracy can use BeLLMark’s confidence intervals, Cohen’s d effect sizes, Wilcoxon tests, and Holm–Bonferroni correction to assess performance against their specific use case. | |
| Annex IV Technical Documentation (Provider obligation, Art 11) |
Providers must maintain technical documentation including validation/testing procedures and metrics (Annex IV, per Art 11). Deployers receive this documentation; they do not produce it under the Act. | BeLLMark’s JSON/PDF/HTML exports include structured test data, config snapshots, bias & calibration reports, and SHA-256 integrity hashes — analogous to Annex IV validation records. Useful for internal documentation of model selection decisions. |
Responses are shuffled and anonymised before judging. No model benefits from its identity — evaluation is based purely on output quality. This eliminates brand bias from model selection decisions.
Wilcoxon signed-rank tests with Holm–Bonferroni correction ensure that reported differences between models are statistically real, not noise. Both p-value and effect size (Cohen’s d) must meet thresholds.
Four automated bias tests: position bias (presentation order effects), length bias (verbosity correlation), self-preference bias (same-provider inflation), and verbosity bias (judge reasoning length correlation).
Cohen’s κ (2 judges) and Fleiss’ κ (3+ judges), Intraclass Correlation Coefficient (ICC 3,1), and per-judge reliability scoring. Quantifies inter-rater agreement for audit-ready credibility.
Bayesian adaptive K-factor tracks model performance across benchmark runs. Provides a longitudinal record of how model quality changes over time — essential for continuous monitoring requirements.
HTML, PDF, PPTX, JSON, and CSV exports with full run data, statistical analysis, bias reports, and integrity hashes. Exports serve as audit documentation that stakeholders can independently verify.
Transparency builds trust. BeLLMark is built for comparative model evaluation, not for full-scope regulatory compliance. The following obligations require different tooling:
BeLLMark is one component in a broader AI governance workflow. It produces the model evaluation evidence — structured comparison reports, statistical analysis, bias detection, and exportable documentation — that deployers need when building their compliance case. It does not replace a conformity assessment, a risk management system, or legal advice.
When a regulator or auditor reviews your AI deployment, they will ask how you chose the model powering your system. BeLLMark produces the evidence to answer that question defensibly.
BeLLMark produces structured, documented, reproducible evaluation records. These support a defensible model selection narrative when reviewed by internal governance teams or in an audit context. BeLLMark does not guarantee acceptance by any regulatory authority and does not substitute for legal compliance advice. The EU AI Act is one of several contexts where documented evaluation matters — enterprise AI governance, procurement rigor, and research reproducibility all benefit from the same discipline.
Blind A/B/C testing. Statistical significance. Bias detection. Exportable documentation. Self-hosted, private, and under your control.
Get StartedDisclaimer: BeLLMark is an evaluation and documentation tool. It does not constitute legal advice, regulatory compliance certification, or conformity assessment under EU Regulation 2024/1689 (the EU AI Act) or any other regulation. Organisations should consult qualified legal counsel for compliance guidance specific to their situation. The feature-to-provision mappings on this page describe how BeLLMark’s capabilities may support certain workflows relevant to the Act’s objectives; they do not assert that using BeLLMark satisfies any specific legal obligation. The “Partial” and “Indirect” fit ratings are BeLLMark’s own characterisation of how its features relate to each provision’s objectives, and do not constitute a representation that use of BeLLMark satisfies, or is sufficient to satisfy, any compliance obligation.