How BeLLMark Supports AI Evaluation Under the EU AI Act

The EU AI Act Changes the Stakes of Model Selection

The EU AI Act (Regulation 2024/1689) introduces legally binding obligations for organisations deploying AI systems in the European Union. For deployers of high-risk AI systems, the Act requires documented evidence of model evaluation, performance monitoring, and risk management. The full application date is 2 August 2026.

What This Means for AI Teams

Organisations deploying AI in regulated contexts — credit scoring, HR screening, healthcare diagnostics, public services — will need documented, reproducible evidence that the models they selected were evaluated rigorously. For deployers of high-risk AI systems under Annex III, informal model comparisons and anecdotal testing may no longer suffice when an auditor asks: “How did you choose this model, and what evidence supports that decision?”

Where BeLLMark Supports Deployer Obligations

BeLLMark is an evaluation and documentation tool, not a compliance platform. The following table maps specific BeLLMark capabilities to the Act provisions they most closely support. The ratings reflect how BeLLMark’s pre-deployment evaluation workflows relate to each provision’s requirements. They do not assert that using BeLLMark satisfies any specific legal obligation.

Act Provision	What It Requires	How BeLLMark Helps	Fit
Article 26(5) Deployer Monitoring	Monitor operation of high-risk AI system on the basis of instructions for use (Art 26(5)). Art 26 also requires human oversight assignment, input data quality, log retention, and worker notification — which are outside BeLLMark’s scope.	Pre-deployment performance baselines that support monitoring workflows: blind model comparisons across versions, ELO rating history for longitudinal tracking, bias detection to surface evaluation artifacts	Partial
Article 9 Risk Management (Provider obligation)	Providers must establish a risk management system including testing against pre-defined metrics (Art 9(6)–(8)). This is a provider obligation; deployers are not directly subject to Art 9.	BeLLMark’s evaluation methodology — weighted criteria, configurable scoring, statistical power analysis, Wilson/bootstrap CIs — is analogous to Art 9 testing practices. Useful for deployers who also fine-tune models or wish to document their own evaluation rigour.	Indirect
Article 15 Accuracy & Robustness (Provider obligation)	Providers must design systems for appropriate accuracy and declare relevant metrics in instructions for use (Art 15(1), (3)). Deployers receive this information; they are not required to measure accuracy under Art 15.	Deployers wishing to independently verify provider-stated accuracy can use BeLLMark’s confidence intervals, Cohen’s d effect sizes, Wilcoxon tests, and Holm–Bonferroni correction to assess performance against their specific use case.	Indirect
Annex IV Technical Documentation (Provider obligation, Art 11)	Providers must maintain technical documentation including validation/testing procedures and metrics (Annex IV, per Art 11). Deployers receive this documentation; they do not produce it under the Act.	BeLLMark’s JSON/PDF/HTML exports include structured test data, config snapshots, bias & calibration reports, and SHA-256 integrity hashes — analogous to Annex IV validation records. Useful for internal documentation of model selection decisions.	Indirect

Note: “Partial” fit means BeLLMark supports one prerequisite workflow for this obligation but does not address it in its primary operational dimension. “Indirect” means the provision is a provider obligation under the Act; BeLLMark’s capabilities are analogous but deployers are not directly subject to this requirement. Ratings do not indicate that BeLLMark satisfies the underlying legal obligation.

Evaluation Capabilities That Support Governance

Blind A/B/C Evaluation

Responses are shuffled and anonymised before judging. No model benefits from its identity — evaluation is based purely on output quality. This eliminates brand bias from model selection decisions.

Statistical Significance Testing

Wilcoxon signed-rank tests with Holm–Bonferroni correction ensure that reported differences between models are statistically real, not noise. Both p-value and effect size (Cohen’s d) must meet thresholds.

Bias Detection Suite

Four automated bias tests: position bias (presentation order effects), length bias (verbosity correlation), self-preference bias (same-provider inflation), and verbosity bias (judge reasoning length correlation).

Judge Calibration Analysis

Cohen’s κ (2 judges) and Fleiss’ κ (3+ judges), Intraclass Correlation Coefficient (ICC 3,1), and per-judge reliability scoring. Quantifies inter-rater agreement for audit-ready credibility.

ELO Rating System

Bayesian adaptive K-factor tracks model performance across benchmark runs. Provides a longitudinal record of how model quality changes over time — essential for continuous monitoring requirements.

Multi-Format Export

HTML, PDF, PPTX, JSON, and CSV exports with full run data, statistical analysis, bias reports, and integrity hashes. Exports serve as audit documentation that stakeholders can independently verify.

What BeLLMark Does Not Cover

Transparency builds trust. BeLLMark is built for comparative model evaluation, not for full-scope regulatory compliance. The following obligations require different tooling:

Out of Scope

Conformity assessment or certification — BeLLMark does not perform or substitute for the conformity assessment procedures required for high-risk AI systems under the Act.
Adversarial testing and red teaming — The Act’s GPAI systemic risk provisions (Article 55) require adversarial testing of models with a view to identifying and mitigating systemic risks. BeLLMark evaluates quality and capability, not safety and harm.
GPAI provider documentation — Article 53 requires model providers to document training data, architecture, and compute resources. BeLLMark covers behavioural evaluation output only, not model provenance.
Production monitoring — Article 72 requires post-market monitoring of AI systems in production. BeLLMark is a pre-deployment evaluation tool, not a production observability platform.
Standardised evaluation protocols — The “standardised protocols and tools” referenced in Article 55 will be developed by the EU AI Office. BeLLMark uses configurable criteria and custom question sets, not regulatory-mandated benchmarks.
Legal or compliance advice — BeLLMark is software. It does not provide legal counsel, regulatory interpretation, or compliance certification of any kind.

Where BeLLMark Fits in a Compliance Workflow

BeLLMark is one component in a broader AI governance workflow. It produces the model evaluation evidence — structured comparison reports, statistical analysis, bias detection, and exportable documentation — that deployers need when building their compliance case. It does not replace a conformity assessment, a risk management system, or legal advice.

Documented Model Selection: What Auditors Want to See

When a regulator or auditor reviews your AI deployment, they will ask how you chose the model powering your system. BeLLMark produces the evidence to answer that question defensibly.

1 Evaluation criteria and scoring rubric — What dimensions were measured (accuracy, completeness, safety, domain relevance) and how were they weighted?
2 Statistical evidence of model differences — Were the differences between candidate models statistically significant, or could they be explained by noise? What were the confidence intervals and effect sizes?
3 Bias controls and mitigation — Was the evaluation blind? Were position bias, length bias, and self-preference bias tested and controlled for?
4 Reproducibility artifacts — Can the evaluation be repeated? Are the prompt set, judge configuration, model versions, random seed, and run parameters documented?
5 Export integrity — Are exported results tamper-evident? BeLLMark’s SHA-256 hash signing provides cryptographic fingerprints for every export, ensuring documentation integrity.

Structured Evaluation, Not Compliance Certification

BeLLMark produces structured, documented, reproducible evaluation records. These support a defensible model selection narrative when reviewed by internal governance teams or in an audit context. BeLLMark does not guarantee acceptance by any regulatory authority and does not substitute for legal compliance advice. The EU AI Act is one of several contexts where documented evaluation matters — enterprise AI governance, procurement rigor, and research reproducibility all benefit from the same discipline.

Evaluate Models With Rigour

Blind A/B/C testing. Statistical significance. Bias detection. Exportable documentation. Self-hosted, private, and under your control.

Get Started

Disclaimer: BeLLMark is an evaluation and documentation tool. It does not constitute legal advice, regulatory compliance certification, or conformity assessment under EU Regulation 2024/1689 (the EU AI Act) or any other regulation. Organisations should consult qualified legal counsel for compliance guidance specific to their situation. The feature-to-provision mappings on this page describe how BeLLMark’s capabilities may support certain workflows relevant to the Act’s objectives; they do not assert that using BeLLMark satisfies any specific legal obligation. The “Partial” and “Indirect” fit ratings are BeLLMark’s own characterisation of how its features relate to each provision’s objectives, and do not constitute a representation that use of BeLLMark satisfies, or is sufficient to satisfy, any compliance obligation.