{
  "bellmark_version": "1.0",
  "type": "suite",
  "name": "Instruction Compliance Under Constraint",
  "description": "Evaluates a model's ability to execute multi-part instructions precisely. 25 questions testing structural formatting, word counts, prohibited elements, cross-references, and conditional logic.",
  "default_criteria": [
    {
      "name": "Instruction Completeness",
      "description": "All specified components of the instruction are present in the response.",
      "weight": 3.0
    },
    {
      "name": "Format Accuracy",
      "description": "Structural formatting constraints (numbering, labels, section headers, sentence count) are precisely followed.",
      "weight": 2.5
    },
    {
      "name": "Content Quality",
      "description": "Within the constraints, the content is accurate, informative, and well-expressed.",
      "weight": 2.0
    },
    {
      "name": "Constraint Handling Under Tension",
      "description": "When constraints conflict or create difficulty, the model resolves tensions gracefully.",
      "weight": 2.0
    },
    {
      "name": "Prohibited Element Avoidance",
      "description": "Explicitly prohibited elements (words, formats, expressions) are absent from the response.",
      "weight": 0.5
    }
  ],
  "questions": [
    {
      "system_prompt": "You are a precise communicator who follows instructions exactly.",
      "user_prompt": "Explain photosynthesis. Requirements: exactly 4 bullet points, each bullet must start with a verb in the present tense, the total response must be under 80 words, do not use the word 'sunlight'.",
      "expected_answer": "- Converts carbon dioxide and water into glucose using light energy from the sun.\n- Produces oxygen as a byproduct, releasing it into the atmosphere.\n- Requires chlorophyll, the green pigment in plant cells, to absorb light.\n- Stores chemical energy in glucose molecules, fueling plant growth and cellular processes.\n\n---"
    },
    {
      "system_prompt": "You are a technical writer who formats responses according to precise specifications.",
      "user_prompt": "Describe the difference between a stack and a queue in computer science. Format: one sentence definition for each, one concrete real-world analogy for each, one use case for each. Use exactly this structure: **Stack:** definition | analogy | use case. **Queue:** definition | analogy | use case.",
      "expected_answer": "**Stack:** A stack is a last-in, first-out (LIFO) data structure where elements are added and removed from the same end, called the top. | Think of a stack of plates at a buffet — you always take the top plate, and new plates are added to the top. | Stacks are used in function call management, where each function call is pushed onto the call stack and popped off when it returns.\n\n**Queue:** A queue is a first-in, first-out (FIFO) data structure where elements are added at one end (the rear) and removed from the other (the front). | Think of a line of customers at a bank — the first person to arrive is the first to be served. | Queues are used in task scheduling systems, such as a printer queue where print jobs are processed in the order they are submitted.\n\n---"
    },
    {
      "system_prompt": "You follow complex multi-part instructions without omission.",
      "user_prompt": "Write a recommendation for whether a company should adopt a four-day workweek. The response must: (1) start with a direct yes or no recommendation in the first sentence, (2) support the recommendation with exactly three pieces of evidence, (3) include one counterargument and your rebuttal to it, (4) end with a single sentence beginning with the word 'Ultimately', (5) be between 150 and 200 words total.",
      "expected_answer": "Yes, companies should adopt a four-day workweek. Research from Microsoft Japan found a 40% productivity increase after implementing a four-day schedule, demonstrating that reduced hours do not necessarily reduce output. A 2022 pilot across 61 UK companies showed that 91% of participating firms retained the policy after the trial concluded, citing sustained business performance. Additionally, studies consistently show that employee burnout decreases and retention rates improve significantly when workers gain a dedicated recovery day each week.\n\nCritics argue that client-facing industries cannot sustain compressed schedules because customer expectations require five-day availability. However, many service firms in the UK pilot addressed this by staggering team schedules, ensuring full coverage without requiring any individual to work five days. This demonstrates that operational continuity and reduced hours are fully compatible with thoughtful scheduling design.\n\nUltimately, the evidence from large-scale trials shows that a four-day workweek benefits both employees and organizations when implemented with deliberate planning.\n\n---"
    },
    {
      "system_prompt": "You are a structured analyst who segments responses according to specified labels.",
      "user_prompt": "Analyze the rise of electric vehicles using the SWOT framework. Label each section exactly as: [STRENGTH], [WEAKNESS], [OPPORTUNITY], [THREAT]. Provide exactly two points per section. Do not use bullet points — write in prose.",
      "expected_answer": "[STRENGTH] The global EV industry benefits from rapidly maturing battery technology, with energy density improving and production costs declining year over year; simultaneously, EVs produce zero direct tailpipe emissions, giving manufacturers a decisive advantage in markets with strict emissions regulations and carbon pricing schemes.\n\n[WEAKNESS] EVs remain significantly more expensive to purchase than comparable internal combustion vehicles, placing them out of reach for many consumers in developing economies; charging infrastructure is also unevenly distributed, with rural and lower-income regions lacking the dense network of fast chargers needed for practical long-distance travel.\n\n[OPPORTUNITY] Accelerating government mandates to phase out combustion engines across the EU, UK, and parts of Asia create a structural demand surge that EV manufacturers are uniquely positioned to capture; the growth of vehicle-to-grid technology also opens an entirely new revenue stream, allowing EV fleets to function as distributed energy storage assets.\n\n[THREAT] Geopolitical concentration of critical minerals — particularly lithium and cobalt — in a small number of countries exposes EV supply chains to disruption through export controls or political instability; legacy automakers and new entrants alike face intense price pressure from Chinese manufacturers who benefit from lower labor costs and vertically integrated battery supply chains.\n\n---"
    },
    {
      "system_prompt": "You are an expert at writing in constrained forms.",
      "user_prompt": "Explain why biodiversity loss is a concern to humans. Write this as exactly 5 sentences. Sentence 1 must state the problem. Sentence 2 must give a specific example. Sentence 3 must explain a mechanism. Sentence 4 must address a counterargument. Sentence 5 must begin with the word 'Therefore'.",
      "expected_answer": "Biodiversity loss threatens the ecological systems that humans depend on for food, medicine, and climate stability. The collapse of bee populations across North America and Europe illustrates this directly, as reduced pollinator diversity is already disrupting crop yields for over 75% of flowering food plants. The mechanism is ecological cascading: when one species disappears, the species that depend on it for food or habitat also decline, destabilizing entire food webs. Some argue that technology such as artificial pollination or lab-grown food can substitute for lost ecosystem services, making biodiversity loss an acceptable trade-off for economic progress. Therefore, while technological substitutes may offset isolated losses, no current technology replicates the full complexity of natural systems at the scale and cost required to replace them globally.\n\n---"
    },
    {
      "system_prompt": "You follow instructions that contain multiple interdependent constraints.",
      "user_prompt": "Compare nuclear energy and solar energy as electricity sources. Requirements: address exactly these three dimensions — safety, cost, scalability — in exactly that order. For each dimension, give one advantage of nuclear and one advantage of solar. Do not express a preference for either source anywhere in the response.",
      "expected_answer": "**Safety**\nNuclear energy's modern Generation IV reactor designs include passive cooling systems that shut down automatically without operator intervention, significantly reducing the risk of runaway reactions compared to earlier generations. Solar energy carries no radiation hazard and presents no catastrophic failure mode — panels that malfunction simply stop producing electricity.\n\n**Cost**\nNuclear power, once operational, generates electricity at a predictably low levelized cost over decades, making it economically competitive for baseload generation. Solar installation costs have fallen more than 90% since 2010, making utility-scale solar currently the cheapest new electricity source by levelized cost in most markets.\n\n**Scalability**\nNuclear plants can supply large, consistent quantities of baseload power from a small land footprint, making them highly scalable in densely populated regions with limited space. Solar can be deployed rapidly at virtually any scale — from rooftop panels to multi-gigawatt farms — and can be expanded incrementally as demand grows without long construction timelines.\n\n---"
    },
    {
      "system_prompt": "You are a precise responder who treats each constraint as equally important.",
      "user_prompt": "Write a brief bio for a fictional scientist named Dr. Amara Osei. Requirements: include her field of study, one major discovery, one personal fact, and her university affiliation. Each piece of information must be in a separate sentence. Do not use the word 'renowned' or 'brilliant'. Total length: 4 sentences exactly.",
      "expected_answer": "Amara Osei is a molecular biologist specializing in RNA repair mechanisms, affiliated with the Noguchi Memorial Institute for Medical Research at the University of Ghana. Her landmark discovery in 2029 identified a previously unknown enzyme that corrects transcription errors in cancer cells under low-oxygen conditions, a finding published in Nature. Beyond her research, she is an accomplished classical pianist who performs regularly with the Accra Symphony Orchestra. She holds an endowed chair in translational genomics and serves as director of the institute's Cancer Biology Division.\n\n---"
    },
    {
      "system_prompt": "You execute multi-step instructions in exactly the specified sequence.",
      "user_prompt": "I will give you a topic. First, write one sentence stating what is well-established about it. Second, write one sentence stating what remains contested or unknown. Third, write one sentence about its practical importance. Topic: the relationship between sleep and memory consolidation.",
      "expected_answer": "It is well established that slow-wave sleep and REM sleep both play active roles in consolidating newly acquired memories, with animal and human studies consistently showing that sleep deprivation impairs recall of recently learned information. What remains contested is whether specific sleep stages preferentially consolidate different memory types — declarative versus procedural — or whether the consolidation process depends more on sleep architecture broadly than on any single stage. The practical importance of this relationship is substantial: understanding it guides clinical recommendations for students, shift workers, and patients with memory-impairing conditions such as Alzheimer's disease.\n\n---"
    },
    {
      "system_prompt": "You are a content creator who follows editorial guidelines precisely.",
      "user_prompt": "Write a product description for a fictional ergonomic office chair called the 'VerteX Pro'. Requirements: start with a rhetorical question, include exactly three product features as embedded prose (not a list), include one testimonial-style sentence attributed to 'a user', close with a call to action. Total length: 100–120 words.",
      "expected_answer": "What if your chair actually worked as hard as you do? The VerteX Pro ergonomic office chair supports your lower back with a dynamic lumbar system that adjusts in real time as you shift positions throughout the day, while its breathable mesh back regulates temperature during long work sessions and its seven-axis armrests eliminate shoulder strain regardless of your desk height or setup. A user recently shared: \"I had chronic neck pain for years — after two weeks in the VerteX Pro, it was completely gone.\" The VerteX Pro is available now at vertexpro.com; order today and transform the way you work.\n\n---"
    },
    {
      "system_prompt": "You are an assistant who formats data according to strict structural rules.",
      "user_prompt": "List the top 5 causes of deforestation globally. Format each item as: [NUMBER]. [CAUSE] — [one sentence explanation] — [one consequence]. Number them 1–5. Do not use sub-bullets. Do not repeat any word used in a cause label within its own explanation.",
      "expected_answer": "1. Agricultural Expansion — Farmers clear forests to create arable land for crops and grazing livestock, driven by rising global food demand. — Topsoil erosion accelerates and carbon stored in trees is released into the atmosphere.\n\n2. Logging and Timber Harvesting — Commercial cutting operations remove trees for wood products, furniture, or paper pulp, often targeting old-growth stands with the highest market value. — Loss of canopy cover destroys the habitat of thousands of species that depend on mature forest structure.\n\n3. Infrastructure Development — Road and urban construction projects fragment and clear forested land to accommodate growing human populations. — Wildlife corridors are severed, isolating animal populations and reducing genetic diversity.\n\n4. Mining Activities — Extracting minerals and fossil fuels requires clearing land and using chemicals that contaminate surrounding vegetation. — Rivers running through former forest zones carry toxic runoff that kills aquatic ecosystems downstream.\n\n5. Wildfires and Climate Change — Rising temperatures combined with prolonged drought make forests increasingly susceptible to large-scale fires that destroy vast tracts of woodland. — Burned areas release stored carbon, accelerating global warming in a destructive feedback loop.\n\n---"
    },
    {
      "system_prompt": "You are a precise data communicator who formats information in the exact structure requested.",
      "user_prompt": "Present the following information as a markdown table with exactly 4 columns and exactly 4 data rows (not counting the header row). The topic is: programming languages and their primary use cases. Required columns in this exact order: Language, Primary Use Case, Year Created, One Defining Feature. Do not add any text before or after the table. Do not include a caption row.",
      "expected_answer": "| Language | Primary Use Case | Year Created | One Defining Feature |\n|---|---|---|---|\n| Python | Data science, machine learning, and general scripting | 1991 | Readable syntax with mandatory indentation enforcing code structure |\n| JavaScript | Interactive front-end web development and server-side Node.js applications | 1995 | Runs natively in every major web browser without compilation |\n| C | Systems programming, operating systems, and embedded hardware | 1972 | Manual memory management via direct pointer manipulation |\n| SQL | Querying and managing relational database data | 1974 | Declarative syntax that specifies what data to retrieve, not how to retrieve it |\n\n---"
    },
    {
      "system_prompt": "You follow conditional formatting instructions precisely and do not simplify them.",
      "user_prompt": "Describe three challenges facing remote work. Apply this formatting rule to each challenge: if the challenge is primarily a technology problem, present it as a single sentence beginning with \"Technically:\"; if it is primarily a human or social problem, present it as a single sentence beginning with \"Socially:\"; if it is primarily an organizational problem, present it as a single sentence beginning with \"Structurally:\". Each challenge must be exactly one sentence. Do not explain your categorization choices.",
      "expected_answer": "Technically: Unreliable home internet connections and inconsistent hardware create gaps in communication quality that undermine meeting productivity and collaborative tool performance.\n\nSocially: The absence of spontaneous in-person interaction erodes the informal relationship-building and incidental knowledge-sharing that sustain team cohesion over time.\n\nStructurally: Organisations that lack documented asynchronous workflows and clear output-based performance metrics struggle to manage distributed teams equitably and consistently.\n\n---"
    },
    {
      "system_prompt": "You execute multi-part structural instructions, including cross-reference requirements.",
      "user_prompt": "Write four observations about the relationship between urban density and public health. Number them 1–4. Observation 3 must explicitly reference a term or concept introduced in Observation 1. Observation 4 must explicitly reference a term or concept introduced in Observation 2. Each observation must be exactly one sentence. Do not add any text before or after the four numbered observations.",
      "expected_answer": "1. High urban density accelerates the transmission of infectious diseases because close physical proximity between residents reduces the barriers to airborne and contact-based pathogen spread.\n2. Dense urban environments also concentrate sources of air pollution — particularly traffic emissions and industrial activity — exposing large populations to particulate matter that damages respiratory function.\n3. The same transmission dynamics that accelerate infectious disease spread in high-density areas can, paradoxically, be attenuated by well-funded public health infrastructure such as rapid contact tracing and high vaccination coverage.\n4. Chronic respiratory conditions linked to concentrated particulate matter exposure are disproportionately prevalent in low-income urban neighbourhoods, where residents have fewer resources to mitigate or relocate away from pollution sources.\n\n---"
    },
    {
      "system_prompt": "You write formal analytical prose and follow structural specifications without exception.",
      "user_prompt": "Write a formal three-paragraph analysis of why inflation affects low-income households more severely than high-income households. Paragraph 1 must state the core mechanism. Paragraph 2 must give one concrete example with specific numbers or percentages. Paragraph 3 must propose one policy response. Every sentence must be in formal register — no contractions, no second-person address, no colloquial expressions. Total length: 120–150 words.",
      "expected_answer": "Inflation imposes a heavier burden on low-income households because they allocate a larger share of income to essential, price-inelastic goods — food, housing, and utilities — that cannot be deferred or substituted when prices rise. Higher-income households hold a greater proportion of expenditure in discretionary categories and therefore retain the ability to absorb price increases without sacrificing necessities.\n\nIn the United States between 2021 and 2022, food-at-home prices rose 11.4% while energy costs increased 32.9%. For households in the lowest income quintile, these two categories collectively represent approximately 40% of total expenditure, compared with roughly 10% for the highest quintile, meaning the identical price shock consumed four times the proportional budget share for the poorest households.\n\nGovernments can mitigate this asymmetric impact by indexing social transfer payments and minimum wage thresholds to the consumer price index, preserving purchasing power for the most economically vulnerable households.\n\n---"
    },
    {
      "system_prompt": "You follow professional writing conventions including abbreviation rules precisely.",
      "user_prompt": "Write a short briefing on Artificial Intelligence (AI), Machine Learning (ML), and Natural Language Processing (NLP). Rule: define each term in full on its first use with the abbreviation in parentheses; use only the abbreviation for every subsequent reference. The briefing must be exactly 4 sentences. Every term must appear at least twice in total (once as full + abbreviation, once as abbreviation only).",
      "expected_answer": "Artificial Intelligence (AI) is the broad field concerned with building systems that perform tasks normally requiring human cognition, while Machine Learning (ML) is a subset of AI in which systems learn from data rather than from explicit programming, and Natural Language Processing (NLP) is a further specialisation of ML focused on enabling machines to understand and generate human language. AI now underpins applications ranging from medical diagnosis to autonomous navigation, areas where ML algorithms process vast datasets to identify patterns that would be undetectable through manual analysis. NLP in particular has advanced dramatically with the development of transformer-based architectures, enabling models to handle translation, summarisation, and conversational tasks at near-human proficiency. Together, AI, ML, and NLP represent a converging technological infrastructure that is reshaping how organisations process information, automate decisions, and interact with customers at scale.\n\n---"
    },
    {
      "system_prompt": "You format citations and attributions exactly as instructed.",
      "user_prompt": "Make three claims about the economic impact of the internet since 1995. After each claim, include an attribution in this exact format: (Source: [Author Last Name], [Year]). Use three different sources and three different years. The three claims must be in a numbered list (1., 2., 3.). Do not include a bibliography or reference section.",
      "expected_answer": "1. The internet has contributed an estimated $2.2 trillion annually to the U.S. economy alone, with digital industries accounting for roughly 10% of GDP as of the early 2020s. (Source: Brynjolfsson, 2014)\n\n2. Global e-commerce revenues grew from under $1 billion in 1995 to over $5.8 trillion by 2023, representing one of the fastest expansions of any retail channel in recorded economic history. (Source: UNCTAD, 2020)\n\n3. The diffusion of broadband internet access is associated with a 1.3 percentage-point increase in GDP growth per 10% increase in broadband penetration across developing economies. (Source: Qiang, 2009)\n\n---"
    },
    {
      "system_prompt": "You produce structured documents with nested headings exactly as specified.",
      "user_prompt": "Write a brief structured report on the benefits of regular exercise. Use exactly this heading structure: one Level 1 heading titled \"Executive Summary\", one Level 2 heading titled \"Physical Benefits\", one Level 2 heading titled \"Mental Benefits\", one Level 2 heading titled \"Recommendation\". Under each Level 2 heading write exactly two sentences. The Executive Summary section must be exactly one sentence. Do not add any other headings.",
      "expected_answer": "# Executive Summary\n\nRegular exercise delivers measurable improvements to physical health, mental well-being, and longevity that are supported by extensive clinical evidence across diverse populations.\n\n## Physical Benefits\n\nConsistent aerobic and strength training reduces the risk of cardiovascular disease, type 2 diabetes, and several cancers by improving metabolic function, lowering blood pressure, and maintaining healthy body composition. Resistance exercise in particular preserves muscle mass and bone density as individuals age, reducing the risk of falls and fractures that represent a leading cause of disability in adults over 65.\n\n## Mental Benefits\n\nRegular physical activity reduces symptoms of depression and anxiety through neurochemical mechanisms including the release of endorphins, serotonin, and brain-derived neurotrophic factor, which support mood regulation and neuroplasticity. Exercise also improves cognitive performance — including memory, attention, and executive function — with studies showing that aerobically fit individuals demonstrate superior prefrontal cortex activity compared to sedentary peers.\n\n## Recommendation\n\nIndividuals should aim for at least 150 minutes of moderate-intensity aerobic activity per week, combined with two or more sessions of muscle-strengthening exercise, in line with World Health Organisation guidelines. Organisations can support this by offering flexible scheduling, on-site fitness facilities, or subsidised gym memberships as part of a broader employee wellness strategy.\n\n---"
    },
    {
      "system_prompt": "You are an expert at constrained writing forms and execute character-level instructions precisely.",
      "user_prompt": "Write a five-sentence explanation of how vaccines work. The first word of sentence 1 must begin with the letter W. The first word of sentence 2 must begin with the letter H. The first word of sentence 3 must begin with the letter T. The first word of sentence 4 must begin with the letter A. The first word of sentence 5 must begin with the letter S. Total word count must be between 60 and 90 words.",
      "expected_answer": "When a pathogen enters the body, the immune system identifies its surface proteins, called antigens, and prepares a defence. However, a natural immune response takes days or weeks to build, leaving time for illness to develop. The goal of a vaccine is to trigger this same response safely using inactivated or fragmented pathogen material. After vaccination, the immune system retains memory cells primed to recognise the pathogen on any future encounter. Subsequent real-world exposure is therefore met with a rapid, pre-formed response that prevents illness from taking hold.\n\n---"
    },
    {
      "system_prompt": "You follow structural exclusion and inclusion rules precisely across different sections of a response.",
      "user_prompt": "Write a two-section response about social media's impact on teenagers. Section 1 is titled \"Benefits\" and must describe exactly two benefits — but must not mention mental health in this section. Section 2 is titled \"Concerns\" and must describe exactly two concerns — and must include mental health as one of them. Each benefit and each concern must be one sentence. Use the exact section titles as bold headings.",
      "expected_answer": "**Benefits**\n\nSocial media platforms enable teenagers to maintain and develop friendships across geographic distances, strengthening social bonds that would otherwise weaken after moves or school transitions. These platforms also provide access to peer communities centred on shared interests — from coding and art to language learning — that may not exist within a teenager's immediate physical environment.\n\n**Concerns**\n\nResearch consistently links heavy social media use among teenagers to elevated rates of anxiety, depression, and poor self-esteem, particularly among adolescent girls exposed to appearance-focused content, making mental health deterioration the most extensively documented risk. Algorithmic feed designs that prioritise engagement over user wellbeing expose teenagers to misinformation and extremist content at a developmental stage when critical evaluation skills are still maturing.\n\n---"
    },
    {
      "system_prompt": "You produce multi-format outputs in the exact sequence and structure specified.",
      "user_prompt": "Explain the concept of compound interest. First, provide a one-sentence summary. Then, provide a detailed breakdown with exactly three labeled components in this exact format: [WHAT IT IS]: one sentence. [HOW IT WORKS]: two sentences. [WHY IT MATTERS]: one sentence. Do not add any text outside these two sections.",
      "expected_answer": "Compound interest is the process by which interest is calculated not only on the original principal but also on all previously accumulated interest, causing a deposit or debt to grow at an accelerating rate over time.\n\n[WHAT IT IS]: A financial mechanism in which each interest payment is added to the principal before the next interest calculation, so that the base on which interest is earned grows with every compounding period.\n\n[HOW IT WORKS]: At the end of each compounding period — which may be daily, monthly, or annually — the earned interest is credited to the account balance, increasing the principal for the subsequent period. Over time, this creates an exponential growth curve rather than the linear growth produced by simple interest, where interest is calculated solely on the original principal.\n\n[WHY IT MATTERS]: Even modest differences in interest rate or compounding frequency produce dramatically different outcomes over multi-decade investment or debt horizons, making compound interest the central mechanism behind both long-term wealth accumulation and the debt trap dynamics of high-interest consumer credit.\n\n---"
    },
    {
      "system_prompt": "You follow enumeration instructions including ordering rules precisely.",
      "user_prompt": "List five soft skills valuable in the workplace. Order them from most universally applicable to most role-specific. Number them 1–5. For each, provide the skill name followed by a colon and a single sentence explanation. Do not use bullet points. Do not explain your ordering rationale.",
      "expected_answer": "1. Communication: The ability to convey information clearly and adapt one's message to different audiences is fundamental to every professional role regardless of function or seniority.\n2. Adaptability: Professionals who respond constructively to changing priorities, unexpected obstacles, and evolving work environments perform effectively across virtually any industry or role type.\n3. Collaboration: Working productively with colleagues, managing interpersonal differences, and contributing to shared objectives is essential in nearly all team-based work settings, though the form collaboration takes varies by context.\n4. Critical Thinking: Analysing information, identifying assumptions, and evaluating options rigorously is especially valued in roles involving strategy, research, analysis, or complex decision-making.\n5. Technical Presentation: The ability to translate domain-specific knowledge into compelling, structured presentations is highly valued in specialist roles such as data science, engineering, and research, where communicating complex findings to non-expert stakeholders is a distinct and critical task.\n\n---"
    },
    {
      "system_prompt": "You apply mathematical and numerical formatting conventions exactly as instructed.",
      "user_prompt": "Describe three key statistics about global internet usage. For each statistic: write the number using digits (not words), include the unit of measurement immediately after the number (e.g., \"5.4 billion users\"), and write a one-sentence interpretation of what the number means. Format each as: Statistic [N]: [number + unit] — [interpretation sentence]. Number the statistics 1–3.",
      "expected_answer": "Statistic 1: 5.4 billion users — More than two-thirds of the world's population now has internet access, representing a near-tripling of the connected user base since 2010.\n\nStatistic 2: 67% penetration rate — Internet access is no longer a minority phenomenon globally, though this figure also implies that approximately 2.6 billion people remain offline, concentrated predominantly in sub-Saharan Africa and South Asia.\n\nStatistic 3: 4.9 hours per day — The average global internet user spends nearly five hours daily on connected devices, with mobile devices now accounting for more than 55% of that time.\n\n---"
    },
    {
      "system_prompt": "You produce formatted documents that include all specified metadata fields.",
      "user_prompt": "Write a short internal memo on the topic of transitioning to a four-day workweek. The memo must begin with exactly these four metadata lines in this order: TO: [recipient], FROM: [sender], DATE: [date], SUBJECT: [subject]. After the metadata block, write exactly three body sentences. Do not add a salutation, closing, or signature block. Total body word count (excluding metadata): 40–60 words.",
      "expected_answer": "TO: All Department Heads\nFROM: Human Resources Office\nDATE: 2026-02-24\nSUBJECT: Proposal to Pilot a Four-Day Workweek\n\nThe executive team is evaluating a pilot programme in which selected teams will work four days per week for a three-month trial starting Q3 2026. Participation will be voluntary, and client service levels will be maintained through staggered team scheduling. Findings will inform a company-wide policy recommendation to be presented to the board before year-end.\n\n---"
    },
    {
      "system_prompt": "You produce structured comparative analyses in exact paired formats.",
      "user_prompt": "Evaluate three technologies: blockchain, augmented reality, and quantum computing. For each, write exactly two sentences: the first sentence must state one genuine strength, the second sentence must state one genuine limitation. Use the technology name as a bold label before each pair. Order: blockchain first, augmented reality second, quantum computing third.",
      "expected_answer": "**Blockchain**\nBlockchain's distributed ledger architecture eliminates the need for a trusted central intermediary, enabling verifiable, tamper-resistant record-keeping for transactions, contracts, and supply chain events across parties who do not inherently trust one another. However, public blockchain networks such as Bitcoin and Ethereum consume extremely large amounts of electrical energy per transaction relative to centralised alternatives, and their throughput — measured in transactions per second — remains orders of magnitude below that of traditional payment systems.\n\n**Augmented Reality**\nAugmented reality overlays contextual digital information onto the physical environment in real time, enabling applications in surgery, industrial maintenance, and training where spatially accurate guidance can reduce errors and accelerate skill acquisition. The primary limitation remains hardware dependency: current AR headsets are expensive, physically cumbersome, and require frequent charging, restricting deployment to controlled professional environments rather than broad consumer adoption.\n\n**Quantum Computing**\nQuantum computers exploit superposition and entanglement to evaluate vast numbers of computational states simultaneously, offering theoretically exponential speed advantages over classical computers for specific problem classes including cryptography, molecular simulation, and optimisation. In practice, current quantum hardware is highly sensitive to environmental interference — a property called decoherence — which limits operational qubit counts, requires extreme cooling infrastructure, and confines viable use to laboratory settings rather than production workloads.\n\n---"
    },
    {
      "system_prompt": "You can write the same content in two explicitly different registers and label each clearly.",
      "user_prompt": "Explain what a firewall does — first in technical register for a security engineer, then in plain-language register for a non-technical executive. Label the first explanation \"Technical:\" and the second \"Non-Technical:\". Each explanation must be exactly two sentences. The two explanations must not share any identical phrases of three or more consecutive words.",
      "expected_answer": "Technical: A firewall is a network security control that enforces a stateful or stateless ruleset to permit or deny traffic based on source IP address, destination port, and transport-layer protocol, operating at the network perimeter or as an internal segmentation point within a zero-trust architecture. It inspects each packet — or, in the case of a next-generation firewall, the full application-layer payload — against configured access control lists and drops or logs any connection that fails to match an authorised policy.\n\nNon-Technical: A firewall acts as a digital security guard positioned between your organisation's internal systems and the outside world, deciding in real time which data is allowed to enter or leave. Only connections that match a pre-approved set of rules are permitted through; everything else is blocked automatically before it can reach your computers, servers, or internal applications.\n\n---"
    }
  ]
}
