All Major AI Models Score Zero on Meta's Hellish Programming Benchmark

On May 7, 2026, Meta AI Research dropped a bombshell on the machine learning community. Their newly released ProgramBench benchmark — a dataset designed to test genuine software engineering capability rather than toy programming puzzles — produced a result so stark it is already reshaping the conversation about AI and the future of coding: every major AI model scored zero.

Not a low score. Not a disappointing score. Absolute zero on the benchmark’s most meaningful category: architecture-level module reconstruction.

ProgramBench Results

What Is ProgramBench?

ProgramBench is not another LeetCode clone. Meta’s researchers deliberately designed it to measure what they call “Engineering Intelligence” — the ability to understand, refactor, and reconstruct software at the level of entire modules, not individual functions. The benchmark consists of three tiers:

Tier 1 — Function Completion (FC): Given a function signature and docstring, complete the body. This mirrors the kind of autocomplete tasks Copilot and ChatGPT handle daily.
Tier 2 — Module Reconstruction (MR): Given a partially redacted multi-file codebase (with module structure, imports, and interfaces intact), reconstruct the missing implementations. This requires understanding architectural patterns, dependency graphs, and cross-cutting concerns.
Tier 3 — System Design Planning (SDP): Given a high-level specification, produce a coherent module decomposition, interface definition, and dependency plan. This is architecture work.

Models fared passably on Tier 1. Claude Opus 4.7 achieved 78% on function completion. GPT-5.5 reached 74%. Even open-source models like DeepSeek-V3 managed respectable scores in the 60–70% range.

Tier 3 saw a sharp decline. GPT-5.5 scored 23% on system design planning. Claude Opus 4.7 managed 31%. But these numbers, while poor, were not the headline.

Tier 2 — Module Reconstruction — is where every single model scored zero.

The Zero Heard Around the World

Here is the raw truth: when presented with a partially redacted multi-file codebase and asked to fill in the missing components, no model — from GPT-5.5 to Claude Opus 4.7 to Gemini 2.5 Pro to DeepSeek-V3 — could produce a single correct answer across the entire benchmark suite.

Benchmark Tier	GPT-5.5	Claude Opus 4.7	Gemini 2.5 Pro	DeepSeek-V3	Llama 4
Function Completion	74%	78%	71%	67%	62%
Module Reconstruction	0%	0%	0%	0%	0%
System Design Planning	23%	31%	19%	14%	9%

Source: Meta AI Research, ProgramBench Technical Report (May 2026)

The module reconstruction tasks were not obscure. They involved real-world patterns: a rate-limited API client with retry logic and circuit breaking, a caching layer with multi-level invalidation, and an event-sourced domain model with compensating transactions. These are exactly the kinds of components mid-level software engineers design and implement every day.

Why Do Models Fail So Completely?

The failure mode is instructive. Models did not produce syntax errors or obviously broken code. They produced plausible-looking code that was architecturally wrong — code that compiled, ran, and appeared correct at first glance, but violated fundamental design invariants, introduced hidden coupling between decoupled components, and ignored cross-cutting concerns like error propagation, transaction boundaries, and consistency guarantees.

This reveals a deep truth about how current LLMs work. They are pattern matchers trained on local context windows — brilliant at completing the next few lines of a function, but fundamentally incapable of reasoning about how those lines fit into a system of interconnected components. A codebase is not a sequence of tokens. It is a graph of dependencies, constraints, and invariants. Current architectures do not model that graph.

Meta’s researchers coined a useful distinction: models have syntactic intelligence (the ability to produce well-formed code) but lack architectural intelligence (the ability to produce a well-formed system). The gap between the two is vast.

Engineering Intelligence: The Next Frontier

The term “Engineering Intelligence” is gaining traction as the successor to “AGI” in practical discourse. It is not about whether a model can write a recursive Fibonacci function or solve a dynamic programming puzzle — every major model cleared that bar years ago. Engineering Intelligence is about whether a model can:

Understand why a particular abstraction exists in a codebase
Recognize when a change in one module will break invariants in another
Design systems that are maintainable, testable, and resilient under real-world constraints
Make trade-off decisions between performance, clarity, and correctness

ProgramBench suggests that none of today’s models have even a rudimentary form of Engineering Intelligence. They are tools for acceleration — writing boilerplate, generating test cases, explaining code — but they cannot reason about software as a system.

What This Means for Software Engineers

For the millions of developers watching the AI revolution with a mix of excitement and anxiety, ProgramBench offers a clarifying data point. AI is not coming for your job — not the part that involves thinking about architecture, making design trade-offs, and ensuring that systems are correct under all conditions. What AI is doing is compressing the bottom end of the skill distribution: the tasks that once required junior developers to type out hundreds of lines of boilerplate are now handled in seconds.

The job of a software engineer is evolving toward what it has arguably always been at its core: designing systems, not typing code. The typing was never the hard part. ProgramBench just proved it in the most rigorous way possible.

The race is now on to build the first model that can score above zero on Module Reconstruction. Whoever cracks that problem will not just have built a better autocomplete engine — they will have built a machine that can genuinely engineer software.