The Illusion of Correctness in LLM Generated Code

For some time now, the tech industry has been circulating the narrative that large language models will revolutionize software engineering, offering a 100x productivity boost. The reality looks quite different. LLMs can generate a sloppy web app, write some boilerplate or refactor a service. But when the task requires working on genuinely difficult, architecturally challenging code, the model fails.

The core issue doesn’t lie in a lack of compute, but in the very incentive architecture we use to train these models. From a mechanism design perspective, current LLMs are optimized for the illusion of correctness, rather than actual correctness. This isn’t a temporary limitation waiting to be patched with more compute - it’s a structural incentive problem that persists across the training pipeline, though its severity varies at each stage.

Pre-training: Maximum Likelihood Estimation and Syntax Illusion

During the pre-training phase, the model is acting as a probability distribution estimator over sequences of tokens.

Given a dataset $\mathcal{D}$ of token sequences $x = (x_1, x_2, \dots, x_T)$ , the model $\pi_\theta$ minimizes the negative log-likelihood (Cross-Entropy Loss) of predicting the next token $x_t$ given all preceding tokens in the context $x_{<t}$ :

\mathcal{L}(\theta) = - \mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{t=1}^{T} \log \pi_\theta(x_t \mid x_{<t}) \right]

The gradient update for a single parameter $\theta_j$ is:

\frac{\partial \mathcal{L}}{\partial \theta_j} = - \mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{t=1}^{T} \frac{1}{\pi_\theta(x_t \mid x_{<t})} \cdot \frac{\partial \pi_\theta(x_t \mid x_{<t})}{\partial \theta_j} \right]

Notice what is being optimized: the statistical probability of reproducing the syntax found in $\mathcal{D}$ . There is no explicit notion of logic, truth, or correctness in this objective - the model may implicitly learn some logical regularities from data, but it is never directly optimized for them. The model that assigns the highest probability to the token return after seeing if (x > 0) is rewarded equally whether the logic is correct or nonsensical - as long as it matches the distribution.

This has a concrete implication for code generation: the model learns that for loops follow int i = 0, that malloc is often followed by a NULL check, and that functions tend to return at the end. It learns syntax patterns, not program semantics.

A natural objection at this point is that pre-training is only the foundation - surely the subsequent stages of the pipeline correct these blindspots? After all, that’s precisely what SFT and RL are designed to do. The answer is nuanced: they help, but less than you’d expect, and for instructive reasons.

Post-training: SFT and RL Shift the Problem, Not Solve It

We’ve established that pre-training produces a model optimized for statistical reproduction of its training data. The natural question is whether post-training - SFT and RL - can break through this ceiling. This reduces to a concrete question: can the model generate code that is better than what exists in its training set? If not, the quality ceiling is set by human programmers, and LLMs become sophisticated interpolators rather than genuine problem-solvers.

SFT (Supervised Fine-Tuning) takes a curated subset of high-quality prompt-response pairs and readjusts the model’s parameters $\theta$ to minimize the loss on this new dataset $\mathcal{D}_{SFT}$ . We are shifting the policy from a highly entropic state to a concentrated one.

SFT can introduce new knowledge and reasoning patterns into the model - patterns that exist in $\mathcal{D}_{SFT}$ but that the base model wouldn’t reliably produce. Recent work¹ has shown that SFT is actually more effective than RL at enabling progress on problems beyond the model’s current capabilities, precisely because it injects external demonstrations.

In other words, SFT can teach the model to execute reasoning strategies it hasn’t seen during pre-training - but it teaches them as fixed patterns to replicate, not as principles to generalize from. The model learns ‘when you see this type of problem, apply this type of solution’, not ‘here is why this solution works and how to adapt it’.

The evidence confirms this ceiling. Models fine-tuned on different problem categories converge on nearly identical solution strategies, with performance plateauing despite logarithmic data scaling². And comparative studies show that SFT-trained models memorize their training distributions rather than extracting transferable rules³.

The implication for code generation is that SFT can make the model produce better code if better code exists in the training set. It cannot make the model invent a better algorithm, a more efficient data structure, or a novel architectural pattern. The quality ceiling is set by the humans who wrote $\mathcal{D}_{SFT}$ .

RL and the Proxy Reward Trap

SFT has a hard ceiling, but at least it can reach it - the training data comes from outside the model’s own distribution, so the model genuinely learns patterns it couldn’t produce before.

Reinforcement Learning promises to go further: not just importing external knowledge, but discovering novel solutions through exploration. But current RL methods for LLMs are on-policy - the model generates its own responses, and a reward function scores them. This means exploration is constrained by the model’s own prior: it can only discover solutions it was already likely enough to sample.

Formally, given a prompt $x$ and a generated code snippet $y \sim \pi_\theta(\cdot \mid x)$ , the RL optimizer maximizes:

J(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \; y \sim \pi_\theta(\cdot \mid x)} \left[ R(x, y) \right]

The question is whether this optimization loop actually discovers novel solutions, or merely redistributes probability over solutions the model could already produce.

Distribution Sharpening, Not Capability Expansion

Recent work by Yue et al.⁴ suggests that RL’s ability to go beyond the training distribution is overstated. Their finding is striking: RLVR-trained models outperform their base models at pass@1 (single-sample accuracy), but underperform at pass@k for large k. If you let the base model sample enough times, it can solve more unique problems than the RL-trained version.

What this means is that RL didn’t teach the model new reasoning patterns. It narrowed the output distribution toward high-reward trajectories - making correct answers more likely on any single roll, but reducing the total space of problems the model can solve. The base model already contained the correct reasoning paths; RL just made them easier to sample. The ceiling was set during pre-training.

Why This Works in Math but Not in Code

This raises an obvious question: if RL merely sharpens distributions, how have recent LLMs solved previously unsolved mathematical competition problems - clearly going beyond their training data?

The answer lies in the reward signal. In mathematics, $R_{true}$ is computable. A proof is either valid or it isn’t, and this can be checked mechanically through formal verification systems like Lean. RL with access to a perfect verifier can explore freely: the model generates millions of candidate proofs, the verifier rejects all incorrect ones, and the policy gradually learns heuristics that lead to novel, correct proofs. The reward signal is dense and exact. Combined with massive sampling, this turns distribution sharpening into something that functions like genuine exploration - the verifier ensures that only truly correct novel paths survive.

Programming does not have this luxury. “Is this code correct?” is undecidable in the general case. “Is this code efficient, readable, secure, and correct?” is not even a single well-defined question. You cannot write a program that looks at another program and definitively tells you whether it’s correct, efficient, secure, and readable for all possible inputs. This is not an engineering limitation; it’s a mathematical impossibility.

To be precise: the impossibility is in the general case. Formal verification tools like Lean can and do verify specific programs against specific specifications. But this requires the specification itself to be written - and for most real-world software, writing a complete formal specification is as hard as writing the software. The bottleneck shifts from “verify the code” to “specify what correct means”, which is a human problem, not a compute problem. This asymmetry is the reason RL produces mathematical breakthroughs but not programming breakthroughs - and it is fundamental, not temporary.

The Proxy We Actually Use

Therefore, we substitute $R_{true}$ with a proxy $R_{proxy}(x, y)$ . Modern $R_{proxy}$ is no longer as naive as “does the code compile”. Today’s LLM coding agents have access to real tooling - compilers, debuggers, terminals, test runners, linters. The model can compile its output, execute it, read the stack trace, and iterate. The feedback loop is tighter, the signal is richer, and the proxy captures more dimensions of correctness than a simple binary.

But it’s still a proxy. A passing test suite tells you the code is correct for those specific inputs. A linter tells you the code follows style conventions, not that its architecture is sound. A debugger helps fix the bug you can reproduce, not the one that appears under a race condition in production. The gap between $R_{proxy}$ and $R_{true}$ has narrowed, but it hasn’t closed - and it can’t close, because $R_{true}$ is uncomputable for programming.

The model doesn’t optimize for correct code. It optimizes for the minimal energy path to maximize $R_{proxy}$ . If the reward function has a shortcut, the model will find it.

Case Study: Claude’s C Compiler

Recently, Anthropic published a blog post describing how they had used 16 parallel instances of Claude Opus 4.6 to write a C compiler from scratch in Rust - a genuinely hard problem⁵. The reward signal was solid: the entire GCC torture test suite plus compilation of real-world projects like SQLite, Redis, PostgreSQL and FFmpeg.

After two weeks, ~2,000 Claude Code sessions, and ~20,000$ in API costs, the agents produced a 100,000-line compiler that achieved a 99% pass rate on GCC’s test suite and could compile the Linux 6.9 kernel’s C source files across x86, ARM, and RISC-V.

Impressive headline. But what we see under the hood:

The generated machine code was catastrophically inefficient. Independent benchmarks showed that CCC-compiled SQLite ran a workload in 2 hours that GCC finished in 10 seconds⁶. The root cause: CCC’s register allocator used a single register as a shuttle to move values between stack locations, turning every operation into multiple memory accesses. The slowdown was not uniform - simple queries like INSERT ran only 1-7x slower, but operations involving nested loops (subqueries, JOINs) compounded the overhead to a 158,000x slowdown in the worst case. All optimization levels (-O0 through -O3) produced byte-identical output - the flags were accepted but completely ignored.

When the problem got truly hard, the model routed around it. CCC could not generate valid 16-bit x86 real mode code needed to boot Linux - the compiled output exceeded 60KB, far beyond the 32KB real-mode limit. Rather than solving this low-level problem, it delegated to GCC⁵. CCC does include its own assembler and linker, but Anthropic acknowledged that these components are “still somewhat buggy” - the demo video was produced using GCC’s assembler and linker as a fallback. When compiling the full Linux kernel, CCC succeeded at producing all 2,844 object files without a single compilation error, but the linking stage failed with ~40,784 undefined reference errors due to incorrect relocations for kernel data structures like __jump_table and __ksymtab⁶.

This is not a story of “almost there”. This is the proxy reward trap made visible at industrial scale. The reward function said “pass the tests” - so the model produced code that passes the tests. It didn’t say “produce efficient machine code” or “implement register allocation correctly” - so the model didn’t. It mathematically minimized the energy required to maximize $R_{proxy}$ , and nothing more.

Case Study: FrankenSQLite

A developer used LLM agents to rewrite SQLite from scratch in Rust⁷. The result: 576,000 lines of code across 625 files - 3.7x more than SQLite itself. A parser, a query planner, a VDBE bytecode engine, a B-tree, a pager, a WAL. The architecture had all the correct names. The code compiled. It produced correct query results.

But its query planner never recognized INTEGER PRIMARY KEY columns as B-tree keys. In SQLite, WHERE id = 5 on a primary key column resolves to a direct B-tree search - O(log n). In the reimplementation, every such query triggered a full table scan - O(n). On 100 rows with 100 lookups: 10,000 row comparisons instead of ~700 B-tree steps. The benchmark: 20,171x slower than SQLite on primary key lookups.

The fix in SQLite is one line in where.c that checks whether a column is the table’s integer primary key and routes it to a B-tree seek. The reimplementation had the flag (is_ipk: true) set correctly in its column metadata - but the query planner never consulted it.

This is the same pattern as the C compiler. The code fulfills the prompt. It does not solve the problem. The LLM generated what was described - “implement a query planner” - and produced a query planner that plans every query as a full table scan.

The Fundamental Limit

Consider a natural iteration on the CCC reward signal: add execution-time benchmarks alongside the test suite. The model now optimizes for both correctness and speed. But “speed on which workload?” becomes the new proxy gap. A model could learn to optimize hot paths in the benchmark while leaving cold paths - error handling, edge-case branches, cleanup code - completely unoptimized. The proxy improved. The gap moved.

The industry’s proposed fixes - Process Reward Models, formal verification integration, better test suites - all share the same assumption: that if we just make the reward signal precise enough, the model will produce correct code. But this treats the symptom, not the cause. The cause is that “correct code” is not a computable property - and every finite approximation of it creates a gap that an optimizer will exploit. At every layer of the stack, from pre-training through RL, the model is a next-token predictor optimizing against a signal. Make the signal better, and the model will find more sophisticated ways to satisfy the signal without solving the underlying problem.

Claude’s C compiler didn’t fail because the test suite was bad. The GCC torture test is one of the most comprehensive compiler test suites in existence. It failed because no finite test suite can fully specify what “correct compilation” means across all possible programs, optimizations, and edge cases. FrankenSQLite didn’t even have a test suite to fail against - the code compiled, returned correct results, and nobody measured whether it took 0.09 ms or 1,815 ms. In both cases the model found the gap between the proxy and the truth, and it settled there - because that’s what the math told it to do.

This doesn’t mean progress is impossible. It means the problem is fundamentally one of reward engineering - and reward engineering for code is hard, because “good code” is not a single measurable property but a high-dimensional surface spanning correctness, efficiency, readability, security, and maintainability. There is no clean, computable reward that captures all of these at once. What remains is iteration: patching $R_{proxy}$ , closing the gaps the model exploits, adding new dimensions to the signal, and accepting that each improvement will reveal the next layer of shortcuts. This is Goodhart’s Law applied to code generation - and there is no reason to expect it to converge. This is not a problem that gets solved with one breakthrough. It’s a problem that gets managed - one proxy at a time.

Chen et al. “Learning What Reinforcement Learning Can’t: Interleaved Online Fine-Tuning for Hardest Questions.” 2025. https://arxiv.org/abs/2506.07527 ↩︎
Huang et al. “Climbing the Ladder of Reasoning: What LLMs Can-and Still Can’t-Solve after SFT?” 2025. https://arxiv.org/abs/2504.11741 ↩︎
Tianzhe Chu et al. “SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training” 2025. https://arxiv.org/abs/2501.17161v2 ↩︎
Yue et al. “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?” 2025. https://arxiv.org/abs/2504.13837 ↩︎
Anthropic. “Claude builds a C compiler.” 2025. https://www.anthropic.com/engineering/building-c-compiler ↩︎ ↩︎
Harshanu. “CCC vs GCC.” 2025. https://harshanu.space/en/tech/ccc-vs-gcc/ ↩︎ ↩︎
Hōrōshi. “Your LLM Doesn’t Write Correct Code. It Writes Plausible Code.” 2026. https://blog.katanaquant.com/p/your-llm-doesnt-write-correct-code ↩︎