The power of agentic coding is that the same agent can write, build, run and test the code—and then tweak it according to that. Closed loop is what actually makes this work. Open loop sucks.
Note that this is not AI-exclusive. Human programmers also suck at producing working code without access to a compiler or an ability to test the code.
If you can’t run closed loop for some reason, then, do the cheap stupid version of it and paste the errors you get into the AI’s chat window.
The dirty little secret is that “quality assurance” on code borders on non-existent in a solid 60% of the cases—enterprise or no enterprise, AI or no AI. Features ship broken and failures surface in prod. The same mitigations that apply to human-induced faults apply to AI-induced faults—assuming anyone gives enough of a fuck to have any.
My intuition: RLVR typically follows the path of least resistance. This often results in it tweaking small behavioral knobs: improving reliability of methods, downweighting unreliable and upweighting reliable methods, encouraging adaptive behaviors like backtracking and discouraging maladaptive behaviors like error self-consistency and self-amplification. Upwards from there is old circuits being used in novel ways: i.e. existing “error detectors” rewired to feed into self-check/backtracking triggers.
But nothing prevents RLVR from burning in new behaviors from scratch. Assuming there is no way to get there without, and sufficient RLVR pressure is applied.
This is in part driven by low KL properties of RLVR setups (both explicit and downstream from RLVR being on-policy), in part driven by how ample are the “low hanging fruits” of base model behavior being prediction-optimal but task-suboptimal, and in part by how “expensive” RLVR pressure is—few setups apply enough of it to get truly novel behaviors burned in.