We handed five coding agents the same legacy repo and the same refactor brief, then diffed the results. The spread in approach — and in how much they broke — was the interesting part.
What we measured
Tests passing after, lines touched, and whether the public API stayed stable. One agent rewrote half the app; another made a surgical ten-line change. We show all five diffs.