As much as I’d like this to be true (don’t believe all the benchmarks), in reality, using e.g. gpt 5.5 is still a lot less pain in the ass, mostly has to do with more reprompting (gpt is just smarter, oneshots stuff more often) + a lot slower (on an RTX 3090 for reference).
You're not wrong, but perhaps you are not giving them adequete time to be wrong. Oneshotting is not the be-all, right is, especially maintainably correct. I've found letting them fight over it useful.





Why would you do that?
Because you hate yourself?
Oh, right, you found a tool (self reflect) that let's you do it with zero effort. FU..