It's more like they are a sophisticated world modeling program that builds a world model (or approximate "bag of heuristics") modeling the state of the context provided and the kind of environment that produced it, and then synthesize that world model into extending the context one token at a time.
But the models have been found to be predicting further than one token at a time and have all sorts of wild internal mechanisms for how they are modeling text context, like building full board states for predicting board game moves in Othello-GPT or the number comparison helixes in Haiku 3.5.
The popular reductive "next token" rhetoric is pretty outdated at this point, and is kind of like saying that what a calculator is doing is just taking numbers correlating from button presses and displaying different numbers on a screen. While yes, technically correct, it's glossing over a lot of important complexity in between the two steps and that absence leads to an overall misleading explanation.
Actually, OAI the other month found in a paper that a lot of the blame for confabulations could be laid at the feet of how reinforcement learning is being done.
All the labs basically reward the models for getting things right. That's it.
Notably, they are not rewarded for saying "I don't know" when they don't know.
So it's like the SAT where the better strategy is always to make a guess even if you don't know.
The problem is that this is not a test process but a learning process.
So setting up the reward mechanisms like that for reinforcement learning means they produce models that are prone to bullshit when they don't know things.
TL;DR: The labs suck at RL and it's important to keep in mind there's only a handful of teams with the compute access for training SotA LLMs, with a lot of incestual team compositions, so what they do poorly tends to get done poorly across the industry as a whole until new blood goes "wait, this is dumb, why are we doing it like this?"