it doesn't train or self-improve like ML does
I think the training (or fitting) process is comparable to how a support vector machine is trained. It's not iterative like SGD in deep learning, it's closer to the traditional machine learning techniques.
But I agree that this is a pretty academic discussion, it doesn't matter much in practice.
I don't think WFC can be described as an example of a Monte Carlo method.
In a Monte Carlo experiment, you use randomness to approximate a solution, for example to solve an integral where you don't have a closed form. The more you sample, the more accurate the result.
In WFC, the number of random experiments depends on your map size and is not variable.