OpenAI has continually claimed that the "model weights haven't changed" on their models over time, which many have accepted as "the outputs shouldn't be changing." Even if the former is true, something else is definitely happening behind the scenes:
For example, GPT-4's success rate on "is this number prime? think step by step" fell from 97.6% to 2.4% from March to June, while GPT-3.5 improved. Behavior on sensitive inputs also changed. Other tasks changed less, but there are definitely singificant changes in LLM behavior.
Is is feedback for alignment? Is it reducing costs through other architecture changes? It's a mystery!
Another fun pull quote, for code generation:
For GPT-4, the percentage of generations that are directly executable dropped from 52.0% in March to 10.0% in June. The drop was also large for GPT-3.5 (from 22.0% to 2.0%).
If you're building a product on top of a model you aren't running yourself, these sorts of (unreported) changes can wreak havoc on your operations. Even if your initial test runs worked great, two months down the line and you might have everything unexpectedly fall apart.
Full paper here