I've given up on all the "look at how an LLM scores on this test!!!" excitement because there's almost always something going on, whether it's explicitly cooking the books in favor of the LLM, testing questions its already seen, or (my favorite!) some sort of answer leakage.