A great piece about the pitfalls of evaluating large language models. It tackles a few reasons why evaluating LLMs as if they were people is not necessarily the right tack:
- Data contamination: the AI has already seen the answers!
- Robustness: answering one question doesn't mean the AI can answer a similar question
- Flawed benchmarks: machines take shortcuts that aren't relevant to the actual question
Most tests are pretty bad at actual evaluating much of anything. Cognitive scientist Michael Frank (in summary) believes that
...it is necessary to evaluate systems on their robustness by giving multiple variations of each test item and on their generalization abilities by giving systematic variations on the underlying concepts being assessed—much the way we might evaluate whether a child really understood what he or she had learned.
Seems reasonable to me, but it's much less fun to develop a robust test than to wave your arms around screaming about the end of the world.