It's tough to make robust tests to evaluate machines if you're used to making assumptions based on adult humankind. The paper's title – Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models is a reference to a horse than did not do math.