June 2, 2023: @melmitchell1

#papers   #evaluation   #models  

It's tough to make robust tests to evaluate machines if you're used to making assumptions based on adult humankind. The paper's title – Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models is a reference to a horse than did not do math.