Theory of mind is a hallmark of emotional and social intelligence that enables us to infer people’s intentions, relate to each other, and empathize. Most children acquire these types of skills between the ages of 3 and 5.
The researchers tested two large-scale language model families, OpenAI’s GPT-3.5 and GPT-4, and three versions of Meta’s Llama on tasks designed to test human theory of mind. Tasks include identifying false beliefs, recognizing gaffes, and understanding what is implied rather than directly said. They also tested 1,907 subjects to compare score sets.
The research team administered five different tests. The first, a hint task, was designed to measure people’s ability to infer other people’s true intentions from indirect comments. The second, a false belief task, assessed whether people could infer that others could reasonably be expected to believe something they know to be coincidentally untrue. Another test measured people’s ability to recognize when someone is making a blunder, and the fourth test assessed whether people could explain the contrast between what was said and what was meant by telling strange stories in which the protagonist does something unusual. It also included testing whether people could understand sarcasm.
The AI model ran each test 15 times in separate chats to ensure it could handle each request individually, and its responses were scored in the same way as humans would. The researchers then tested human volunteers and compared the two sets of scores.
Both versions of GPT performed as well as or in some cases better than the human average on tasks involving indirect requests, misdirection, and false beliefs, but GPT-4 outperformed humans on tests of sarcasm, innuendo, and weird stories. Llama 2’s three models performed below the human average.
However, while Llama 2, the largest of the three Meta models tested, performed better than humans when it came to recognizing rude scenarios, GPT consistently returned incorrect responses. The authors attribute this to GPT’s general aversion to generating conclusions about opinions, as the model primarily responded that there was insufficient information to come to an answer either way.