What People Think LLMs Are Good At

20 Feb, 2025

I recently came across a post on X, The Everything App, where a mathematician talks through his own reasoning when approaching a research task that he's using an LLM (specifically OpenAI's Deep Research) for. I want to say up front: I don't mean any shade for this guy in particular. Instead, this is meant more as an indictment on the way these are advertised and talked about generally.

He wanted to test a hypothesis: "more math research is being done by young people now than previously". He came up with a testable, falsifiable proxy for this: "the average age of authors in a particular published journal for math research over time". The tasks that LLMs are good at tend to be tasks that are easy but tedious for humans, he believes, but he transposes this into the idea that easy/tedious human tasks are the set of tasks that LLMs are good at. They are not. I would argue the set of tasks that LLMs are good at is shockingly narrow, but it's certainly not every easy/tedious human task. Counting the Rs in "strawberry" is an easy counterexample here, something humans can do easily but LLMs will constantly trip up on.

The next problem he runs into is anthropomorphization; treating an LLM as though it were a person. I see so, so many people say "it's like a junior developer" or "it's like a college freshman" to try to get people to believe that it's something like a mistake-prone person, and so once again, I can't blame this guy for falling into this trap. He finds that the report, which says published mathematicians are actually getting older, is what he calls "made up" but I would call "correctly generated", and he sets off to try to reconstruct "the tool's approach".

He claims that the tool is working like a person, and performed some intermediate steps:

It found an article saying that mathematics is "a young man's game"
It only looked at 5-6 papers but claimed to be comprehensive
It filled in a narrative to support some claim

These are things a human does. These are things a "junior researcher" would do. It is not what the LLM did, because the LLM's approach is to probablistically generate the most likely next word after a series of previous words. He mentions the thing about "young man's game" because of the type of input it is: an opinion erroneously taken as fact, a claim about the world, and clearly not the source material for the requested data about author ages. Besides, wouldn't a person with the conclusion that "mathemeticians are young" produce a table showing young ages? This part does not seem to jump out to him; rather, he sees that it is irrelevent to the research task, and treats it as the source of the error. But the statement is not the basis of research. It is one of many weights that influence the probability of what each next word is.

One of my favorite elucidations of this idea comes from an article by Josh Dzieza about people in AI relationships, which quotes a paper by Murray Shanahan:

Last year, Murray Shanahan, professor of cognitive robotics at Imperial College London and a senior scientist at DeepMind, published a paper cautioning against the use of mental terms like “think” and “believe” by his peers. These systems can generate language that seems astonishingly human, Shanahan wrote, but the fundamentally alien process they use to do so has important implications for how we should understand their words. To use Shanahan’s example, when you ask a person, “What country is to the south of Rwanda?” and they answer “Burundi,” they are communicating a fact they believe to be true about the external world. When you pose the question to a language model, what you are really asking is, “Given the statistical distribution of words in the vast public corpus of text, what are the words most likely to follow the sequence ‘what country is to the south of Rwanda?’” Even if the system responds with the word “Burundi,” this is a different sort of assertion with a different relationship to reality than the human’s answer, and to say the AI “knows” or “believes” Burundi to be south of Rwanda is a category mistake that will lead to errors and confusion.

This is the error that Daniel Litt, and nearly everyone writing about LLMs today, is making. LLMs do not make claims of fact about the world. They do not examine the state of the world and draw from it an idea, and translate that idea into language to present it to the user. They will not someday develop into things that do that, at least not through the current mechanisms by which they work. They do a different process, fundamentally alien to the processes of human thought, which still end with "presenting language to the user". The simliarity of the result misleads people about the steps that happen beforehand, encouraging the user to engage their theory of mind and project a human thought process into them. The marketing surrounding these products do everything in their power to reinforce this instinct.

Keeping this in mind, let's return to Daniel's research task. He prompts for the information in a different format, and gets a CSV file containing seven entries, with text suggesting that there is a full file somewhere. He takes this language -- that it contains "all relevant entries from 1950 through 2025" -- to mean that in the initial analysis, this is the set of data that it used to create the table initially. This is not what happened; the CSV file is simply a new series of words placed probablistically one after another, linked to the previous statements not through some inner compendium of research analysis, but only by the prior statements tipping the probability scales on what new statements tumble out.

When I was young, there was a toy called a Magic 8 Ball. This toy was made of a sphere containing some cloudy liquid suspending a buoyant 20-sided die in it, with a small window to see it. You were meant to ask a yes or no question, shake the ball, hold it with the window facing upward, and see which side of the die surfaced against the window to read it. However, the result presented you with language, allowing you to project consciousness into the process. The toy required a spiritual leap of faith for that process, because there was no apparent mechanism by which it might gain information about the world besides some kind of guiding deity or spirit reaching in to jostle the die. LLMs are far enough removed from common understanding to prevent this skepticism.

In fact, the AI relationship article I mentioned earlier has many people who make this mistake. Almost every interview with a person in what they claim is "a relationship with an AI" will start with a preamble of the form "I'm not stupid. I'm not a rube. I know how it works. I know it's just probabalistically putting one word after another", and then they will say "but", and then they will say something that is wholly incompatible with that idea.

This is not a new problem. In fact, this was a particular pet peeve of computer scientist Edsger W. Dijkstra. In 1973 he wrote about precisely this, as it existed in his time

I think anthropomorphism is the worst of all. I have now seen programs "trying to do things", "wanting to do things", "believing thing to be true", "knowing things" etc. Don't be so naïve as to believe that this use of language is harmless. It invited the programmer to identify himself with the execution of the program and almost forces upon him the use of operational semantics. ...

And now we have the fad of making all sorts of systems and components "intelligent" or "smart". It often boils down to designing a woolly man-machine interface that makes the machine as unlike a computer as possible: the computer's greatest strength --the efficient embodiment of a formal system-- has to be disguised at great cost. So much for anthropomorphism. (This morning I declined to write a popular article about the question "Can machines think?" I told the editor that I thought the question as ill-posed and uninteresting as the question "Can submarines swim?" But the editor, being a social scientist, was unmoved: he thought the latter a very interesting question too.)

LLMs represent the pinnacle of what Dijkstra calls a "woolly man-machine interface". We are far beyond "trying" and "knowing" in this field; even the technical jargon around LLMs has words like "training" and "attention" and "inference". These all have specific technical meanings, and also just so happen to have meanings that apply to human consciousness, and many people would love for you to trip and fall and mistake one for the other.

An LLM's output would be claims of fact if they came from a person. Avoiding the conclusion that a human-like process created it is difficult, and takes constant vigilance. There are far too many people now who are paid a great deal of money to prevent everyone, most of all themselves, from understanding this.