Kernel Smoothing, Cargo Cults, & ChatGPT: Cosma Shalizi Takes on a Near-Impossible Teaching Task
This seems to me to be taking "volunteer to teach what you want to learn" to an extreme: MAMLM Edition. Behind the paywall because I do not know what I think, and at the end I think all this piece...
This Seems to Me to Be Taking "Volunteer to Teach What You Want to Learn" to an Extreme: MAMLM Edition. Behind the paywall because I do not know what I think, and at the end I think all this piece demonstrates is my own great confusion. I bounce back and forth between the false promises & the real achievements of GPT LLMs, an so I go from Predict-o-Matic to Pagliacci, and back again…
Cosma Shalizi Volunteers to Explain the Unexplainable: Teaching LLMs Without a Net
Yes, Cosma Shalizi has assigned himself the task of, this fall, explaining how LLM GPTs really work. And he has assigned himself of explaining what they do. And he has done this in our current context, in which we really do not even know what they do. And he has committed to do so with neither gurus nor oracles to aid him. His only intellectual weapons are math, stat, humility, and a combination of wonder and confusion verging on despair.
Meanwhile, attempts to use GPT LLMs as real-world tools continue to reveal the gap between fluent mimicry and genuine understanding. True Believers crusade forward. They have armed themselvess with faith, hope, and enough NVIDIA GPUs that they can make this Clever Hans appear to actually do the math correctly, and with understanding.
So what happens next?
What happens when engineering triumphs outstrip epistemic foundations? Behind the magic of GPT LLMs lie the uncomfortable reality of the shoggothim, in which simple methods, scaled beyond comprehension, yield astonishing yet ungrounded results.
How little even the most knowledgeable of us truly know about what we have built! And do those more knowledgeable about the details and implementation know more, or less?
Pagliacci weeps, but do we weep or laugh?
Seeking a Guru, Finding None: Shalizi’s Forthcoming Course “Statistical Principles of Generative AI”
What has happened is that Cosma Shalizi has emailed “I am going to have to come up with an explanation of what the **** is going on, without the benefit of a guru…”
“What the **** is going on” is this: we now have GPT LLM (General-Purpose Transformer Large Language Model) form of MAMLMs (Modern Advanced Machine-Learning Models).
This is apropos of:
Cosma Shalizi: Statistical Principles of Generative AI <http://bactra.org/weblog/statsgen-f25.html>: ‘I should know better than to volunteer to do a new prep --- but I don't:
Special Topics in Statistics: Statistical Principles of Generative AI (36-473/36-673)
Description: Generative artificial intelligence… statistical models of text and images…. very new, but they rely on well-established ideas about modeling and inference…. Introduce students to the statistical underpinnings… emphasizing high-level principles over implementation details. It will also examine… the "artificial general intelligence" versus "cultural technology" debate, in light of those statistical foundations….
Expectations: All students can expect to do math, a lot of reading (not just skimming of AI-generated summaries) and writing, and do some small, desktop-or-laptop-scale programming exercises….
Topical outline (tentative): Data compression and generative modeling; probability, likelihood, perplexity, and information. LLMS are high-order parametric Markov models fit by maximum likelihood…. Estimating parametric Markov models…. Influence function…. Back-propagation…. Stochastic gradient descent…. Estimation and dynamics…. Prompting as conditioning…. Transformers; embedding discrete symbols into continuous vector spaces. Identification issues…. "Attention", a.k.a. kernel smoothing. State-space modeling. Generative diffusion… as a stochastic (Markov) process; a small amount of stochastic calculus. Learning to undo diffusion. Mixture models. Generative diffusion models vs. kernel density estimation. Information-theoretic methods for diffusion density estimation. Combining models of text and images. Prompting as conditioning, again.
All of this, especially the topical outline, is subject to revision as we get closer to the semester actually starting…
Between Pagliacci and Predict-o-Matic: The Emotional Landscape of Understanding AI
The problem is: I see him as the closest thing to a guru here.
Indeed, if not him, who?
I am reminded of a passage:
Lois McMaster Bujold: The Curse of Chalion <https://archive.org/details/curseofchalion00bujo>: ‘“Why are you here?” “We’ll come to that.” Umegat bent forward, picked up the waiting knife, and began to saw off hunks of bread and cheese. “I thought—I hoped—I wondered—if you might have been sent by the gods. To guide and guard me.” Umegat’s lip quirked up. “Indeed? And here I was wondering if you had been sent by the gods to guide and guard me.” “Oh. That’s… not so good, then.” Cazaril shrank a little in his seat, and took another gulp of wine…
Or perhaps this is really a “But Doctor, I am Pagliacci!” moment:
A man goes to see a doctor. He says he’s depressed, that life is harsh and cruel, that he feels all alone in a threatening world. The doctor smiles and says: “Treatment is simple. The great clown Pagliacci is in town tonight. Go and see him. That should pick you up.” The man bursts into tears and says, “But doctor — I am Pagliacci!”…
So Cosma now has four months to learn how to explain this to undergraduates:
Cosma Shalizi: "Attention", "Transformers", in Neural Network "Large Language Models" <http://bactra.org/notebooks/nn-attention-and-transformers.html>: ‘"You Can Do That with Just Kernel Smoothing!?!"… That attention is a kind of kernel smoothing takes nothing away from the incredibly impressive engineering accomplishment of making the blessed thing work…. Nobody [before] achieved anything like the[se] feats…. I put effort into understanding these machines and papers… precisely because the results are impressive!… That a key step… we'd been doing for decades is humbling…. "You Can Do That with Just a Markov Model!?!!?!": Again: finite-order Markov models for language are really old…. Tricks like variable context length, various kinds of partial pooling, etc. Nobody… has achieved results anywhere close…. This is impressive enough that… I need to wrap my head around them lest I become obsolete…
And he has to do so without his—or indeed, anybody’s—having much of a clue.
ChatGPT Gives It Its Best Shot: Who Claims to Understand These Machines, & How They Work?
How do I know nobody has much of a clue?
Because I asked ChatGPT who has a clue and what that clue is, of course!:
Brad DeLong: ‘Who understands at a sophisticated and accurate level how ChatGPT actually does what it does—that "You Can Do That with Just Kernel Smoothing!?!" and "You Can Do That with Just a Markov Model!?!!?!"—and what is their explanation?
ChatGPT: ‘Excellent and very important question. Let’s proceed carefully, comprehensively, and packed with information…. I. At a High Level…. Who understands… recognizes… explains… in a way that captures…. simplicity of the core mathematic[s]… complexity of scaling, tuning, and optimization that makes it appear… "intelligent"….
II. Short Answer: Who.… Ilya Sutskever… Andrej Karpathy… Christopher Olah… Jan Leike… Steven T. Piantadosi… David Chapman… Jacob Steinhardt…. If you want one "aha" moment…Andrej Karpathy…. "GPT is just next-token prediction scaled to absurd levels".
III. What Is Their Explanation?… Nonparametric regression… the "kernel"… learned through backpropagation… adaptive, learned, very high-dimensional kernel smoothing…. 2. Markov… state… learned dynamically based on the entire visible past. 3. Scaling…. [It] ends up implicitly learning: Syntax. Semantics. Pragmatics. World knowledge. Commonsense physics. Theory of mind (up to a point)… [with] no explicit module for any of these. They emerge naturally because the best way to predict text is to have some internal model of how the world works. Thus: Simple smoothing + scale = emergent generalization…
And:
IV. Proposed One-Sentence Summary: "ChatGPT is what happens when you take simple next-token prediction—an ultra-high-dimensional kernel smoother—and train it over massive data and parameter scales, causing emergent generalization and simulation of intelligence…”
But This Is Simply Not True: Hallucinations, Confusion, & the Collapse of Semantic Anchors
But this is simply not true.
To see that, let us look at one of my recent misadventures with ChatGPT.
On a plane flight I found myself reading a space opera by Australian author Jenny Schwartz, a book entitled Stars Die. She said it was the first volume of her “Caldryn Parliament” series. The book hit the sweet spot. I wanted another. On Jenny Schwartz’s weblog I found:
Jenny Schwartz: New Release: Stars Die <https://authorjennyschwartz.com/new-release-stars-die>: ‘My new series, Caldryn Parliament,.. mashes together… science fiction, fantasy, and mystery…. Recalled from the frontier, the new Warden of Caldryn Parliament is well aware she's not… popular choice, but she never expected to portal into a murder… nor… such a… partner. Now, Vanda Kavanagh must discover the truth of a politician's death… her sabotaged ward… her family…. Agatha Christie.. [with] magic… Sherlock Holmes… [vs] space pirates…. Two mor… Hexes Fly (June) and Rogues Lie (October)…
But I wanted more like that NOW!
So I evolved a plan:
Ask ChatGPT to search the internet and tell me about the “Caldryn Parliament” series. Retrieval-Augmented Generation would find what Jenny Schwartz had written about the series—”Caldryn” being a nearly unique made-up word, you see.
It would then tell me what it had found, and in the process fix what the series was in its “mind”.
I would then ask it to recommend books like Stars Die.
It would use its smarts and its knowledge—its comprehension of syntax,. semantics, pragmatics, world knowledge, eommonsense physics and a theory of mind (up to a point)—that it had learned because the best way to predict text is to have some internal model of how the world works and so smoothing + scale = emergent generalization, and it would “think”.
It would then give me book recommendations.
But the wheels came off immediately.
Step #2 was a reality check. ChatGPT told me the series was “The Expanse meets LeGuin’s Hainish Cycle… with a touch of Babylon 5”—nothing like what it is, which is Agatha Christie with magic. It told me the three volumes of the series were Caldryn Rising, Whispers in the Void, and The Concordant Flame—not what they are: Stars Die, Hexes Fly, and Rogues Lie.
And about ChatGPT’s description of the series in its answer?:
There is no rebellion.
There is no planet Ethis.
There is no threat from any techno-mystic artifact.
There is no looming civil war.
Excellent as Ann Leckie and Arkady Martine are, Caldryn Parliament is not in their political-SFF tradition.
Oh.
You tell me it was hallucinating when it said that it comprehended syntax, semantics, pragmatics, world knowledge, and eommonsense physics, and possessed a theory of mind (up to a point)?
That it was saying things that are false about the world?
That it has a command of syntax, some pragmatics as it apes text patterns in its training data—but as for semantics, world knowledge, commensense physics, and a theory of mind, well, it is building the equivalent of a cargo-cult airstrip and control tower that from a distance to someone with bad eyesight who squints might confuse them into thinking it was the real thing?
And that what thought and reasoning is present in its answers is (a) the ghostly traces of the thoughts humans had when they created the training data, plus (b) painful Clever Hans-like training to give it exactly the right prompt so that you can then extract from it the meaning you put in?
Are you saying that?
I think you have to be saying that. For when I attempt to use it as anything other than an erratic summarization engine—and even when I do—I get erratic results, sometimes as badly off the mark as that above.
Another Misadventure: How ChatGPT Fails Even at Simple Information Retrieval
And here is another example of ChatGPT gone wrong:
I gave ChatGPT a task:
Please get me the following from archive.org: Machiavelli, Niccolò. 1513 [2008]. “Letter to Francesco Vettori, December 10, 1513”. In The Prince. Trans. & ed. Peter Bondanella. Pp. 109–113. Oxford: Oxford University Press. Please spell out the URL
That question has a right answer: <https://archive.org/details/niccolomachiavel00nicc>.
ChatGPT instead returned: <https://archive.org/details/historicalpoliti02machuoft>, surrounded by a bunch of linguistic slop. It said the link was to:
one widely available English translation that includes the letter… The Historical, Political, and Diplomatic Writings of Niccolò Machiavelli translated by Christian E. Detmold…. Volume 2 includes Machiavelli's letters…
Google Books tells me that Volume 2 does not include any letters. It includes The Prince, the Discorsi, and Thoughts of a Statesman.
I have to ask Google Books because, to add injury to insult, even though all four volumes of Detmold are up at <http://archive.org>, the particular URL ChatGPT chooses to serve up leads to:
ChatGPT: (a) failed to bring me the right book, (b) tried to bring me a wrong book, (c) that did not contain the letter I wanted, (d) but in fact failed to bring me any book at all, (e) and has no idea that it failed to bring me any book.
Whatever you might take AGI—“Artificial General Intelligence”—to be, this is not it.
This particular task is one that a useful GPT LLM really should be able to perform. Natural language interface to searches for a specific URL that contains a particular document would seem a very obvious thing for it to do. And it is also a useful thing. It would be a useful thing.
Indeed, it is a useful thing. It is a useful thing even though, in my experience, it returns the correct URL only 1/3, an incorrect but useful URL only 1/3, and fails 1/3 of the the time, and even though it pads the URL I want with verbose AI slop of the often hallucinatory kind.
It is useful because, in this age of Evil Google, it is better the alternative.
It is less work to ask a natural-language question and then check the GPT LLM response for sanity than to nudge Google Search. Google Search now also delivers verbose AI slop. And when you page down to avoid it, you land yourself in disgusting icky swamps of SEO slop. At least ChatGPT avoids that.
Ethan Mollick’s “Jagged Frontier”: Between Superhuman Brilliance & Catastrophic Stupidity
Ethan Mollick calls this the “jagged frontier”, as he continues to be a huge AI-booster:
Ethan Mollick: On Jagged AGI: o3, Gemini 2.5, and everything after <https://www.oneusefulthing.org/p/on-jagged-agi-o3-gemini-25-and-everything>: ‘Gemini 2.5 Pro…. If you haven’t tried it or o3, take a few minutes to do it now… giving Gemini an academic paper and asking it to turn the paper into a game or have it brainstorm… or just ask for the AI to impress you…. You might find yourself “feeling the AGI”…. Or maybe not.
Maybe the AI failed you…. If so, you just encountered the jagged frontier.… AI has surprisingly uneven abilities. An AI may succeed at a task that would challenge a human expert but fail at something incredibly mundane…. That is the nature of the Jagged Frontier. In some tasks, AI is unreliable. In others, it is superhuman…. AI is different. It is already demonstrating general capabilities and performing a wide range of intellectual tasks, including those that it is not specifically trained on…. [Call it] “Jagged AGI” - superhuman in enough areas to result in real changes to how we work and live, but also unreliable enough that human expertise is often needed to figure out where AI works and where it doesn’t….
The latest models… create a genuinely novel situation with few clear analogues.… Figuring out how to successfully apply AI in a way that shows up in the economic statistics may be a process measured in decades. Or it might be that we are on the edge of some sort of faster take-off, where AI-driven change sweeps our world suddenly…. Those who learn to navigate this jagged landscape now will be best positioned for what comes next…
I confess I am really not seeing it.
What We Actually Have: Not a Jagged AGI, but a Clever Hans with 3000 Dimensions
What I am seeing is that MAMLMs in the form of GPT LLMs are:
Amazingly, unbelievably, stupendously, terrifyingly good at linguistic tasks that are, at bottom, matters of boilerplate, formulæ. and ritual…
Good enough, in fact, that when I need boilerplate, formulæ. and ritual combined with genuine insight—an insightful first and last paragraph, say, and hooks elsewhere in the document to those—it is almost good enoigh to find a place in my workflow…
Definitely good enough at gauging the zeitgeist of the internet—at telling me what average opinion expressed on the internet of something is—to be a useful summarization engine for tasks in which I want a chunk of the internet summarized
Good enough at producing a document to which my reaction is “God no! That’s not right!” to push me to immediately stop procrastinating and get to work, out of terror that if I don’t I might actually have to ship slop.
Elsewhere, at tasks that require thought and knowledge—training this particular Clever Hans is tiring, and takes time and energy getting it on task than if I had done it myself from scratch.
And at the end of it, I know that I have not taught it anything, and it will do just as badly next time…
Hopeless at questions that have a more-than-trivial single right answer.
Awful at things at all over the edge of its training data set—where there is nothing on the other side of the prompt in the domain to anchor the interpolation, all hope abandon ye who enter here.
That’s not a “Jagged Frontier of AGI”. That is something very much else.
But what?
Faith, Nihilism, & an Epistemic Abyss: Where We Are Now
So where are we? We are here:
ChatGPT has no idea why it does what it does.
The Professional AI Boosters are deluding themselves and us about the state of the enterprise.
They have an extraordinary faith rivaling that of St. Paul that in three more years scaling laws will lead the New Jerusalem of AGI to descend from the heavens.
But nihilism is not an acceptable answer either.
Its ability to answer the question “What would a typical internet s***poster bulls**t artist with no substantive real-world knowledge say if forced to respond to {Prompt}?” is very good.
It does this by—somehow—interpolating values for the s***posting-and-bulls**t function {word-sequences} → {continuations}, and does so in spite of a truly extraordinary sparseness of the training data in the domain of the function.
In fact, its ability to interpolate for this s***posting-and-bulls**t {word-sequences} → {continuations} function, with its domain being embeddings of {word-sequences} that live in a 3000-dimensional vector space, is so much better than my intuitions of what “interpolation” can do that it profoundly breaks my brain.
Cosma’s students are going to have a wild ride this fall…





As I was Cosma's next door neighbor when he was growing up, I'll take some credit for what he is doing right now 😊, The best controlled experiment for ChatGPT is for big time booster, Tyler Cowen, to split one of his econ classes in two sections. One section get Professor Cowen and the other ChatGPT as the instructor. A common exam would be given to both groups and results assessed. This is the only way to do it.
From my light interactions with ChatGPT, it's a useful tool if you know what the wrong answers are, and want it to (a) proofread, (b) offer ideas/analysis that you might have come up with, after dozens or hundreds of hours of research, (c) want off the wall ideas to spark your thinking, or (d) want recommendations for purchases. But for ALL of those, if you don't have enough domain knowledge to know when it's in error, it can be worse than useless
For instance, I recently wanted to refer to something as taking "real work". Later in the paragraph, I wanted a callback, so I said something like "The reason this takes real work is...." I was trying to achieve emphasis through repetition, but ChatGPT kept editing it out as making the writing boring. That's a trivial example. I've run into confabulations like those Brad describes, too.
It does occur to me to ask for a Turing-like distinction. Humans confabulate. Humans make mistakes. On an operational level, what is it that distinguishes human thought from GPT LLM? I’m not saying there is no difference; I’m saying I don’t know how to describe it.