Centuries-Old Mathematical Ideas Power Today’s LLMs
30 November, 2025
Centuries-old ideas from Newton, Leibniz, Cayley and Shannon, created to model motion, change and information, now underpin the mathematical machinery driving today’s LLMs
By Yohai Schwiger
It’s tempting to think of large language models as a purely linguistic phenomenon — machines that read paragraphs, grasp context, and choose words as if they were human. But beneath this linguistic surface lies a vast mathematical engine, far more intricate than any grammar or vocabulary. What drives an LLM is not a stockpile of words, but an enormous machinery of algebraic, differential, statistical and informational ideas, assembled slowly over centuries. Leibniz and Newton in the 17th century, Cayley and Sylvester in the 19th, and Shannon in the 20th — none of them imagined a model that speaks Hebrew and generates ideas, yet each supplied a foundational brick for the mechanisms that allow LLMs to “understand” us today.
The irony is striking: tools invented to calculate the paths of planets, describe physical forces, or send a clear signal down a noisy phone line are now the nuts and bolts of the LLM brain. Concepts originally meant to measure motion, change, noise and uncertainty have — almost accidentally — become the bedrock on which today’s language models are built. This article traces the mathematical story: how centuries-old ideas became the conceptual and computational foundation of modern machine intelligence.
Vectors and Matrices — The Nuts and Bolts of the LLM Mind
Vectors and matrices are the most basic operations an LLM can perform. Every step — representing a word in context, interpreting a full sentence, predicting the next token or updating weights during training — ultimately reduces to multiplying them. It’s the computational language an LLM “thinks” in.
The reason is simple but deep: language is inherently high-dimensional. A word isn’t just a sound or a dictionary entry; it carries meaning, syntax, emotional tone, usage patterns — an entire semantic world stretched across hundreds or thousands of hidden dimensions. To represent such an object mathematically, you need a tool built for multi-dimensional work. A vector is the modern container for such an entity — “cat,” “run,” “curious” — and a matrix is the operator that shifts, rotates, warps or sharpens that vector within its space. Just as physicists used vectors to describe motion in physical space, LLMs use them to describe motion in semantic space.
From this foundation grew the idea that changed everything: the Attention mechanism. Introduced in the 2017 paper Attention Is All You Need by the Google Brain team, it became the cornerstone of modern models. Attention allowed LLMs not only to represent words, but to determine which words matter to which. In a sentence like:
“The cat sat on the rug because it was tired,”
the pronoun “it” is ambiguous. To resolve it, the model generates three new vectors — Query, Key and Value — from three different matrices, and calculates how much each word “attends” to each other word. If the Query of “it” is close to the Key of “cat,” the model infers a semantic link and focuses attention there. Behind this linguistic phenomenon sits a dry algebraic fact: it’s all matrix–vector multiplications.
Yet the foundation for this was laid in the 19th century, when mathematicians like Arthur Cayley and James Sylvester tackled physics, not language. Vectors emerged as a way to describe forces and motion — the path of a star, the velocity of a particle — and matrices appeared as algebraic descriptions of transformations: rotations, reflections, and systems of linear equations. Today, the same tools built to understand forces and fields now let a model understand nuance, syntax and intent. What was once the math of physics has quietly become the math of meaning.
Calculus — How an LLM Learns From Its Mistakes
If linear algebra is how an LLM thinks, then calculus is how it learns. The problem seems simple: once the model predicts the next token, how does it know how wrong it was, and in which direction it should adjust itself to improve? This is where the Loss function (typically Cross-Entropy) comes in — a numeric measurement of the gap between prediction and reality. But knowing the size of the gap isn’t enough; the model must understand the direction in its vast weight space that reduces that gap.
To do that, the model computes derivatives — the rate of change of the error with respect to each individual parameter. This is precisely the language Newton and Leibniz created in the 17th century to understand how things change: slopes, rates, directions. Gradient Descent — developed much later in the 19th and 20th centuries as part of optimization theory — takes these derivatives and updates the model’s weights in the opposite direction of the slope. Each step is tiny, but millions of steps gradually carve the mountain down, guiding the model toward better predictions. This process occurs only during training; at inference time the weights are “frozen,” and no gradients are computed.
Another historical irony: the mechanism that allows a modern LLM to shrink its error step by step is built from two ideas born in completely different worlds. The derivative, invented to describe celestial motion and the fall of objects, meets the optimization methods designed to solve engineering and economic problems. None of these thinkers imagined models corrected by a Loss, yet the math created for understanding a falling apple now helps explain how an error “falls” in weight space. The derivative tells the model what the change looks like; Gradient Descent tells it where to move — two ancient worlds fused inside the LLM’s learning machinery.
Statistics and Probability — The Art of the Intelligent Guess
Even after representing words as vectors and learning from errors through gradients, the model still faces the most basic question: what comes next? An LLM never knows the answer. It estimates it. Every step of text generation is probabilistic: the model constructs a distribution over all possible tokens in its vocabulary.
This knowledge doesn’t emerge from thin air. It comes from the weights internalized during training. Each time the model predicted incorrectly, its weights shifted; each time it predicted correctly, they stabilized. After billions of such iterations, the weights encode a grand map of linguistic tendencies: which words co-occur, which fit certain contexts, and which rarely appear together. Thus, when the model continues the phrase,
“The scientist entered the laboratory and…”
its internal map nudges probabilities toward words like “examined” or “activated,” not “napped.” The probabilities come directly from the model’s internal state — its weights and the relationships they encode.
The intellectual roots here stretch centuries back. Thomas Bayes and Pierre-Simon Laplace explored how to infer unseen truths from partial information; Ronald Fisher and Karl Pearson formalized the statistical tools for measuring variance, correlation and inference. Their ideas flow directly into today’s models: Cross-Entropy, which measures how surprised the model was by the true token, and sampling mechanisms like Temperature or Top-k, all stand on the statistical foundations those mathematicians built.
And so another irony emerges: mathematics born from the study of dice, coins and gambling tables now drives machines built to understand nuance, intent and meaning. From the candlelit rooms of 18th-century statisticians to the GPU clusters of modern AI — probability continues to do what it does best: make smart guesses.
Information Theory — Measuring Not Just Error, but Surprise
When an LLM predicts the next word, its outcome is not simply “right or wrong.” The moment it sees the true token, it measures how surprised it is. If the true word was one of the likely options, the surprise is low; if it was something the model deemed nearly impossible, the surprise is high. This measure of surprise is at the heart of Cross-Entropy: a score that tells the model not only that it was wrong, but how far reality was from expectation — and from that, the gradients determine how to adjust the weights.
The idea comes from Claude Shannon’s information theory, born at Bell Labs in the 1940s while trying to understand how to transmit messages through noisy phone lines. Shannon realized that a surprising message carries more information than an expected one. The same principle migrated directly into LLMs: an unexpected token teaches the model a lot; an expected one teaches little. This is why it’s called “information theory” — it measures how much uncertainty drops when the true token is revealed, making it the foundation for measuring learning in modern models.
Optimization Theory — How an LLM Finds Its Way Through a Mountain of Weights
Even after the model can measure surprise, compute gradients, and understand how each weight contributes to error, the hardest problem remains: how to navigate a space of billions of parameters where each point represents a different model. Optimization theory provides the principle that lets the model improve step by step. Optimization doesn’t ask, “Why did I err?” but “In which direction should I move to err less?”
The core idea is Gradient Descent. When the model predicts a token, computes its surprise, and derives the overall error, it uses gradients to determine how a tiny shift in each weight would change that error. Updating the weights in the direction that decreases the Loss creates a slow but steady journey across parameter space, where each small step reduces accumulated surprise. Training is nothing more than this journey; at inference time the weights are fixed.
What’s remarkable is that optimization was never invented for AI. The idea of descending along a slope to find a minimum appeared in the 19th century in the study of analytical and physical systems, and later found uses in statistics, economics and control theory. The mathematicians behind these ideas — from Lagrange’s analytical methods to the Karush-Kuhn-Tucker conditions — never imagined training models with 100 billion weights. What began as a method for finding an “optimal value” in a mathematical function has become the engine of deep learning. Optimization, built originally to solve engineering problems, is now the tool that lets LLMs move through their errors, converge toward the statistical truth of language, and rebuild their understanding step by step.
