The History of LLM (Large Language Models)

Tue, 07 Apr 2026 00:00:00 +0000

The history of LLMs does not begin with the neural networks of the 2010s, but much earlier — with the fundamental idea of modeling language as a probabilistic sequence. Below, I will break it down step by step, from the very first concepts to modern LLMs, with exact dates and key figures.

1. The Idea: Probabilistic Language Modeling (Early 20th Century — 1950s)

1913: Russian mathematician Andrey Markov was the first to apply Markov chains to text analysis (Pushkin’s poem “Eugene Onegin”). This laid the foundation for n-gram models — the idea that the probability of the next character/word depends on several preceding ones.
1948–1951: Claude Shannon (founder of information theory) used n-grams to estimate the “predictability” (entropy) of the English language. He demonstrated that even simple statistical models can generate coherent text.
1950: Alan Turing, in his paper “Computing Machinery and Intelligence”, posed the question of machine understanding of language (the Turing test). This is the philosophical foundation of the entire field.

2. First Practical Systems (1950s–1960s)

1954: Researchers at IBM and Georgetown University created the first machine translation system (Russian → English). This was a purely rule-based approach, with no statistics involved.
1966: Joseph Weizenbaum (MIT) developed ELIZA — the first program simulating a conversation (a psychotherapist). It operated on simple pattern-matching templates and became the ancestor of chatbots.

3. Statistical Language Models (SLM) and Neural Networks (1980s–2000s)

1986: David Rumelhart, Geoffrey Hinton, and Ronald Williams popularized the backpropagation algorithm for training multi-layer neural networks. Without this tool, modern LLMs would be impossible.
1990: Jeffrey Elman introduced the Simple Recurrent Network (SRN) — the first recurrent neural network (RNN) capable of accounting for word sequences over time.
1997: Sepp Hochreiter and Jürgen Schmidhuber invented LSTM (Long Short-Term Memory) — an improved RNN that solved the “vanishing gradient” problem and could retain long-range dependencies in text. LSTM became the standard for the next 20 years.
2003: Yoshua Bengio (with co-authors Réjean Ducharme, Pascal Vincent, and Christian Jauvin) published the seminal paper “A Neural Probabilistic Language Model”. This was the first neural language model: it replaced rigid n-gram tables with distributed word representations (word embeddings), training them jointly with the network. This is where the idea was born that words are vectors in a high-dimensional space.

4. The Rise of Embeddings, the GPU Revolution, and Attention (2012–2017)

2012: The AlexNet team (Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton) demonstrated that deep convolutional neural networks can be efficiently trained on graphics processing units (GPUs). This opened the door to large-scale training in NLP.
2013: Tomas Mikolov (with the Google team: Kai Chen, Greg Corrado, Jeffrey Dean) released Word2Vec. The model radically simplified and accelerated Bengio’s approach. It showed that word vectors capture deep semantics (the classic example: “king − man + woman ≈ queen”).
2014: Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio introduced the attention mechanism in seq2seq models for machine translation. Instead of compressing the entire translation into a single vector, the network learned to “look at” specific important words in the input text.

5. The Birth of the Transformer and the LLM Era (2017 — Present)

June 12, 2017: The Google Brain team (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, Illia Polosukhin) published the paper “Attention Is All You Need” on arXiv. They proposed the Transformer architecture — completely abandoning slow recurrent layers in favor of a parallelizable attention mechanism. This became the absolute foundation of all modern LLMs.