The History of LLM (Large Language Models)

$ cat toc.txt

The history of LLMs does not begin with the neural networks of the 2010s, but much earlier — with the fundamental idea of modeling language as a probabilistic sequence. Below, I will break it down step by step, from the very first concepts to modern LLMs, with exact dates and key figures.

1. The Idea: Probabilistic Language Modeling (Early 20th Century — 1950s)

1913: Russian mathematician Andrey Markov was the first to apply Markov chains to text analysis (Pushkin’s poem “Eugene Onegin”). This laid the foundation for n-gram models — the idea that the probability of the next character/word depends on several preceding ones.
1948–1951: Claude Shannon (founder of information theory) used n-grams to estimate the “predictability” (entropy) of the English language. He demonstrated that even simple statistical models can generate coherent text.
1950: Alan Turing, in his paper “Computing Machinery and Intelligence”, posed the question of machine understanding of language (the Turing test). This is the philosophical foundation of the entire field.

2. First Practical Systems (1950s–1960s)

1954: Researchers at IBM and Georgetown University created the first machine translation system (Russian → English). This was a purely rule-based approach, with no statistics involved.
1966: Joseph Weizenbaum (MIT) developed ELIZA — the first program simulating a conversation (a psychotherapist). It operated on simple pattern-matching templates and became the ancestor of chatbots.

3. Statistical Language Models (SLM) and Neural Networks (1980s–2000s)

1986: David Rumelhart, Geoffrey Hinton, and Ronald Williams popularized the backpropagation algorithm for training multi-layer neural networks. Without this tool, modern LLMs would be impossible.
1990: Jeffrey Elman introduced the Simple Recurrent Network (SRN) — the first recurrent neural network (RNN) capable of accounting for word sequences over time.
1997: Sepp Hochreiter and Jürgen Schmidhuber invented LSTM (Long Short-Term Memory) — an improved RNN that solved the “vanishing gradient” problem and could retain long-range dependencies in text. LSTM became the standard for the next 20 years.
2003: Yoshua Bengio (with co-authors Réjean Ducharme, Pascal Vincent, and Christian Jauvin) published the seminal paper “A Neural Probabilistic Language Model”. This was the first neural language model: it replaced rigid n-gram tables with distributed word representations (word embeddings), training them jointly with the network. This is where the idea was born that words are vectors in a high-dimensional space.

4. The Rise of Embeddings, the GPU Revolution, and Attention (2012–2017)

2012: The AlexNet team (Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton) demonstrated that deep convolutional neural networks can be efficiently trained on graphics processing units (GPUs). This opened the door to large-scale training in NLP.
2013: Tomas Mikolov (with the Google team: Kai Chen, Greg Corrado, Jeffrey Dean) released Word2Vec. The model radically simplified and accelerated Bengio’s approach. It showed that word vectors capture deep semantics (the classic example: “king − man + woman ≈ queen”).
2014: Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio introduced the attention mechanism in seq2seq models for machine translation. Instead of compressing the entire translation into a single vector, the network learned to “look at” specific important words in the input text.

5. The Birth of the Transformer and the LLM Era (2017 — Present)

June 12, 2017: The Google Brain team (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, Illia Polosukhin) published the paper “Attention Is All You Need” on arXiv. They proposed the Transformer architecture — completely abandoning slow recurrent layers in favor of a parallelizable attention mechanism. This became the absolute foundation of all modern LLMs.

From 2018 onward, explosive growth began:

Year	Model / Event	Authors / Company	Key Contribution	Parameters
2018	BERT	Jacob Devlin et al. (Google)	Bidirectional pre-training (masked language modeling)	340M
2018	GPT-1	Alec Radford et al. (OpenAI)	Generative Pre-Training	117M
2019	GPT-2	OpenAI	Model scaling + coherent text generation	1.5B
2020	GPT-3	OpenAI	Few-shot learning (learning from examples in the prompt)	175B
2022	ChatGPT (GPT-3.5)	OpenAI	RLHF integration (Reinforcement Learning from Human Feedback) — opened access to millions	—
2023	GPT-4	OpenAI	Multimodality (text + images), rumored MoE-based architecture (Mixture of Experts)	~1.8T (estimated)

Next, open-weight models flooded the market: LLaMA (Meta, 2023), Mistral, Grok (xAI), and hundreds of others. By 2024, LLMs had fully evolved into multimodal systems (Gemini 1.5, Claude 3) capable of processing video and audio, maintaining millions of tokens in context, performing complex logical reasoning, and operating as autonomous agents.

Brief Summary of Evolution

The idea → probabilistic language model (Markov, Shannon).
Statistics → classical n-grams (peak in the 1980s–1990s).
Neural networks → distributed word representations (Bengio 2003, Word2Vec 2013).
Attention → Transformer (2017) — an architectural revolution at scale.
Scaling (data + GPU compute + parameters) + RLHF → modern LLMs.

Each step was the result of work by hundreds of researchers, but key breakthroughs always occurred when someone proposed a simpler, parallelizable architecture and found the computational resources to train it. Today, LLMs are not just “large models” — they are the direct continuation of a more than 70-year-old human dream of a machine that understands our language.

← all posts [share] [rss]