<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>ntdim's Blog on ntdim</title><link>https://blog.ppid.ru/en/</link><description>Recent content in ntdim's Blog on ntdim</description><generator>Hugo</generator><language>en</language><lastBuildDate>Tue, 07 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.ppid.ru/en/index.xml" rel="self" type="application/rss+xml"/><item><title>The History of LLM (Large Language Models)</title><link>https://blog.ppid.ru/en/blog/history-of-llm/</link><pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate><guid>https://blog.ppid.ru/en/blog/history-of-llm/</guid><description>&lt;p>The history of LLMs does not begin with the neural networks of the 2010s, but much earlier — with the fundamental idea of modeling language as a probabilistic sequence. Below, I will break it down step by step, from the very first concepts to modern LLMs, with exact dates and key figures.&lt;/p>
&lt;h2 id="1-the-idea-probabilistic-language-modeling-early-20th-century--1950s">1. The Idea: Probabilistic Language Modeling (Early 20th Century — 1950s)&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>1913&lt;/strong>: Russian mathematician &lt;strong>Andrey Markov&lt;/strong> was the first to apply Markov chains to text analysis (Pushkin&amp;rsquo;s poem &amp;ldquo;Eugene Onegin&amp;rdquo;). This laid the foundation for &lt;strong>n-gram models&lt;/strong> — the idea that the probability of the next character/word depends on several preceding ones.&lt;/li>
&lt;li>&lt;strong>1948–1951&lt;/strong>: &lt;strong>Claude Shannon&lt;/strong> (founder of information theory) used n-grams to estimate the &amp;ldquo;predictability&amp;rdquo; (entropy) of the English language. He demonstrated that even simple statistical models can generate coherent text.&lt;/li>
&lt;li>&lt;strong>1950&lt;/strong>: &lt;strong>Alan Turing&lt;/strong>, in his paper &amp;ldquo;Computing Machinery and Intelligence&amp;rdquo;, posed the question of machine understanding of language (the Turing test). This is the philosophical foundation of the entire field.&lt;/li>
&lt;/ul>
&lt;h2 id="2-first-practical-systems-1950s1960s">2. First Practical Systems (1950s–1960s)&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>1954&lt;/strong>: Researchers at &lt;strong>IBM&lt;/strong> and Georgetown University created the first machine translation system (Russian → English). This was a purely &lt;strong>rule-based&lt;/strong> approach, with no statistics involved.&lt;/li>
&lt;li>&lt;strong>1966&lt;/strong>: &lt;strong>Joseph Weizenbaum&lt;/strong> (MIT) developed &lt;strong>ELIZA&lt;/strong> — the first program simulating a conversation (a psychotherapist). It operated on simple pattern-matching templates and became the ancestor of chatbots.&lt;/li>
&lt;/ul>
&lt;h2 id="3-statistical-language-models-slm-and-neural-networks-1980s2000s">3. Statistical Language Models (SLM) and Neural Networks (1980s–2000s)&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>1986&lt;/strong>: &lt;strong>David Rumelhart&lt;/strong>, &lt;strong>Geoffrey Hinton&lt;/strong>, and Ronald Williams popularized the &lt;strong>backpropagation&lt;/strong> algorithm for training multi-layer neural networks. Without this tool, modern LLMs would be impossible.&lt;/li>
&lt;li>&lt;strong>1990&lt;/strong>: &lt;strong>Jeffrey Elman&lt;/strong> introduced the &lt;strong>Simple Recurrent Network (SRN)&lt;/strong> — the first recurrent neural network (RNN) capable of accounting for word sequences over time.&lt;/li>
&lt;li>&lt;strong>1997&lt;/strong>: &lt;strong>Sepp Hochreiter&lt;/strong> and &lt;strong>Jürgen Schmidhuber&lt;/strong> invented &lt;strong>LSTM&lt;/strong> (Long Short-Term Memory) — an improved RNN that solved the &amp;ldquo;vanishing gradient&amp;rdquo; problem and could retain long-range dependencies in text. LSTM became the standard for the next 20 years.&lt;/li>
&lt;li>&lt;strong>2003&lt;/strong>: &lt;strong>Yoshua Bengio&lt;/strong> (with co-authors Réjean Ducharme, Pascal Vincent, and Christian Jauvin) published the seminal paper &lt;em>&amp;ldquo;A Neural Probabilistic Language Model&amp;rdquo;&lt;/em>. This was &lt;strong>the first neural language model&lt;/strong>: it replaced rigid n-gram tables with &lt;strong>distributed word representations&lt;/strong> (word embeddings), training them jointly with the network. This is where the idea was born that words are vectors in a high-dimensional space.&lt;/li>
&lt;/ul>
&lt;h2 id="4-the-rise-of-embeddings-the-gpu-revolution-and-attention-20122017">4. The Rise of Embeddings, the GPU Revolution, and Attention (2012–2017)&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>2012&lt;/strong>: The &lt;strong>AlexNet&lt;/strong> team (Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton) demonstrated that deep convolutional neural networks can be efficiently trained on graphics processing units (GPUs). This opened the door to large-scale training in NLP.&lt;/li>
&lt;li>&lt;strong>2013&lt;/strong>: &lt;strong>Tomas Mikolov&lt;/strong> (with the Google team: Kai Chen, Greg Corrado, Jeffrey Dean) released &lt;strong>Word2Vec&lt;/strong>. The model radically simplified and accelerated Bengio&amp;rsquo;s approach. It showed that word vectors capture deep semantics (the classic example: &lt;em>&amp;ldquo;king − man + woman ≈ queen&amp;rdquo;&lt;/em>).&lt;/li>
&lt;li>&lt;strong>2014&lt;/strong>: &lt;strong>Dzmitry Bahdanau&lt;/strong>, &lt;strong>Kyunghyun Cho&lt;/strong>, and &lt;strong>Yoshua Bengio&lt;/strong> introduced the &lt;strong>attention mechanism&lt;/strong> in seq2seq models for machine translation. Instead of compressing the entire translation into a single vector, the network learned to &amp;ldquo;look at&amp;rdquo; specific important words in the input text.&lt;/li>
&lt;/ul>
&lt;h2 id="5-the-birth-of-the-transformer-and-the-llm-era-2017--present">5. The Birth of the Transformer and the LLM Era (2017 — Present)&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>June 12, 2017&lt;/strong>: The Google Brain team (&lt;strong>Ashish Vaswani&lt;/strong>, &lt;strong>Noam Shazeer&lt;/strong>, &lt;strong>Niki Parmar&lt;/strong>, &lt;strong>Jakob Uszkoreit&lt;/strong>, &lt;strong>Llion Jones&lt;/strong>, &lt;strong>Aidan Gomez&lt;/strong>, &lt;strong>Łukasz Kaiser&lt;/strong>, &lt;strong>Illia Polosukhin&lt;/strong>) published the paper &lt;strong>&amp;ldquo;Attention Is All You Need&amp;rdquo;&lt;/strong> on arXiv. They proposed the &lt;strong>Transformer architecture&lt;/strong> — completely abandoning slow recurrent layers in favor of a parallelizable attention mechanism. This became the absolute foundation of &lt;strong>all modern LLMs&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;p>From 2018 onward, explosive growth began:&lt;/p></description></item></channel></rss>