(Read time: 6 minutes)
Transformers are the tech behind ChatGPT, Claude, and basically every LLM you’ve heard of.
They don’t understand language like humans, but they feel like they do, thanks to something called “attention.”
Transformers are fast, context-aware, and built to handle long text like a boss.
Last time, we explored how AI models learn. Today, we get to the juicy part.
Ever typed a weirdly specific prompt into ChatGPT and thought,
“Wait… how the hell did it get me?”
Same.
It almost feels like magic. But the answer is less Hogwarts, more math—and it all starts with Transformer models.
Before transformers, AI models relied on RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory). These were solid for their time but had a fatal flaw—terrible memory.
Imagine trying to understand a sentence but forgetting the start by the time you hit the end. That’s what older models were dealing with.
Now, think about reading a book. Instead of processing one word at a time, what if you could absorb an entire page in one glance?
That’s precisely how transformers work. They analyze all words at once, capturing relationships instantly instead of trudging through them one by one.
The secret sauce? Attention.
It was introduced in a 2017 paper from Google titled “Attention Is All You Need ” (Yes, really.)
This mechanism lets models focus on the most relevant words—capturing relationships across the sentence instead of guessing.
Source: cheezburger.com
For example, in:
“The cat sat on the mat because it was tired.”
A transformer knows that “it” refers to “the cat,” not the mat—because attention helps it recognize relationships instead of guessing.
Alright, we’re getting a bit more technical here. You could skip this, but if you’ve a few minutes, trust me—it’s worth it.
Here’s a behind-the-scenes look at how AI understands and generates text:
When you type a prompt, the model doesn’t see words. It sees tokens—bite-sized chunks like:
“The Eiffel Tower is in Paris” becomes →
“The”, “Eiffel”, “Tower”, “is”, “in”, “Paris”
Tokens help the model break down and process language like puzzle pieces.
Each token is converted into a vector—a numerical representation of its meaning. However, the AI does not inherently “know” what words mean. Instead, these numbers capture statistical relationships between words.
"Paris" might be something like: [4,2,−3,1].
(btw, real vectors are much longer, often hundreds or thousands of dimensions)
These numbers aren’t random. They’re trained to reflect how words relate:
4 = Paris is a city
2 = It’s a capital
-3 = Its relation to other major cities (London, New York, etc.)
1 = It’s a tourist hotspot
Without vectors, AI would just see “Paris” as random letters. With them? It starts to understand.
Now, the AI figures out relationships between words using attention mechanisms, specifically the self-attention mechanism in transformer models.
Think of it like a matchmaker for words—minus the awkward small talk.
Example: “The Eiffel Tower is in…”
Original vector for “Eiffel Tower”: [3,7,-2]
Updated vector after processing context: [4,9,-1]
These updates reflect learned relationships:
4 → Reinforces that “Eiffel Tower” is a physical landmark
9 → Increases its connection to tourism
-1 → Lowers association with unrelated concepts (e.g., technology)
How does AI know to make these changes? It learns through training, adjusting these numbers over time via a process called backpropagation (a fancy term for fine-tuning connections based on patterns in data).
After adjusting the vectors, the model filters out irrelevant details using feedforward layers (which often include Multi-Layer Perceptrons, or MLPs)
It’s like having a personal assistant who says, “Keep the fact that Paris is in France, but skip the population stats—nobody asked.”
Finally, the model predicts the most likely next word based on probabilities after all these transformations.
Example: If the input is “The Eiffel Tower is in”, the AI calculates:
“Paris” → 95% probability
“London” → 2% probability
“France” → 1% probability
Since “Paris” has the highest probability (based on training data), the model selects it as the output.
The model doesn’t know it’s right. It just plays the odds—learned from training on mountains of text.
But when it’s right? It feels eerily human.
Source: Coursera
If you’re hungry for a deeper dive into how generative AI and transformer models actually work, this free course on Coursera is a gem. It’s short, sharp, and you can knock it out in a few hours. Well worth your time.
They’re extremely effective:
Parallel Processing: They analyze entire sentences at once, processing all words (or tokens) simultaneously. This makes them super fast, especially with today’s powerful computing resources.
Understanding Context: Transformers use self-attention to focus on relationships between words and understand how meaning shifts based on context.
Handling Long Texts: Transformers shine when it comes to long pieces of text.
Thanks to their self-attention powers, they can connect ideas even if they’re far apart in the text, making them perfect for summarizing documents, translating languages, or answering questions based on long paragraphs.
Next up: we’re entering the Crypto AI arena.
We’re talking AI agents, Bittensor, decentralized intelligence, and how this transformer-powered tech is reshaping everything from trading to shitposting.
Trust me, you don’t wanna miss it.
It’s going to be a fun ride. See you in the next lesson! 🎉
Cheers,
Teng Yan