(Read time: 6 minutes)

Key Takeaways

Transformers are the tech behind ChatGPT, Claude, and basically every LLM you’ve heard of.
They don’t understand language like humans, but they feel like they do, thanks to something called “attention.”
Transformers are fast, context-aware, and built to handle long text like a boss.

Last time, we explored how AI models learn. Today, we get to the juicy part.

Ever typed a weirdly specific prompt into ChatGPT and thought,

“Wait… how the hell did it get me?”

Same.

It almost feels like magic. But the answer is less Hogwarts, more math—and it all starts with Transformer models.

Breaking It Down: How Do Transformers Work?

Before transformers, AI models relied on RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory). These were solid for their time but had a fatal flaw—terrible memory.

Imagine trying to understand a sentence but forgetting the start by the time you hit the end. That’s what older models were dealing with.

Now, think about reading a book. Instead of processing one word at a time, what if you could absorb an entire page in one glance?

That’s precisely how transformers work. They analyze all words at once, capturing relationships instantly instead of trudging through them one by one.

The secret sauce? Attention.

It was introduced in a 2017 paper from Google titled “Attention Is All You Need ” (Yes, really.)

This mechanism lets models focus on the most relevant words—capturing relationships across the sentence instead of guessing.

Source: cheezburger.com

For example, in:

“The cat sat on the mat because it was tired.”

A transformer knows that “it” refers to “the cat,” not the mat—because attention helps it recognize relationships instead of guessing.

A Different Kind of Token!

Alright, we’re getting a bit more technical here. You could skip this, but if you’ve a few minutes, trust me—it’s worth it.

Here’s a behind-the-scenes look at how AI understands and generates text:

Step 1: Breaking Down the Sentence

When you type a prompt, the model doesn’t see words. It sees tokens—bite-sized chunks like:

“The Eiffel Tower is in Paris” becomes →

“The”, “Eiffel”, “Tower”, “is”, “in”, “Paris”

Tokens help the model break down and process language like puzzle pieces.

Step 2: Turning Words Into Numbers

Each token is converted into a vector—a numerical representation of its meaning. However, the AI does not inherently “know” what words mean. Instead, these numbers capture statistical relationships between words.

"Paris" might be something like: [4,2,−3,1].

(btw, real vectors are much longer, often hundreds or thousands of dimensions)

These numbers aren’t random. They’re trained to reflect how words relate:

4 = Paris is a city

2 = It’s a capital

-3 = Its relation to other major cities (London, New York, etc.)

1 = It’s a tourist hotspot

Without vectors, AI would just see “Paris” as random letters. With them? It starts to understand.

Step 3: Making Connections Between Words

Now, the AI figures out relationships between words using attention mechanisms, specifically the self-attention mechanism in transformer models.

Think of it like a matchmaker for words—minus the awkward small talk.

Example: “The Eiffel Tower is in…”

Original vector for “Eiffel Tower”: [3,7,-2]
Updated vector after processing context: [4,9,-1]

These updates reflect learned relationships:

4 → Reinforces that “Eiffel Tower” is a physical landmark

9 → Increases its connection to tourism

-1 → Lowers association with unrelated concepts (e.g., technology)

How does AI know to make these changes? It learns through training, adjusting these numbers over time via a process called backpropagation (a fancy term for fine-tuning connections based on patterns in data).

Step 4: Removing unnecessary information

After adjusting the vectors, the model filters out irrelevant details using feedforward layers (which often include Multi-Layer Perceptrons, or MLPs)

It’s like having a personal assistant who says, “Keep the fact that Paris is in France, but skip the population stats—nobody asked.”

Step 5: Predicting the next word

Finally, the model predicts the most likely next word based on probabilities after all these transformations.

Example: If the input is “The Eiffel Tower is in”, the AI calculates:

“Paris” → 95% probability
“London” → 2% probability
“France” → 1% probability

Since “Paris” has the highest probability (based on training data), the model selects it as the output.

The model doesn’t know it’s right. It just plays the odds—learned from training on mountains of text.

But when it’s right? It feels eerily human.

Why Are Transformers So Powerful?

They’re extremely effective:

Parallel Processing: They analyze entire sentences at once, processing all words (or tokens) simultaneously. This makes them super fast, especially with today’s powerful computing resources.

Understanding Context: Transformers use self-attention to focus on relationships between words and understand how meaning shifts based on context.

Handling Long Texts: Transformers shine when it comes to long pieces of text.

Thanks to their self-attention powers, they can connect ideas even if they’re far apart in the text, making them perfect for summarizing documents, translating languages, or answering questions based on long paragraphs.

What’s Next?

Next up: we’re entering the Crypto AI arena.

We’re talking AI agents, Bittensor, decentralized intelligence, and how this transformer-powered tech is reshaping everything from trading to shitposting.

Trust me, you don’t wanna miss it.

It’s going to be a fun ride. See you in the next lesson! 🎉

Cheers,

Teng Yan

Lesson #5 — Transformers