TL;DR

  • DeepSeek R1 surprised the tech industry by performing as well as top Western AI models like GPT-4, while being trained at a much lower cost.

  • It works using three smart methods: breaking down its thought process step by step, learning by trial and error without needing constant human help, and creating smaller versions that run on regular computers.

  • It is free and open-source, but OpenAI accused DeepSeek of copying answers from their models to improve R1, raising concerns about fair use.

  • Alibaba’s Qwen 3 has now beaten DeepSeek in some areas, and DeepSeek is working on new training techniques to stay competitive

Hello {{ First Name | }}

In 1957, the Soviet Union launched a beach-ball-sized satellite called Sputnik. It cost less than a Hollywood movie but sent a global shockwave.

The world realized the Soviets had real technological momentum and were entering the Space Age. In the United States, the launch triggered panic and forced a reckoning: they weren’t as far ahead as they thought.

DeepSeek R1 felt like a modern Sputnik: an unexpected breakthrough in large language models that changes the conversation.

Developed by a research team in China, now seen by many as the leading open-source AI lab, it has caught the interest of the AI research community due to its impressive performance in areas like math, coding, and scientific reasoning.

So, What is DeepSeek?

DeepSeek has two key large language models (LLM):

  • DeepSeek V3

  • DeepSeek R1: Designed for advanced reasoning

In January 2025, the chatbot shot to the top of both the Apple App Store and Google Play, even surpassing ChatGPT in downloads. The surge shattered the assumption that Western firms held a comfortable lead in AI, triggering a $1 trillion sell-off that rattled giants like Nvidia and Microsoft.

So what makes it so interesting?

  • It’s cheap – Most AI models require massive data centers and powerful Nvidia graphics cards (GPUs) to function. DeepSeek R1, however, was trained without access to Nvidia’s latest chips and…. for reportedly for less than $6 million. The newly released DeepSeek-V3 paper explains how the model reached high intelligence on only 2,048 H800 GPUs by cutting memory, math and network cost at every step.

  • It’s open-source – Unlike some AI models that are locked behind paywalls, DeepSeek R1 is free for anyone to use and modify.

  • It’s powerful – Despite being trained with fewer resources, it competes with some of the best AI models available today.

So, how does DeepSeek work?

DeepSeek R1 is different from traditional AI models, because it uses a combination of three powerful techniques:

  1. Chain of Thought (CoT) Reasoning 

  2. Reinforcement Learning (RL)

  3. Model Distillation 

1. Chain of Thought (CoT)

Chain of Thought is a method where the AI explains its thinking process out loud, just like a student showing their work on a math test. 

By writing down each step, it can catch its own mistakes and improve.

Suppose you asked: “If a bathtub holds 100 liters and drains at 5 liters per minute, how long until it’s empty?”

Instead of just saying “20 minutes,” the AI might write:

  1. Total water = 100 liters. Drain rate = 5 liters/minute.

  2. Time = Total / Rate = 100 / 5 = 20 minutes.

  3. Wait—is the tub already full? Yes. So 20 minutes is correct.

By breaking down its reasoning, the AI can spot errors (like forgetting the tub is full) and fix them. This makes its answers more accurate.

Why This Matters

DeepSeek R1 is completely open-source, meaning anyone can see exactly how it reaches its conclusions (i.e., its chain of thought process). This transparency makes it a huge breakthrough in AI research. 

It is believed that DeepSeek’s transparency in revealing its reasoning process compelled OpenAI to launch “GPT-o3” mini” and “GPT-o3 High” earlier than planned, as these models also display their reasoning (i.e., how they reach conclusions). 

2. Reinforcement Learning (Without Full Supervision)

Traditional AI training methods involve giving the AI a huge dataset with answers and expecting it to learn from them (i.e., Supervised Learning).

DeepSeek R1, however, uses Reinforcement Learning (RL).

Reinforcement learning lets the AI learn by trial and error, similar to how a baby learns to walk. No one tells the baby exactly how to move its legs. Instead, the baby:

  1. Tries walking and falls

  2. Learn what worked and what didn’t

  3. Keep adjusting until it gets better

In other words, instead of just memorizing correct answers, R1 figures things out on its own.

Why This Matters

Most AI models rely on static training, meaning they only improve when humans update them with new data. Due to the immense costs associated with large-scale supervised learning, only companies with billions of dollars have been able to train highly advanced AI systems. 

For example, OpenAI employed low-cost workers in Kenya to filter harmful content and refine its models—a common but ethically debated approach.

DeepSeek R1 doesn’t need humans to tell it the “right” answer. Instead, it experiments with different ways to solve a problem (like solving an equation in 2 steps vs. 10 steps). This means that given enough time, it can surpass traditional AI models in accuracy without requiring constant human intervention.

By prioritizing factual correctness over plausible-sounding responses, R1 can scale its training without relying on vast teams of human annotators.

In tests, DeepSeek R1 started with 70% accuracy and climbed to over 80% with practice, eventually outperforming some well-known models. The longer it trains, the better it gets!

But How does DeepSeek actually use RL?

You see, training large language models (LLMs) involves 3 stages:

  • Pre-training: The model learns general knowledge by processing vast amounts of text and code. For example, given an input like “write a bedtime ___,” the model learns to predict a fitting completion such as “story”. 

  • Supervised Fine-Tuning: In this phase, the model is trained on a dataset of instruction–response pairs. This step helps the model learn how to follow human instructions more reliably.

  • Post-training: In this, the model is refined using feedback. Traditionally, this is achieved through either human feedback or AI feedback. Due to the high cost of pre-training, many companies focus on post-training, which is refining pre-trained models like Llama to specialize in specific tasks.

Instead of following the normal approach, DeepSeek R1 model starts with a pre-trained model (DeepSeek-V3-Base) and skips the supervised fine-tuning stage. 

Its training relies solely on a rule-based reinforcement learning method called Group Relative Policy Optimization (GRPO).

GRPO works like a teacher grading multiple student answers to the same question, but with a set of strict rules.

For example, in a math exam, each student is asked, "What is 5 + 3?". Instead of just one student answering, several students provide different responses, each explaining their thought process. The teacher then evaluates each answer based on two things:

  • Accuracy (whether the final answer is correct) and

  • Clarity (whether the steps are well-organized and easy to follow).

If one student writes: "I thought about it this way: If I have five apples and someone gives me three more, I now have eight apples. So the answer is 8."

This answer would get points for both correctness and explanation.

But if another student simply writes "8" without any reasoning, they might lose points for not showing their work.

This method makes the evaluation process more structured and cost-effective, as there is no need for an additional AI system to judge the answers.

At the same time, it busted another myth. A common assumption has been that post-training only enables models to retrieve and refine patterns learned during pre-training, limiting their ability to generate new reasoning. However, DeepSeek R1 showed that language models can even acquire new knowledge through reinforcement learning (RL) even in the post-training phase. 

3. Model Distillation: Making AI More Accessible

DeepSeek R1 is incredibly powerful, but there’s one problem: it’s huge. Running it at full capacity requires thousands of high-powered computer chips, which is far beyond what most people can afford.

To solve this, researchers used a technique called Model Distillation.

For example, a smart professor who is an expert in physics can teach students in simpler terms. Over time, the students can perform just as well as the professor. That is what Model Distillation is.

  1. A large AI model (the "teacher") is used to generate answers.

  2. A smaller AI model (the "student") is trained to mimic the teacher.

  3. The student AI can now perform almost as well as the teacher, but with far fewer resources.

When you teach a dog, you don’t show it every trick—you reward the right behavior. Similarly, if you can measure whether the AI is doing its job well, it can improve on its own.  

The Accusation

However, OpenAI accuses DeepSeek of using model distillation with GPT-o1's outputs to improve their own models, which is allegedly a violation of OpenAI's terms of service specifically prohibiting the use of their API for building rival models.

Accusation from OpenAI of using their o1 LLM as teacher model

While there's no definite proof, OpenAI points to things like large data extractions from the OpenAI API, potentially linked to DeepSeek, and DeepSeek sometimes identifying itself as ChatGPT.

Why This Matters

With distillation, researchers made DeepSeek R1’s knowledge available in smaller versions that require less computing power, can run on regular computers and are accessible to more people.

You can even run a distilled variant of this AI on your personal laptop (like a MacBook Pro) with a single command. A team at UC Berkeley managed to reproduce DeepSeek R1’s capabilities with a budget of just $30!

This particularly benefits startups and smaller teams who can now compete with tech giants.

How did R1 become so powerful?

It's because of its core model, DeepSeek-V3 (released in late 2024). It’s a 671 billion parameter model with the following features: 

  1. Mixture of Experts (MoE) approach: The model uses MoE, which splits into “expert groups” focused on different skills. When you ask a question, only the relevant experts (about 5% of the model) activate, saving energy and computing power. Instead of using all 671 billion parameters for every word, it activates only a small portion (about 37 billion) using a MoE system, cutting computation down from 2.4 TFLOPS to just 0.25 TFLOPS per word.

  2. Multi-Token Predictions: Instead of guessing one word at a time (that’s how traditional LLMs work), V3 predicts chunks of text simultaneously—like finishing a sentence instead of a single letter. This speeds up responses. For example, when given a 1,000-word email prompt, DeepSeek-V3 can generate a reply in just four seconds, using only half the computing resources older models would need.

  3. Simpler Math (FP8): By using rounded 8-bit numbers in its calculations, V3 runs faster and requires less hardware, with less than a 0.25% drop in accuracy compared to traditional 16-bit precision, while significantly reducing memory usage.

  4. Multi-head Latent Attention: It reduces memory usage with a technique called Multi-head Latent Attention, which stores only 70 KB of data per word—about one-seventh of what similar models like Llama-3.1 need—so everything fits comfortably in GPU memory.

R1 proves you don’t need massive data centers or armies of workers to build cutting-edge AI. If you can define what “correct” means for your task—whether coding, healthcare, or writing—you can train a tailored model using goal-driven methods.

What’s Next for DeepSeek?

Although DeepSeek has shown impressive results, competitors are not sitting idly.

For example, Alibaba recently released its new AI model Qwen 3, which has now beaten DeepSeek R1 on many tests, especially for coding and writing. It also gives users a new option to choose between fast or detailed answers using a “thinking budget” slider.

qwenlm.github

To stay competitive, DeepSeek is now also cooking up something new to cut down on operational costs. The company has teamed up with researchers at Tsinghua University to develop a new technique called Generative Reward Modelling—or GRM for short (DeepSeek-GRM).

So, what is GRM?

Traditional reward models often just give a score (like 7/10) for each model response. But GRM generates a text-based critique explaining why a certain response is better or worse, and from that explanation, a score is extracted. This makes the evaluation more detailed.

These critiques and scores are generated through a method called “Self-Principled Critique Tuning (SPCT)”. It teaches the model how to create its own rules for judging answers.

In other words, instead of giving it fixed rules, SPCT helps the model learn to decide what’s important, like clarity, correctness, or detail, depending on the question.

For example, if the question is “How do solar panels work?”, the model might decide that “technical accuracy” and “clear steps” are the most important. Then it uses those ideas to write a critique and give a score.

What’s Next for AI?

One of the most revealing trends lies in data centers and chips.

There’s a beautiful contradiction in technology: the cheaper intelligence becomes, the more we hunger for power. Developers, now with affordable AI, will inevitably chase greater computational power—not because they need it, but because they can. It’s similar to building a faster car—once you have it, you want to drive farther.

Until recently, the assumption was simple: more powerful AI means more data centers. So tech giants like Meta plan to spend $65 billion on data centers this year alone. Microsoft and Amazon invest similar amounts, while OpenAI is developing $100 billion facilities.

Then came DeepSeek R1, which flipped that assumption. It showed that advanced AI models could run with far less computing power than expected.

This raises a broader question. What if we’re overbuilding? The history of tech bubbles suggests that infrastructure built during speculative frenzies often ends up useful, eventually. The dot-com crash cleared the way for Google, Meta, and the modern internet, but it took years for those gains to materialize.

The same may apply to this new wave of data center expansion. If efficient models like DeepSeek R1 become the norm, the benefits of today’s massive investments will take time to appear.

Still, demand for AI is rising fast. Much of the current spending is aimed at powering large, still-inefficient models. Whether efficiency gains can keep pace with growing demand will shape the next chapter in AI’s story.

Cheers,

Teng Yan & Ravi

Want more? Join over 5,000 curious minds and subscribe to the Chain of Thought research newsletter for clear, insightful takes on AI and crypto, delivered straight to your inbox.