Knowledge Distillation: Distilling DeepSeek-R1

← Series hub ← Previous | Next →

The release of DeepSeek-R1 in early 2025 disrupted conventional wisdom surrounding artificial intelligence scaling. Rather than simply chasing raw parameter size, DeepSeek demonstrated a paradigm shift: Knowledge Distillation from massive reasoning models can transfer complex multi-step reasoning traces (Chain of Thought - CoT) into smaller student models (SLMs) like Qwen or Llama.

Thanks to this technique, distilled open models like DeepSeek-R1-Distill-Qwen-14B or DeepSeek-R1-Distill-Llama-8B achieve reasoning and coding scores that surpass vanilla models multiple times their size.

This article details the mechanics of reasoning distillation, its mathematical formulations, and shows how to set up an automated Python pipeline to distill custom tasks using DeepSeek-R1.

1. Classification of Knowledge Distillation Methods

Knowledge distillation is a compression technique that transfers capabilities from a large, highly capable model (Teacher) to a smaller, resource-efficient model (Student). The objective is to train the Student to mimic the Teacher’s output probability distributions or behavior while minimizing representation errors.

       ┌───────────────────────┐
       │    Teacher (Large)    │
       │   e.g., DeepSeek-R1   │
       └───────────┬───────────┘
                   │
           [Transfer Knowledge]
                   │
         ┌─────────┴─────────┐
         ▼                   ▼
┌──────────────────┐  ┌──────────────────┐
│   Logits-Based   │  │  Sequence-Level  │
│   (Soft Labels)  │  │  (CoT Reasoning) │
└────────┬─────────┘  └────────┬─────────┘
         │                   │
         └─────────┬─────────┘
                   │
                   ▼
       ┌───────────────────────┐
       │    Student (Small)    │
       │  e.g., Qwen-Coder-7B  │
       └───────────────────────┘

In practice, there are three primary approaches to distillation:

1.1. Logits-Based Distillation (Response-Based)

The Student learns directly from the output next-token probability distribution (logits) generated by the Teacher. Instead of learning only from hard labels (the single highest probability token), the Student models the Teacher’s soft labels, capturing semantic correlations between vocabulary tokens.

Loss Function: $$L_{\text{KD}} = (1 - \gamma) \cdot L_{\text{CE}}(y, p_s) + \gamma \cdot T^2 \cdot \text{KL}(p_t^{\tau}, p_s^{\tau})$$ Where:
$L_{\text{CE}}$ is the standard cross-entropy loss between ground-truth labels $y$ and student predictions $p_s$.
$\text{KL}$ is the Kullback-Leibler divergence measuring deviation between the Teacher’s soft probability distribution $p_t^{\tau}$ and the Student’s soft distribution $p_s^{\tau}$.
$T$ is the temperature scaling parameter used to soften output logits.

1.2. Feature-Based Distillation

The Student directly mimics the hidden states (Hidden Layers) or Attention Maps of the Teacher. This is complex to implement because it requires matching hidden dimensions through linear projection layers to align representation spaces.

1.3. Sequence-Level Distillation (CoT Distillation)

This is the approach driving DeepSeek-R1’s distilled models. Rather than extracting raw logits, we prompt the Teacher (DeepSeek-R1) to solve complex problems while capturing its full Chain of Thought (CoT) reasoning path enclosed in <think> ... </think> tags. We then use this synthetic reasoning dataset to perform standard SFT on the Student model.

Advantages: Simple to execute, requires zero internal model architecture intervention, and functions over standard completion APIs. The smaller student model learns to structure its thoughts, decompose tasks, and self-correct errors before yielding final answers.

2. Why Qwen 2.5 Coder is the Ideal Student Model

Qwen 2.5 Coder (particularly the 7B and 14B versions) provides exceptional coding and syntax comprehension due to its extensive code-centric pre-training dataset.

However, base Qwen models occasionally struggle with high-level logic or architectural design. By distilling reasoning traces from DeepSeek-R1, Qwen Coder combines precise code syntax with structured algorithmic planning.

3. Building a CoT Data Generation Pipeline

To perform Sequence-Level Distillation, we must generate a high-quality reasoning dataset. Below is a Python script using httpx and asyncio to fetch reasoning traces from the DeepSeek API:

import asyncio
import json
import httpx
from pydantic import BaseModel

# API configurations (Point to self-hosted vLLM or cloud API providers)
API_URL = "https://api.deepseek.com/v1/chat/completions"
API_KEY = "your-api-key-here"

class CodeTask(BaseModel):
    id: int
    instruction: str

async def generate_cot_data(task: CodeTask, client: httpx.AsyncClient) -> dict:
    prompt = f"""Solve the following programming task. Required constraints:
1. Document your entire analytical logic, alternative solutions, time/space complexity analysis, and potential edge-case errors.
2. Provide your final clean Python solution at the end.

Task:
{task.instruction}"""

    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    }

    payload = {
        "model": "deepseek-reasoning", # DeepSeek-R1 reasoning model returning thoughts
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.6,
        "max_tokens": 4096
    }

    try:
        response = await client.post(API_URL, json=payload, headers=headers, timeout=120.0)
        response_json = response.json()
        
        # DeepSeek-R1 exposes thought traces in the 'reasoning_content' field
        choice = response_json["choices"][0]
        message = choice["message"]
        
        reasoning = message.get("reasoning_content", "")
        content = message.get("content", "")
        
        # Format the training pair for SFT CoT
        # Wrap the reasoning trace in <think> tags for the student model to learn structural syntax
        formatted_output = f"<think>\n{reasoning}\n</think>\n\n{content}"
        
        return {
            "instruction": task.instruction,
            "output": formatted_output,
            "id": task.id
        }
    except Exception as e:
        # Note: In production, consider using the `tenacity` library
        # to automatically retry on 429 (Rate Limit) or 500 API errors.
        print(f"Error processing task {task.id}: {str(e)}")
        return {
            "instruction": task.instruction,
            "output": None,
            "id": task.id
        }

async def main():
    raw_tasks = [
        {"id": 1, "instruction": "Write a function to detect a cycle in a singly linked list (Linked List Cycle)."},
        {"id": 2, "instruction": "Design a Least Recently Used (LRU) Cache structure with O(1) time complexity for get and put operations."},
        {"id": 3, "instruction": "Find the length of the longest substring without repeating characters."}
    ]
    
    tasks = [CodeTask(**t) for t in raw_tasks]
    
    print(f"Starting distillation of {len(tasks)} tasks...")
    limits = httpx.Limits(max_keepalive_connections=5, max_connections=10)
    async with httpx.AsyncClient(limits=limits) as client:
        # Gather completions concurrently
        jobs = [generate_cot_data(task, client) for task in tasks]
        results = await asyncio.gather(*jobs)
        
    cleaned_results = [r for r in results if r["output"] is not None]
    
    # Save output to JSONL for SFT training (as configured in Part 3)
    output_file = "deepseek_r1_distilled_cot.jsonl"
    with open(output_file, "w", encoding="utf-8") as f:
        for item in cleaned_results:
            f.write(json.dumps(item, ensure_ascii=False) + "\n")
            
    print(f"Successfully saved {len(cleaned_results)} distilled samples to {output_file}")

if __name__ == "__main__":
    asyncio.run(main())

4. Why We Preserve `<think> ... </think>` Tags

When training the distilled student model, maintaining a strict boundaries between Reasoning (Thought) and the Final Answer is essential:

Separation of Concerns: It prevents representation mixing. The student model learns when to write free-form logic (exploring ideas, backtracking errors) without messing up the clean syntax structure required in the final output (like structured JSON or Python code blocks).
Facilitating Reinforcement Learning: This format is critical for subsequent optimization phases, where reinforcement learning algorithms evaluate reasoning length and output correctness independently.

Next Chapter

While fine-tuning on reasoning data teaches the student model the appearance of reasoning, it doesn’t guarantee logical correctness. Student models often mimic the reasoning tone but output logical fallacies.

In Part 5: Preference Alignment (DPO, KTO, GRPO), we explore reinforcement learning algorithms (DPO, KTO, and GRPO) that align and enforce correct logical steps on our models.

🤝 Let's Connect

Are you facing similar challenges with system architecture, scaling, or migration? I'd love to hear about it. Connect with me on LinkedIn, check out my GitHub, or drop me an email.

1. Classification of Knowledge Distillation Methods#

1.1. Logits-Based Distillation (Response-Based)#

1.2. Feature-Based Distillation#

1.3. Sequence-Level Distillation (CoT Distillation)#

2. Why Qwen 2.5 Coder is the Ideal Student Model#

3. Building a CoT Data Generation Pipeline#

4. Why We Preserve <think> ... </think> Tags#

Next Chapter#