The SLM Playbook: Fine-Tuning & Model Distillation

Welcome to Phase 2.5 of our AI-Native architecture journey.

As Small Language Models (SLMs) like Llama 3 8B, Phi-4 14B, and Qwen 2.5 Coder 7B reach capabilities matching larger commercial models (Frontier LLMs) in specific domains, self-hosting and fine-tuning these models is the key to optimizing TCO, ensuring data privacy, and retaining full technology control.

This series is designed as a Hands-On Technical Playbook, taking you from quantization math and alignment algorithms to concrete Axolotl/vLLM code and configuration templates ready for enterprise scale.

Series Contents

💡 Core Principle: This playbook is not just about AI theory. We provide runnable YAML configs, core mathematical derivations, and Python code tested on production NVIDIA A10G/H100 GPUs.

Executive Summary — The SLM Playbook

← Series hub Next → For the past two years, enterprise AI adoption has been dominated by a singular architectural pattern: API integration with massive, closed-source models (Frontier LLMs). While this API-Centric model allows for rapid prototyping, it becomes a severe liability when scaled to production workloads handling sensitive company data. The Problem with API-Centric Architectures Relying exclusively on commercial APIs (such as GPT-4 or Claude 3.5 Sonnet) introduces three critical bottlenecks for scale-ups and enterprises: ...

Hybrid AI Architecture & Self-Hosted vLLM | SLM Playbook

← Series hub ← Previous | Next → In the early phase of the AI wave (2023-2024), the default architecture for most startups and enterprises was API-Centric: routing every single request to OpenAI’s GPT-4 or Anthropic’s Claude. While highly convenient for proof-of-concept (PoC) phases, this model rapidly falls apart under production loads when encountering two massive walls: data privacy regulations and astronomical operational costs. By 2026, the rise of Small Language Models (SLMs) ranging from 2B to 14B parameters has dramatically shifted the landscape. Models such as Microsoft’s Phi-4 (14B), Qwen 2.5/3.5 Coder (7B/14B), and Llama 3 8B, when properly fine-tuned, achieve performance close to—or even exceeding—commercial frontier models on domain-specific, narrow tasks. ...

Data Engineering SFT: NEFTune & SemDeDup | SLM Playbook

← Series hub ← Previous | Next → In the era of LLMs/SLMs, the classic data science proverb: “Garbage In, Garbage Out” has never been more relevant. When performing Supervised Fine-Tuning (SFT) for Small Language Models (SLMs), data quality and format dictate over 90% of the model’s downstream capabilities. Feeding millions of raw, web-scraped dialogue pairs or low-quality synthetic data directly into your model will overfit it to repetitive phrasing, restrict its reasoning capabilities, and waste thousands of GPU hours. ...

Practical QLoRA Fine-tuning: Axolotl & Unsloth | SLM Playbook

← Series hub ← Previous | Next → Full-parameter fine-tuning of a large language model is a luxury. For even an 8B model like Llama 3, updating all weights in 16-bit precision requires massive clusters far beyond the reach of mid-sized teams or startups. To resolve these hardware barriers, Parameter-Efficient Fine-Tuning (PEFT) methods were developed, with LoRA and QLoRA emerging as the dominant paradigms. They allow developers to train multi-billion parameter models on a single consumer GPU (like an RTX 3090, 4090, or A10G) while maintaining near-zero performance degradation compared to full tuning. ...

Knowledge Distillation: Distilling DeepSeek-R1 | SLM Playbook

← Series hub ← Previous | Next → The release of DeepSeek-R1 in early 2025 disrupted conventional wisdom surrounding artificial intelligence scaling. Rather than simply chasing raw parameter size, DeepSeek demonstrated a paradigm shift: Knowledge Distillation from massive reasoning models can transfer complex multi-step reasoning traces (Chain of Thought - CoT) into smaller student models (SLMs) like Qwen or Llama. Thanks to this technique, distilled open models like DeepSeek-R1-Distill-Qwen-14B or DeepSeek-R1-Distill-Llama-8B achieve reasoning and coding scores that surpass vanilla models multiple times their size. ...

Preference Alignment: DPO, KTO, & GRPO Algorithms | SLM Playbook

← Series hub ← Previous | Next → Supervised Fine-Tuning (SFT) is the stepping stone that introduces domain knowledge and tone to a model, but it does not instruct the model on handling complex preference tradeoffs: identifying safe vs. toxic generation boundaries, formatting alignment, or self-correcting logic errors during reasoning cycles. To ensure small models align with human intent, safety guidelines, and logical correctness, we execute a Preference Alignment phase. This article details the mechanics of reinforcement learning for LLM alignment. We compare the mathematical objectives of DPO and KTO, and dissect GRPO (Group Relative Policy Optimization)—the breakthrough algorithm powering DeepSeek-R1 that frees up over 50% of training memory. ...

Optimizing vLLM Serving: AWQ, GPTQ, & GGUF | SLM Playbook

← Series hub ← Previous Successfully training and aligning a Small Language Model (SLM) is only half the battle. In enterprise environments, deploying a model to production serving requires solving three major challenges: high request concurrency, low response latency, and minimized compute cost. To achieve this, we must master model compression (Quantization) and high-performance serving configurations using vLLM—the state-of-the-art serving engine for LLMs. This final article in The SLM Playbook series compares the technical attributes of AWQ, GPTQ, and GGUF quantization formats, details how to set up Dynamic LoRA serving to conserve VRAM, and outlines a resilient enterprise-grade serving architecture. ...

Series Contents#

Series Contents