Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Submit feedback
    • Contribute to GitLab
  • Sign in
S
sheiksandwiches
  • Project
    • Project
    • Details
    • Activity
    • Cycle Analytics
  • Issues 153
    • Issues 153
    • List
    • Board
    • Labels
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Create a new issue
  • Jobs
  • Issue Boards
  • Adela Baine
  • sheiksandwiches
  • Issues
  • #145

Closed
Open
Opened Mar 15, 2025 by Adela Baine@adelabaine0415
  • Report abuse
  • New issue
Report abuse New issue

Distillation with Reasoning: can DeepSeek R1 Teach Better Than Humans?


Inclusion of reasoning "chains of thought" (CoT) in the design output considerably improves its quality, however it increases reasoning expense. - Distillation transfers reasoning knowledge from an expensive instructor design to a more cost-effective trainee, decreasing general reasoning cost. - DeepSeek R1 can produce detailed CoT, making it an excellent instructor design. - Synthetic data created by DeepSeek R1 may outshine information produced by human professionals.

Introduction

The recent release of DeepSeek R1 has taken the AI neighborhood by storm, providing efficiency on par with leading frontier models-such as OpenAI's o1-at a fraction of the cost. Still, ura.cc R1 can be costly for usage cases with high traffic or low latency requirements.

DeepSeek R1's strength lies in its explicit detailed reasoning. Before producing a last answer, it produces an internal "chain of thought" (CoT) to systematically reason through each issue. This procedure is a form of test-time calculation, enabling the model to dynamically allocate more compute to complicated issues. However, these extended reasoning series normally increase reasoning cost.

Distillation

Distillation is an approach for transferring knowledge from a large, more powerful teacher design to a smaller, more affordable trainee model. According to the DeepSeek R1 paper, R1 is highly effective in this instructor function. Its detailed CoT sequences assist the trainee design to break down intricate tasks into smaller, more workable actions.

Comparing Distillation to Human-Labeled Data

Although fine-tuning with human-labeled information can produce specific designs, gathering both last responses and their matching thinking actions is pricey. Distillation scales more easily: instead of relying on human annotations, the instructor model immediately generates the training data for the trainee.

A Side Note on Terminology

The term "distillation" can refer to different techniques:

Distribution Distillation Aligns the trainee design's output with the teacher's utilizing Kullback-Leibler divergence (KL-divergence). Works finest when both models share the exact same architecture, tokenizer, and pre-training information.

Data Distillation Uses the teacher model to generate conclusions for a set of prompts. Fine-tunes the trainee design using a standard cross-entropy loss on these generated outputs, skipping the KL-divergence term. Allows the teacher and trainee to be various design families and tokenizers (though if the teacher uses specialized tokens like __, it can be helpful for both models to acknowledge them).

In this post, we focus on the information distillation since it supports a broader variety of student-teacher pairs.

Data Generation

Training information is typically a traffic jam in model advancement. In a current post (include link), we checked out how to produce labels by integrating model output with a confirmation function. Distillation takes a various approach, utilizing an instructor design to manufacture missing completions.

DeepSeek R1 stands apart due to the fact that it not only supplies last answers however also exposes its detailed chain of thought-unlike other reasoning models that keep this internal process concealed. If your dataset includes ground truth answers, you can recognize premium synthetic CoTs through rejection tasting, choosing just the finest chains to additional enhance your fine-tuned design. Rejection sampling can eliminate inaccurate information examples either by comparing the produced data against ground reality labels or by applying a user-defined validation function. From the interface point of view, the recognition function resembles the verifiable benefit function used by value-model-free RL methods like these explained in our current blog site post.

Case Study: GSM8K

GSM8K (Elementary School Math 8K) is a dataset of 8.5 K diverse grade-school math word issues. Each data point includes:

1. A problem description. 2. A human expert's chain of thought. 3. The last answer.

We expanded this dataset by adding:

Synthetic R1 thinking, i.e., the CoT generated by DeepSeek R1.

Then, we fine-tuned three variants of the design (using LoRA on llama-3.1 -8 B-instruct), each with different training targets:

Direct Answer Only: Generate the last response without showing reasoning. Human Expert CoT: Generate the final answer along with a reasoning chain looking like the human expert's. Synthetic R1 CoT: Generate the final response together with DeepSeek R1's synthetic reasoning chain. The table listed below sums up average precision and thinking length:

- Note: The precision for the 5-shot baseline might differ from numbers reported elsewhere due to various assessment setups. The key focus is on comparing relative efficiency throughout distillation techniques, not on beating other models.

From this research study, artificial reasoning CoTs from DeepSeek R1 appear superior to human-expert CoTs in improving performance, albeit with a greater reasoning expense due to their longer length.

Fireworks AI Inference and Fine-Tuning Platform

DeepSeek R1 is available on the Fireworks AI platform. An easy to use distillation interface will soon become part of FireOptimizer. If you need earlier gain access to, please get in touch to check out alternatives.

Conclusions

By integrating reasoning-based information through distillation, companies can considerably improve model efficiency without bearing the full burden of human-annotated datasets. DeepSeek R1's capability to produce long, top quality reasoning chains makes it a powerful teacher model-showing that, in many cases, the machine might just out-teach the human.

Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
No due date
0
Labels
None
Assign labels
  • View project labels
Reference: adelabaine0415/sheiksandwiches#145