Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Submit feedback
    • Contribute to GitLab
  • Sign in
4
4ben
  • Project
    • Project
    • Details
    • Activity
    • Cycle Analytics
  • Issues 1
    • Issues 1
    • List
    • Board
    • Labels
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Create a new issue
  • Jobs
  • Issue Boards
  • Madge Lantz
  • 4ben
  • Issues
  • #1

Closed
Open
Opened Feb 11, 2025 by Madge Lantz@madgelantz031
  • Report abuse
  • New issue
Report abuse New issue

Understanding DeepSeek R1


DeepSeek-R1 is an open-source language design built on DeepSeek-V3-Base that's been making waves in the AI community. Not only does it match-or even surpass-OpenAI's o1 design in many benchmarks, however it likewise comes with fully MIT-licensed weights. This marks it as the very first non-OpenAI/Google model to provide strong reasoning abilities in an open and available manner.

What makes DeepSeek-R1 especially exciting is its openness. Unlike the less-open methods from some market leaders, DeepSeek has actually released a detailed training method in their paper. The design is also incredibly cost-effective, with input tokens costing just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).

Until ~ GPT-4, the typical wisdom was that much better models needed more information and compute. While that's still legitimate, designs like o1 and R1 show an option: inference-time scaling through reasoning.

The Essentials

The DeepSeek-R1 paper provided multiple models, but main amongst them were R1 and R1-Zero. Following these are a series of distilled designs that, while fascinating, I will not discuss here.

DeepSeek-R1 uses 2 significant concepts:

1. A multi-stage pipeline where a little set of cold-start information kickstarts the design, followed by large-scale RL. 2. Group Relative Policy Optimization (GRPO), a support learning method that counts on comparing several model outputs per prompt to avoid the requirement for a different critic.

R1 and R1-Zero are both thinking models. This basically suggests they do before answering. For the R1 series of models, this takes kind as believing within a tag, before answering with a final summary.

R1-Zero vs R1

R1-Zero uses Reinforcement Learning (RL) straight to DeepSeek-V3-Base without any supervised fine-tuning (SFT). RL is utilized to optimize the design's policy to maximize benefit. R1-Zero attains exceptional accuracy but in some cases produces complicated outputs, such as blending several languages in a single action. R1 repairs that by integrating minimal supervised fine-tuning and several RL passes, which enhances both accuracy and readability.

It is intriguing how some languages might reveal certain concepts much better, which leads the model to pick the most expressive language for the job.

Training Pipeline

The training pipeline that DeepSeek released in the R1 paper is profoundly fascinating. It showcases how they developed such strong thinking models, and what you can anticipate from each phase. This consists of the problems that the resulting designs from each stage have, and how they resolved it in the next phase.

It's intriguing that their training pipeline differs from the typical:

The normal training strategy: Pretraining on big dataset (train to predict next word) to get the base design → monitored fine-tuning → preference tuning through RLHF R1-Zero: Pretrained → RL R1: Pretrained → Multistage training pipeline with multiple SFT and RL phases

Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) samples to ensure the RL process has a decent beginning point. This offers a great design to start RL. First RL Stage: Apply GRPO with rule-based benefits to enhance thinking correctness and formatting (such as forcing chain-of-thought into thinking tags). When they were near merging in the RL process, they moved to the next action. The result of this step is a strong thinking design but with weak general capabilities, e.g., bad format and language mixing. Rejection Sampling + general data: Create new SFT information through rejection sampling on the RL checkpoint (from action 2), integrated with supervised data from the DeepSeek-V3-Base design. They gathered around 600k premium thinking samples. Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k total samples (600k thinking + 200k general tasks) for wider abilities. This action resulted in a strong thinking model with general abilities. Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to refine the last model, in addition to the reasoning benefits. The result is DeepSeek-R1. They also did design distillation for numerous Qwen and Llama models on the reasoning traces to get distilled-R1 models.

Model distillation is a method where you use an instructor design to enhance a trainee model by producing training information for the trainee design. The teacher is usually a larger model than the trainee.

Group Relative Policy Optimization (GRPO)

The standard idea behind utilizing reinforcement learning for LLMs is to tweak the model's policy so that it naturally produces more accurate and helpful responses. They utilized a reward system that inspects not just for accuracy however also for correct format and language consistency, qoocle.com so the design slowly finds out to prefer reactions that meet these quality criteria.

In this paper, they encourage the R1 design to create chain-of-thought thinking through RL training with GRPO. Rather than adding a separate module at inference time, the training process itself nudges the model to produce detailed, detailed outputs-making the chain-of-thought an emerging behavior of the enhanced policy.

What makes their technique particularly interesting is its reliance on straightforward, rule-based reward functions. Instead of depending on expensive external models or human-graded examples as in conventional RLHF, the RL utilized for R1 utilizes simple requirements: it may provide a higher benefit if the answer is correct, if it follows the anticipated/ formatting, and if the language of the response matches that of the timely. Not depending on a reward model also suggests you don't need to hang around and effort training it, and it doesn't take memory and compute far from your main model.

GRPO was introduced in the DeepSeekMath paper. Here's how GRPO works:

1. For each input prompt, the model generates various responses. 2. Each response gets a scalar reward based on elements like accuracy, formatting, and language consistency. 3. Rewards are changed relative to the group's performance, basically determining just how much better each reaction is compared to the others. 4. The model updates its strategy a little to favor actions with higher relative benefits. It only makes slight adjustments-using strategies like clipping and a KL penalty-to ensure the policy doesn't wander off too far from its initial behavior.

A cool element of GRPO is its versatility. You can use simple rule-based reward functions-for circumstances, awarding a bonus when the model correctly uses the syntax-to guide the training.

While DeepSeek used GRPO, you might use alternative methods rather (PPO or PRIME).

For those aiming to dive deeper, Will Brown has written quite a great application of training an LLM with RL using GRPO. GRPO has likewise currently been contributed to the Transformer Reinforcement Learning (TRL) library, which is another great resource. Finally, Yannic Kilcher has a great video explaining GRPO by going through the DeepSeekMath paper.

Is RL on LLMs the course to AGI?

As a final note on explaining DeepSeek-R1 and the methodologies they have actually presented in their paper, I wish to highlight a passage from the DeepSeekMath paper, based upon a point Yannic Kilcher made in his video.

These findings show that RL enhances the design's overall efficiency by rendering the output distribution more robust, to put it simply, it appears that the enhancement is associated to increasing the right reaction from TopK rather than the enhancement of essential abilities.

Simply put, RL fine-tuning tends to shape the output circulation so that the highest-probability outputs are most likely to be proper, even though the total ability (as measured by the variety of correct answers) is mainly present in the pretrained model.

This recommends that reinforcement learning on LLMs is more about refining and "shaping" the existing circulation of actions rather than enhancing the model with completely new abilities. Consequently, while RL methods such as PPO and GRPO can produce substantial performance gains, wiki-tb-service.com there appears to be an inherent ceiling determined by the underlying model's pretrained knowledge.

It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm delighted to see how it unfolds!

Running DeepSeek-R1

I have actually used DeepSeek-R1 through the main chat interface for numerous problems, which it seems to solve all right. The extra search performance makes it even nicer to use.

Interestingly, o3-mini(-high) was launched as I was composing this post. From my preliminary testing, R1 appears more powerful at math than o3-mini.

I likewise rented a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments. The main goal was to see how the design would carry out when released on a single H100 GPU-not to extensively test the model's capabilities.

671B through Llama.cpp

DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers working on the GPU), running by means of llama.cpp:

29 layers appeared to be the sweet spot offered this setup.

Performance:

A r/localllama user explained that they were able to get over 2 tok/sec with DeepSeek R1 671B, without using their GPU on their local video gaming setup. Digital Spaceport wrote a full guide on how to run Deepseek R1 671b fully in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.

As you can see, the tokens/s isn't quite manageable for any major work, but it's fun to run these large models on available hardware.

What matters most to me is a mix of effectiveness and time-to-usefulness in these designs. Since thinking models require to think before answering, their time-to-usefulness is generally higher than other designs, but their effectiveness is also typically greater. We require to both take full advantage of usefulness and decrease time-to-usefulness.

70B through Ollama

70.6 b params, 4-bit KM quantized DeepSeek-R1 running through Ollama:

GPU usage shoots up here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.

Resources

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs through Reinforcement Learning [2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models DeepSeek R1 - Notion (Building a completely local "deep scientist" with DeepSeek-R1 - YouTube). DeepSeek R1's dish to reproduce o1 and the future of reasoning LMs. The Illustrated DeepSeek-R1 - by Jay Alammar. Explainer: What's R1 & Everything Else? - Tim Kellogg. DeepSeek R1 Explained to your granny - YouTube

DeepSeek

- Try R1 at chat.deepseek.com. GitHub - deepseek-ai/DeepSeek-R 1. deepseek-ai/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an unique autoregressive framework that merges multimodal understanding and generation. It can both understand and generate images. DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models by means of Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source reasoning model that equals the efficiency of OpenAI's o1. It presents a detailed method for training such models using massive reinforcement knowing techniques. DeepSeek-V3 Technical Report (December 2024) This report discusses the implementation of an FP8 mixed precision training framework confirmed on an incredibly large-scale design, attaining both sped up training and reduced GPU memory use. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper delves into scaling laws and presents findings that help with the scaling of massive designs in open-source configurations. It introduces the DeepSeek LLM task, devoted to advancing open-source language models with a long-term viewpoint. DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research introduces the DeepSeek-Coder series, a variety of open-source code designs trained from scratch on 2 trillion tokens. The models are pre-trained on a high-quality project-level code corpus and use a fill-in-the-blank job to improve code generation and infilling. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper presents DeepSeek-V2, a Mixture-of-Experts (MoE) language design defined by affordable training and efficient inference. DeepSeek-Coder-V2: forum.batman.gainedge.org Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that attains performance comparable to GPT-4 Turbo in code-specific tasks.

Interesting events

- Hong Kong University duplicates R1 results (Jan 25, '25).

  • Huggingface announces huggingface/open-r 1: Fully open recreation of DeepSeek-R1 to reproduce R1, completely open source (Jan 25, '25).
  • OpenAI scientist verifies the DeepSeek group independently found and used some core ideas the OpenAI team utilized en route to o1

    Liked this post? Join the newsletter.
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
No due date
0
Labels
None
Assign labels
  • View project labels
Reference: madgelantz031/4ben#1