Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Submit feedback
    • Contribute to GitLab
  • Sign in
S
sheiksandwiches
  • Project
    • Project
    • Details
    • Activity
    • Cycle Analytics
  • Issues 153
    • Issues 153
    • List
    • Board
    • Labels
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Create a new issue
  • Jobs
  • Issue Boards
  • Adela Baine
  • sheiksandwiches
  • Issues
  • #66

Closed
Open
Opened Feb 11, 2025 by Adela Baine@adelabaine0415
  • Report abuse
  • New issue
Report abuse New issue

Run DeepSeek R1 Locally - with all 671 Billion Parameters


Recently, I revealed how to quickly run distilled variations of the DeepSeek R1 model locally. A distilled design is a compressed version of a bigger language design, where knowledge from a bigger design is transferred to a smaller one to reduce resource usage without losing too much performance. These designs are based upon the Llama and Qwen architectures and be available in variants varying from 1.5 to 70 billion parameters.

Some explained that this is not the REAL DeepSeek R1 which it is impossible to run the full design in your area without a number of hundred GB of memory. That sounded like an obstacle - I believed! First Attempt - Warming up with a 1.58 bit Quantized Version of DeepSeek R1 671b in Ollama.cpp

The developers behind Unsloth dynamically quantized DeepSeek R1 so that it could run on as little as 130GB while still gaining from all 671 billion specifications.

A quantized LLM is a LLM whose parameters are saved in lower-precision formats (e.g., 8-bit or 4-bit rather of 16-bit). This substantially minimizes memory use and accelerates processing, with very little impact on efficiency. The full variation of DeepSeek R1 uses 16 bit.

The trade-off in accuracy is hopefully compensated by increased speed.

I downloaded the files from this collection on Hugging Face and ran the following command with Llama.cpp.

The following table from Unsloth shows the advised value for the n-gpu-layers criterion, which shows how much work can be offloaded to the GPU.

According to the table, I thought 7 ought to be the maximum, however I got it keeping up 12. According to Windows Task Manager my GPU has 40 GB of memory, and not 24 as I thought. So then it adds up (7/ 24 * 40 ≈ 12).

If you choose to run the design straight with Ollama, you must combine the 3 GGUF files using Llama.cpp or a comparable tool first.

Above is some of the text that appears on screen while running the command.

Llama.cpp and the quantized model are heavy on memory and wiki.vst.hs-furtwangen.de CPU, but also use the GPU (a little) and continuously read from the disk where the model files are kept.

The model repeats the concern and is getting all set to begin thinking.

The thinking, or thinking, is quite lengthy - which also adds to the general time it takes.

Okay, so I need to determine Alan Turing's crucial contribution to how we live our lives in 2025. Hmm. Let's start by remembering who Alan Turing was. He was a British mathematician and computer scientist, right? He's known for his work in code-breaking during The second world war, especially with the Enigma machine. Oh, and he's also considered among the fathers of theoretical computer science and expert system.

So, his contributions ... Well, the Turing Machine idea is essential in computer technology. It's a theoretical design of a computer, which laid the foundation for contemporary computing. Then there's the Turing Test, which is about figuring out whether a maker can show smart habits identical from a human. That relates to AI. Also, his deal with morphogenesis, which I think is about biological patterns and growth, like how organisms establish shapes and patterns.

Now, how do these contributions affect life in 2025? Well, we remain in a world where AI is ending up being more prevalent. So perhaps his work on the Turing Test is affecting how AI is established and integrated into society. If the Turing Test is a benchmark for AI intelligence, then developments in AI that pass this test might be considerable. But wait, the Turing Test is more of a philosophical benchmark. Maybe the real algorithms or concepts he established are more fundamental.

The Turing Machine design is the basis for the stored-program computer system, which is the architecture of modern computer systems. All our computer systems today, including mobile phones, servers, etc, are based on this idea. So, in 2025, even as technology advances, the basic concepts he laid down would still be the basis for all computational devices. That's quite huge. Without that, we would not have the computer systems that power everything from the web to AI algorithms.

Additionally, his deal with code-breaking throughout WWII caused advancements in cryptography. Modern encryption methods may have roots in his work, which is essential for safe and secure communication in 2025. Everything from electronic banking to private messaging counts on encryption, so that's a huge contribution.

Another angle is his influence on the advancement of algorithms. The concept of a universal computing device permits for the creation of intricate algorithms that drive AI, data analysis, artificial intelligence, and so on. So, any AI systems in 2025 would come from his foundational work.

Also, morphogenesis may have affected fields like biology and genetics. Maybe in 2025, comprehending biological patterns assists in areas like hereditary engineering or synthetic biology. But I'm uncertain how straight that ties into daily life compared to computing.

Putting all of it together, the most important contribution may be the Turing Machine and the theory of computation, which underpin all modern-day computing innovation. Without that, the digital age would not exist as we understand it. In 2025, even with advancements, the standard framework of calculation he supplied would still be important. Alternatively, his work on AI and the Turing Test might be more highlighted in an AI-driven future, but the fundamental computing theory seems more essential.

Wait, but the Turing Test is more about assessing AI rather than a contribution to technology itself. The real contribution is the theoretical basis for computing, which allowed all the digital tech we have. So, in 2025, all the smart devices, AI, cloud computing, and so on, are all built on the principles he established. Therefore, his essential contribution is the concept of a universal Turing maker, which is the basis for all contemporary computer systems, leading to the digital revolution that defines modern life.

And finally the answer.

Alan Turing's most substantial contribution to how we live in 2025 is his fundamental operate in theoretical computer technology, especially the concept of the universal Turing device. This theoretical design established the concepts of modern-day computation, forming the basis for all digital computers. In 2025, this innovation underpins the entire digital world, allowing technologies like expert system, cloud computing, smartphones, and the web. Without Turing's conceptual structure, the development of programmable computers and subsequent innovations that define modern life-from AI algorithms to protect communication systems-would not have been possible. His work on cryptography and the Turing Test likewise contributed to secure interactions and AI advancement, but it is the universal Turing machine that remains the cornerstone of our digital existence.

In summary: The 1.58 bit quantized model generated 0.39 tokens per second. In total, it took about 37 minutes to address the exact same concern.

I was type of shocked that I was able to run the design with only 32GB of RAM.

Second Attempt - DeepSeek R1 671b in Ollama

Ok, I get it, a quantized model of only 130GB isn't truly the complete model. Ollama's model library seem to include a complete variation of DeepSeek R1. It's 404GB with all 671 billion specifications - that should be genuine enough, right?

No, not truly! The version hosted in Ollamas library is the 4 bit quantized version. See Q4_K_M in the screenshot above? It took me a while!

With Ollama set up on my home PC, I just required to clear 404GB of disk space and run the following command while grabbing a cup of coffee:

Okay, it took more than one coffee before the download was total.

But finally, the was done, and the excitement grew ... till this message appeared!

After a quick check out to an online store selling various kinds of memory, I concluded that my motherboard would not support such large amounts of RAM anyhow. But there must be alternatives?

Windows enables for virtual memory, indicating you can swap disk area for virtual (and rather slow) memory. I figured 450GB of additional virtual memory, in addition to my 32GB of genuine RAM, must suffice.

Note: Be conscious that SSDs have a minimal variety of compose operations per memory cell before they wear out. Avoid excessive usage of virtual memory if this issues you.

A new attempt, and rising excitement ... before another mistake message!

This time, Ollama attempted to push more of the Chinese language model into the GPU's memory than it could manage. After searching online, it seems this is a recognized issue, but the option is to let the GPU rest and let the CPU do all the work.

Ollama utilizes a "Modelfile" containing setup for the model and how it should be utilized. When utilizing models straight from Ollama's design library, you usually do not handle these files as you need to when downloading designs from Hugging Face or comparable sources.

I ran the following command to display the existing configuration for DeepSeek R1:

Then, I added the following line to the output and waited in a brand-new file called Modelfile:

I then produced a new model setup with the following command, where the last parameter is my name for the design, which now runs entirely without GPU usage:

Once again, the excitement grew as I nervously typed the following command:

Suddenly, it took place! No error message, but tokens started appearing! Admittedly, the speed was such that if you were quick enough, you might almost grab a coffee between each screen update - however it was working! A language model on par with the finest from OpenAI, with all its 671 billion specifications, operating on my three-year-old PC with 32GB (genuine) RAM!

I had actually asked the exact same concern to both ChatGPT (4o, o1, 03-mini-high), DeepSeek R1 hosted in China and DeepSeek R1 671b hosted locally.

This is the reasoning from DeepSeek R1 running in Ollama on my own computer, as it appeared slowly-token by token.

And finally this is the response from DeepSeek R1 running in Ollama. The model outputs the answer with markdown format.

Alan Turing's most significant contribution to contemporary life, especially as we look towards 2025, depends on his fundamental deal with computational theory and synthetic intelligence, which underpin almost every element of contemporary innovation. Here's a structured breakdown:

1. Universal Turing Machine (1936 ):

- This conceptual model for general-purpose computing is the bedrock of all digital innovation today. It established that a single maker could execute any computable job with the best program, allowing: Modern Computing Devices (smartphones, cloud facilities, IoT). Software Development, enabling versatile applications throughout industries.

2. In 2025, improvements in quantum computing, AI, and edge computing still depend on Turing's principles of programmable logic.

3. Artificial Intelligence & the Turing Test (1950 ):

- His framework for assessing machine intelligence stimulated years of research into human-like AI. By 2025, this tradition is evident in: - Advanced chatbots, virtual assistants (e.g., GPT-4/ 5), and self-governing systems.

  • Ethical arguments around AI decision-making in health care, financing, and law.

    4. Cryptography & Security:

    - Turing's WWII codebreaking work foreshadowed contemporary encryption strategies critical for: - Securing digital transactions, individual data, and blockchain innovations.
  • Combating cyber threats in a progressively linked world.

    Why the Universal Machine Stands Apart: While AI and cryptography are transformative, Turing's theoretical design of calculation is the most basic. It allowed the development of programmable systems that drive today's innovations-from AI algorithms to quantum computing research study. Without this foundation, the digital infrastructure of 2025 simply would not exist.

    So, the length of time did it take, utilizing the 4 bit quantized design? A long time! At 0.05 tokens per second - suggesting 20 seconds per token - it took almost seven hours to get a response to my question, consisting of 35 minutes to pack the design.

    While the design was believing, the CPU, memory, and the disk (used as virtual memory) were close to 100% hectic. The disk where the design file was saved was not busy during generation of the response.

    After some reflection, I believed possibly it's all right to wait a bit? Maybe we should not ask language models about whatever all the time? Perhaps we ought to believe for ourselves initially and be prepared to wait for an answer.

    This may look like how computer systems were used in the 1960s when devices were large and availability was really limited. You prepared your program on a stack of punch cards, which an operator filled into the maker when it was your turn, and you could (if you were lucky) select up the outcome the next day - unless there was a mistake in your program.

    Compared with the action from other LLMs with and without reasoning

    DeepSeek R1, hosted in China, believes for 27 seconds before supplying this answer, which is slightly much shorter than my locally hosted DeepSeek R1's reaction.

    ChatGPT answers likewise to DeepSeek but in a much shorter format, with each design supplying somewhat different responses. The reasoning designs from OpenAI invest less time thinking than DeepSeek.

    That's it - it's certainly possible to run different quantized versions of DeepSeek R1 locally, with all 671 billion criteria - on a 3 years of age computer system with 32GB of RAM - just as long as you're not in excessive of a hurry!

    If you actually desire the full, non-quantized version of DeepSeek R1 you can find it at Hugging Face. Please let me understand your tokens/s (or rather seconds/token) or you get it running!
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
No due date
0
Labels
None
Assign labels
  • View project labels
Reference: adelabaine0415/sheiksandwiches#66