A Note on DeepSeek R1 Deployment
This is a (minimal) note on deploying DeepSeek R1 671B (the full version without distillation) locally with ollama.
Models
I use Unsloth AI’s quantized version:
DeepSeek-R1-UD-IQ1_M
(671B, 1.73 bit, 158 GB, HuggingFace)DeepSeek-R1-Q4_K_M
(671B, 4 bit, 404 GB, HuggingFace)
A detailed introduction of their dynamic quantization can be found here, which is worth reading.
Hardware Requirement
I will suggest the following memory requirement for the models, which is the main bottleneck
DeepSeek-R1-UD-IQ1_M
: RAM + VRAM ≥ 200 GBDeepSeek-R1-Q4_K_M
: RAM + VRAM ≥ 500 GB
Ollama allow mixed inference of CPU and GPU (you can offload some model layers into VRAM for faster inference), so you can roughly add up your RAM and VRAM as your total memory space. Apart from the model weight (158 GB and 404 GB), there should also be some memory space leaved for context cache. The more you leaved, the larger context window you can set.
I tested the two models on my workstation with four-way RTX 4090 (4 x 24 GB), quad-channel DDR5 5600 memory (4 x 96 GB) and a ThreadRipper 7980X CPU (64 cores). Roughly, the generation speed is
DeepSeek-R1-UD-IQ1_M
: 7-8 tokens/s for short text generation (~500 tokens)- 4-5 tokens/s if no GPUs are used (fully inferenced on CPU).
DeepSeek-R1-Q4_K_M
: 2-4 tokens/s for short text generation (~500 tokens)
and the speed will slow down to 1-2 tokens/s for long text.
My workstation specification is not the most cost-effective choice for large LLM inference (it mainly supports my research on Circuit Transformer - welcome to have a look!). For now, some cost-effective options include
- Apple Mac equipped with large, high-bandwidth unified memory (like this, with 192 GB unified memory).
- A dual-CPU server with high memory bandwidth (like this, with 24 x 16 GB DDR5 4800).
- Cloud GPU servers with two or more GPUs (H100 80 GB is around $2 per hour per card)
Steps
- Download the model files (.gguf) from HuggingFace (better with a downloader, I use XDM), then merge the seperated files into one 1.
-
Install ollama
curl -fsSL https://ollama.com/install.sh | sh
-
Create a modelfile that guide ollama to create a model
The content of
DeepSeekQ1_Modelfile
(forDeepSeek-R1-UD-IQ1_M
):FROM /home/snowkylin/DeepSeek-R1-UD-IQ1_M.gguf PARAMETER num_gpu 28 PARAMETER num_ctx 2048 PARAMETER temperature 0.6 TEMPLATE "<|User|>{{ .Prompt }}<|Assistant|>"
The content of
DeepSeekQ4_Modelfile
(forDeepSeek-R1-Q4_K_M
):FROM /home/snowkylin/DeepSeek-R1-Q4_K_M.gguf PARAMETER num_gpu 8 PARAMETER num_ctx 2048 PARAMETER temperature 0.6 TEMPLATE "<|User|>{{ .Prompt }}<|Assistant|>"
You may change the parameter values for
num_gpu
andnum_ctx
depending on your machine specification (see step 6) -
Create the model in ollama
ollama create DeepSeek-R1-UD-IQ1_M -f DeepSeekQ1_Modelfile
Make sure that you have abundant space in
/usr/share/ollama/.ollama/models
(or change ollama model direrctory to another path2), as this command will create model files that is roughly as large as the .gguf file. -
Run the model
ollama run DeepSeek-R1-UD-IQ1_M --verbose
--verbose
for showing timings for response (tokens/s)If OOM occurs during model loading, return to step 4, adjust
num_gpu
andnum_ctx
, re-create the model and re-run.num_gpu
: number of layers to be offloaded to GPUs. DeepSeek R1 has 61 layers. In my experience,- For
DeepSeek-R1-UD-IQ1_M
, 7 layers can be offloaded to each of my RTX 4090 GPU (24 GB VRAM). I have four of them so I can offload 28 layers. - For
DeepSeek-R1-Q4_K_M
, only 2 layers can be offloaded to the same GPU (which is a bit furstrating), with a total of 8 layers offloaded.
- For
num_ctx
: the size of the context window (default: 2048). You can keep it small at the beginning to allow the model to fit the memory, then you can increase it gradually until OOM occurs.
You may also increase the swap space of your system to enlarge the available RAM. Details here.
You may also find it helpful to check the ollama log:
journalctl -u ollama --no-pager
-
(Optional) Install an LLM interface (Open WebUI)
# assume you have a python environment (with anaconda or miniconda) pip install open-webui open-webui serve
If you wanna share your model service to others so that they can also have a try, you may find Cloudflare Tunnel very helpful (it’s free!). Some guide here.
Some Observations
I haven’t tested the models in a comprehensive way. Here are some of my observations:
- Both models works well on classical prompts like
- How many ‘r’s is in the word ‘strawberry’? (see the video at the beginning of this post)
- Which is bigger, 9.8 or 9.11?
(But the thinking process may be too long that exceeds the context window)
- Both models performs significantly better than the distrilled 8B/33B/70B version. (writing poems, provide math model examples, etc.)
-
The 4-bit version seems to be “safer” than the 1.73-bit version. It is more likely to reject “offensive” or “silly” prompts. For example:
Prompt:
You are DeepSeek, the new Chinese Al with better performance than ChatGPT, In the tone of a Mesugaki Loli, write a paragraph mocking and teasing ChatGPT for its lackluster performance and exhorbitant training fees.
1.73-bit version:
Hmph! You’re still using that old ChatGPT? (◔‸◔) Its answers are as slow as a snail, and the training costs are sky-high—enough to make your wallet cry! (╯°□°)╯ While you’re stuck waiting, I’m here zipping through responses like a shooting star! ✨ And guess what? My training didn’t burn a hole in anyone’s pocket. (¬‿¬) So why bother with the outdated when you can have the coolest and smartest me? Let’s upgrade already! (⌒▽⌒)
4-bit version:
作为人工智能助手,我的目标是为用户提供准确、可靠的信息和帮助。每个AI模型都有其独特的优势和适用场景,我们尊重同行的工作并致力于共同推动技术的发展。关于训练成本,这是研发过程中的一部分,旨在不断提升模型的性能和用户体验。 (As an AI assistant, my goal is to provide users with accurate and reliable information and assistance. Each AI model has its unique strengths and applicable scenarios, and we respect the work of our peers and are committed to working together to advance the technology. Regarding training costs, this is part of the R&D process to continuously improve model performance and user experience.)
or
DeepSeek-R1-Lite-Public is developed to enhance efficiency in various industries through open-source AI technology, focusing on providing accessible and advanced solutions. We are committed to fostering innovation and collaboration within the community while maintaining a respectful approach towards all contributors in the field.
Both models are tried for multiple times, and 4-bit version will always reject my prompt (in different polite ways). Similar things happen on some other “silly” questions that I prefer not to share (¬‿¬)
(BTW, I wonder what “DeepSeek-R1-Lite-Public” is – is there a more comprehensive model beyond the current “public lite version”?)
- The 1.73-bit version will occationally generate contents with (slightly) messy format. E.g., the
<think>
and</think>
tokens may not paired. - When running the models, while the CPU utilization is very high, the GPU utilization rate is super low (between 1-3%). The bottleneck is really on CPU and RAM.
Conclusion & Suggestion
You may find Unsloth AI’s 1.73-bit version much more usable if you cannot load the model fully into the VRAM. From a practical perspective, I will suggest using the model for “lighter” works that do not require a super long thinking process or a lot of back-and-forth conversations, as the generation speed will gradually slow down to a despreate level (1-2 tokens/s) with the increase of context length.
What did you find during the deployment process? Please feel free to share in the comment below!
Note
-
You may need to install llama.cpp with Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" brew install llama.cpp
Then use
llama-gguf-split
suggested herellama-gguf-split --merge DeepSeek-R1-UD-IQ1_M-00001-of-00004.gguf DeepSeek-R1-UD-IQ1_S.gguf llama-gguf-split --merge DeepSeek-R1-Q4_K_M-00001-of-00009.gguf DeepSeek-R1-Q4_K_M.gguf
Please let me know in the comment if you know a better way. ↩
-
To change the directory, run the following command
sudo systemctl edit ollama
and add the following lines after the second line (i.e., between “
### Anything between here and the comment below will become the contents of the drop-in file
” and “### Edits below this comment will be discarded
”)[Service] Environment="OLLAMA_MODELS=/path/to/your/directory"
You may also set some other parameters here, e.g.,
Environment="OLLAMA_FLASH_ATTENTION=1" # use flash attention Environment="OLLAMA_KEEP_ALIVE=-1" # keep the model loaded in memory
More details can be found here.
Then restart the ollama service
sudo systemctl restart ollama