声称自己比llama.cpp快的ktransformers

stevessr · 2025 年2 月 10 日 16:16

wwow · 2025 年2 月 10 日 16:25

进来看看

6512345 · 2025 年2 月 10 日 16:26

感谢分享

pandamao · 2025 年2 月 10 日 21:51

感谢分享，不知道有没有试试

yqyan · 2025 年2 月 10 日 22:03

可以支持 24GB 显存加 382GB 内存跑 671B 的 4bit 量化的 Deepseek V3 / R1，有佬友试过吗

handsome · 2025 年2 月 11 日 01:41

感谢分享。

org · 2025 年2 月 11 日 01:42

百花齐放

xiaomage · 2025 年2 月 11 日 04:28

这玩意怎么用？

stevessr · 2025 年2 月 11 日 04:41

kvcache-ai/ktransformers: A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
……

总结

🚀 Quick Start

Preparation

Some preparation:

CUDA 12.1 and above, if you didn’t have it yet, you may install from here.

# Adding CUDA to PATH
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_PATH=/usr/local/cuda

Linux-x86_64 with gcc, g++ and cmake

sudo apt-get update
sudo apt-get install gcc g++ cmake ninja-build

We recommend using Conda to create a virtual environment with Python=3.11 to run our program.

conda create --name ktransformers python=3.11
conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first

Make sure that PyTorch, packaging, ninja is installed
```
pip install torch packaging ninja cpufeature numpy
```

Installation

Use a Docker image, see documentation for Docker
You can install using Pypi (for linux):
```
pip install ktransformers --no-build-isolation
```
for windows we prepare a pre compiled whl package in ktransformers-0.1.1+cu125torch24avx2-cp311-cp311-win_amd64.whl, which require cuda-12.5, torch-2.4, python-3.11, more pre compiled package are being produced.
Or you can download source code and compile:
- init source code
```
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule init
git submodule update
```
- [Optional] If you want to run with website, please compile the website before execute bash install.sh
- Compile and install (for Linux)
```
bash install.sh
```
- Compile and install(for Windows)
```
install.bat
```
If you are developer, you can make use of the makefile to compile and format the code.
the detailed usage of makefile is here

Local Chat

We provide a simple command-line local chat Python script that you can run for testing.

Note that this is a very simple test tool only support one round chat without any memory about last input, if you want to try full ability of the model, you may go to RESTful API and Web UI. We use the DeepSeek-V2-Lite-Chat-GGUF model as an example here. But we also support other models, you can replace it with any other model that you want to test.

Run Example

# Begin from root of your cloned repo!
# Begin from root of your cloned repo!!
# Begin from root of your cloned repo!!! 

# Download mzwing/DeepSeek-V2-Lite-Chat-GGUF from huggingface
mkdir DeepSeek-V2-Lite-Chat-GGUF
cd DeepSeek-V2-Lite-Chat-GGUF

wget https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/resolve/main/DeepSeek-V2-Lite-Chat.Q4_K_M.gguf -O DeepSeek-V2-Lite-Chat.Q4_K_M.gguf

cd .. # Move to repo's root dir

# Start local chat
python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF

# If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try：
# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite
# python  ktransformers.local_chat --model_path ./DeepSeek-V2-Lite --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF

It features the following arguments:

--model_path (required): Name of the model (such as “deepseek-ai/DeepSeek-V2-Lite-Chat” which will automatically download configs from Hugging Face). Or if you already got local files you may directly use that path to initialize the model.

Note: .safetensors files are not required in the directory. We only need config files to build model and tokenizer.
--gguf_path (required): Path of a directory containing GGUF files which could that can be downloaded from Hugging Face. Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
--optimize_rule_path (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the ktransformers/optimize/optimize_rules directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
--max_new_tokens: Int (default=1000). Maximum number of new tokens to generate.
--cpu_infer: Int (default=10). The number of CPUs used for inference. Should ideally be set to the (total number of cores - 2).

Suggested Model

Model Name	Model Size	VRAM	Minimum DRAM	Recommended DRAM
DeepSeek-V2-q4_k_m	133G	11G	136G	192G
DeepSeek-V2.5-q4_k_m	133G	11G	136G	192G
DeepSeek-V2.5-IQ4_XS	117G	10G	107G	128G
Qwen2-57B-A14B-Instruct-q4_k_m	33G	8G	34G	64G
DeepSeek-V2-Lite-q4_k_m	9.7G	3G	13G	16G
Mixtral-8x7B-q4_k_m	25G	1.6G	51G	64G
Mixtral-8x22B-q4_k_m	80G	4G	86.1G	96G
InternLM2.5-7B-Chat-1M	15.5G	15.5G	8G(32K context)	150G (1M context)

More will come soon. Please let us know which models you are most interested in.

Be aware that you need to be subject to their corresponding model licenses when using DeepSeek and QWen.

Click To Show how to run other examples

Qwen2-57B

pip install flash_attn # For Qwen2

mkdir Qwen2-57B-GGUF && cd Qwen2-57B-GGUF

wget https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/resolve/main/qwen2-57b-a14b-instruct-q4_k_m.gguf?download=true -O qwen2-57b-a14b-instruct-q4_k_m.gguf

cd ..

python -m ktransformers.local_chat --model_name Qwen/Qwen2-57B-A14B-Instruct --gguf_path ./Qwen2-57B-GGUF

# If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try：
# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct
# python  ktransformers/local_chat.py --model_path ./Qwen2-57B-A14B-Instruct --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF

DeepseekV2

mkdir DeepSeek-V2-Chat-0628-GGUF && cd DeepSeek-V2-Chat-0628-GGUF
# Download weights
wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf
wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf
wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf
wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf

cd ..

python -m ktransformers.local_chat --model_name deepseek-ai/DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF

# If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try：

# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat-0628

# python -m ktransformers.local_chat --model_path ./DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF

model name	weights download link
Qwen2-57B	Qwen2-57B-A14B-gguf-Q4K-M
DeepseekV2-coder	DeepSeek-Coder-V2-Instruct-gguf-Q4K-M
DeepseekV2-chat	DeepSeek-V2-Chat-gguf-Q4K-M
DeepseekV2-lite	DeepSeek-V2-Lite-Chat-GGUF-Q4K-M

RESTful API and Web UI

Start without website:

ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002

Start with website:

ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF  --port 10002 --web True

Or you want to start server with transformers, the model_path should include safetensors

ktransformers --type transformers --model_path /mnt/data/model/Qwen2-0.5B-Instruct --port 10002 --web True

Access website with url http://localhost:10002/web/index.html#/chat :

More information about the RESTful API server can be found here. You can also find an example of integrating with Tabby here.

📃 Brief Injection Tutorial

WyInnovate · 2025 年2 月 12 日 15:12

我也想问在想要不要试试

cs328902 · 2025 年2 月 12 日 15:14

刚刚在 B 站看到有人跑，好像每秒七八个 token 吧。

WyInnovate · 2025 年2 月 12 日 15:14

vllm 是不是比 llama.cpp 和 ollama 好？

WyInnovate · 2025 年2 月 12 日 15:15

有点慢加大显存能更快吗？好奇

cs328902 · 2025 年2 月 12 日 15:26

我这里是没条件部署的，不过即便这个方案，一般人其实也真没那个条件。单显卡或许还好，那么多的内存只能用服务器机柜了。

简单看了一下，是把模型部分加载到内存和显存里，通过程序判断问题是该通过哪个回答，加大显存应该可以让更多的问题使用显卡回复从而提升速度。

WyInnovate · 2025 年2 月 12 日 15:31

感谢佬回复谢谢

yyy2024 · 2025 年2 月 12 日 15:44

模型还得是moe这种才行，蒸馏的70b都不能这样搞

snprintf · 2025 年2 月 14 日 12:02

没搞懂。不能用ollama下载下来的模型文件吗？

stevessr · 2025 年2 月 14 日 12:02

可以用gguf

melvinleee · 2025 年2 月 14 日 12:08

我就差这350G内存了

stevessr · 2025 年2 月 14 日 12:09

我也差，还有5090……

话题		回复	浏览量
Ollama跑的671B DeepSeek R1也会截断！搞七捻三 ollama , DeepSeek , 人工智能 , 纯水	33	1289	2025 年2 月 13 日
DeepSeek 新手上路 (二) 模型自部署文档共建人工智能	50	2190	2025 年2 月 19 日
今天买了台新PC，本地部署了Open WebUI+Ollama 搞七捻三 ollama , 人工智能 , OpenWebUI , 纯水	107	2427	2025 年1 月 30 日
手搓Ktransformer运行Deepseek-r1:671b_Q2_K_XS 开发调优 DeepSeek , 人工智能	48	895	2025 年2 月 26 日
本地安装部署图形化界面Deepseek模型（2060显卡可运行的），较稳定不易运行出错，目前有什么方式吗？开发调优人工智能 , 快问快答	83	790	2025 年2 月 1 日