模型量化指南

概述

FineTuneX 支持多种模型量化方法，可以将微调后的模型进一步压缩，减少显存占用和推理延迟。

支持的量化方法

方法	位数	压缩比	速度	精度	适用场景
AWQ	4bit	4x	快	高	GPU 推理
GPTQ	4bit	4x	中	高	GPU 推理
GGUF	2-8bit	2-8x	中	中	CPU 推理

快速开始

1. AWQ 量化（推荐）

# 安装依赖
pip install autoawq

# 运行量化
python examples/quantize_awq.py --model_path ./outputs/qwen3.5-0.8b-finetuned

2. GPTQ 量化

# 安装依赖
pip install auto-gptq

# 运行量化
python examples/quantize_gptq.py --model_path ./outputs/qwen3.5-0.8b-finetuned

3. GGUF 量化

# 运行量化（会自动克隆 llama.cpp）
python examples/quantize_gguf.py --model_path ./outputs/qwen3.5-0.8b-finetuned --quant_type Q4_K_M

详细使用

使用量化脚本

# AWQ 量化
python scripts/quantize_model.py \
  --model_path ./outputs/qwen3.5-0.8b-finetuned \
  --method awq \
  --bits 4

# GPTQ 量化
python scripts/quantize_model.py \
  --model_path ./outputs/qwen3.5-0.8b-finetuned \
  --method gptq \
  --bits 4 \
  --group_size 128

# GGUF 量化
python scripts/quantize_model.py \
  --model_path ./outputs/qwen3.5-0.8b-finetuned \
  --method gguf \
  --quant_type Q4_K_M

估算量化大小

# 仅估算大小，不执行量化
python scripts/quantize_model.py \
  --model_path ./outputs/qwen3.5-0.8b-finetuned \
  --estimate_only

输出示例:

4bit 量化:
  原始大小：3.50 GB
  压缩比：4.0x
  估算大小：1.09 GB
  节省空间：2.41 GB (68.8%)

量化方法对比

AWQ (Activation-aware Weight Quantization)

优点:

✅ 量化速度快
✅ 精度损失小
✅ 推理速度快

缺点:

❌ 需要额外依赖
❌ 仅支持 GPU

适用场景: 需要快速推理的生产环境

安装:

pip install autoawq

使用:

from finetunex.quantization import quantize_to_awq

quantize_to_awq(
    model_path="./outputs/qwen3.5-0.8b-finetuned",
    output_path="./outputs/qwen3.5-0.8b-awq",
    quantization_config={
        "w_bit": 4,
        "q_group_size": 128,
    }
)

GPTQ

优点:

✅ 精度高
✅ 压缩比好
✅ 社区支持好

缺点:

❌ 量化速度慢
❌ 需要校准数据

适用场景: 对精度要求高的场景

安装:

pip install auto-gptq

使用:

from finetunex.quantization import quantize_to_gptq

quantize_to_gptq(
    model_path="./outputs/qwen3.5-0.8b-finetuned",
    output_path="./outputs/qwen3.5-0.8b-gptq",
    quantization_config={
        "bits": 4,
        "group_size": 128,
    }
)

GGUF

优点:

✅ 支持 CPU 推理
✅ 多种量化级别
✅ 生态完善

缺点:

❌ 需要 llama.cpp
❌ GPU 加速有限

适用场景: 无 GPU 或边缘设备

使用:

from finetunex.quantization import quantize_to_gguf

quantize_to_gguf(
    model_path="./outputs/qwen3.5-0.8b-finetuned",
    output_path="./outputs/qwen3.5-0.8b-Q4_K_M.gguf",
    quantization_type="Q4_K_M"
)

GGUF 量化类型

类型	大小	速度	质量	推荐度
Q2_K	最小	最快	最低	⭐⭐
Q3_K_S	小	快	低	⭐⭐⭐
Q3_K_M	中小	快	中	⭐⭐⭐⭐
Q4_K_S	中	中	中高	⭐⭐⭐⭐
Q4_K_M	中	中	高	⭐⭐⭐⭐⭐
Q5_K_S	中大	中	高	⭐⭐⭐⭐
Q5_K_M	大	中慢	很高	⭐⭐⭐⭐
Q6_K	大	慢	很高	⭐⭐⭐
Q8_0	最大	最慢	最高	⭐⭐⭐

推荐: 使用 Q4_K_M 平衡质量和大小

使用量化后的模型

AWQ 模型

from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM

# 加载量化模型
model = AutoAWQForCausalLM.from_quantized(
    "./outputs/qwen3.5-0.8b-awq",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("./outputs/qwen3.5-0.8b-awq")

# 推理
prompt = "你好"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

GPTQ 模型

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

# 加载量化模型
model = AutoGPTQForCausalLM.from_quantized(
    "./outputs/qwen3.5-0.8b-gptq",
    device="cuda:0",
)
tokenizer = AutoTokenizer.from_pretrained("./outputs/qwen3.5-0.8b-gptq")

# 推理
prompt = "你好"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

GGUF 模型

# 命令行推理
./llama.cpp/main -m ./outputs/qwen3.5-0.8b-Q4_K_M.gguf -p "你好" -n 512

# Python 推理
from llama_cpp import Llama

llm = Llama(model_path="./outputs/qwen3.5-0.8b-Q4_K_M.gguf")
output = llm("你好", max_tokens=100)
print(output)

完整流程示例

1. 微调模型

python examples/qwen3.5_0.8b_local_finetune.py

2. 查看模型大小

python scripts/quantize_model.py \
  --model_path ./outputs/qwen3.5-0.8b-finetuned \
  --show_info \
  --estimate_only

3. 量化模型

python examples/quantize_awq.py \
  --model_path ./outputs/qwen3.5-0.8b-finetuned

4. 测试量化模型

python scripts/inference.py \
  --model_path ./outputs/qwen3.5-0.8b-awq \
  --interactive

性能对比

Qwen3.5-0.8B 示例

版本	大小	显存占用	推理速度
原始 FP16	3.5 GB	~7 GB	100%
AWQ 4bit	1.1 GB	~3 GB	120%
GPTQ 4bit	1.0 GB	~2.5 GB	110%
GGUF Q4_K_M	1.1 GB	CPU	80%

速度越快越好（相对于原始 FP16）

常见问题

Q: 量化会影响模型性能吗？

A: 会有一定影响，但通常很小。4bit 量化通常能保持 95%+ 的原始性能。

Q: 应该选择哪种量化方法？

有 GPU: 选择 AWQ 或 GPTQ
无 GPU: 选择 GGUF
追求速度: AWQ
追求精度: GPTQ

Q: 量化后模型能直接加载吗？

A: 需要使用对应的库加载量化模型，不能直接用原始方式加载。

Q: 量化需要多长时间？

AWQ: 5-15 分钟
GPTQ: 15-60 分钟
GGUF: 10-30 分钟

取决于模型大小和硬件。

Q: 量化会丢失多少精度？

A: 4bit 量化通常损失 1-5% 的精度，取决于任务和量化方法。

依赖安装

# AWQ
pip install autoawq

# GPTQ
pip install auto-gptq

# GGUF (llama.cpp)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

# Python binding
pip install llama-cpp-python

最佳实践

先微调后量化: 在完整精度的模型上微调，然后再量化
选择合适的量化级别: 4bit 通常是最佳平衡点
测试量化效果: 量化后测试模型性能
保存原始模型: 保留原始模型以便尝试其他量化方法
使用校准数据: GPTQ 量化时使用校准数据可以提高精度

quantization.md 7.4 KB

Historia Czysty

模型量化指南

概述

支持的量化方法

快速开始

1. AWQ 量化（推荐）

2. GPTQ 量化

3. GGUF 量化

详细使用

使用量化脚本

估算量化大小

量化方法对比

AWQ (Activation-aware Weight Quantization)

GPTQ

GGUF

GGUF 量化类型

使用量化后的模型

AWQ 模型

GPTQ 模型

GGUF 模型

完整流程示例

1. 微调模型

2. 查看模型大小

3. 量化模型

4. 测试量化模型

性能对比

Qwen3.5-0.8B 示例

常见问题

Q: 量化会影响模型性能吗？

Q: 应该选择哪种量化方法？

Q: 量化后模型能直接加载吗？

Q: 量化需要多长时间？

Q: 量化会丢失多少精度？

依赖安装

最佳实践

相关资源

quantization.md 7.4 KB Historia Czysty

模型量化指南

概述

支持的量化方法

快速开始

1. AWQ 量化（推荐）

2. GPTQ 量化

3. GGUF 量化

详细使用

使用量化脚本

估算量化大小

量化方法对比

AWQ (Activation-aware Weight Quantization)

GPTQ

GGUF

GGUF 量化类型

使用量化后的模型

AWQ 模型

GPTQ 模型

GGUF 模型

完整流程示例

1. 微调模型

2. 查看模型大小

3. 量化模型

4. 测试量化模型

性能对比

Qwen3.5-0.8B 示例

常见问题

Q: 量化会影响模型性能吗？

Q: 应该选择哪种量化方法？

Q: 量化后模型能直接加载吗？

Q: 量化需要多长时间？

Q: 量化会丢失多少精度？

依赖安装

最佳实践

相关资源

quantization.md 7.4 KB

Historia Czysty