# 模型量化指南

## 概述

FineTuneX 支持多种模型量化方法，可以将微调后的模型进一步压缩，减少显存占用和推理延迟。

## 支持的量化方法

| 方法 | 位数 | 压缩比 | 速度 | 精度 | 适用场景 |
|------|------|--------|------|------|----------|
| **AWQ** | 4bit | 4x | 快 | 高 | GPU 推理 |
| **GPTQ** | 4bit | 4x | 中 | 高 | GPU 推理 |
| **GGUF** | 2-8bit | 2-8x | 中 | 中 | CPU 推理 |

## 快速开始

### 1. AWQ 量化（推荐）

```bash
# 安装依赖
pip install autoawq

# 运行量化
python examples/quantize_awq.py --model_path ./outputs/qwen3.5-0.8b-finetuned
```

### 2. GPTQ 量化

```bash
# 安装依赖
pip install auto-gptq

# 运行量化
python examples/quantize_gptq.py --model_path ./outputs/qwen3.5-0.8b-finetuned
```

### 3. GGUF 量化

```bash
# 运行量化（会自动克隆 llama.cpp）
python examples/quantize_gguf.py --model_path ./outputs/qwen3.5-0.8b-finetuned --quant_type Q4_K_M
```

## 详细使用

### 使用量化脚本

```bash
# AWQ 量化
python scripts/quantize_model.py \
  --model_path ./outputs/qwen3.5-0.8b-finetuned \
  --method awq \
  --bits 4

# GPTQ 量化
python scripts/quantize_model.py \
  --model_path ./outputs/qwen3.5-0.8b-finetuned \
  --method gptq \
  --bits 4 \
  --group_size 128

# GGUF 量化
python scripts/quantize_model.py \
  --model_path ./outputs/qwen3.5-0.8b-finetuned \
  --method gguf \
  --quant_type Q4_K_M
```

### 估算量化大小

```bash
# 仅估算大小，不执行量化
python scripts/quantize_model.py \
  --model_path ./outputs/qwen3.5-0.8b-finetuned \
  --estimate_only
```

输出示例:
```
4bit 量化:
  原始大小：3.50 GB
  压缩比：4.0x
  估算大小：1.09 GB
  节省空间：2.41 GB (68.8%)
```

## 量化方法对比

### AWQ (Activation-aware Weight Quantization)

**优点**:
- ✅ 量化速度快
- ✅ 精度损失小
- ✅ 推理速度快

**缺点**:
- ❌ 需要额外依赖
- ❌ 仅支持 GPU

**适用场景**: 需要快速推理的生产环境

**安装**:
```bash
pip install autoawq
```

**使用**:
```python
from finetunex.quantization import quantize_to_awq

quantize_to_awq(
    model_path="./outputs/qwen3.5-0.8b-finetuned",
    output_path="./outputs/qwen3.5-0.8b-awq",
    quantization_config={
        "w_bit": 4,
        "q_group_size": 128,
    }
)
```

### GPTQ

**优点**:
- ✅ 精度高
- ✅ 压缩比好
- ✅ 社区支持好

**缺点**:
- ❌ 量化速度慢
- ❌ 需要校准数据

**适用场景**: 对精度要求高的场景

**安装**:
```bash
pip install auto-gptq
```

**使用**:
```python
from finetunex.quantization import quantize_to_gptq

quantize_to_gptq(
    model_path="./outputs/qwen3.5-0.8b-finetuned",
    output_path="./outputs/qwen3.5-0.8b-gptq",
    quantization_config={
        "bits": 4,
        "group_size": 128,
    }
)
```

### GGUF

**优点**:
- ✅ 支持 CPU 推理
- ✅ 多种量化级别
- ✅ 生态完善

**缺点**:
- ❌ 需要 llama.cpp
- ❌ GPU 加速有限

**适用场景**: 无 GPU 或边缘设备

**使用**:
```python
from finetunex.quantization import quantize_to_gguf

quantize_to_gguf(
    model_path="./outputs/qwen3.5-0.8b-finetuned",
    output_path="./outputs/qwen3.5-0.8b-Q4_K_M.gguf",
    quantization_type="Q4_K_M"
)
```

## GGUF 量化类型

| 类型 | 大小 | 速度 | 质量 | 推荐度 |
|------|------|------|------|--------|
| Q2_K | 最小 | 最快 | 最低 | ⭐⭐ |
| Q3_K_S | 小 | 快 | 低 | ⭐⭐⭐ |
| Q3_K_M | 中小 | 快 | 中 | ⭐⭐⭐⭐ |
| Q4_K_S | 中 | 中 | 中高 | ⭐⭐⭐⭐ |
| **Q4_K_M** | 中 | 中 | **高** | ⭐⭐⭐⭐⭐ |
| Q5_K_S | 中大 | 中 | 高 | ⭐⭐⭐⭐ |
| Q5_K_M | 大 | 中慢 | 很高 | ⭐⭐⭐⭐ |
| Q6_K | 大 | 慢 | 很高 | ⭐⭐⭐ |
| Q8_0 | 最大 | 最慢 | 最高 | ⭐⭐⭐ |

**推荐**: 使用 `Q4_K_M` 平衡质量和大小

## 使用量化后的模型

### AWQ 模型

```python
from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM

# 加载量化模型
model = AutoAWQForCausalLM.from_quantized(
    "./outputs/qwen3.5-0.8b-awq",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("./outputs/qwen3.5-0.8b-awq")

# 推理
prompt = "你好"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
```

### GPTQ 模型

```python
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

# 加载量化模型
model = AutoGPTQForCausalLM.from_quantized(
    "./outputs/qwen3.5-0.8b-gptq",
    device="cuda:0",
)
tokenizer = AutoTokenizer.from_pretrained("./outputs/qwen3.5-0.8b-gptq")

# 推理
prompt = "你好"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
```

### GGUF 模型

```bash
# 命令行推理
./llama.cpp/main -m ./outputs/qwen3.5-0.8b-Q4_K_M.gguf -p "你好" -n 512
```

```python
# Python 推理
from llama_cpp import Llama

llm = Llama(model_path="./outputs/qwen3.5-0.8b-Q4_K_M.gguf")
output = llm("你好", max_tokens=100)
print(output)
```

## 完整流程示例

### 1. 微调模型

```bash
python examples/qwen3.5_0.8b_local_finetune.py
```

### 2. 查看模型大小

```bash
python scripts/quantize_model.py \
  --model_path ./outputs/qwen3.5-0.8b-finetuned \
  --show_info \
  --estimate_only
```

### 3. 量化模型

```bash
python examples/quantize_awq.py \
  --model_path ./outputs/qwen3.5-0.8b-finetuned
```

### 4. 测试量化模型

```bash
python scripts/inference.py \
  --model_path ./outputs/qwen3.5-0.8b-awq \
  --interactive
```

## 性能对比

### Qwen3.5-0.8B 示例

| 版本 | 大小 | 显存占用 | 推理速度 |
|------|------|----------|----------|
| 原始 FP16 | 3.5 GB | ~7 GB | 100% |
| AWQ 4bit | 1.1 GB | ~3 GB | 120% |
| GPTQ 4bit | 1.0 GB | ~2.5 GB | 110% |
| GGUF Q4_K_M | 1.1 GB | CPU | 80% |

*速度越快越好（相对于原始 FP16）*

## 常见问题

### Q: 量化会影响模型性能吗？

A: 会有一定影响，但通常很小。4bit 量化通常能保持 95%+ 的原始性能。

### Q: 应该选择哪种量化方法？

A: 
- **有 GPU**: 选择 AWQ 或 GPTQ
- **无 GPU**: 选择 GGUF
- **追求速度**: AWQ
- **追求精度**: GPTQ

### Q: 量化后模型能直接加载吗？

A: 需要使用对应的库加载量化模型，不能直接用原始方式加载。

### Q: 量化需要多长时间？

A: 
- AWQ: 5-15 分钟
- GPTQ: 15-60 分钟
- GGUF: 10-30 分钟

取决于模型大小和硬件。

### Q: 量化会丢失多少精度？

A: 4bit 量化通常损失 1-5% 的精度，取决于任务和量化方法。

## 依赖安装

```bash
# AWQ
pip install autoawq

# GPTQ
pip install auto-gptq

# GGUF (llama.cpp)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

# Python binding
pip install llama-cpp-python
```

## 最佳实践

1. **先微调后量化**: 在完整精度的模型上微调，然后再量化
2. **选择合适的量化级别**: 4bit 通常是最佳平衡点
3. **测试量化效果**: 量化后测试模型性能
4. **保存原始模型**: 保留原始模型以便尝试其他量化方法
5. **使用校准数据**: GPTQ 量化时使用校准数据可以提高精度

## 相关资源

- [AWQ 论文](https://arxiv.org/abs/2306.00978)
- [GPTQ 论文](https://arxiv.org/abs/2210.17323)
- [llama.cpp](https://github.com/ggerganov/llama.cpp)
- [AutoAWQ GitHub](https://github.com/casper-hansen/AutoAWQ)
- [AutoGPTQ GitHub](https://github.com/PanQiWei/AutoGPTQ)

---

**最后更新**: 2026-03-30
**版本**: 0.1.0