DeepSeek V3 与主流模型性能对比
数据来源:
LiveBench 评测: livebench.ai
Aider 代码编辑能力评测: aider.chat/docs/leaderboards
内容概要
LiveBench 综合测试
- DeepSeek V3 的整体得分在 TOP5 行列 (第5名)
- 在代码能力上表现优异,仅次于 o1-2024-12-17-high 和 Claude 3.5 Sonnet (20241022版)
Aider 代码编辑测试
- DeepSeek V3 的完成正确率为 48.4%,仅次于 o1-2024-12-17-high
- 显著超过了包括 Claude 3.5 Sonnet (45.3%) 在内的其他所有模型
- 展现出极高的格式准确性(98.7%)
详细数据对比
LiveBench 综合评测数据
模型 | Global Average | Reasoning | Coding | Mathematics | Data Analysis | Language | IF Average |
---|---|---|---|---|---|---|---|
o1-2024-12-17-high | 75.67 | 91.58 | 69.68 | 80.32 | 65.47 | 65.39 | 81.55 |
o1-preview-2024-09-12 | 65.79 | 67.42 | 50.85 | 65.49 | 67.69 | 68.72 | 74.60 |
gemini-exp-1206 | 64.09 | 57.00 | 63.41 | 72.36 | 63.16 | 51.29 | 77.34 |
gemini-2.0-flash-thinking-exp-1219 | 61.83 | 64.58 | 53.13 | 69.03 | 68.11 | 36.83 | 79.32 |
deepseek-v3 | 60.45 | 56.75 | 61.77 | 60.54 | 60.94 | 47.48 | 75.25 |
gemini-2.0-flash-exp | 59.26 | 59.08 | 54.36 | 60.39 | 61.67 | 38.22 | 81.86 |
claude-3-5-sonnet-20241022 | 59.03 | 56.67 | 67.13 | 52.28 | 55.03 | 53.76 | 69.30 |
claude-3-5-sonnet-20240620 | 58.74 | 57.17 | 60.85 | 54.32 | 58.87 | 53.21 | 68.01 |
o1-mini-2024-09-12 | 57.76 | 72.33 | 48.05 | 61.99 | 57.92 | 40.89 | 65.40 |
gemini-exp-1121 | 57.36 | 49.92 | 49.75 | 63.75 | 60.29 | 40.30 | 80.15 |
gpt-4o-2024-08-06 | 55.33 | 53.92 | 51.44 | 49.54 | 60.91 | 47.59 | 68.58 |
Aider 代码编辑能力评测
模型 | 完成正确率 | 正确格式使用率 | Command Edit format |
---|---|---|---|
o1-2024-12-17 (high) | 61.7% | 91.5% | aider --model openrouter/openai/o1 diff |
DeepSeek Chat V3 Preview | 48.4% | 98.7% | aider --model deepseek/deepseek-chat diff |
claude-3-5-sonnet-20241022 | 45.3% | 100.0% | aider --model claude-3-5-sonnet-20241022 diff |
gemini-exp-1206 | 38.2% | 98.2% | aider --model gemini/gemini-exp-1206 whole |
o1-mini-2024-09-12 | 32.9% | 96.9% | aider --model o1-mini whole |
claude-3-5-haiku-20241022 | 28.0% | 91.1% | aider --model claude-3-5-haiku-20241022 diff |
gemini-2.0-flash-exp | 22.2% | 100.0% | aider --model gemini/gemini-2.0-flash-exp whole |
DeepSeek Chat V2.5 | 17.8% | 92.9% | aider --model deepseek/deepseek-chat diff |
gpt-4o-2024-11-20 | 15.1% | 96.0% | aider --model gpt-4o-2024-11-20 diff |
yi-lightning | 12.9% | 92.9% | aider --model openai/yi-lightning whole |
Qwen2.5-Coder-32B-Instruct | 8.0% | 71.6% | aider --model openai/Qwen/Qwen2.5-Coder-32B-Instruct diff |
gpt-4o-mini-2024-07-18 | 3.6% | 100.0% | aider --model gpt-4o-mini-2024-07-18 whole |
注意:表格中使用颜色标注突出显示了两个重点对比模型:
- 蓝色: DeepSeek V3
- 绿色: Claude 3.5 Sonnet