今天看到论坛都在说V3出来了,于是本地跑了轮LiveBench
成绩如下:
model | average | reasoning | coding | math | data_analysis | language | if | company |
---|---|---|---|---|---|---|---|---|
o1-2024-12-17-high | 75.67 | 91.58 | 69.69 | 80.32 | 65.47 | 65.39 | 81.55 | OpenAI |
o1-preview-2024-09-12 | 65.79 | 67.42 | 50.85 | 65.49 | 67.69 | 68.72 | 74.60 | OpenAI |
gemini-exp-1206 | 64.09 | 57.00 | 63.41 | 72.36 | 63.16 | 51.29 | 77.34 | |
deepseek-v3 | 61.97 | 53.3 | 62.1 | 61.9 | 58.6 | 52.9 | 83.0 | DeepSeek |
gemini-2.0-flash-thinking-exp-1219 | 61.83 | 64.58 | 53.13 | 69.03 | 68.11 | 36.83 | 79.32 | |
gemini-2.0-flash-exp | 59.26 | 59.08 | 54.36 | 60.39 | 61.67 | 38.22 | 81.86 | |
claude-3-5-sonnet-20241022 | 59.03 | 56.67 | 67.13 | 52.28 | 55.03 | 53.76 | 69.30 | Anthropic |
claude-3-5-sonnet-20240620 | 58.74 | 57.17 | 60.85 | 54.32 | 58.87 | 53.21 | 68.01 | Anthropic |
o1-mini-2024-09-12 | 57.76 | 72.33 | 48.05 | 61.99 | 57.92 | 40.89 | 65.40 | OpenAI |
gemini-exp-1121 | 57.36 | 49.92 | 49.75 | 63.75 | 60.29 | 40.30 | 80.15 |