希望不要涨价太多
在Reddit找到一个LiveBench的数据,如果数据属实的话:
Reddit原帖子截图
LiveBench排名(假定上述数据真实)
Model |
Global Average |
Reasoning Average |
Coding Average |
Mathematics Average |
Data Analysis Average |
Language Average |
IF Average |
o1-2024-12-17-high |
75.67 |
91.58 |
69.69 |
80.32 |
65.47 |
65.39 |
81.55 |
o1-preview-2024-09-12 |
65.79 |
67.42 |
50.85 |
65.49 |
67.69 |
68.72 |
74.6 |
gemini-exp-1206 |
64.09 |
57 |
63.41 |
72.36 |
63.16 |
51.29 |
77.34 |
gemini-2.0-flash-thinking-exp-1219 |
61.83 |
64.58 |
53.13 |
69.03 |
68.11 |
36.83 |
79.32 |
Deepseek-V3 |
60.4 |
50 |
63.4 |
60 |
57.7 |
50.2 |
80.9 |
gemini-2.0-flash-exp |
59.26 |
59.08 |
54.36 |
60.39 |
61.67 |
38.22 |
81.86 |
claude-3-5-sonnet-20241022 |
59.03 |
56.67 |
67.13 |
52.28 |
55.03 |
53.76 |
69.3 |
claude-3-5-sonnet-20240620 |
58.74 |
57.17 |
60.85 |
54.32 |
58.87 |
53.21 |
68.01 |
o1-mini-2024-09-12 |
57.76 |
72.33 |
48.05 |
61.99 |
57.92 |
40.89 |
65.4 |
gemini-exp-1121 |
57.36 |
49.92 |
49.75 |
63.75 |
60.29 |
40.3 |
80.15 |
gpt-4o-2024-08-06 |
55.33 |
53.92 |
51.44 |
49.54 |
60.91 |
47.59 |
68.58 |
gpt-4o-2024-05-13 |
54.41 |
49.67 |
50 |
46.98 |
61.57 |
50.05 |
68.21 |
gemini-1.5-pro-002 |
54.33 |
49.08 |
48.8 |
59.07 |
54.97 |
43.29 |
70.78 |
grok-2-1212 |
54.3 |
54.83 |
46.44 |
54.88 |
54.45 |
45.58 |
69.63 |
gemini-1.5-pro-exp-0827 |
53.29 |
50.92 |
41.43 |
58.5 |
53.5 |
46.15 |
69.26 |
meta-llama-3.1-405b-instruct-turbo |
52.36 |
53.25 |
42.65 |
41.05 |
55.85 |
45.46 |
75.9 |
gpt-4o-2024-11-20 |
52.19 |
55.75 |
46.08 |
42.87 |
56.15 |
47.37 |
64.94 |
9 个赞
Dhudean
(Dhudean)
7
找到嘞, 我靠这也太强了, 竟然能力压claude
2 个赞
期待一波 API 定价,不知道 deepseek 这回成本能控制的怎么样
1 个赞
你说的应该是 DeepSeek 官网,我说的是硅基(https://siliconflow.cn),因为我还有余额,所以我一般都在硅基来调用 API。
什么时候上livebench啊,想看一下各种能力的排名
ljoker
(西楚霸王霸天虎)
18
硅基不会给直接用吧,之前骆驼的405b都不给直接用
reddit搜到一个,不确定真实性:
All Groups
Average |
60.4 |
Reasoning |
50.0 |
Coding |
63.4 |
Mathematics |
60.0 |
Data Analysis |
57.7 |
Language |
50.2 |
Instruction Following |
80.9 |
LiveBench排名(假定上述数据真实)
Model |
Global Average |
Reasoning Average |
Coding Average |
Mathematics Average |
Data Analysis Average |
Language Average |
IF Average |
o1-2024-12-17-high |
75.67 |
91.58 |
69.69 |
80.32 |
65.47 |
65.39 |
81.55 |
o1-preview-2024-09-12 |
65.79 |
67.42 |
50.85 |
65.49 |
67.69 |
68.72 |
74.6 |
gemini-exp-1206 |
64.09 |
57 |
63.41 |
72.36 |
63.16 |
51.29 |
77.34 |
gemini-2.0-flash-thinking-exp-1219 |
61.83 |
64.58 |
53.13 |
69.03 |
68.11 |
36.83 |
79.32 |
Deepseek-V3 |
60.4 |
50 |
63.4 |
60 |
57.7 |
50.2 |
80.9 |
gemini-2.0-flash-exp |
59.26 |
59.08 |
54.36 |
60.39 |
61.67 |
38.22 |
81.86 |
claude-3-5-sonnet-20241022 |
59.03 |
56.67 |
67.13 |
52.28 |
55.03 |
53.76 |
69.3 |
claude-3-5-sonnet-20240620 |
58.74 |
57.17 |
60.85 |
54.32 |
58.87 |
53.21 |
68.01 |
o1-mini-2024-09-12 |
57.76 |
72.33 |
48.05 |
61.99 |
57.92 |
40.89 |
65.4 |
gemini-exp-1121 |
57.36 |
49.92 |
49.75 |
63.75 |
60.29 |
40.3 |
80.15 |
gpt-4o-2024-08-06 |
55.33 |
53.92 |
51.44 |
49.54 |
60.91 |
47.59 |
68.58 |
gpt-4o-2024-05-13 |
54.41 |
49.67 |
50 |
46.98 |
61.57 |
50.05 |
68.21 |
gemini-1.5-pro-002 |
54.33 |
49.08 |
48.8 |
59.07 |
54.97 |
43.29 |
70.78 |
grok-2-1212 |
54.3 |
54.83 |
46.44 |
54.88 |
54.45 |
45.58 |
69.63 |
gemini-1.5-pro-exp-0827 |
53.29 |
50.92 |
41.43 |
58.5 |
53.5 |
46.15 |
69.26 |
meta-llama-3.1-405b-instruct-turbo |
52.36 |
53.25 |
42.65 |
41.05 |
55.85 |
45.46 |
75.9 |
gpt-4o-2024-11-20 |
52.19 |
55.75 |
46.08 |
42.87 |
56.15 |
47.37 |
64.94 |
2 个赞
685B?之前Llama也只有405B而已、Mistral-Large只有123B,這感覺成本很高啊,主打一個力大磚飛