685B的Deepseek V3上抱脸了,Aider榜上超sonnet了,LiveBench数据好像也出了


:melting_face:希望不要涨价太多

在Reddit找到一个LiveBench的数据,如果数据属实的话:

Reddit原帖子截图

LiveBench排名(假定上述数据真实)
Model Global Average Reasoning Average Coding Average Mathematics Average Data Analysis Average Language Average IF Average
o1-2024-12-17-high 75.67 91.58 69.69 80.32 65.47 65.39 81.55
o1-preview-2024-09-12 65.79 67.42 50.85 65.49 67.69 68.72 74.6
gemini-exp-1206 64.09 57 63.41 72.36 63.16 51.29 77.34
gemini-2.0-flash-thinking-exp-1219 61.83 64.58 53.13 69.03 68.11 36.83 79.32
Deepseek-V3 60.4 50 63.4 60 57.7 50.2 80.9
gemini-2.0-flash-exp 59.26 59.08 54.36 60.39 61.67 38.22 81.86
claude-3-5-sonnet-20241022 59.03 56.67 67.13 52.28 55.03 53.76 69.3
claude-3-5-sonnet-20240620 58.74 57.17 60.85 54.32 58.87 53.21 68.01
o1-mini-2024-09-12 57.76 72.33 48.05 61.99 57.92 40.89 65.4
gemini-exp-1121 57.36 49.92 49.75 63.75 60.29 40.3 80.15
gpt-4o-2024-08-06 55.33 53.92 51.44 49.54 60.91 47.59 68.58
gpt-4o-2024-05-13 54.41 49.67 50 46.98 61.57 50.05 68.21
gemini-1.5-pro-002 54.33 49.08 48.8 59.07 54.97 43.29 70.78
grok-2-1212 54.3 54.83 46.44 54.88 54.45 45.58 69.63
gemini-1.5-pro-exp-0827 53.29 50.92 41.43 58.5 53.5 46.15 69.26
meta-llama-3.1-405b-instruct-turbo 52.36 53.25 42.65 41.05 55.85 45.46 75.9
gpt-4o-2024-11-20 52.19 55.75 46.08 42.87 56.15 47.37 64.94
9 个赞

这么大啊

3 个赞

这参数量也太猛了,期待一下评测结果

3 个赞

编辑了一下,aider的榜上超过sonnet了
Aider LLM Leaderboards | aider

1 个赞

aider上没找到V3在哪捏


2 个赞

Aider LLM Leaderborads

3 个赞

找到嘞, 我靠这也太强了, 竟然能力压claude

2 个赞

等待测评

期待一波 API 定价,不知道 deepseek 这回成本能控制的怎么样

1 个赞

API 已经用上 V3 了,暂时还没涨价

真能超sonnet吗?代码能力怎么样呢

感谢佬分享 期待硅基可以早点上

已经上了, 目前网页和api都是v3

你说的应该是 DeepSeek 官网,我说的是硅基(https://siliconflow.cn),因为我还有余额,所以我一般都在硅基来调用 API。

抱歉,回复的时候没有看到是硅基流动

什么时候上livebench啊,想看一下各种能力的排名

这么强??

硅基不会给直接用吧,之前骆驼的405b都不给直接用

reddit搜到一个,不确定真实性:
All Groups

Average 60.4
Reasoning 50.0
Coding 63.4
Mathematics 60.0
Data Analysis 57.7
Language 50.2
Instruction Following 80.9

LiveBench排名(假定上述数据真实)
Model Global Average Reasoning Average Coding Average Mathematics Average Data Analysis Average Language Average IF Average
o1-2024-12-17-high 75.67 91.58 69.69 80.32 65.47 65.39 81.55
o1-preview-2024-09-12 65.79 67.42 50.85 65.49 67.69 68.72 74.6
gemini-exp-1206 64.09 57 63.41 72.36 63.16 51.29 77.34
gemini-2.0-flash-thinking-exp-1219 61.83 64.58 53.13 69.03 68.11 36.83 79.32
Deepseek-V3 60.4 50 63.4 60 57.7 50.2 80.9
gemini-2.0-flash-exp 59.26 59.08 54.36 60.39 61.67 38.22 81.86
claude-3-5-sonnet-20241022 59.03 56.67 67.13 52.28 55.03 53.76 69.3
claude-3-5-sonnet-20240620 58.74 57.17 60.85 54.32 58.87 53.21 68.01
o1-mini-2024-09-12 57.76 72.33 48.05 61.99 57.92 40.89 65.4
gemini-exp-1121 57.36 49.92 49.75 63.75 60.29 40.3 80.15
gpt-4o-2024-08-06 55.33 53.92 51.44 49.54 60.91 47.59 68.58
gpt-4o-2024-05-13 54.41 49.67 50 46.98 61.57 50.05 68.21
gemini-1.5-pro-002 54.33 49.08 48.8 59.07 54.97 43.29 70.78
grok-2-1212 54.3 54.83 46.44 54.88 54.45 45.58 69.63
gemini-1.5-pro-exp-0827 53.29 50.92 41.43 58.5 53.5 46.15 69.26
meta-llama-3.1-405b-instruct-turbo 52.36 53.25 42.65 41.05 55.85 45.46 75.9
gpt-4o-2024-11-20 52.19 55.75 46.08 42.87 56.15 47.37 64.94
2 个赞

685B?之前Llama也只有405B而已、Mistral-Large只有123B,這感覺成本很高啊,主打一個力大磚飛