cnm
(CNM)
1
LM竞技场 排行第三
Model |
Overall |
Overall w/ Style Control |
Hard Prompts |
Hard Prompts w/ Style Control |
Coding |
Math |
Creative Writing |
Instruction Following |
Longer Query |
Multi-Turn |
gemini-exp-1206 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
chatgpt-4o-latest-20241120 |
1 |
1 |
3 |
4 |
1 |
5 |
1 |
2 |
1 |
1 |
gemini-2.0-flash-exp |
3 |
3 |
2 |
2 |
3 |
1 |
2 |
2 |
1 |
1 |
o1-preview |
4 |
3 |
2 |
1 |
1 |
1 |
4 |
2 |
3 |
3 |
o1-mini |
5 |
7 |
3 |
4 |
1 |
1 |
16 |
5 |
4 |
5 |
gemini-1.5-pro-002 |
5 |
6 |
6 |
7 |
7 |
5 |
4 |
5 |
5 |
7 |
grok-2-2024-08-13 |
7 |
10 |
10 |
12 |
8 |
10 |
6 |
9 |
9 |
8 |
yi-lightning |
7 |
11 |
6 |
9 |
7 |
6 |
6 |
8 |
6 |
5 |
gpt-4o-2024-05-13 |
7 |
6 |
9 |
9 |
7 |
10 |
6 |
8 |
7 |
7 |
claude-3-5-sonnet-20241022 |
7 |
5 |
6 |
2 |
5 |
5 |
6 |
6 |
5 |
5 |
官方benchmark
Benchmark |
GEMINI 1.5 FLASH 002 |
GEMINI 1.5 PRO 002 |
GEMINI 2.0 FLASH EXPERIMENTAL |
MMLU-Pro |
67.3% |
75.8% |
76.4% |
Natural2Code |
79.8% |
85.4% |
92.9% |
Bird-SQL (Dev) |
45.6% |
54.4% |
56.9% |
LiveCodeBench |
30.0% |
34.3% |
35.1% |
FACTS Grounding |
82.9% |
80.0% |
83.6% |
MATH |
77.9% |
86.5% |
89.7% |
HiddenMath |
47.2% |
52.0% |
63.0% |
GPQA (diamond) |
51.0% |
59.1% |
62.1% |
MRCR (1M) |
71.9% |
82.6% |
69.2% |
MMMU |
62.3% |
65.9% |
70.7% |
Vibe-Eval (Reka) |
48.9% |
53.9% |
56.3% |
CoVoST2 (21 lang) |
37.4 |
40.1 |
39.2 |
EgoSchema (test) |
66.8% |
71.2% |
71.5% |
livebench 排行第六
Model |
Organization |
Global Average |
Reasoning Average |
Coding Average |
Mathematics Average |
Data Analysis Average |
Language Average |
IF Average |
o1-preview-2024-09-12 |
OpenAI |
65.63 |
67.42 |
50.85 |
64.92 |
67.31 |
68.72 |
74.60 |
gemini-exp-1206 |
Google |
63.91 |
57.00 |
63.41 |
71.69 |
63.16 |
50.84 |
77.34 |
claude-3-5-sonnet-20241022 |
Anthropic |
58.99 |
56.67 |
67.13 |
52.28 |
54.78 |
53.76 |
69.30 |
o1 |
OpenAI |
N/A |
N/A |
61.62 |
N/A |
N/A |
N/A |
N/A |
claude-3-5-sonnet-20240620 |
Anthropic |
58.72 |
57.17 |
60.85 |
54.32 |
58.74 |
53.21 |
68.01 |
gemini-2.0-flash-exp |
Google |
57.99 |
59.08 |
54.36 |
59.39 |
59.68 |
33.55 |
81.86 |
gemini-exp-1121 |
Google |
57.41 |
49.92 |
50.36 |
63.75 |
60.29 |
40.00 |
80.15 |
o1-mini-2024-09-12 |
OpenAI |
57.38 |
72.33 |
48.05 |
60.89 |
56.73 |
40.89 |
65.40 |
gemini-exp-1114 |
Google |
56.70 |
55.67 |
52.36 |
55.59 |
60.82 |
38.69 |
77.08 |
step-2-16k-202411 |
StepFun |
55.64 |
55.50 |
46.87 |
48.88 |
58.19 |
44.52 |
79.88 |
gpt-4o-2024-08-06 |
OpenAI |
54.38 |
53.92 |
51.44 |
48.54 |
56.23 |
47.59 |
68.58 |
gemini-1.5-pro-002 |
Google |
54.22 |
49.08 |
48.80 |
58.40 |
54.97 |
43.29 |
70.78 |
有3.5 sonnet 相比 3 opus 那味了
16 个赞
cnm
(CNM)
3
相比于Gemini 1.5 Flash的提升:
Benchmark |
Gemini 1.5 Flash 002 |
Gemini 2.0 Flash Experimental |
提升百分比 |
MMLU-Pro |
67.3% |
76.4% |
13.5% |
Natural2Code |
79.8% |
92.9% |
16.4% |
Bird-SQL (Dev) |
45.6% |
56.9% |
24.8% |
LiveCodeBench |
30.0% |
35.1% |
17.0% |
FACTS Grounding |
82.9% |
83.6% |
0.8% |
MATH |
77.9% |
89.7% |
15.2% |
HiddenMath |
47.2% |
63.0% |
33.5% |
GPQA (diamond) |
51.0% |
62.1% |
21.8% |
MRCR (1M) |
71.9% |
69.2% |
-3.7% |
MMMU |
62.3% |
70.7% |
13.5% |
Vibe-Eval (Reka) |
48.9% |
56.3% |
15.1% |
CoVoST2 (21 lang) |
37.4 |
39.2 |
4.8% |
EgoSchema (test) |
66.8% |
71.5% |
7.0% |
相比于Gemini 1.5 Pro的提升:
Benchmark |
Gemini 1.5 Pro 002 |
Gemini 2.0 Flash Experimental |
提升百分比 |
MMLU-Pro |
75.8% |
76.4% |
0.8% |
Natural2Code |
85.4% |
92.9% |
8.8% |
Bird-SQL (Dev) |
54.4% |
56.9% |
4.6% |
LiveCodeBench |
34.3% |
35.1% |
2.3% |
FACTS Grounding |
80.0% |
83.6% |
4.5% |
MATH |
86.5% |
89.7% |
3.7% |
HiddenMath |
52.0% |
63.0% |
21.2% |
GPQA (diamond) |
59.1% |
62.1% |
5.1% |
MRCR (1M) |
82.6% |
69.2% |
-16.2% |
MMMU |
65.9% |
70.7% |
7.3% |
Vibe-Eval (Reka) |
53.9% |
56.3% |
4.5% |
CoVoST2 (21 lang) |
40.1 |
39.2 |
-2.2% |
EgoSchema (test) |
71.2% |
71.5% |
0.4% |
看起来提升最大的是数学 应该是奥林匹克竞赛模型生成的数据发力了
4 个赞
好的,这就让它证明费马猜想、四色猜想和哥德巴赫猜想。
让它完整证明
1 个赞
2.0 flash 在上下文方面跟 1.5 pro一样吗