Gemini 2.0,拿下!

LM竞技场 排行第三

Model Overall Overall w/ Style Control Hard Prompts Hard Prompts w/ Style Control Coding Math Creative Writing Instruction Following Longer Query Multi-Turn
gemini-exp-1206 1 1 1 1 1 1 1 1 1 1
chatgpt-4o-latest-20241120 1 1 3 4 1 5 1 2 1 1
gemini-2.0-flash-exp 3 3 2 2 3 1 2 2 1 1
o1-preview 4 3 2 1 1 1 4 2 3 3
o1-mini 5 7 3 4 1 1 16 5 4 5
gemini-1.5-pro-002 5 6 6 7 7 5 4 5 5 7
grok-2-2024-08-13 7 10 10 12 8 10 6 9 9 8
yi-lightning 7 11 6 9 7 6 6 8 6 5
gpt-4o-2024-05-13 7 6 9 9 7 10 6 8 7 7
claude-3-5-sonnet-20241022 7 5 6 2 5 5 6 6 5 5

官方benchmark

Benchmark GEMINI 1.5 FLASH 002 GEMINI 1.5 PRO 002 GEMINI 2.0 FLASH EXPERIMENTAL
MMLU-Pro 67.3% 75.8% 76.4%
Natural2Code 79.8% 85.4% 92.9%
Bird-SQL (Dev) 45.6% 54.4% 56.9%
LiveCodeBench 30.0% 34.3% 35.1%
FACTS Grounding 82.9% 80.0% 83.6%
MATH 77.9% 86.5% 89.7%
HiddenMath 47.2% 52.0% 63.0%
GPQA (diamond) 51.0% 59.1% 62.1%
MRCR (1M) 71.9% 82.6% 69.2%
MMMU 62.3% 65.9% 70.7%
Vibe-Eval (Reka) 48.9% 53.9% 56.3%
CoVoST2 (21 lang) 37.4 40.1 39.2
EgoSchema (test) 66.8% 71.2% 71.5%

livebench 排行第六

Model Organization Global Average Reasoning Average Coding Average Mathematics Average Data Analysis Average Language Average IF Average
o1-preview-2024-09-12 OpenAI 65.63 67.42 50.85 64.92 67.31 68.72 74.60
gemini-exp-1206 Google 63.91 57.00 63.41 71.69 63.16 50.84 77.34
claude-3-5-sonnet-20241022 Anthropic 58.99 56.67 67.13 52.28 54.78 53.76 69.30
o1 OpenAI N/A N/A 61.62 N/A N/A N/A N/A
claude-3-5-sonnet-20240620 Anthropic 58.72 57.17 60.85 54.32 58.74 53.21 68.01
gemini-2.0-flash-exp Google 57.99 59.08 54.36 59.39 59.68 33.55 81.86
gemini-exp-1121 Google 57.41 49.92 50.36 63.75 60.29 40.00 80.15
o1-mini-2024-09-12 OpenAI 57.38 72.33 48.05 60.89 56.73 40.89 65.40
gemini-exp-1114 Google 56.70 55.67 52.36 55.59 60.82 38.69 77.08
step-2-16k-202411 StepFun 55.64 55.50 46.87 48.88 58.19 44.52 79.88
gpt-4o-2024-08-06 OpenAI 54.38 53.92 51.44 48.54 56.23 47.59 68.58
gemini-1.5-pro-002 Google 54.22 49.08 48.80 58.40 54.97 43.29 70.78

有3.5 sonnet 相比 3 opus 那味了 :tieba_003:

16 个赞

相比于Gemini 1.5 Flash的提升:

Benchmark Gemini 1.5 Flash 002 Gemini 2.0 Flash Experimental 提升百分比
MMLU-Pro 67.3% 76.4% 13.5%
Natural2Code 79.8% 92.9% 16.4%
Bird-SQL (Dev) 45.6% 56.9% 24.8%
LiveCodeBench 30.0% 35.1% 17.0%
FACTS Grounding 82.9% 83.6% 0.8%
MATH 77.9% 89.7% 15.2%
HiddenMath 47.2% 63.0% 33.5%
GPQA (diamond) 51.0% 62.1% 21.8%
MRCR (1M) 71.9% 69.2% -3.7%
MMMU 62.3% 70.7% 13.5%
Vibe-Eval (Reka) 48.9% 56.3% 15.1%
CoVoST2 (21 lang) 37.4 39.2 4.8%
EgoSchema (test) 66.8% 71.5% 7.0%

相比于Gemini 1.5 Pro的提升:

Benchmark Gemini 1.5 Pro 002 Gemini 2.0 Flash Experimental 提升百分比
MMLU-Pro 75.8% 76.4% 0.8%
Natural2Code 85.4% 92.9% 8.8%
Bird-SQL (Dev) 54.4% 56.9% 4.6%
LiveCodeBench 34.3% 35.1% 2.3%
FACTS Grounding 80.0% 83.6% 4.5%
MATH 86.5% 89.7% 3.7%
HiddenMath 52.0% 63.0% 21.2%
GPQA (diamond) 59.1% 62.1% 5.1%
MRCR (1M) 82.6% 69.2% -16.2%
MMMU 65.9% 70.7% 7.3%
Vibe-Eval (Reka) 53.9% 56.3% 4.5%
CoVoST2 (21 lang) 40.1 39.2 -2.2%
EgoSchema (test) 71.2% 71.5% 0.4%

看起来提升最大的是数学 :tieba_003:应该是奥林匹克竞赛模型生成的数据发力了 :tieba_003:

4 个赞

好的,这就让它证明费马猜想、四色猜想和哥德巴赫猜想。
让它完整证明

1 个赞

拿下openai!

2.0 flash 在上下文方面跟 1.5 pro一样吗

一百万 :tieba_003:

1 个赞