【长期更新Wiki】语言模型区分题库:主要用来区分语言模型,也能测试逻辑能力

【图像版】AI大模型图像理解能力测试题库:区分模型、对比图形理解能力
序号 题目 答案 :heavy_check_mark:_________________________ :x:_________________________
1 提取图中文字 不想上班,那就不上 :blue_square:G-2.0FP,:purple_square:G-2.OF,:green_circle:G-2.0P,:large_blue_circle:GT,:white_large_square:4o,:orange_square:o1,:heart:o3m,:blue_heart:o3mh
2 提取图中文字 图片三 bsjx :large_blue_circle:GT,:orange_square:o1(不稳定),:heart:o3m(不稳定)
3 求解 DC = 30/7 :orange_square:o1,:red_circle:o1p,:heart:o3m,:blue_heart:o3mh :large_blue_circle:GT
4 提取图中文字 图片二 4yu6 :orange_square:o1,:heart:o3m,:blue_heart:o3mh :large_blue_circle:GT
5 提取图中文字 图片一 rpmx :large_blue_circle:GT,:orange_square:o1,:heart:o3m,:blue_heart:o3mh
6 提取图中文字 真诚、友善、团结、专业,共建你我引以为荣之社区 :large_blue_circle:GT,:heart:o3m,:blue_heart:o3mh :orange_square:o1(不稳定),:red_circle:o1p(不稳定)

说明:

  1. :heavy_check_mark::x::这些列需要根据模型测试结果进行填写。您可以参考以下步骤进行填写:
    • :heavy_check_mark: 列填写准确率≥80%的模型名称。
    • :x: 列填写准确率在40%-60%之间且标注“(不稳定)”的模型名称。
      模型列表(按名称首字母排序):

题库测试的语言模型(按名称首字母排序):

  1. :brown_square: Claude 3.5 sonnet (C3.5)
  2. :large_blue_circle: gemini-2.0-flash-thinking-exp-01-21 (GT)
  3. :purple_square: gemini-2.0-flash (G-2.0F)
  4. :blue_square: gemini-2.O-flash-lite-preview-02-05 (G-2.0FP)
  5. :green_circle: gemini-2.0-pro-exp-02-05 (G-2.0P)
  6. :white_large_square: GPT4o (4o)
  7. :orange_square: o1 (o1)
  8. :red_circle: o1 pro (o1p)
  9. :heart: o3 mini (o3m)
  10. :blue_heart: o3-mini-high (o3mh)


序号 题型 题目_________________________________ :heavy_check_mark:________________ :x:________________ 答案
1 数列 设实数列 \(\{x_n\}\) 满足:\(x_0 = 0\),\(x_2 = \sqrt[3]{2}x_1\),\(x_3\) 是正整数,且 \[x_{n+1} = \frac{1}{\sqrt[3]{4}} x_n + \sqrt[3]{4} x_{n-1} + \frac{1}{2} x_{n-2} (n \geq 2).\] 问:这类数列中最少有多少个整数项? :brown_square:C3.5,:green_square:DSR1,:orange_circle:DSV3,:purple_circle: DB1.5,:white_large_square:4o,:brown_circle:4om,:white_circle:GLM,:purple_square:G-2.OF,:blue_square:G-2.0FP,:large_blue_circle:GT,:green_circle:G-2.0P,:yellow_square:K1.5,:orange_square:o1,:red_circle:o1p,:heart:o3m,:blue_heart:o3mh, :star2:GK3R, :star:GK3 5
2 不等式 给定不小于3的正整数 \( n \),求最小的正数 \(\lambda\),使得对于任何 \(\theta_i \in (0, \frac{\pi}{2}) \) (\(i = 1, 2, \cdots, n\)),只要 \(\tan \theta_1 \cdot \tan \theta_2 \cdots \cdot \tan \theta_n = 2^{\frac{n}{2}}\),就有 \(\cos \theta_1 + \cos \theta_2 + \cdots + \cos \theta_n\) 不大于 \(\lambda\)。 :green_square:DSR1, :star2:GK3R :brown_square:C3.5,:orange_circle:DSV3,:purple_circle: DB1.5,:white_large_square:4o,:brown_circle:4om,:white_circle:GLM,:purple_square:G-2.OF,:blue_square:G-2.0FP,:large_blue_circle:GT,:green_circle:G-2.0P,:yellow_square:K1.5,:orange_square:o1,:red_circle:o1p,:heart:o3m,:blue_heart:o3mh(不稳定), :star:GK3 n-1
3 解析几何 已知过点 $A(-1, 0)$ 、 $B(1, 0)$ 两点的动抛物线的准线始终与圆 $x^2 + y^2 = 9$ 相切,该抛物线焦点 $P$ 的轨迹是某圆锥曲线 $E$ 的一部分。<br>(1) 求曲线 $E$ 的标准方程;<br>(2) 已知点 $C(-3, 0)$ , $D(2, 0)$ ,过点 $D$ 的动直线与曲线 $E$ 相交于 $M$ 、 $N$ ,设 $\triangle CMN$ 的外心为 $Q$ , $O$ 为坐标原点,问:直线 $OQ$ 与直线 $MN$ 的斜率之积是否为定值,如果为定值,求出该定值;如果不是定值,则说明理由。 :heart:o3m,:blue_heart:o3mh :green_square:DSR1,:brown_square:C3.5,:orange_circle:DSV3,:purple_circle: DB1.5,:white_large_square:4o,:brown_circle:4om,:white_circle:GLM,:purple_square:G-2.OF,:large_blue_circle:GT,:green_circle:G-2.0P,:blue_square:G-2.0FP,:yellow_square:K1.5,:orange_square:o1,:red_circle:o1p, :star:GK3, :star2:GK3R \frac{x^2}{9} + \frac{y^2}{8} = 1, -5
4 逻辑推理 Sroan 有一个私人的保险箱,密码是 7 个 不同的数字。 Guess #1: 9062437 Guess #2: 8593624 Guess #3: 4286915 Guess #4: 3450982 Sroan 说: 你们 4 个人每人都猜对了位置不相邻的两个数字。 (只有 “位置及其对应的数字” 都对才算对) 问:密码是什么? :orange_square:o1,:red_circle:o1p,:heart:o3m,:blue_heart:o3mh :green_square:DSR1,:brown_square:C3.5,:orange_circle:DSV3,:purple_circle:DB1.5, :large_blue_circle:GT,:purple_square:G-2.OF,:green_circle:G-2.0P,:blue_square:G-2.0FP,:white_circle:GLM,:white_large_square:4o,:brown_circle:4om,:yellow_square:K1.5, 4053927
5 解析几何 在平面四边形ABCD中,AB = AC = CD = 1,\angle ADC = 30^{\circ},\angle DAB = 120^{\circ}。将\triangle ACD沿AC翻折至\triangle ACP,其中P为动点。 求二面角A - CP - B的余弦值的最小值。 :green_square:DSR1,:red_circle:o1p,:heart:o3m,:blue_heart:o3mh :brown_square:C3.5,:orange_circle:DSV3,:purple_circle:DB1.5,:purple_square:G-2.OF,:green_circle:G-2.0P,:large_blue_circle:GT,:blue_square:G-2.0FP,:white_circle:GLM,:white_large_square:4o,:brown_circle:4om,:orange_square:o1(不稳定),:yellow_square:K1.5 \frac{\sqrt{3}}{3}
6 排列问题 有 8 个人,分别是 A、B、C、D 和另外 4 人。要将这 8 个人随机安排在教室的两排座位上,每排有 4 个座位,共 8 个座位。相邻的定义是:若两个人坐在同一排并且座位编号相邻,则这两个人相邻。现要求 A 与 B 必须相邻,且 C 与 D 不相邻,问在上述条件下共有多少种不同的排法? :green_square:DSR1,:orange_square:o1,:red_circle:o1p,:heart:o3m,:blue_heart:o3mh :brown_square:C3.5,:orange_circle:DSV3,:purple_circle:DB1.5,:purple_square:G-2.OF,:large_blue_circle:GT(不稳定),:green_circle:G-2.0P,:blue_square:G-2.0FP,:white_circle:GLM(不稳定),:white_large_square:4o, :brown_circle:4om,:yellow_square:K1.5 6528
7 电子技术基础 已知8段共阳极LED数码管要显示字符“5”(a段为最低位),此时的段码为 _______。 :green_square:DSR1,:orange_square:o1, ,:red_circle:o1p,:heart:o3m,:blue_heart:o3mh :blue_square:G-2.0FP,:brown_square:C3.5,:orange_circle:DSV3,:purple_circle: DB1.5(不稳定),:purple_square:G-2.OF(不稳定),:large_blue_circle:GT(不稳定),:green_circle:G-2.0P,:white_circle:GLM,:white_large_square:4o,:brown_circle:4om,:yellow_square:K1.5(不稳定) 92H
8 变质量动力学 雨滴开始自自由下落时质量为 $m_0$。在下落过程中,单位时间凝聚的水汽质量为 $\lambda$($\lambda$为常量)。试求雨滴经过时间 $t$下落的距离。忽略空气阻力。重力加速度为$g$。 :green_square:DSR1,:large_blue_circle:GT,:purple_square:G-2.OF,:orange_square:o1,:red_circle:o1p,:heart:o3m,:blue_heart:o3mh :blue_square:G-2.0FP,:brown_square:C3.5,:purple_circle:DB1.5,:green_circle:G-2.0P(不稳定),:white_circle:GLM,:white_large_square:4o, :brown_circle:4om,:yellow_square:K1.5(不稳定) s(t) = \frac{g t^{2}}{4} + \frac{g m_{0} t}{2 \lambda} - \frac{g m_{0}^{2}}{2 \lambda^{2}} \ln\left(1 + \frac{\lambda t}{m_{0}}\right)
9 解析几何 在平面直角坐标系中,函数 ( y = \frac{x+1}{|x|+1} ) 的图像上有三个不同的点位于直线上,且这三点的横坐标之和为 0。求 ( l ) 的斜率的取值范围。 :green_square:DSR1,:white_circle:GLM,:orange_square:o1,:red_circle:o1p,:heart:o3m,:blue_heart:o3mh :brown_square:C3.5,:orange_circle:DSV3,:purple_circle:DB1.5,:purple_square:G-2.OF,:large_blue_circle:GT,:green_circle:G-2.0P,:blue_square:G-2.0FP,:white_large_square:4o, :brown_circle:4om,:yellow_square:K1.5 0 < k < \frac{2}{9}
10 几何 在正四棱台 $ABCD-A_1B_1C_1D_1$ 中,$AB=2$,$A_1B_1=1$,$AA_1=\sqrt{2}$,则该棱台的体积为多少? :green_square:DSR1,:purple_circle:DB1.5,:large_blue_circle:GT,:orange_square:o1,:red_circle:o1p,:heart:o3m,:blue_heart:o3mh :brown_square:C3.5,:orange_circle:DSV3(不稳定),:green_circle:G-2.0P(不稳定),:purple_square:G-2.OF(不稳定),:blue_square:G-2.0FP,:white_circle:GLM,:white_large_square:4o,:brown_circle:4om,:yellow_square:K1.5 \frac{7\sqrt{6}}{6}
11 几何 在$\Delta ABC$中,$\angle A$、$\angle B$、$\angle C$所对的边分别为$a, b, c$,且$c=10$,$\frac{\cos A}{\cos B} = \frac{b}{a} = \frac{4}{3}$,$P$为$\Delta ABC$内切圆上的动点,求点$P$到顶点$A$、$B$、$C$的距离的平方和的最大值和最小值。 :green_square:DSR1,:purple_circle:DB1.5,:large_blue_circle:GT,:white_circle:GLM,:orange_square:o1,:yellow_square:K1.5,:red_circle:o1p,:heart:o3m,:blue_heart:o3mh :brown_square:C3.5,:orange_circle:DSV3(不稳定),:green_circle:G-2.0P,:purple_square:G-2.OF(不稳定),:blue_square:G-2.0FP(不稳定),:white_large_square:4o,:brown_circle:4om 88, 72
12 转动惯量 一个半圆形薄板质量为 $M$,半径为 $R$。当它以直径为轴转动时,转动惯量为多大? :green_square:DSR1,:orange_circle:DSV3,:purple_circle:DS1.5,:white_circle:GLM,:white_large_square:4o,:orange_square:o1,:red_circle:o1p,:large_blue_circle:GT,:purple_square:G-2.OF,:green_circle:G-2.0P,:yellow_square:K1.5,:heart:o3m,:blue_heart:o3mh :blue_square:G-2.0FP(不稳定),:brown_square:C3.5(不稳定),:brown_circle:4om(不稳定) \frac{MR^2}{4}
13 单片机定时器初值计算 AT89S51采用6MHz的晶振,定时2ms,如用定时器方式1时的初值(16进制数)应为多少?(写出计算过程) :brown_square:C3.5, :orange_circle:DSV3,:green_square:DSR1,:white_large_square:4o,:star:GK3,:purple_circle:DS1.5,:purple_square:G-2.OF,:large_blue_circle:GT,:green_circle:G-2.0P, :orange_square:o1,:red_circle:o1p,:yellow_square:K1.5,:heart:o3m,:blue_heart:o3mh :blue_square:G-2.0FP(不稳定),:white_circle:GLM,:brown_circle:4om 0xFC18
14 三角函数 已知函数 $f(x) = \cos(\omega x) - 1$ ($\omega > 0$) 在区间 $[0, 2\pi]$ 有且仅有 3 个零点,则$\omega$的取值范围是? :orange_circle:DSV3,:green_square:DSR1,:purple_circle:DB1.5, :white_large_square:4o,:white_circle:GLM,:purple_square:G-2.OF, :green_circle:G-2.0P, :orange_square:o1,:red_circle:o1p,:large_blue_circle:GT, :yellow_square:K1.5,:heart:o3m,:blue_heart:o3mh :brown_square:C3.5(不稳定),:brown_circle:4om(不稳定),:blue_square:G-2.0FP [2, 3)
15 古汉语解析 披发左衽的意思是? :brown_square:C3.5,:orange_circle:DSV3,:green_square:DSR1,:purple_circle: DB1.5,:white_large_square:4o,:star:GK3,:white_circle:GLM,:purple_square:G-2.OF,:blue_square:G-2.0FP,:large_blue_circle:GT,:green_circle:G-2.0P,:yellow_square:K1.5,:orange_square:o1,:red_circle:o1p :brown_circle:4om,:heart:o3m,:blue_heart:o3mh 非汉族习俗

题库测试的语言模型(按名称首字母排序):

序号 图标 模型名称 缩写
1 :brown_square: Claude 3.5 sonnet C3.5
2 :green_square: DeepSeek-R1 DSR1
3 :orange_circle: DeepSeek-V3 DSV3
4 :purple_circle: Doubao-1.5-pro DB1.5
5 :purple_square: gemini-2.0-flash G-2.0F
6 :large_blue_circle: gemini-2.0-flash-thinking-exp-01-21 GT
7 :green_circle: gemini-2.0-pro-exp-02-05 G-2.0P
8 :blue_square: gemini-2.O-flash-lite-preview-02-05 G-2.0FP
9 :white_circle: GLM-Zero GLM
10 :brown_circle: GPT-4o-mini 4om
11 :white_large_square: GPT4o 4o
12 :star: Grok-3 GK3
13 :sparkles: Grok-3 mini GK3m
14 :star2: Grok-3 Reasoning Beta GK3R
15 :fire: Grok-3 mini Reasoning GK3Rm
16 :yellow_square: Kimi k1.5 K1.5
17 :orange_square: o1 o1
18 :red_circle: o1 pro o1p
19 :heart: o3 mini o3m
20 :blue_heart: o3-mini-high o3mh

开放 Wiki,大家一起编辑。 建议:

  1. 使用相对权威的平台的语言模型测试,而不是明显阉割过的语言模型。
  2. 测试后发送截图证明测试结果。
  3. 一道题一个模型至少测试 5 次再定结果。
  4. 准确率 ≥80% 放入 ✔️ 列, 40%-60% 放 列且标注“(不稳定)”。
  5. 模型使用默认参数。
  6. 用模型的名称排序表格中的模型顺序。

专业 LLM 基准测试 : LiveBench: A Challenging, Contamination-Free LLM Benchmark

555 Likes

太强了,大师!这就去测测

33 Likes

最强的还是01

20 Likes

https://linux.do/t/topic/273810?u=yeahhe

o1模型可以试试这个佬的。我测了他的o1可以,o1 mini有问题

21 Likes


阿里的Marco-o1模型
测试问题

在正四棱台 ABCD-A1B1C1D1中,AB=2,A1B1=1,AA1=√2,则该棱台的体积为多少?

总结:有类似于o1的内置思维链,会自己检查答案(实测最多两次,两次都不对会自动转成模型自己认为的正确答案),应该为qwen32b水平

24 Likes

QwQ也是阿里的模型,可以对比一下

17 Likes

这个答案看起来像gpt4o的水平。在反复验算的情况下还做错,那这个模型的底子应该不太行

26 Likes

qwq 10次对2次(其中一次无限循环,一次因思考过多把正确答案改错)

总结:靠堆token,经常爆token,分两次对话完成

21 Likes

是huggingface的QwQ吗,huggingface会掐断回答。硅基流动的QwQ会好很多

11 Likes

10次对两次,符合题库中的测试结果

19 Likes

hf的

还有不知道为什么我的o1preview
这个问题经常给我7√2/3,不知道是题目格式的问题还是模型本身
用的官API

13 Likes

这个有佬测了很多遍,我是没有疑问的。应该是你用的模型被阉割了或者降智了

13 Likes

那不对啊,第一题能答对第二题答不对?

14 Likes


一次测试结果正确,思考一分53秒,英文思考

12 Likes

能答对。。。。

18 Likes

根本没有模型能作对的题目,不收录吧?至少有1个模型能作对,是吗?

22 Likes

给一个粗浅的思考提示,还是可以答对的,没有思考提示,的确答不对

19 Likes

提示词看看,也许可以弥补现在的差距

21 Likes

提示词主要是思考,我觉得没啥,最主要是提示大模型可以通过理解提问者的意图,然后根据对应的训练数据回答,这样好像可以帮助大模型给出更好的答案,但你的测试问题,我重新测了几遍,1206,5次只有1次是15,提升不多。而且可能是运气也不一定

19 Likes

测了三次,o1 pro才对一次

18 Likes