v2.0 — Premium · 10 questions · 85 marks. Deliberately weighted to Hard and out-of-the-box problems for the 6→7 student: combined/pooled statistics, the two regression lines, Simpson's paradox, and "same summary stats, different data." Two Medium warm-ups only. D1 is SL content; no Paper 3, but Part IV pushes into exploration territory. Companion Solutions file in Solutions/.v2.0 —— 精选版 · 10 道题 · 85 分。刻意向难题(Hard)与非常规思维题(out-of-the-box)倾斜,面向 6 冲 7 的学生:合并/合并方差(pooled)统计量、两条回归直线、辛普森悖论(Simpson's paradox)、以及"统计量相同、数据不同"。仅保留 2 道中等难度的热身题。D1 属 SL 内容,无 Paper 3,但第四部分进入探究式题型。配套答案文件位于 Solutions/ 目录。
PART I · PAPER 1 SECTION A第一部分 · 第一卷 A 节No calculator · short response · 20 marks不可使用计算器 · 简答题 · 20 分
Section A — Short ResponseA 节 —— 简答题
Short does not mean easy. Each item rewards a clean argument over a memorised formula. Marks are awarded for correct method as well as final answers. No calculator permitted.题短不等于题易。每道题更看重清晰的论证,而非死记公式。方法分(method marks)与最终答案同等重要。不可使用计算器。
A dataset $\{x_i\}$ is transformed by $y_i = 3x_i - 7$. The transformed data have mean $\bar{y} = 50$ and standard deviation $\sigma_y = 9$.对数据集 $\{x_i\}$ 作变换 $y_i = 3x_i - 7$,变换后数据的均值 $\bar{y} = 50$、标准差 $\sigma_y = 9$。
(a)Find the mean $\bar{x}$ and standard deviation $\sigma_x$ of the original data.求原始数据的均值 $\bar{x}$ 与标准差 $\sigma_x$。[3]
(b)One extra reading, exactly equal to $\bar{x}$, is appended to the original dataset. State the new mean, and state — with reason — whether the standard deviation increases, decreases, or stays the same.向原始数据集中再追加一个恰好等于 $\bar{x}$ 的读数。写出新的均值,并说明理由:标准差是增大、减小,还是不变。[3]
Q2HARDPaper 1A4.3 Combined Mean[7 marks]
Group A consists of $12$ values with mean $15$. Group B consists of $8$ values with mean $m$. When the two groups are pooled, the combined mean of all $20$ values is $18$.A 组有 $12$ 个值,均值为 $15$;B 组有 $8$ 个值,均值为 $m$。两组合并后,全部 $20$ 个值的合并均值(combined mean)为 $18$。
(a)Find $m$.求 $m$。[3]
(b)After pooling, one value originally in Group A is found to have been recorded as $9$ but should have been $39$. Find the corrected combined mean of all $20$ values.合并后发现,A 组中原本记为 $9$ 的一个值实际应为 $39$。求改正后全部 $20$ 个值的合并均值。[2]
(c)Explain why the combined mean in (a) is not the simple average of $15$ and $m$.解释为何 (a) 中的合并均值不是 $15$ 与 $m$ 的简单平均。[2]
A dataset consists of five positive integers. It has:某数据集由五个正整数组成,满足:
median $= 3$,中位数 $= 3$,
a single mode equal to $3$ (the mode is unique),唯一众数(unique mode)等于 $3$,
mean $= 4$, and range $= 8$.均值 $= 4$,极差(range)$= 8$。
(a)Find all datasets satisfying every condition.求出满足全部条件的所有数据集。[5]
(b)Justify that your list is complete (i.e. no other dataset works).论证你给出的列表是完整的(即不存在其他符合条件的数据集)。[2]
PART II · PAPER 1 SECTION B第二部分 · 第一卷 B 节No calculator · extended response · 20 marks不可使用计算器 · 长答题 · 20 分
Section B — Extended ResponseB 节 —— 长答题
Set out a clean, ordered argument. Where a part says "show that," reproduce every key step. Marks are awarded for method, accuracy, and reasoning. No calculator permitted.论证要清晰有序。凡题目要求"show that"(证明)时,要写出每一关键步骤。方法、准确性与推理都计分。不可使用计算器。
Q4HARDPaper 1B4.4 Two Regression Lines[12 marks]
Five paired observations $(x, y)$ are recorded:记录了五组配对观测值 $(x, y)$:
(a)Show that $\bar{x} = 3$ and $\bar{y} = 4$.证明 $\bar{x} = 3$、$\bar{y} = 4$。[2]
(b)In a single tidy table, find $S_{xx}$, $S_{yy}$, and $S_{xy}$.用一张整齐的表格求 $S_{xx}$、$S_{yy}$、$S_{xy}$。[4]
(c)Find the equation of the regression line of $y$ on $x$, and the regression line of $x$ on $y$. Give both in the form (slope, intercept) with exact fractions or decimals.求 $y$ 关于 $x$ 的回归直线(y on x),以及 $x$ 关于 $y$ 的回归直线(x on y)。两条直线都以(斜率, 截距)形式给出,结果用精确分数或小数。[3]
(d)Show that the product of the two gradients equals $r^2$, and hence write down $r$.证明两条回归直线斜率之积等于 $r^2$,并据此写出 $r$。[2]
(e)The two lines are different, yet they intersect. State the point of intersection and explain in one sentence why the lines differ.这两条直线不同,但它们相交。写出交点坐标,并用一句话解释两条直线为何不同。[1]
A dataset $D$ of $n = 20$ values has mean $50$, standard deviation $10$, median $48$, and interquartile range $16$. Each part below applies separately to the original $D$. For each, state the new value of every one of the four statistics (mean, median, SD, IQR), or write "cannot be determined" with a one-line reason if a value is not fixed by the information given.数据集 $D$ 含 $n = 20$ 个值,均值 $50$、标准差 $10$、中位数 $48$、四分位距(IQR)$16$。以下各小题分别作用于原始的 $D$。对每一情形,写出四个统计量(均值、中位数、标准差、IQR)各自的新值;若某个值无法由所给信息确定,则写"无法确定(cannot be determined)"并用一句话说明理由。
(a)Add $5$ to every value.每个值都加 $5$。[2]
(b)Multiply every value by $2$.每个值都乘以 $2$。[2]
(c)Append one new value equal to $50$ (so $n$ becomes $21$).追加一个等于 $50$ 的新值(使 $n$ 变为 $21$)。[2]
(d)A single data-entry error inflates exactly one value by $100$ (no other value changes; $n$ stays $20$).一处录入错误使恰好一个值增大了 $100$(其余值不变,$n$ 仍为 $20$)。[2]
PART III · PAPER 2第三部分 · 第二卷Calculator (GDC) required · 26 marks需要图形计算器(GDC)· 26 分
Section C — Paper 2 (Calculator)C 节 —— 第二卷(可用计算器)
A graphic display calculator is required. Use GDC features (1-Var Stats, LinReg) where helpful, but always state the values you read off — never leave a bare "GDC" with no number behind it. Round to 3 s.f. unless told otherwise.需要图形计算器。可在合适处使用 GDC 的 1-Var Stats 与 LinReg 功能,但必须写出读取到的数值 —— 不要只留空泛的"GDC"而无具体数据。除非另有说明,结果保留 3 位有效数字。
Q6MEDIUMPaper 24.4 Regression & Residual[7 marks]
A researcher records weekly study hours $x$ and a diagnostic score $y$ for $6$ students:研究者记录了 $6$ 名学生每周的学习时长 $x$ 与一次诊断测验成绩 $y$:
$x$
1
3
4
6
7
9
$y$
8
11
11
16
18
20
(a)Find the PMCC $r$ and the regression line $y = a + bx$.求皮尔逊积矩相关系数(PMCC)$r$ 与回归直线 $y = a + bx$。[3]
(b)Interpret the gradient $b$ in the context of this study, in one sentence.结合本研究情境,用一句话解释斜率 $b$ 的含义。[2]
(c)Find the residual (observed $-$ predicted) for the student with $x = 6$, and say what its sign tells you.求 $x = 6$ 的学生的残差(residual,观测值 $-$ 预测值),并说明其正负号说明了什么。[2]
Q7HARDPaper 24.3 Pooled SD & z-scores[10 marks]
Two cohorts sit the same examination. Cohort A: $n_A = 30$ students, mean $62$, standard deviation $9$. Cohort B: $n_B = 20$ students, mean $71$, standard deviation $6$.两个班级参加同一场考试。A 班:$n_A = 30$ 人,均值 $62$,标准差 $9$;B 班:$n_B = 20$ 人,均值 $71$,标准差 $6$。
(a)Find the mean of all $50$ students combined.求全部 $50$ 名学生合并后的均值。[2]
(b)Using the identity $\sum x^2 = n\big(\sigma^2 + \bar{x}^2\big)$ for each cohort, find $\sum x^2$ for the combined group and hence the standard deviation of all $50$ scores.对每个班级使用恒等式 $\sum x^2 = n\big(\sigma^2 + \bar{x}^2\big)$,求合并组的 $\sum x^2$,并由此求全部 $50$ 个成绩的标准差。[5]
(c)A student in Cohort A and a student in Cohort B each scored $80$. Using each cohort's own mean and SD, find both $z$-scores and state who performed more exceptionally relative to their own cohort.A 班和 B 班各有一名学生考了 $80$ 分。分别用各自班级的均值与标准差求两人的 $z$ 分数(z-score),并指出相对于本班谁的表现更突出。[3]
Q8HARDPaper 24.4 Simpson's Paradox[9 marks]
A study of a new fertilizer collects $(x, y)$ data, where $x$ is dose and $y$ is yield, from two soil types. The six observations are:一项新化肥的研究在两种土壤上收集了 $(x, y)$ 数据,其中 $x$ 为施用剂量、$y$ 为产量。六个观测值为:
Soil 1土壤 1
$(1,8)$
$(2,7)$
$(3,6)$
Soil 2土壤 2
$(6,12)$
$(7,11)$
$(8,10)$
(a)Treating each soil type separately, state the value of $r$ for Soil 1 and for Soil 2. (Each set is exactly collinear — you may state $r$ by inspection.)将两种土壤分开处理,分别写出土壤 1 与土壤 2 的 $r$ 值。(每组数据严格共线,可直接观察写出 $r$。)[2]
(b)Now pool all six points and use the GDC to find $r$ for the combined data.现将全部六个点合并,用 GDC 求合并数据的 $r$。[3]
(c)The sign of the correlation flips between (a) and (b). Name this phenomenon and identify the lurking variable.相关系数的符号在 (a) 与 (b) 之间反转。请说出这一现象的名称,并指出其中的潜伏变量(lurking variable)。[2]
(d)A scientist wants to advise farmers on dosing. Which analysis — the pooled line or the within-soil lines — should guide the advice, and why?某科学家想就施用剂量向农民提建议。应当依据哪种分析 —— 合并直线还是分土壤的直线 —— 来给出建议?为什么?[2]
PART IV · OUT-OF-THE-BOX第四部分 · 非常规思维题Exploration · reasoning & proof · 19 marks探究 · 推理与证明 · 19 分
Section D — Stretch & ExplorationD 节 —— 拓展与探究
These problems are not harder arithmetic — they ask you to think about what the statistics actually mean. This is where a 7 is built. A calculator is allowed but rarely the point.这些题不是更繁的算术 —— 而是让你思考统计量究竟意味着什么。冲 7 正是在此处建立。可使用计算器,但计算往往不是重点。
Q9HARDOUT-OF-THE-BOX4.3 Same Stats, Different Data[10 marks]
Consider the two datasets, each of size $5$:考虑以下两个各含 $5$ 个值的数据集:
$A$
2
4
6
8
10
$B$
3
4
5
7
11
(a)Show that $A$ and $B$ have the same mean and the same variance.证明 $A$ 与 $B$ 的均值相同、方差也相同。[4]
(b)Find the median of each. Use the relationship between mean and median to describe the shape (symmetry / skew) of each dataset.求两者的中位数。利用均值与中位数的关系,描述每个数据集的形状(对称 / 偏态,skew)。[3]
(c)A report describes a dataset using only its mean and standard deviation. Explain what this example shows about the limits of that practice, and name one extra thing a reader should ask for.某报告仅用均值与标准差来描述一个数据集。请用本例说明这种做法的局限,并指出读者还应额外索取的一项信息。[3]
This question is about the computational form of the variance.本题研究方差的"计算式"形式。
(a)Prove the identity $\displaystyle \sum_{i=1}^{n}(x_i - \bar{x})^2 = \sum_{i=1}^{n} x_i^2 - n\bar{x}^2$, stating clearly where you use $\sum x_i = n\bar{x}$.证明恒等式 $\displaystyle \sum_{i=1}^{n}(x_i - \bar{x})^2 = \sum_{i=1}^{n} x_i^2 - n\bar{x}^2$,并清楚指明在何处用到 $\sum x_i = n\bar{x}$。[4]
(b)A dataset of $n = 8$ values has $\sum x = 96$ and $\sum x^2 = 1256$. Find the mean, the variance, and the standard deviation.某含 $n = 8$ 个值的数据集满足 $\sum x = 96$、$\sum x^2 = 1256$。求均值、方差与标准差。[3]
(c)A ninth value, equal to the current mean, is now added. Using your identity — not a fresh calculation from raw data — find the new variance, and confirm it is consistent with your answer to Q1(b).现加入第 $9$ 个值,恰等于当前均值。请用上面的恒等式(而非从原始数据重新计算)求新的方差,并验证它与 Q1(b) 的结论一致。[2]