I B  M A T H  A A  H L
Unit D1 · Solutions · PremiumUnit D1 · 解析 · 精选版

Univariate Data — Solutions单变量数据 —— 解析

Companion to the 7-Chaser Practice Set冲 7 强化练习卷的解析配套

MEDIUM HARD Paper 1A Paper 1B Paper 2 OUT-OF-THE-BOX

Syllabus 4.1 – 4.4考纲 4.1 – 4.4AA HL



v2.0 · companion to Unit_D1_Univariate_Data_Practice.html v2.0 · 10 Qs · 85 marks · IB-style mark-by-mark withUnit_D1_Univariate_Data_Practice.html v2.0 配套 · 10 题 · 85 分 · 按 IB 风格逐分点给出,附 M1 / A1 / R1 callouts分点标注

PART I  ·  PAPER 1 SECTION A — SOLUTIONS第一部分  ·  第一卷 A 节 —— 解析No calculator · 20 marks不可使用计算器 · 20 分

Section A — Worked SolutionsA 节 —— 详细解析

Q1MEDIUMPaper 1A4.3 Reverse Transformation[6 marks]

$y_i = 3x_i - 7$ gives $\bar{y} = 50$, $\sigma_y = 9$. (a) Find $\bar{x}$, $\sigma_x$. (b) Append a value $= \bar{x}$ to the original data; new mean, and does SD rise/fall/stay?$y_i = 3x_i - 7$ 给出 $\bar{y} = 50$、$\sigma_y = 9$。(a) 求 $\bar{x}$、$\sigma_x$。(b) 向原始数据追加一个等于 $\bar{x}$ 的值;新均值如何,标准差升/降/不变?

Answers:答案:  (a) $\bar{x} = 19$, $\sigma_x = 3$  ·  (b) mean stays $19$; SD decreases均值仍为 $19$;标准差减小

(a) Invert the transformation M1·A1·A1

For $y = ax + b$ we have $\bar{y} = a\bar{x} + b$ and $\sigma_y = |a|\,\sigma_x$. With $a = 3$, $b = -7$: $$ 50 = 3\bar{x} - 7 \;\Rightarrow\; \bar{x} = \frac{57}{3} = \boxed{19}, \qquad \sigma_x = \frac{\sigma_y}{|a|} = \frac{9}{3} = \boxed{3}. $$

(b) Append a point equal to the mean A1·R1·R1

The new value is $\bar{x} = 19$. Adding a value equal to the mean does not move the mean, so the new mean is still $19$.

For the SD, write the (population) variance as $\sigma^2 = \dfrac{\sum (x_i - \bar{x})^2}{n}$. The appended point has deviation $19 - 19 = 0$, so the numerator $\sum (x_i - \bar{x})^2$ is unchanged, while the denominator grows from $n$ to $n+1$. A fixed positive numerator over a larger denominator gives a smaller variance, so the standard deviation decreases.

The mean is the "balance point." Dropping data exactly at the mean always shrinks spread without shifting centre. This is the same mechanism behind Q10(c) — keep it in your pocket for "effect on SD" questions where you are given no raw data.

(a) 反解变换 M1·A1·A1

对 $y = ax + b$ 有 $\bar{y} = a\bar{x} + b$、$\sigma_y = |a|\,\sigma_x$。代入 $a = 3$、$b = -7$: $$ 50 = 3\bar{x} - 7 \;\Rightarrow\; \bar{x} = \frac{57}{3} = \boxed{19}, \qquad \sigma_x = \frac{\sigma_y}{|a|} = \frac{9}{3} = \boxed{3}. $$

(b) 追加一个等于均值的点 A1·R1·R1

新值为 $\bar{x} = 19$。追加一个等于均值的值不会移动均值,所以新均值仍为 $19$。

就标准差而言,把(总体)方差写成 $\sigma^2 = \dfrac{\sum (x_i - \bar{x})^2}{n}$。新追加点的离差为 $19 - 19 = 0$,因此分子 $\sum (x_i - \bar{x})^2$ 不变,而分母由 $n$ 增大到 $n+1$。固定的正分子除以更大的分母,方差变小,故标准差减小。

均值是"平衡点"。恰好在均值处加入数据,总会在不移动中心的前提下缩小离散程度。这与 Q10(c) 是同一机制 —— 遇到"未给原始数据、问对 SD 的影响"一类题时,把它记在心里。
Q2HARDPaper 1A4.3 Combined Mean[7 marks]

A: $12$ values, mean $15$. B: $8$ values, mean $m$. Combined mean $= 18$. (a) Find $m$. (b) A "9" in A should be "39"; corrected combined mean. (c) Why $\ne$ average of $15$ and $m$.A:$12$ 值,均值 $15$。B:$8$ 值,均值 $m$。合并均值 $= 18$。(a) 求 $m$。(b) A 中一个"9"应为"39";改正后的合并均值。(c) 为何 $\ne$ $15$ 与 $m$ 的平均。

Answers:答案:  (a) $m = 22.5$  ·  (b) $19.5$$19.5$  ·  (c) unequal group sizes → weighted mean两组大小不等 → 加权平均

(a) Use total $=$ mean $\times$ count M1·M1·A1

The combined mean uses the sum of all values over the total count: $$ \frac{12(15) + 8m}{20} = 18 \;\Rightarrow\; 180 + 8m = 360 \;\Rightarrow\; 8m = 180 \;\Rightarrow\; m = \boxed{22.5}. $$

(b) Correct one value M1·A1

Replacing a $9$ by a $39$ adds $39 - 9 = 30$ to the grand total. Original grand total $= 18 \times 20 = 360$, so corrected total $= 390$ and $$ \bar{x}_{\text{corrected}} = \frac{390}{20} = \boxed{19.5}. $$

(c) Why not the simple average R1

The simple average $\tfrac{1}{2}(15 + 22.5) = 18.75$ would be correct only if the two groups were the same size. Here Group A has more values ($12 > 8$), so its mean of $15$ is weighted more heavily, pulling the combined mean below $18.75$ to exactly $18$. The combined mean is a size-weighted average: $\bar{x} = \dfrac{n_A\bar{x}_A + n_B\bar{x}_B}{n_A + n_B}$.
Weighted-mean trap. Examiners love offering the unweighted average as a distractor. Whenever you merge groups of different sizes, weight by count — never average the means directly.

(a) 用"总和 $=$ 均值 $\times$ 个数" M1·M1·A1

合并均值用所有值之和除以总个数: $$ \frac{12(15) + 8m}{20} = 18 \;\Rightarrow\; 180 + 8m = 360 \;\Rightarrow\; 8m = 180 \;\Rightarrow\; m = \boxed{22.5}. $$

(b) 改正一个值 M1·A1

把一个 $9$ 改成 $39$,使总和增加 $39 - 9 = 30$。原总和 $= 18 \times 20 = 360$,故改正后总和 $= 390$, $$ \bar{x}_{\text{corrected}} = \frac{390}{20} = \boxed{19.5}. $$

(c) 为何不是简单平均 R1

简单平均 $\tfrac{1}{2}(15 + 22.5) = 18.75$ 只有在两组大小相同时才正确。这里 A 组个数更多($12 > 8$),其均值 $15$ 被赋予更大权重,把合并均值从 $18.75$ 往下拉到恰好 $18$。合并均值是按个数加权的平均:$\bar{x} = \dfrac{n_A\bar{x}_A + n_B\bar{x}_B}{n_A + n_B}$。
加权平均陷阱。命题人很爱把"不加权的平均"作为干扰项。只要合并大小不同的组,就要按个数加权 —— 绝不能直接对均值取平均。
Q3HARDPaper 1A4.2 / 4.3 Constraint Puzzle[7 marks]

Five positive integers: median $3$, unique mode $3$, mean $4$, range $8$. (a) Find all such datasets. (b) Justify completeness.五个正整数:中位数 $3$、唯一众数 $3$、均值 $4$、极差 $8$。(a) 求所有这样的数据集。(b) 论证完整性。

Answers:答案:  (a) the only dataset is唯一的数据集为 $\{1, 3, 3, 4, 9\}$  ·  (b) exhaustive case-check on the smallest value对最小值穷举验证

Set up the sorted list M1

Write the five integers in non-decreasing order $a \le b \le c \le d \le e$. The conditions give:
  • median $= c = 3$ (the middle of five),
  • mean $4 \Rightarrow a + b + c + d + e = 20$, so $a + b + d + e = 17$,
  • range $8 \Rightarrow e = a + 8$,
  • unique mode $3 \Rightarrow$ the value $3$ occurs at least twice, and strictly more often than any other value.
Substituting $e = a + 8$ into $a + b + d + e = 17$ gives $\;2a + b + d = 9\;$ with $a \le b \le 3 \le d \le e$.

(a) Search over the smallest value M1·A1·A1

Since $a \ge 1$ and $2a \le 2a + b + d = 9$, we have $a \le 4$; and $a \le c = 3$. Test $a = 1, 2, 3$:
$a$$b + d$candidates ($b\le 3\le d$)valid?
$1$$7$$(1,6),(2,5),(3,4)$only $(3,4)$
$2$$5$$(2,3)$ [need $b\ge 2,\,d\ge3$]none
$3$$3$$b\ge3,d\ge3\Rightarrow b+d\ge6>3$none

For $a=1$: with $(b,d)=(3,4)$, the set is $\{1,3,3,4,9\}$ — value $3$ appears twice, every other value once, so mode $3$ is unique. ✓ The pairs $(1,6)$ and $(2,5)$ give $\{1,1,3,6,9\}$ and $\{1,2,3,5,9\}$, in which $3$ appears only once, so $3$ is not the mode. ✗ For $a=2$, the only split $(b,d)=(2,3)$ gives $\{2,2,3,3,10\}$, where $2$ and $3$ each appear twice — mode not unique. ✗

So the unique answer is $\boxed{\{1, 3, 3, 4, 9\}}$  (check: sum $=20$, mean $4$; sorted middle $=3$; mode $3$; range $9-1=8$). ✓

(b) Completeness R1

Every dataset must have $a \in \{1, 2, 3\}$ (positive integer, $a \le 3$, and $2a \le 9$). The table tests all three, and only $a=1$ survives with exactly one admissible $(b,d)$. A triple-$3$ solution is impossible: $\{a,3,3,3,e\}$ forces $a + e = 11$ with $e = a+8$, giving $a = 1.5 \notin \mathbb{Z}$. Hence the list is complete.
Constraint puzzles. Sort first, translate each clue into an equation/inequality, then bound one variable (here $a$) to a tiny finite range and exhaust it. "Find all" questions are really "bound, then check" questions.

设排好序的列表 M1

把五个整数按非降序写成 $a \le b \le c \le d \le e$。条件给出:
  • 中位数 $= c = 3$(五个数的正中间),
  • 均值 $4 \Rightarrow a + b + c + d + e = 20$,即 $a + b + d + e = 17$,
  • 极差 $8 \Rightarrow e = a + 8$,
  • 唯一众数 $3 \Rightarrow$ 值 $3$ 至少出现两次,且严格多于任何其他值。
将 $e = a + 8$ 代入 $a + b + d + e = 17$ 得 $\;2a + b + d = 9\;$,其中 $a \le b \le 3 \le d \le e$。

(a) 对最小值进行搜索 M1·A1·A1

由 $a \ge 1$ 且 $2a \le 9$ 得 $a \le 4$;又 $a \le c = 3$。逐一检验 $a = 1, 2, 3$:
$a$$b + d$候选($b\le 3\le d$)有效?
$1$$7$$(1,6),(2,5),(3,4)$仅 $(3,4)$
$2$$5$$(2,3)$ [需 $b\ge 2,\,d\ge3$]
$3$$3$$b\ge3,d\ge3\Rightarrow b+d\ge6>3$

当 $a=1$:取 $(b,d)=(3,4)$,得 $\{1,3,3,4,9\}$ —— 值 $3$ 出现两次,其余各值各一次,故众数 $3$ 唯一。✓ 而 $(1,6)$、$(2,5)$ 给出 $\{1,1,3,6,9\}$ 与 $\{1,2,3,5,9\}$,其中 $3$ 只出现一次,故 $3$ 不是众数。✗ 当 $a=2$ 时,唯一的拆分 $(b,d)=(2,3)$ 给出 $\{2,2,3,3,10\}$,其中 $2$ 与 $3$ 各出现两次 —— 众数不唯一。✗

因此唯一答案为 $\boxed{\{1, 3, 3, 4, 9\}}$  (验证:和 $=20$,均值 $4$;排序中间 $=3$;众数 $3$;极差 $9-1=8$)。✓

(b) 完整性 R1

任何符合条件的数据集都必有 $a \in \{1, 2, 3\}$(正整数、$a \le 3$、且 $2a \le 9$)。上表已检验全部三种情形,仅 $a=1$ 通过,且恰有一组允许的 $(b,d)$。三个 $3$ 的解也不可能:$\{a,3,3,3,e\}$ 迫使 $a + e = 11$ 且 $e = a+8$,得 $a = 1.5 \notin \mathbb{Z}$。故列表完整。
约束型谜题。先排序,把每条线索翻译成等式/不等式,再把某个变量(这里是 $a$)限制在很小的有限范围内逐一穷举。"求所有"本质上就是"先限界,再验证"。
PART II  ·  PAPER 1 SECTION B — SOLUTIONS第二部分  ·  第一卷 B 节 —— 解析No calculator · 20 marks不可使用计算器 · 20 分

Section B — Extended SolutionsB 节 —— 长答题解析

Q4HARDPaper 1B4.4 Two Regression Lines[12 marks]

$x = 1,2,3,4,5$; $y = 2,3,5,4,6$. (a) Show $\bar{x}=3,\bar{y}=4$. (b) $S_{xx},S_{yy},S_{xy}$. (c) Both regression lines. (d) Product of gradients $= r^2$; find $r$. (e) Intersection & why they differ.$x = 1,2,3,4,5$;$y = 2,3,5,4,6$。(a) 证 $\bar{x}=3,\bar{y}=4$。(b) $S_{xx},S_{yy},S_{xy}$。(c) 两条回归直线。(d) 斜率之积 $= r^2$;求 $r$。(e) 交点及两线为何不同。

Answers:答案:  (b) $S_{xx}=10,\,S_{yy}=10,\,S_{xy}=9$  ·  (c) $y=0.9x+1.3$; $\;x=0.9y-0.6$  ·  (d) $0.81=r^2,\;r=0.9$  ·  (e) meet at $(3,4)$交于 $(3,4)$

(a) Means A1·A1

$$ \bar{x} = \frac{1+2+3+4+5}{5} = \frac{15}{5} = 3, \qquad \bar{y} = \frac{2+3+5+4+6}{5} = \frac{20}{5} = 4. \quad \blacksquare $$

(b) Sums of squares and products M1·A1·A1·A1

$x_i$$y_i$$x_i-\bar{x}$$y_i-\bar{y}$$(x-\bar x)^2$$(y-\bar y)^2$$(x-\bar x)(y-\bar y)$
$1$$2$$-2$$-2$$4$$4$$4$
$2$$3$$-1$$-1$$1$$1$$1$
$3$$5$$0$$1$$0$$1$$0$
$4$$4$$1$$0$$1$$0$$0$
$5$$6$$2$$2$$4$$4$$4$
Totals$S_{xx}=10$$S_{yy}=10$$S_{xy}=9$

(c) The two regression lines M1·A1·A1

$y$ on $x$ minimises vertical errors, gradient $b_{y|x} = \dfrac{S_{xy}}{S_{xx}} = \dfrac{9}{10} = 0.9$, through $(3,4)$: $$ y - 4 = 0.9(x - 3) \;\Rightarrow\; \boxed{y = 0.9x + 1.3}. $$ $x$ on $y$ minimises horizontal errors, gradient $b_{x|y} = \dfrac{S_{xy}}{S_{yy}} = \dfrac{9}{10} = 0.9$, through $(3,4)$: $$ x - 3 = 0.9(y - 4) \;\Rightarrow\; \boxed{x = 0.9y - 0.6}. $$

(d) Product of gradients M1·A1

$$ b_{y|x}\cdot b_{x|y} = \frac{S_{xy}}{S_{xx}}\cdot\frac{S_{xy}}{S_{yy}} = \frac{S_{xy}^2}{S_{xx}S_{yy}} = r^2. $$ Numerically $0.9 \times 0.9 = 0.81 = r^2$, and since $S_{xy} = 9 > 0$ the correlation is positive, so $r = +\sqrt{0.81} = \boxed{0.9}$.

(e) Intersection A1

Both lines pass through $(\bar{x}, \bar{y}) = \boxed{(3, 4)}$. They differ because they minimise different errors — vertical deviations for $y$ on $x$, horizontal deviations for $x$ on $y$ — so they coincide only in the perfect-correlation limit $|r| = 1$.
Why two lines exist. "The" regression line depends on which variable you treat as the response. The closer the two lines, the stronger the linear relationship; their gradients pinch together exactly as $r^2 \to 1$. The IB formula booklet gives only $b_{y|x}$, but the geometry of both is fair game on Paper 3-style reasoning.

(a) 均值 A1·A1

$$ \bar{x} = \frac{1+2+3+4+5}{5} = \frac{15}{5} = 3, \qquad \bar{y} = \frac{2+3+5+4+6}{5} = \frac{20}{5} = 4. \quad \blacksquare $$

(b) 平方和与乘积和 M1·A1·A1·A1

$x_i$$y_i$$x_i-\bar{x}$$y_i-\bar{y}$$(x-\bar x)^2$$(y-\bar y)^2$$(x-\bar x)(y-\bar y)$
$1$$2$$-2$$-2$$4$$4$$4$
$2$$3$$-1$$-1$$1$$1$$1$
$3$$5$$0$$1$$0$$1$$0$
$4$$4$$1$$0$$1$$0$$0$
$5$$6$$2$$2$$4$$4$$4$
合计$S_{xx}=10$$S_{yy}=10$$S_{xy}=9$

(c) 两条回归直线 M1·A1·A1

$y$ 关于 $x$(最小化竖直误差),斜率 $b_{y|x} = \dfrac{S_{xy}}{S_{xx}} = \dfrac{9}{10} = 0.9$,过 $(3,4)$: $$ y - 4 = 0.9(x - 3) \;\Rightarrow\; \boxed{y = 0.9x + 1.3}. $$ $x$ 关于 $y$(最小化水平误差),斜率 $b_{x|y} = \dfrac{S_{xy}}{S_{yy}} = \dfrac{9}{10} = 0.9$,过 $(3,4)$: $$ x - 3 = 0.9(y - 4) \;\Rightarrow\; \boxed{x = 0.9y - 0.6}. $$

(d) 斜率之积 M1·A1

$$ b_{y|x}\cdot b_{x|y} = \frac{S_{xy}}{S_{xx}}\cdot\frac{S_{xy}}{S_{yy}} = \frac{S_{xy}^2}{S_{xx}S_{yy}} = r^2. $$ 数值上 $0.9 \times 0.9 = 0.81 = r^2$;又因 $S_{xy} = 9 > 0$,相关为正,故 $r = +\sqrt{0.81} = \boxed{0.9}$。

(e) 交点 A1

两条直线都过 $(\bar{x}, \bar{y}) = \boxed{(3, 4)}$。它们不同,是因为各自最小化不同的误差 —— $y$ 关于 $x$ 用竖直离差,$x$ 关于 $y$ 用水平离差 —— 只有在完全相关 $|r| = 1$ 的极限下两线才重合。
为何存在两条直线。"那条"回归直线取决于你把哪个变量当作响应变量。两线越接近,线性关系越强;当 $r^2 \to 1$ 时,两者斜率恰好夹拢。IB 公式手册只给 $b_{y|x}$,但两条线的几何关系在 Paper 3 式推理中是可以考的。
Q5HARDPaper 1B4.2 / 4.3 Resistance & Robustness[8 marks]

$D$: $n=20$, mean $50$, SD $10$, median $48$, IQR $16$. For each separate change, give new mean/median/SD/IQR (or "cannot be determined"). (a) $+5$ to all. (b) $\times 2$ all. (c) append $50$. (d) one value $+100$.$D$:$n=20$,均值 $50$,SD $10$,中位数 $48$,IQR $16$。对每个独立变化,给出新的 均值/中位数/SD/IQR(或"无法确定")。(a) 全部 $+5$。(b) 全部 $\times 2$。(c) 追加 $50$。(d) 一个值 $+100$。

Answers:答案:  (a) $55,\,53,\,10,\,16$  ·  (b) $100,\,96,\,20,\,32$  ·  (c) mean $50$, SD $\downarrow\approx 9.76$, median/IQR n.d.均值 $50$,SD $\downarrow\approx 9.76$,中位数/IQR 无法确定  ·  (d) mean $55$, SD $\uparrow$, median/IQR n.d. (resistant)均值 $55$,SD $\uparrow$,中位数/IQR 无法确定(稳健)

(a) Shift by $+5$ A1·A1

A pure translation $x \mapsto x + 5$ moves every location measure by $+5$ and leaves every spread measure unchanged: $$ \text{mean } 55,\quad \text{median } 53,\quad \text{SD } 10,\quad \text{IQR } 16. $$

(b) Scale by $\times 2$ A1·A1

Multiplying by $a = 2$ scales locations by $2$ and spreads by $|a| = 2$: $$ \text{mean } 100,\quad \text{median } 96,\quad \text{SD } 20,\quad \text{IQR } 32. $$

(c) Append one value $= 50$ A1·R1

  • Mean: still $50$ (the new point sits exactly at the mean).
  • SD: decreases. Original $\sum(x-\bar x)^2 = n\sigma^2 = 20(100) = 2000$; the new point adds deviation $0$, so the new variance is $\tfrac{2000}{21} \approx 95.24$, giving SD $\approx \sqrt{95.24} \approx 9.76$.
  • Median & IQR: cannot be determined — they depend on the raw ordering of the 20 values, which we do not have.

(d) One value inflated by $+100$ A1·R1

  • Mean: the total rises by $100$, so the mean rises by $\tfrac{100}{20} = 5$, to $55$.
  • SD: increases — one value is now far from the mean — but the exact amount depends on which value was changed, so no number can be quoted.
  • Median & IQR: cannot be determined; they are resistant to a single extreme change and will usually barely move, but the exact values need the raw data.
Resistant vs. sensitive. Median and IQR are resistant (one wild value barely shifts them); mean and SD are sensitive (one wild value drags them). Part (d) is exactly why analysts report the median for skewed data such as incomes and house prices.

(a) 整体 $+5$ A1·A1

纯平移 $x \mapsto x + 5$ 使每个位置度量加 $5$,而每个离散度量不变: $$ \text{均值 } 55,\quad \text{中位数 } 53,\quad \text{SD } 10,\quad \text{IQR } 16. $$

(b) 整体 $\times 2$ A1·A1

乘以 $a = 2$ 使位置量放大 $2$ 倍,离散量放大 $|a| = 2$ 倍: $$ \text{均值 } 100,\quad \text{中位数 } 96,\quad \text{SD } 20,\quad \text{IQR } 32. $$

(c) 追加一个 $= 50$ 的值 A1·R1

  • 均值:仍为 $50$(新点恰在均值上)。
  • SD:减小。原 $\sum(x-\bar x)^2 = n\sigma^2 = 20(100) = 2000$;新点离差为 $0$,故新方差为 $\tfrac{2000}{21} \approx 95.24$,SD $\approx \sqrt{95.24} \approx 9.76$。
  • 中位数与 IQR:无法确定 —— 它们依赖 $20$ 个值的原始排序,而题目未给出。

(d) 一个值增大 $100$ A1·R1

  • 均值:总和增加 $100$,故均值增加 $\tfrac{100}{20} = 5$,变为 $55$。
  • SD:增大 —— 现在有一个值远离均值 —— 但具体增量取决于被改的是哪个值,无法给出数值。
  • 中位数与 IQR:无法确定;它们对单个极端变化稳健(resistant,通常几乎不动,但精确值需要原始数据。
稳健 vs. 敏感。中位数与 IQR 是稳健的(单个异常值几乎不影响它们);均值与 SD 是敏感的(单个异常值会把它们拉走)。(d) 正是分析师在收入、房价等偏态数据中报告中位数的原因。
PART III  ·  PAPER 2 — SOLUTIONS第三部分  ·  第二卷 —— 解析Calculator · 26 marks可使用计算器 · 26 分

Section C — Paper 2 SolutionsC 节 —— 第二卷解析

Q6MEDIUMPaper 24.4 Regression & Residual[7 marks]

$x=1,3,4,6,7,9$; $y=8,11,11,16,18,20$. (a) $r$ and line. (b) Interpret $b$. (c) Residual at $x=6$.$x=1,3,4,6,7,9$;$y=8,11,11,16,18,20$。(a) $r$ 与直线。(b) 解释 $b$。(c) $x=6$ 处的残差。

Answers:答案:  (a) $r \approx 0.986$, $y \approx 6.02 + 1.60x$  ·  (b) $\approx 1.60$ score-pts per study hour每多学 $1$ 小时约 $+1.60$ 分  ·  (c) residual $\approx +0.405$残差 $\approx +0.405$

(a) GDC linear regression M1·A1·A1

Enter the six pairs into the GDC's two-list editor and run LinReg($a + bx$). (By hand $\bar x = 5$, $\bar y = 14$, $S_{xx} = 42$, $S_{xy} = 67$, $S_{yy} = 110$, so $b = \tfrac{67}{42}$, $a = 14 - \tfrac{67}{42}(5)$, $r = \tfrac{67}{\sqrt{42\cdot 110}}$.) $$ r \approx 0.986, \qquad b \approx 1.60, \qquad a \approx 6.02, $$ so the regression line is $\boxed{y \approx 6.02 + 1.60x}$ (3 s.f.). The high $r$ confirms a strong positive linear association.

(b) Interpret the gradient A1·A1

Each additional hour of weekly study is associated with an increase of about $1.60$ points on the diagnostic score, on average across this sample.

(c) Residual at $x = 6$ M1·A1

Predicted (full GDC precision $a = 6.0238$, $b = 1.5952$): $\hat{y} = 6.0238 + 1.5952(6) = 15.595$. $$ \text{residual} = y - \hat{y} = 16 - 15.595 = \boxed{+0.405}. $$ The residual is positive, so this student scored above what the model predicts for $6$ study hours.
Keep precision, round at the end. Computing the residual from the 3 s.f. line gives $\approx 0.4$; from full GDC values it is $0.405$. IB accepts either, but carrying unrounded $a,b$ avoids losing a final-answer mark on borderline cases.

(a) GDC 线性回归 M1·A1·A1

把六组数对录入 GDC 的双列编辑器,运行 LinReg($a + bx$)。(手算:$\bar x = 5$,$\bar y = 14$,$S_{xx} = 42$,$S_{xy} = 67$,$S_{yy} = 110$,故 $b = \tfrac{67}{42}$,$a = 14 - \tfrac{67}{42}(5)$,$r = \tfrac{67}{\sqrt{42\cdot 110}}$。) $$ r \approx 0.986, \qquad b \approx 1.60, \qquad a \approx 6.02, $$ 故回归直线为 $\boxed{y \approx 6.02 + 1.60x}$(3 位有效数字)。$r$ 很高,确认为强正线性关联。

(b) 解释斜率 A1·A1

在本样本中,每周学习时间每增加 $1$ 小时,诊断成绩平均约提高 $1.60$ 分。

(c) $x = 6$ 处的残差 M1·A1

预测值(保留 GDC 完整精度 $a = 6.0238$、$b = 1.5952$):$\hat{y} = 6.0238 + 1.5952(6) = 15.595$。 $$ \text{残差} = y - \hat{y} = 16 - 15.595 = \boxed{+0.405}. $$ 残差为正,说明该学生在 $6$ 小时学习时间下的实际成绩高于模型预测。
保留精度,最后再四舍五入。用 3 位有效数字的直线算残差得 $\approx 0.4$;用 GDC 完整数值得 $0.405$。IB 两者都接受,但保留未舍入的 $a,b$ 可避免在临界情形丢掉最终答案分。
Q7HARDPaper 24.3 Pooled SD & z-scores[10 marks]

A: $n=30$, mean $62$, SD $9$. B: $n=20$, mean $71$, SD $6$. (a) Combined mean. (b) Combined $\sum x^2$ and combined SD via $\sum x^2 = n(\sigma^2+\bar x^2)$. (c) Two students score $80$; $z$-scores; who is more exceptional.A:$n=30$,均值 $62$,SD $9$。B:$n=20$,均值 $71$,SD $6$。(a) 合并均值。(b) 用 $\sum x^2 = n(\sigma^2+\bar x^2)$ 求合并 $\sum x^2$ 与合并 SD。(c) 两人各考 $80$;求 $z$ 分数;谁更突出。

Answers:答案:  (a) $65.6$  ·  (b) $\sum x^2 = 219\,290$, SD $\approx 9.08$  ·  (c) $z_A = 2.00,\;z_B = 1.50$; A more exceptionalA 更突出

(a) Combined mean M1·A1

$$ \bar{x} = \frac{30(62) + 20(71)}{50} = \frac{1860 + 1420}{50} = \frac{3280}{50} = \boxed{65.6}. $$

(b) Pooled $\sum x^2$ and SD M1·M1·A1·A1·A1

Apply $\sum x^2 = n(\sigma^2 + \bar{x}^2)$ to each cohort: $$ \textstyle\sum x_A^2 = 30(9^2 + 62^2) = 30(81 + 3844) = 30(3925) = 117\,750, $$ $$ \textstyle\sum x_B^2 = 20(6^2 + 71^2) = 20(36 + 5041) = 20(5077) = 101\,540. $$ Combined $\sum x^2 = 117\,750 + 101\,540 = 219\,290$. Then for all $50$ scores $$ \sigma^2 = \frac{\sum x^2}{N} - \bar{x}^2 = \frac{219\,290}{50} - 65.6^2 = 4385.8 - 4303.36 = 82.44, $$ $$ \sigma = \sqrt{82.44} \approx \boxed{9.08}. $$

(c) z-scores M1·A1·A1

$$ z_A = \frac{80 - 62}{9} = \frac{18}{9} = 2.00, \qquad z_B = \frac{80 - 71}{6} = \frac{9}{6} = 1.50. $$ Both scored $80$, but relative to their own cohort the Cohort A student is $2.00$ SD above the mean versus $1.50$ for Cohort B, so the Cohort A student performed more exceptionally.
Pooled SD $\ne$ average of SDs. The two SDs are $9$ and $6$ (average $7.5$), yet the pooled SD is $9.08$ — larger than both contributions would suggest, because the gap between the cohort means ($62$ vs $71$) injects extra spread. Merging groups with different means always inflates the combined SD beyond a naive blend.

(a) 合并均值 M1·A1

$$ \bar{x} = \frac{30(62) + 20(71)}{50} = \frac{1860 + 1420}{50} = \frac{3280}{50} = \boxed{65.6}. $$

(b) 合并 $\sum x^2$ 与 SD M1·M1·A1·A1·A1

对每个班级使用 $\sum x^2 = n(\sigma^2 + \bar{x}^2)$: $$ \textstyle\sum x_A^2 = 30(9^2 + 62^2) = 30(81 + 3844) = 30(3925) = 117\,750, $$ $$ \textstyle\sum x_B^2 = 20(6^2 + 71^2) = 20(36 + 5041) = 20(5077) = 101\,540. $$ 合并 $\sum x^2 = 117\,750 + 101\,540 = 219\,290$。则全部 $50$ 个成绩 $$ \sigma^2 = \frac{\sum x^2}{N} - \bar{x}^2 = \frac{219\,290}{50} - 65.6^2 = 4385.8 - 4303.36 = 82.44, $$ $$ \sigma = \sqrt{82.44} \approx \boxed{9.08}. $$

(c) z 分数 M1·A1·A1

$$ z_A = \frac{80 - 62}{9} = \frac{18}{9} = 2.00, \qquad z_B = \frac{80 - 71}{6} = \frac{9}{6} = 1.50. $$ 两人都考了 $80$,但相对于各自本班,A 班学生高出均值 $2.00$ 个标准差,B 班为 $1.50$,故A 班学生的表现更突出。
合并 SD $\ne$ 各 SD 的平均。两个 SD 为 $9$ 与 $6$(平均 $7.5$),但合并 SD 为 $9.08$ —— 比简单混合所暗示的还要大,因为两班均值之差($62$ 与 $71$)注入了额外的离散。合并均值不同的组,总会使合并 SD 超过朴素的混合值。
Q8HARDPaper 24.4 Simpson's Paradox[9 marks]

Soil 1: $(1,8),(2,7),(3,6)$. Soil 2: $(6,12),(7,11),(8,10)$. (a) $r$ for each soil. (b) $r$ pooled. (c) Name phenomenon & lurking variable. (d) Which analysis guides advice?土壤 1:$(1,8),(2,7),(3,6)$。土壤 2:$(6,12),(7,11),(8,10)$。(a) 各土壤的 $r$。(b) 合并的 $r$。(c) 现象名称与潜伏变量。(d) 应依据哪种分析?

Answers:答案:  (a) $r = -1$ for each soil每种土壤均为  ·  (b) $r \approx +0.763$  ·  (c) Simpson's paradox; lurking variable = soil type辛普森悖论;潜伏变量 = 土壤类型  ·  (d) within-soil lines分土壤的直线

(a) Within each soil A1·A1

Soil 1 points $(1,8),(2,7),(3,6)$ lie exactly on $y = 9 - x$; Soil 2 points $(6,12),(7,11),(8,10)$ lie exactly on $y = 18 - x$. Each set is perfectly, negatively collinear, so $r = \boxed{-1}$ for both.

(b) Pooled correlation M1·A1·A1

Enter all six points and run the GDC. (Check by hand: $\bar{x} = 4.5$, $\bar{y} = 9$, $S_{xy} = 26$, $S_{xx} = 41.5$, $S_{yy} = 28$, so $r = \dfrac{26}{\sqrt{41.5 \times 28}} = \dfrac{26}{\sqrt{1162}} \approx \boxed{+0.763}$.) The pooled correlation is strongly positive.

(c) Name the phenomenon A1·A1

This sign reversal on aggregation is Simpson's paradox. The lurking (confounding) variable is the soil type: Soil 2 happens to have both higher doses and higher yields, so pooling makes dose and yield look positively linked even though, holding soil fixed, more dose lowers yield.

(d) Which analysis to trust R1·R1

The advice should follow the within-soil lines. A farmer applies fertilizer on a given field of a fixed soil type, so the relevant relationship is the one with soil held constant — and there, extra dose reduces yield. The pooled positive trend is an artefact of comparing two different soils, not a causal dose effect.
Disaggregate before you advise. Simpson's paradox is why "controlling for confounders" matters. Whenever pooled data mixes groups that differ in both variables, check each group separately before drawing a causal conclusion — the aggregate can point the opposite way.

(a) 每种土壤内部 A1·A1

土壤 1 的点 $(1,8),(2,7),(3,6)$ 恰落在 $y = 9 - x$ 上;土壤 2 的点 $(6,12),(7,11),(8,10)$ 恰落在 $y = 18 - x$ 上。每组都是完全负共线,故两者 $r = \boxed{-1}$。

(b) 合并相关 M1·A1·A1

录入全部六个点并用 GDC 计算。(手算校验:$\bar{x} = 4.5$,$\bar{y} = 9$,$S_{xy} = 26$,$S_{xx} = 41.5$,$S_{yy} = 28$,故 $r = \dfrac{26}{\sqrt{41.5 \times 28}} = \dfrac{26}{\sqrt{1162}} \approx \boxed{+0.763}$。)合并相关为强

(c) 现象名称 A1·A1

这种因合并而出现的符号反转就是辛普森悖论(Simpson's paradox)。潜伏(混杂)变量土壤类型:土壤 2 恰好同时具有更高的剂量与更高的产量,因此合并后使剂量与产量看起来正相关,尽管固定土壤时,剂量越大产量越低。

(d) 应相信哪种分析 R1·R1

建议应遵循分土壤的直线。农民是在给定固定土壤类型的田里施肥,故相关的关系是固定土壤时的那个 —— 而那里剂量越大产量越低。合并出的正趋势只是比较两种不同土壤的假象,并非剂量的因果效应。
先分组,再建议。辛普森悖论正是"控制混杂变量"重要性的体现。每当合并数据混入了在两个变量上都不同的组,下因果结论前要先分别检查每个组 —— 合并后的整体可能指向相反方向。
PART IV  ·  OUT-OF-THE-BOX — SOLUTIONS第四部分  ·  非常规思维题 —— 解析Exploration · 19 marks探究 · 19 分

Section D — Stretch SolutionsD 节 —— 拓展题解析

Q9HARDOUT-OF-THE-BOX4.3 Same Stats, Different Data[10 marks]

$A=\{2,4,6,8,10\}$, $B=\{3,4,5,7,11\}$. (a) Same mean & variance. (b) Medians & shape. (c) Limits of "mean + SD only."$A=\{2,4,6,8,10\}$,$B=\{3,4,5,7,11\}$。(a) 均值与方差相同。(b) 中位数与形状。(c) 仅用"均值 + SD"的局限。

Answers:答案:  (a) both mean $6$, variance $8$两者均值 $6$、方差 $8$  ·  (b) $A$ median $6$ (symmetric); $B$ median $5$ (right-skew)$A$ 中位数 $6$(对称);$B$ 中位数 $5$(右偏)  ·  (c) summary stats hide shape概括统计量掩盖形状

(a) Equal mean and variance M1·A1·A1·A1

Means: $\bar{A} = \dfrac{2+4+6+8+10}{5} = \dfrac{30}{5} = 6$, $\;\bar{B} = \dfrac{3+4+5+7+11}{5} = \dfrac{30}{5} = 6.$
setdeviations from $6$$\sum(x-6)^2$
$A$$-4,-2,0,2,4$$16+4+0+4+16 = 40$
$B$$-3,-2,-1,1,5$$9+4+1+1+25 = 40$
Both have variance $\sigma^2 = \dfrac{40}{5} = \boxed{8}$ (and hence the same SD $\sqrt{8} = 2\sqrt{2}$).

(b) Medians and shape A1·A1·R1

Both are already sorted; the median is the $3$rd value. $A$: median $= 6 = $ mean, so $A$ is symmetric. $B$: median $= 5 < 6 = $ mean, so the mean is pulled above the median by the long upper tail ($11$): $B$ is right-skewed (positively skewed).

(c) The limits of summary statistics R1·R1·R1

Mean and SD are identical for $A$ and $B$, yet the datasets have visibly different shapes — one symmetric, one skewed with an outlier-like high value. So reporting only mean and SD cannot distinguish these distributions: shape, skew, and extreme values are invisible. A reader should also ask for the median (or the full five-number summary / a box plot / the raw data or a histogram).
Anscombe's lesson. Anscombe's quartet (1973) is four datasets with the same mean, variance, correlation, and regression line but wildly different scatter plots. The moral, baked into the IB course: always plot the data — numbers alone can agree while the stories differ completely.

(a) 均值与方差相同 M1·A1·A1·A1

均值:$\bar{A} = \dfrac{2+4+6+8+10}{5} = \dfrac{30}{5} = 6$,$\;\bar{B} = \dfrac{3+4+5+7+11}{5} = \dfrac{30}{5} = 6.$
数据集对 $6$ 的离差$\sum(x-6)^2$
$A$$-4,-2,0,2,4$$16+4+0+4+16 = 40$
$B$$-3,-2,-1,1,5$$9+4+1+1+25 = 40$
两者方差均为 $\sigma^2 = \dfrac{40}{5} = \boxed{8}$(因此 SD 也相同,$\sqrt{8} = 2\sqrt{2}$)。

(b) 中位数与形状 A1·A1·R1

两组都已排好序,中位数为第 $3$ 个值。$A$:中位数 $= 6 = $ 均值,故 $A$ 对称。$B$:中位数 $= 5 < 6 = $ 均值,均值被偏大的上尾($11$)拉到中位数之上,故 $B$ 右偏(正偏,positively skewed)。

(c) 概括统计量的局限 R1·R1·R1

$A$ 与 $B$ 的均值和 SD 完全相同,但形状明显不同 —— 一个对称,一个带类似异常值高值的偏态。因此仅报告均值与 SD 无法区分这两个分布:形状、偏度与极端值都被掩盖。读者还应索取中位数(或完整的五数概括 / 箱线图 / 原始数据或直方图)。
Anscombe 的教训。Anscombe 四重奏(1973)是四组均值、方差、相关系数、回归直线都相同、但散点图截然不同的数据。IB 课程内化的寓意是:永远先画出数据 —— 数字可以一致,背后的故事却可能完全不同。
Q10HARDOUT-OF-THE-BOX4.3 Variance Identity[9 marks]

(a) Prove $\sum(x_i-\bar x)^2 = \sum x_i^2 - n\bar x^2$. (b) $n=8$, $\sum x=96$, $\sum x^2=1256$: mean, variance, SD. (c) Add a 9th value $=$ mean; new variance via the identity; consistency with Q1(b).(a) 证 $\sum(x_i-\bar x)^2 = \sum x_i^2 - n\bar x^2$。(b) $n=8$,$\sum x=96$,$\sum x^2=1256$:均值、方差、SD。(c) 加入第 $9$ 个值 $=$ 均值;用恒等式求新方差;与 Q1(b) 一致。

Answers:答案:  (b) mean $12$, variance $13$, SD $\approx 3.61$均值 $12$,方差 $13$,SD $\approx 3.61$  ·  (c) new variance $\approx 11.6$ (decreases)新方差 $\approx 11.6$(减小)

(a) Prove the identity M1·A1·A1·A1

Expand the square and split the sum: $$ \sum_{i=1}^{n}(x_i - \bar{x})^2 = \sum_{i=1}^{n}\big(x_i^2 - 2\bar{x}x_i + \bar{x}^2\big) = \sum x_i^2 - 2\bar{x}\sum x_i + n\bar{x}^2. $$ Now use $\sum x_i = n\bar{x}$ in the middle term: $$ = \sum x_i^2 - 2\bar{x}(n\bar{x}) + n\bar{x}^2 = \sum x_i^2 - 2n\bar{x}^2 + n\bar{x}^2 = \sum x_i^2 - n\bar{x}^2. \quad \blacksquare $$ (The factor $\bar{x}$ is constant, so it comes out of $\sum 2\bar{x}x_i = 2\bar{x}\sum x_i$, and $\sum \bar{x}^2 = n\bar{x}^2$.)

(b) Apply it M1·A1·A1

$$ \bar{x} = \frac{\sum x}{n} = \frac{96}{8} = 12, \qquad \sigma^2 = \frac{\sum x^2}{n} - \bar{x}^2 = \frac{1256}{8} - 12^2 = 157 - 144 = 13, $$ $$ \sigma = \sqrt{13} \approx \boxed{3.61}. $$

(c) Add a value equal to the mean M1·A1

The new value is $12$. Update the two running totals: $n \to 9$, $\sum x \to 96 + 12 = 108$ (so $\bar{x}$ stays $108/9 = 12$), and $\sum x^2 \to 1256 + 12^2 = 1256 + 144 = 1400$. Then $$ \sigma_{\text{new}}^2 = \frac{1400}{9} - 12^2 = 155.\overline{5} - 144 = 11.\overline{5} \approx \boxed{11.6}. $$ The variance fell from $13$ to $\approx 11.6$ — exactly the "adding a point at the mean lowers spread" effect predicted in Q1(b), now with a number attached. ✓
Why this identity is worth memorising. $\sum x^2 - n\bar{x}^2$ lets you update mean and variance by tracking just two running sums, $\sum x$ and $\sum x^2$. That is exactly how a GDC's 1-Var Stats and every streaming-data algorithm work — no need to revisit the raw values.

(a) 证明恒等式 M1·A1·A1·A1

展开平方并拆分求和: $$ \sum_{i=1}^{n}(x_i - \bar{x})^2 = \sum_{i=1}^{n}\big(x_i^2 - 2\bar{x}x_i + \bar{x}^2\big) = \sum x_i^2 - 2\bar{x}\sum x_i + n\bar{x}^2. $$ 在中间项用 $\sum x_i = n\bar{x}$: $$ = \sum x_i^2 - 2\bar{x}(n\bar{x}) + n\bar{x}^2 = \sum x_i^2 - 2n\bar{x}^2 + n\bar{x}^2 = \sum x_i^2 - n\bar{x}^2. \quad \blacksquare $$ ($\bar{x}$ 是常数,可从 $\sum 2\bar{x}x_i = 2\bar{x}\sum x_i$ 中提出;且 $\sum \bar{x}^2 = n\bar{x}^2$。)

(b) 应用 M1·A1·A1

$$ \bar{x} = \frac{\sum x}{n} = \frac{96}{8} = 12, \qquad \sigma^2 = \frac{\sum x^2}{n} - \bar{x}^2 = \frac{1256}{8} - 12^2 = 157 - 144 = 13, $$ $$ \sigma = \sqrt{13} \approx \boxed{3.61}. $$

(c) 加入一个等于均值的值 M1·A1

新值为 $12$。更新两个累计量:$n \to 9$,$\sum x \to 96 + 12 = 108$(故 $\bar{x}$ 仍为 $108/9 = 12$),$\sum x^2 \to 1256 + 12^2 = 1256 + 144 = 1400$。于是 $$ \sigma_{\text{new}}^2 = \frac{1400}{9} - 12^2 = 155.\overline{5} - 144 = 11.\overline{5} \approx \boxed{11.6}. $$ 方差从 $13$ 降到 $\approx 11.6$ —— 正是 Q1(b) 预言的"在均值处加点会缩小离散"效应,如今给出了具体数值。✓
为何值得记住这个恒等式。$\sum x^2 - n\bar{x}^2$ 让你只需跟踪两个累计和 $\sum x$ 与 $\sum x^2$ 就能更新均值与方差。这正是 GDC 的 1-Var Stats 以及一切流式数据算法的工作方式 —— 无需回头逐个查看原始数据。