I B  M A T H  A A  H L
Unit D1 · Statistics & Probability · Premium SetUnit D1 · 统计与概率 · 精选挑战卷

Univariate Data单变量数据

IB-Style Practice — 7-Chaser EditionIB 风格练习 —— 冲 7 强化版

MEDIUM HARD Paper 1A Paper 1B Paper 2 OUT-OF-THE-BOX

Syllabus 4.1 – 4.4 (descriptive · grouped · bivariate linear) — depth over recall考纲 4.1 – 4.4(描述性统计 · 分组数据 · 双变量线性)—— 重深度而非记忆 AA HL



Name:姓名:Date:日期:

v2.0 — Premium · 10 questions · 85 marks. Deliberately weighted to Hard and out-of-the-box problems for the 6→7 student: combined/pooled statistics, the two regression lines, Simpson's paradox, and "same summary stats, different data." Two Medium warm-ups only. D1 is SL content; no Paper 3, but Part IV pushes into exploration territory. Companion Solutions file in Solutions/.v2.0 —— 精选版 · 10 道题 · 85 分。刻意向难题(Hard非常规思维题(out-of-the-box)倾斜,面向 6 冲 7 的学生:合并/合并方差(pooled)统计量、两条回归直线、辛普森悖论(Simpson's paradox)、以及"统计量相同、数据不同"。仅保留 2 道中等难度的热身题。D1 属 SL 内容,无 Paper 3,但第四部分进入探究式题型。配套答案文件位于 Solutions/ 目录。

PART I  ·  PAPER 1 SECTION A第一部分  ·  第一卷 A 节No calculator · short response · 20 marks不可使用计算器 · 简答题 · 20 分

Section A — Short ResponseA 节 —— 简答题

Short does not mean easy. Each item rewards a clean argument over a memorised formula. Marks are awarded for correct method as well as final answers. No calculator permitted.题短不等于题易。每道题更看重清晰的论证,而非死记公式。方法分(method marks)与最终答案同等重要。不可使用计算器。

Q1MEDIUM Paper 1A 4.3 Reverse Transformation [6 marks]

A dataset $\{x_i\}$ is transformed by $y_i = 3x_i - 7$. The transformed data have mean $\bar{y} = 50$ and standard deviation $\sigma_y = 9$.对数据集 $\{x_i\}$ 作变换 $y_i = 3x_i - 7$,变换后数据的均值 $\bar{y} = 50$、标准差 $\sigma_y = 9$。

(a) Find the mean $\bar{x}$ and standard deviation $\sigma_x$ of the original data.求原始数据的均值 $\bar{x}$ 与标准差 $\sigma_x$。 [3]
(b) One extra reading, exactly equal to $\bar{x}$, is appended to the original dataset. State the new mean, and state — with reason — whether the standard deviation increases, decreases, or stays the same.原始数据集中再追加一个恰好等于 $\bar{x}$ 的读数。写出新的均值,并说明理由:标准差是增大、减小,还是不变。 [3]
Q2HARD Paper 1A 4.3 Combined Mean [7 marks]

Group A consists of $12$ values with mean $15$. Group B consists of $8$ values with mean $m$. When the two groups are pooled, the combined mean of all $20$ values is $18$.A 组有 $12$ 个值,均值为 $15$;B 组有 $8$ 个值,均值为 $m$。两组合并后,全部 $20$ 个值的合并均值(combined mean)为 $18$。

(a) Find $m$.求 $m$。 [3]
(b) After pooling, one value originally in Group A is found to have been recorded as $9$ but should have been $39$. Find the corrected combined mean of all $20$ values.合并后发现,A 组中原本记为 $9$ 的一个值实际应为 $39$。求改正后全部 $20$ 个值的合并均值。 [2]
(c) Explain why the combined mean in (a) is not the simple average of $15$ and $m$.解释为何 (a) 中的合并均值不是 $15$ 与 $m$ 的简单平均。 [2]
Q3HARD Paper 1A 4.2 / 4.3 Constraint Puzzle [7 marks]

A dataset consists of five positive integers. It has:某数据集由五个正整数组成,满足:

(a) Find all datasets satisfying every condition.求出满足全部条件的所有数据集。 [5]
(b) Justify that your list is complete (i.e. no other dataset works).论证你给出的列表是完整的(即不存在其他符合条件的数据集)。 [2]
PART II  ·  PAPER 1 SECTION B第二部分  ·  第一卷 B 节No calculator · extended response · 20 marks不可使用计算器 · 长答题 · 20 分

Section B — Extended ResponseB 节 —— 长答题

Set out a clean, ordered argument. Where a part says "show that," reproduce every key step. Marks are awarded for method, accuracy, and reasoning. No calculator permitted.论证要清晰有序。凡题目要求"show that"(证明)时,要写出每一关键步骤。方法、准确性与推理都计分。不可使用计算器。

Q4HARD Paper 1B 4.4 Two Regression Lines [12 marks]

Five paired observations $(x, y)$ are recorded:记录了五组配对观测值 $(x, y)$:

$x$12345
$y$23546

Throughout, use $S_{xx} = \sum (x_i - \bar{x})^2$, $S_{yy} = \sum (y_i - \bar{y})^2$, and $S_{xy} = \sum (x_i - \bar{x})(y_i - \bar{y})$.本题统一记 $S_{xx} = \sum (x_i - \bar{x})^2$、$S_{yy} = \sum (y_i - \bar{y})^2$、$S_{xy} = \sum (x_i - \bar{x})(y_i - \bar{y})$。

(a) Show that $\bar{x} = 3$ and $\bar{y} = 4$.证明 $\bar{x} = 3$、$\bar{y} = 4$。 [2]
(b) In a single tidy table, find $S_{xx}$, $S_{yy}$, and $S_{xy}$.用一张整齐的表格求 $S_{xx}$、$S_{yy}$、$S_{xy}$。 [4]
(c) Find the equation of the regression line of $y$ on $x$, and the regression line of $x$ on $y$. Give both in the form (slope, intercept) with exact fractions or decimals.求 $y$ 关于 $x$ 的回归直线(y on x),以及 $x$ 关于 $y$ 的回归直线(x on y)。两条直线都以(斜率, 截距)形式给出,结果用精确分数或小数。 [3]
(d) Show that the product of the two gradients equals $r^2$, and hence write down $r$.证明两条回归直线斜率之积等于 $r^2$,并据此写出 $r$。 [2]
(e) The two lines are different, yet they intersect. State the point of intersection and explain in one sentence why the lines differ.这两条直线不同,但它们相交。写出交点坐标,并用一句话解释两条直线为何不同。 [1]
Q5HARD Paper 1B 4.2 / 4.3 Resistance & Robustness [8 marks]

A dataset $D$ of $n = 20$ values has mean $50$, standard deviation $10$, median $48$, and interquartile range $16$. Each part below applies separately to the original $D$. For each, state the new value of every one of the four statistics (mean, median, SD, IQR), or write "cannot be determined" with a one-line reason if a value is not fixed by the information given.数据集 $D$ 含 $n = 20$ 个值,均值 $50$、标准差 $10$、中位数 $48$、四分位距(IQR)$16$。以下各小题分别作用于原始的 $D$。对每一情形,写出四个统计量(均值、中位数、标准差、IQR)各自的新值;若某个值无法由所给信息确定,则写"无法确定(cannot be determined)"并用一句话说明理由。

(a) Add $5$ to every value.每个值都加 $5$。 [2]
(b) Multiply every value by $2$.每个值都乘以 $2$。 [2]
(c) Append one new value equal to $50$ (so $n$ becomes $21$).追加一个等于 $50$ 的新值(使 $n$ 变为 $21$)。 [2]
(d) A single data-entry error inflates exactly one value by $100$ (no other value changes; $n$ stays $20$).一处录入错误使恰好一个值增大了 $100$(其余值不变,$n$ 仍为 $20$)。 [2]
PART III  ·  PAPER 2第三部分  ·  第二卷Calculator (GDC) required · 26 marks需要图形计算器(GDC)· 26 分

Section C — Paper 2 (Calculator)C 节 —— 第二卷(可用计算器)

A graphic display calculator is required. Use GDC features (1-Var Stats, LinReg) where helpful, but always state the values you read off — never leave a bare "GDC" with no number behind it. Round to 3 s.f. unless told otherwise.需要图形计算器。可在合适处使用 GDC 的 1-Var StatsLinReg 功能,但必须写出读取到的数值 —— 不要只留空泛的"GDC"而无具体数据。除非另有说明,结果保留 3 位有效数字。

Q6MEDIUM Paper 2 4.4 Regression & Residual [7 marks]

A researcher records weekly study hours $x$ and a diagnostic score $y$ for $6$ students:研究者记录了 $6$ 名学生每周的学习时长 $x$ 与一次诊断测验成绩 $y$:

$x$134679
$y$81111161820
(a) Find the PMCC $r$ and the regression line $y = a + bx$.求皮尔逊积矩相关系数(PMCC)$r$ 与回归直线 $y = a + bx$。 [3]
(b) Interpret the gradient $b$ in the context of this study, in one sentence.结合本研究情境,用一句话解释斜率 $b$ 的含义。 [2]
(c) Find the residual (observed $-$ predicted) for the student with $x = 6$, and say what its sign tells you.求 $x = 6$ 的学生的残差(residual,观测值 $-$ 预测值),并说明其正负号说明了什么。 [2]
Q7HARD Paper 2 4.3 Pooled SD & z-scores [10 marks]

Two cohorts sit the same examination. Cohort A: $n_A = 30$ students, mean $62$, standard deviation $9$. Cohort B: $n_B = 20$ students, mean $71$, standard deviation $6$.两个班级参加同一场考试。A 班:$n_A = 30$ 人,均值 $62$,标准差 $9$;B 班:$n_B = 20$ 人,均值 $71$,标准差 $6$。

(a) Find the mean of all $50$ students combined.求全部 $50$ 名学生合并后的均值。 [2]
(b) Using the identity $\sum x^2 = n\big(\sigma^2 + \bar{x}^2\big)$ for each cohort, find $\sum x^2$ for the combined group and hence the standard deviation of all $50$ scores.对每个班级使用恒等式 $\sum x^2 = n\big(\sigma^2 + \bar{x}^2\big)$,求合并组的 $\sum x^2$,并由此求全部 $50$ 个成绩的标准差。 [5]
(c) A student in Cohort A and a student in Cohort B each scored $80$. Using each cohort's own mean and SD, find both $z$-scores and state who performed more exceptionally relative to their own cohort.A 班和 B 班各有一名学生考了 $80$ 分。分别用各自班级的均值与标准差求两人的 $z$ 分数(z-score),并指出相对于本班谁的表现更突出。 [3]
Q8HARD Paper 2 4.4 Simpson's Paradox [9 marks]

A study of a new fertilizer collects $(x, y)$ data, where $x$ is dose and $y$ is yield, from two soil types. The six observations are:一项新化肥的研究在两种土壤上收集了 $(x, y)$ 数据,其中 $x$ 为施用剂量、$y$ 为产量。六个观测值为:

Soil 1土壤 1$(1,8)$$(2,7)$$(3,6)$
Soil 2土壤 2$(6,12)$$(7,11)$$(8,10)$
(a) Treating each soil type separately, state the value of $r$ for Soil 1 and for Soil 2. (Each set is exactly collinear — you may state $r$ by inspection.)将两种土壤分开处理,分别写出土壤 1 与土壤 2 的 $r$ 值。(每组数据严格共线,可直接观察写出 $r$。) [2]
(b) Now pool all six points and use the GDC to find $r$ for the combined data.现将全部六个点合并,用 GDC 求合并数据的 $r$。 [3]
(c) The sign of the correlation flips between (a) and (b). Name this phenomenon and identify the lurking variable.相关系数的符号在 (a) 与 (b) 之间反转。请说出这一现象的名称,并指出其中的潜伏变量(lurking variable)。 [2]
(d) A scientist wants to advise farmers on dosing. Which analysis — the pooled line or the within-soil lines — should guide the advice, and why?某科学家想就施用剂量向农民提建议。应当依据哪种分析 —— 合并直线还是分土壤的直线 —— 来给出建议?为什么? [2]
PART IV  ·  OUT-OF-THE-BOX第四部分  ·  非常规思维题Exploration · reasoning & proof · 19 marks探究 · 推理与证明 · 19 分

Section D — Stretch & ExplorationD 节 —— 拓展与探究

These problems are not harder arithmetic — they ask you to think about what the statistics actually mean. This is where a 7 is built. A calculator is allowed but rarely the point.这些题不是更繁的算术 —— 而是让你思考统计量究竟意味着什么。冲 7 正是在此处建立。可使用计算器,但计算往往不是重点。

Q9HARD OUT-OF-THE-BOX 4.3 Same Stats, Different Data [10 marks]

Consider the two datasets, each of size $5$:考虑以下两个各含 $5$ 个值的数据集:

$A$246810
$B$345711
(a) Show that $A$ and $B$ have the same mean and the same variance.证明 $A$ 与 $B$ 的均值相同、方差也相同 [4]
(b) Find the median of each. Use the relationship between mean and median to describe the shape (symmetry / skew) of each dataset.求两者的中位数。利用均值与中位数的关系,描述每个数据集的形状(对称 / 偏态,skew)。 [3]
(c) A report describes a dataset using only its mean and standard deviation. Explain what this example shows about the limits of that practice, and name one extra thing a reader should ask for.某报告仅用均值与标准差来描述一个数据集。请用本例说明这种做法的局限,并指出读者还应额外索取的一项信息。 [3]
Q10HARD OUT-OF-THE-BOX 4.3 Variance Identity [9 marks]

This question is about the computational form of the variance.本题研究方差的"计算式"形式。

(a) Prove the identity $\displaystyle \sum_{i=1}^{n}(x_i - \bar{x})^2 = \sum_{i=1}^{n} x_i^2 - n\bar{x}^2$, stating clearly where you use $\sum x_i = n\bar{x}$.证明恒等式 $\displaystyle \sum_{i=1}^{n}(x_i - \bar{x})^2 = \sum_{i=1}^{n} x_i^2 - n\bar{x}^2$,并清楚指明在何处用到 $\sum x_i = n\bar{x}$。 [4]
(b) A dataset of $n = 8$ values has $\sum x = 96$ and $\sum x^2 = 1256$. Find the mean, the variance, and the standard deviation.某含 $n = 8$ 个值的数据集满足 $\sum x = 96$、$\sum x^2 = 1256$。求均值、方差与标准差。 [3]
(c) A ninth value, equal to the current mean, is now added. Using your identity — not a fresh calculation from raw data — find the new variance, and confirm it is consistent with your answer to Q1(b).现加入第 $9$ 个值,恰等于当前均值。请用上面的恒等式(而非从原始数据重新计算)求新的方差,并验证它与 Q1(b) 的结论一致。 [2]