IB AA HL · Topic 4 — Statistics & ProbabilityIB AA HL · 主题 4 — 统计与概率

Univariate Data单变量数据

Sub-topics 4.1 – 4.4 of IB AA HL Topic 4. Sampling, presentation of data, central tendency & dispersion, linear correlation & regression.

IB AA HL 主题 4 的子主题 4.1–4.4:抽样(sampling)、数据呈现、集中趋势与离散程度、线性相关(linear correlation)与回归(regression)。

IB AA HL · SL 4.1–4.4 Papers 1 · 2试卷 1 · 2 7 Concepts7 个核心概念 GDC-friendly适配图形计算器

How to use this guide使用指南

!
If you're cramming如果你在临阵抱佛脚

Read the Cram cheat-sheet next, then skim the dashed-gold "Cram-Mode Cheat" box at the top of each section, plus the formula boxes. One sentence to leave with: mean and SD describe the centre and spread; median and IQR describe them robustly; PMCC $r$ measures linear strength on $[-1, 1]$ and the regression line $y = ax + b$ goes through $(\bar{x}, \bar{y})$. Do one worked example per section. Then take the practice quiz.

先看考前速查表,再扫一眼每节顶端金色虚线框里的"考前速查"方框,以及公式框。一句话带走:均值(mean)与标准差(standard deviation)描述中心与离散程度;中位数(median)与四分位距(interquartile range)做稳健描述;皮尔逊积矩相关系数(Pearson's product-moment correlation coefficient)$r$ 在 $[-1, 1]$ 上衡量线性强弱;回归直线(regression line)$y = ax + b$ 恒过 $(\bar{x}, \bar{y})$。每节做一道例题,最后做练习测验。

If you're going for a 7如果你要冲 7 分

Open every ▸ Going deeper. Univariate data is the topic where IB rewards interpretation over arithmetic. Owning why mean shifts under $y = ax + b$ but SD only scales by $|a|$, and why $r$ is invariant under linear transformations of either variable, is what separates a 5 from a 7 on the data-handling paper.

展开每一个 ▸ Going deeper(深入探究)小节。单变量数据是 IB 更看重解释而非计算的主题。真正搞懂:为什么在 $y = ax + b$ 下均值随之平移但标准差只按 $|a|$ 缩放;为什么相关系数(correlation coefficient)$r$ 在任一变量做线性变换(linear transformation)时保持不变。这就是数据处理卷上 5 分与 7 分的分水岭。

SL only — no HL extension仅 SL — 无 HL 拓展 Every section in this unit is core SL content (sub-topics SL 4.1–4.4). The HL-only sub-topics on conditional probability, Bayes, and continuous random variables live in Unit D2 & D3. No HL chips appear here.本单元每一节都是 SL 核心内容(子主题 SL 4.1–4.4)。HL 独有的条件概率(conditional probability)、贝叶斯(Bayes)和连续随机变量等子主题在 Unit D2 与 D3。此处不出现 HL 标签。
GDC convention图形计算器(GDC)约定 On Paper 2, your graphing calculator is the primary tool. Whenever you see "find the mean", "find $r$", or "find the regression line", the IB markscheme expects you to enter the data once and read off the answer. We don't model specific TI / Casio keystrokes — just know the menu names: 1-Var Stats, 2-Var Stats, LinReg(a + bx). Paper 1 stats questions, by contrast, use clean numbers that work by hand.在试卷 2(Paper 2)上,图形计算器(GDC)是主要工具。凡是题面写"求均值"、"求 $r$"或"求回归直线",IB 评分标准期望你录入数据一次后直接读出答案。我们不演示具体 TI / Casio 按键 — 你只需要记住菜单名:1-Var Stats2-Var StatsLinReg(a + bx)。试卷 1(Paper 1)的统计题则使用干净的数字,可手算解决。

Cram Cheat-Sheet考前速查表

Sampling & data types抽样与数据类型 SL 4.1

  • Population = every individual of interest; sample = a subset actually measured.
  • Random sample ⇒ every individual has equal chance of selection. Other methods: systematic, stratified, quota, convenience (biased).
  • Discrete data = counts (integers). Continuous = measurements on a continuum.
  • 总体(population = 所关注的全体个体;样本(sample = 实际被测量的子集。
  • 随机样本(random sample ⇒ 每个个体被抽到的概率相等。其他方法:系统抽样(systematic)、分层抽样(stratified)、配额抽样(quota)、便利抽样(convenience,有偏)。
  • 离散(discrete数据 = 计数(整数)。连续(continuous = 在连续区间上的测量值。

Presenting data数据呈现 SL 4.2

  • Frequency tablehistogram (bars touch; area $\propto$ frequency for continuous data).
  • Cumulative frequency curve → read off median, $Q_1$, $Q_3$, percentiles.
  • Box-and-whisker: five-number summary $\{\min, Q_1, \text{med}, Q_3, \max\}$.
  • Outlier rule: $x < Q_1 - 1.5 \cdot \text{IQR}$ or $x > Q_3 + 1.5 \cdot \text{IQR}$.
  • 频率分布表(frequency table直方图(histogram(连续数据时柱子相连;面积 $\propto$ 频数)。
  • 累积频率曲线(cumulative frequency curve → 读出中位数、$Q_1$、$Q_3$ 与百分位数(percentiles)。
  • 箱形图(box plot:五数概括(five-number summary)$\{\min, Q_1, \text{med}, Q_3, \max\}$。
  • 异常值规则(outlier rule):$x < Q_1 - 1.5 \cdot \text{IQR}$ 或 $x > Q_3 + 1.5 \cdot \text{IQR}$。

Central tendency集中趋势 SL 4.3

  • Mean $\bar{x} = \dfrac{\sum x_i}{n}$ or $\dfrac{\sum f_i x_i}{\sum f_i}$ for grouped data (use the midpoint of each class).
  • Median = middle value when ordered. Modal class = class with greatest frequency density.
  • If a single big outlier exists, prefer median + IQR over mean + SD.
  • 均值(mean)$\bar{x} = \dfrac{\sum x_i}{n}$;分组数据用 $\dfrac{\sum f_i x_i}{\sum f_i}$($x_i$ 取每个组的组中值)。
  • 中位数(median) = 排序后处于中间的值。众数类(modal class = 频率密度(frequency density)最大的那个类。
  • 如果出现单个大异常值,优先选用 中位数 + IQR,而不是 均值 + 标准差。

Dispersion离散程度 SL 4.3

  • Range = $\max - \min$. IQR $= Q_3 - Q_1$.
  • IB defaults to the population SD (divide by $n$, not $n-1$): $$ \sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{n}}, \qquad \sigma^2 = \frac{\sum (x_i - \mu)^2}{n}. $$
  • For grouped data: $\sigma = \sqrt{\dfrac{\sum f_i (x_i - \bar{x})^2}{\sum f_i}}$.
  • 极差(range) = $\max - \min$。四分位距(interquartile range,IQR)$= Q_3 - Q_1$。
  • IB 默认使用总体标准差(population standard deviation(除以 $n$,而非 $n-1$): $$ \sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{n}}, \qquad \sigma^2 = \frac{\sum (x_i - \mu)^2}{n}. $$
  • 分组数据:$\sigma = \sqrt{\dfrac{\sum f_i (x_i - \bar{x})^2}{\sum f_i}}$。

Linear transformations线性变换 SL 4.3

  • If $y_i = a x_i + b$, then $\bar{y} = a \bar{x} + b$ and $\sigma_y = |a|\, \sigma_x$, $\sigma_y^2 = a^2 \sigma_x^2$.
  • Adding a constant shifts the mean, leaves SD alone. Scaling by $a$ multiplies both mean and SD (SD by $|a|$).
  • 若 $y_i = a x_i + b$,则 $\bar{y} = a \bar{x} + b$,$\sigma_y = |a|\, \sigma_x$,$\sigma_y^2 = a^2 \sigma_x^2$。
  • 加一个常数平移均值,标准差不变。乘以 $a$ 同时缩放均值与标准差(标准差按 $|a|$ 缩放)。

Bivariate & regression双变量数据与回归 SL 4.4

  • PMCC: $r = \dfrac{S_{xy}}{\sqrt{S_{xx}\, S_{yy}}} \in [-1, 1]$. $|r|$ near $1$ ⇒ strong linear; near $0$ ⇒ weak.
  • Regression line of $y$ on $x$: $y - \bar{y} = b(x - \bar{x})$ with $b = S_{xy}/S_{xx}$. Equivalently $y = ax + b$. Always passes through $(\bar{x}, \bar{y})$.
  • Only predict inside the data range (interpolation). Extrapolation is unreliable.
  • 皮尔逊积矩相关系数(Pearson's product-moment correlation coefficient,PMCC):$r = \dfrac{S_{xy}}{\sqrt{S_{xx}\, S_{yy}}} \in [-1, 1]$。$|r|$ 接近 $1$ ⇒ 线性关系强;接近 $0$ ⇒ 弱。
  • $y$ 关于 $x$ 的回归直线(regression line):$y - \bar{y} = b(x - \bar{x})$,其中 $b = S_{xy}/S_{xx}$。等价写为 $y = ax + b$。恒过 $(\bar{x}, \bar{y})$。
  • 只在数据范围做预测(内插,interpolation)。外推(extrapolation)不可靠。

1.1 Populations and Samples1.1 总体与样本 SL 4.1

A population is the full set of individuals of interest; a sample is the subset you actually measure. 总体(population)是所关注的全体个体;样本(sample)是你实际测量的子集。 A random sample gives every individual an equal chance of being chosen and is the only kind whose statistics are unbiased estimators of population parameters. Real data is discrete (countable, integer-valued like number of siblings) or continuous (measurable on a continuum like height). Bias arises from how you sample, what people choose to report, and which observations get excluded.随机样本(random sample)让每个个体被选中的概率相等,也是唯一一种统计量为总体参数无偏估计的样本。真实数据要么是离散(可数、取整数值,如兄弟姐妹数),要么是连续(在连续区间上可测,如身高)。抽样偏差(sampling bias)来源于抽样方式、受访者选择性报告与哪些观测被剔除。
Notation: parameter vs. statistic记号:参数 vs. 统计量
$$ \mu, \sigma \;\;\text{(population, fixed)} \qquad \bar{x}, s \;\;\text{(sample, computed)} $$

The IB uses $\mu$ for the population mean and $\sigma$ for the population SD. The sample mean is $\bar{x}$. Because IB defaults to the population formula, the sample SD it reports also divides by $n$ — some textbooks call this $\sigma_n$ on the calculator.

IB 用 $\mu$ 表示总体均值(population mean),$\sigma$ 表示总体标准差(population standard deviation)。样本均值(sample mean)为 $\bar{x}$。由于 IB 默认采用总体公式,它报告的样本标准差也是除以 $n$ — 部分教材在计算器上称之为 $\sigma_n$。

Sampling methods you should be able to name必须能够说出名称的抽样方法 Simple random sample. Every individual has equal probability of being selected, e.g. picking names from a hat or using a random-number generator.
Systematic sample. Choose every $k$-th individual from an ordered list. Cheap but biased if the list has periodicity.
Stratified sample. Split the population into groups (strata: e.g. Year 1, Year 2, Year 3), then sample randomly within each in proportion to its size. Best when strata differ.
Quota sample. Fill pre-set numbers from each group, but selection within each is non-random (e.g. street interviewer takes the first 10 women they see).
Convenience sample. Whoever is easiest to reach. Almost always biased and the IB will call this out.
简单随机抽样(simple random sample)。每个个体被选中的概率相同,例如从帽子里抽名字或用随机数发生器。
系统抽样(systematic sample)。在有序名单上每隔 $k$ 个抽 1 个。成本低,但若名单存在周期性,则会产生偏差。
分层抽样(stratified sample)。将总体按特征划分为若干层(如 1 年级、2 年级、3 年级),再在每层内按其规模成比例地随机抽样。层间差异显著时最优。
配额抽样(quota sample)。各组配额事先确定,但组内选择并非随机(如街头访问员遇到前 10 名女性即停止)。
便利抽样(convenience sample)。谁最容易接触就抽谁。几乎一定有偏,IB 也一定会指出。
Sources of bias to flag in a 1- or 2-mark "comment on this sample" question在 1–2 分"评论此抽样"题中应指出的偏差来源 Selection bias. The sample isn't representative (e.g. a 6am gym survey skewed to early risers).
Non-response bias. Those who do not respond differ systematically from those who do.
Self-reported data. People misreport (income up, weight down).
Volunteer bias. Volunteers tend to be more motivated — results don't generalize.
Measurement bias. The instrument or wording skews answers (leading questions, faulty scale).
选择偏差(selection bias)。样本不具代表性(如清晨 6 点健身房调查偏向早起者)。
无回应偏差(non-response bias)。不回应者与回应者存在系统性差异。
自报数据(self-reported data)。人们会失真报告(收入虚报偏高、体重虚报偏低)。
志愿者偏差(volunteer bias)。志愿者通常更具积极性 — 结论难以推广。
测量偏差(measurement bias)。仪器或措辞使回答偏向(诱导性问题、刻度有误)。

Worked Example — Identifying the right method例题 — 选对方法

Problem: A school has $400$ Year 12 students, $350$ Year 13 students. The principal wants a sample of $30$ for a survey on study habits. Suggest a sampling method and explain.

Best choice: stratified random sampling. The two year groups likely differ in study habits, so we want the sample to reflect both in proportion.

Proportion: $\frac{400}{750} \approx 0.533$, $\frac{350}{750} \approx 0.467$. Take $\lceil 30 \times 0.533 \rceil = 16$ from Year 12 and $30 - 16 = 14$ from Year 13. Within each stratum, choose randomly (random-number generator).

Convenience sampling (just asking 30 friends) would be biased. A simple random sample across all $750$ would also work but might under-represent the smaller stratum by chance.

题目:某校 12 年级有 $400$ 人,13 年级有 $350$ 人。校长想抽取 $30$ 人调查学习习惯。建议一种抽样方法并说明理由。

最佳选择:分层随机抽样(stratified random sampling)。两个年级的学习习惯很可能不同,因此样本应按比例反映两层。

比例:$\frac{400}{750} \approx 0.533$,$\frac{350}{750} \approx 0.467$。从 12 年级取 $\lceil 30 \times 0.533 \rceil = 16$ 人,从 13 年级取 $30 - 16 = 14$ 人。每层内用随机数发生器随机抽取。

便利抽样(只问 30 位朋友)会产生偏差。在全部 $750$ 人中做简单随机抽样也可行,但可能偶然地使较小的那层代表不足。

Worked Example — Discrete vs. continuous例题 — 离散与连续之分

Problem: Classify the following as discrete or continuous:

(a) number of cars passing a sensor per hour, (b) time taken to run 100 m, (c) shoe size, (d) blood pressure.

(a) Discrete — you can count $0, 1, 2, \ldots$ cars; no fractional cars.

(b) Continuous — time can be any positive real number (resolution limited only by the clock).

(c) Discrete — shoe sizes are quantized ($7, 7.5, 8, \ldots$). Half-sizes still make the sample space discrete, not continuous.

(d) Continuous — blood pressure is measured on a continuum, though instruments report integers (mmHg). Treat as continuous unless the problem says otherwise.

题目:把下列变量分类为离散数据(discrete data)或连续数据(continuous data):

(a) 某传感器每小时通过的汽车数;(b) 跑 100 m 所用时间;(c) 鞋码;(d) 血压。

(a) 离散 — 汽车数只能是 $0, 1, 2, \ldots$,没有分数辆汽车。

(b) 连续 — 时间可以是任意正实数(分辨率仅受计时器限制)。

(c) 离散 — 鞋码是量化的($7, 7.5, 8, \ldots$)。即使存在半码,样本空间仍是离散的,不是连续的。

(d) 连续 — 血压在连续区间上测量,尽管仪器只报告整数(mmHg)。除非题目特别说明,否则按连续处理。

▸ Going deeper — Why "random" matters mathematically▸ 深入探究 — 为何"随机"在数学上至关重要

The mean $\bar{X}$ of a simple random sample of size $n$ from a population with mean $\mu$ and SD $\sigma$ satisfies:

从均值为 $\mu$、标准差为 $\sigma$ 的总体中抽取大小为 $n$ 的简单随机样本,其样本均值 $\bar{X}$ 满足:

$$ \mathbb{E}[\bar{X}] = \mu, \qquad \text{Var}(\bar{X}) = \frac{\sigma^2}{n}. $$

The first equation is the unbiasedness property: the sample mean has no systematic error. The second is the square-root law: precision improves like $\sqrt{n}$. Both follow from linearity of expectation and independence of the draws.

第一个等式即无偏性:样本均值没有系统误差。第二个为平方根定律:精度按 $\sqrt{n}$ 改善。两者都来自期望的线性性与抽样的独立性。

If your sample is not random — say, a convenience sample — both properties can fail. $\mathbb{E}[\bar{X}]$ can differ from $\mu$ (bias), and the apparent precision $\sigma / \sqrt{n}$ over-states reality. The IB markscheme does not ask for this derivation in D1, but it explains why bias is fatal: more data does not fix it.

若样本不是随机的(例如便利抽样),两条性质都可能失效:$\mathbb{E}[\bar{X}]$ 与 $\mu$ 出现偏差,名义精度 $\sigma / \sqrt{n}$ 也高估了真实情况。IB 评分标准在 D1 不要求做这段推导,但它解释了为什么偏差是致命的 — 单靠加大数据量解决不了。

A market researcher polls people leaving a gym at 6am about exercise habits. The sample is best described as:市场调研员在清晨 6 点访问离开健身房的人,问及锻炼习惯。该样本最准确的描述是:
SL 4.1
Simple random — everyone in the gym had equal chance.简单随机 — 健身房里每个人机会均等。
Stratified — the gym splits people into age groups.分层 — 健身房按年龄段分组。
Convenience — only morning gym-goers, who are systematically more active.便利 — 只触达清晨健身的人,他们系统性地更活跃。
Systematic — every $k$-th gym member.系统 — 每第 $k$ 位健身房会员。
Correct! The selection rule reaches only a particular sub-group (early-morning gym attendees), who are not representative of the general population. That's a convenience sample with selection bias.正确!选择规则只触及一个特定子群体(清晨健身者),并不代表一般总体。这是带有选择偏差的便利抽样。
It's a convenience sample. The researcher only reaches people who choose to be at the gym at 6am — almost certainly more active and earlier-rising than average.这是便利抽样。研究者只触达自愿在清晨 6 点出现于健身房的人 — 他们几乎一定比平均水平更活跃、更早起。
Which variable is continuous?下列哪个变量是连续的?
SL 4.1
The number of goals scored in a football match.一场足球比赛的进球数。
The mass of a randomly chosen apple.随机抽取一个苹果的质量。
The shoe size of a customer.某顾客的鞋码。
The number of children in a household.某户的子女数。
Correct! Mass can take any positive real value (within instrument resolution). The other three are countable / quantized — discrete.正确!质量可取任意正实数(仅受仪器分辨率限制)。其余三项可计数或量化 — 都是离散的。
Continuous data are measurements on a continuum — mass, length, time. Counts (goals, children) are discrete; quantized scales (shoe size) are discrete too.连续数据是在连续区间上的测量值 — 质量、长度、时间。计数(进球、子女数)是离散的;量化刻度(鞋码)也是离散的。

1.2 Frequency Distributions & Histograms1.2 频率分布与直方图 SL 4.2

Group raw data into classes — report frequencies in a table, then draw a histogram. 将原始数据分组 — 在频率分布表(frequency distribution)里给出频数(frequency),再画直方图(histogram)。 For a discrete variable, bars sit on integer values and the height is the frequency. For a continuous variable, bars touch and the area of each bar is proportional to its frequency. Relative frequency $=$ frequency $/$ total — lets you compare two datasets of different sizes. 离散变量,柱子立在整数刻度上,高度即频数。对连续变量,柱子相连,每个柱子的面积与其频数成正比。相对频率(relative frequency$=$ 频数 $/$ 总数 — 用于比较规模不同的两组数据。
Frequency, relative frequency, frequency density频数、相对频率、频率密度
$$ \text{relative freq.} = \frac{f_i}{\sum f_i} \qquad \text{freq. density} = \frac{f_i}{\text{class width}} $$

When classes have unequal widths, plot frequency density on the $y$-axis so that bar area still represents frequency. IB rarely uses unequal classes — but when it does, this is the only safe way to read the histogram.

当各类组距不等时,在 $y$ 轴上画频率密度(frequency density),这样柱子的面积仍然代表频数。IB 很少出现不等组距 — 但一旦出现,这是读取直方图的唯一安全做法。

Frequency-table conventions频率分布表的约定 For continuous data, IB uses inclusive-lower / exclusive-upper class bounds, e.g. $10 \le x < 20$. The class midpoint $x_i = \tfrac{1}{2}(\text{lower} + \text{upper})$ is what you use to estimate the mean of grouped data — we'll see this in 1.4.
A few classes (5–7) gives a smooth shape; too many gives a noisy one. The IB will tell you the classes.
对连续数据,IB 采用左闭右开的组界,如 $10 \le x < 20$。组中值 $x_i = \tfrac{1}{2}(\text{lower} + \text{upper})$ 即(下界 + 上界)的一半,用于估计分组数据的均值 — 详见 1.4。
类数较少(5–7 组)时形状平滑;过多则噪声大。IB 通常会直接给出类。

Worked Example — Building a histogram例题 — 画一张直方图

Problem: The masses (kg) of 40 students are grouped:

Mass (kg)$40 \le m < 50$$50 \le m < 60$$60 \le m < 70$$70 \le m < 80$$80 \le m < 90$
Frequency4111483

All classes have equal width (10 kg), so we plot frequency on the $y$-axis directly:

The shape is roughly symmetric and unimodal, peaking in $60 \le m < 70$. The modal class is $60 \le m < 70$.

题目:40 名学生的体重(kg)已分组如下:

体重 (kg)$40 \le m < 50$$50 \le m < 60$$60 \le m < 70$$70 \le m < 80$$80 \le m < 90$
频数4111483

各类组距相等(10 kg),所以可直接在 $y$ 轴画频数:

形状大致对称、单峰,峰值在 $60 \le m < 70$。众数类(modal class为 $60 \le m < 70$。

Worked Example — Relative frequency to compare例题 — 用相对频率做比较

Problem: Class A (25 students) and Class B (40 students) both have $8$ students scoring in $[70, 80)$. Which class has a higher proportion in that band?

Compare relative frequencies:

$$ \text{A: } \frac{8}{25} = 0.32 = 32\%, \qquad \text{B: } \frac{8}{40} = 0.20 = 20\%. $$

Class A has the higher proportion. Raw frequencies are identical but say nothing on their own when class sizes differ.

题目:A 班 25 人、B 班 40 人,两班都各有 $8$ 人得分在 $[70, 80)$。哪个班该区间占比更高

比较相对频率(relative frequency):

$$ \text{A: } \frac{8}{25} = 0.32 = 32\%, \qquad \text{B: } \frac{8}{40} = 0.20 = 20\%. $$

A 班占比更高。两班的原始频数相同,但在样本量不同的情况下单看频数毫无意义。

Shape vocabulary IB rewardsIB 看重的形状描述词 Describe a histogram in two beats: (1) shape (symmetric / skewed left / skewed right / bimodal / uniform), and (2) centre & spread (where it peaks, how wide).
Right-skewed $\Leftrightarrow$ long tail to the right $\Leftrightarrow$ mean $>$ median.
Left-skewed $\Leftrightarrow$ long tail to the left $\Leftrightarrow$ mean $<$ median.
Symmetric $\Leftrightarrow$ mean $\approx$ median.
描述直方图分两步:(1) 形状(对称 / 左偏 / 右偏 / 双峰 / 均匀),(2) 中心与离散程度(峰值在哪、宽度多大)。
右偏(right-skewed $\Leftrightarrow$ 右尾长 $\Leftrightarrow$ 均值 $>$ 中位数。
左偏(left-skewed $\Leftrightarrow$ 左尾长 $\Leftrightarrow$ 均值 $<$ 中位数。
对称(symmetric $\Leftrightarrow$ 均值 $\approx$ 中位数。
▸ Going deeper — Why area, not height, encodes frequency▸ 深入探究 — 为什么是面积而非高度表示频数

For continuous data the histogram is a discrete approximation to an unknown probability density. The defining property of a density is that area under it on an interval equals the probability of landing in that interval. So a histogram bar's area should equal the frequency (or relative frequency) of its interval, not its height.

对于连续数据,直方图是某个未知概率密度(probability density)的离散近似。密度的本质性质是:在某区间上其曲线下面积等于落入该区间的概率。所以直方图每根柱子的面积应当等于其区间的频数(或相对频率),而非高度。

With equal-width classes, height and area are proportional, and IB lets you plot either — just say "frequency" on the $y$-axis. With unequal classes, plotting height alone visually exaggerates wide classes and hides narrow ones. Frequency density restores the proportionality:

组距相等时,高度与面积成正比,IB 允许直接画频数 — 只需在 $y$ 轴标注"frequency"。组距不等时,单画高度会在视觉上夸大宽类、压缩窄类。频率密度恢复了比例关系:

$$ \text{area of bar} = \text{freq. density} \times \text{width} = \frac{f_i}{\text{width}} \times \text{width} = f_i. $$

This is also why the limiting curve of relative-frequency histograms (as bins shrink, $n$ grows) is the probability density function — the link to Topics D2–D3.

这也是为什么相对频率直方图在组距趋于零、$n$ 趋于无穷时的极限曲线就是概率密度函数probability density function)— 即与主题 D2–D3 的连接点。

A continuous-data histogram has unequal class widths. To make bar area equal to frequency, the $y$-axis should plot:某连续数据的直方图组距不等。要使柱面积等于频数,$y$ 轴应画:
SL 4.2
Cumulative frequency.累积频率(cumulative frequency)。
Frequency density $=$ frequency $/$ class width.频率密度 $=$ 频数 $/$ 组距。
Raw frequency.原始频数。
Percentage frequency.百分比频率。
Correct! Frequency density makes area $=$ width $\times$ density $=$ frequency, regardless of how wide each class is. Plotting raw frequency would visually exaggerate the wider classes.正确!频率密度使面积 $=$ 组距 $\times$ 密度 $=$ 频数,与各类宽度无关。直接画原始频数会在视觉上夸大较宽的类。
Frequency density $=$ frequency $/$ class width. Then bar area $=$ width $\times$ density $=$ frequency, even with unequal widths.频率密度 $=$ 频数 $/$ 组距。此时柱面积 $=$ 组距 $\times$ 密度 $=$ 频数,即使组距不等也成立。

1.3 Cumulative Frequency & Box Plots1.3 累积频率与箱形图 SL 4.2

A cumulative-frequency graph lets you read percentiles by eye; a box plot is the five-number summary in one picture. 累积频率图(cumulative frequency curve)可以让你目测百分位数;箱形图(box plot)则把五数概括(five-number summary)一图呈现。 Plot cumulative frequency vs. upper class boundary, join the points smoothly. Read: 以累积频率为纵轴、各类上界为横轴作图,平滑连线后读取:
  • median at $0.5 n$ on the $y$-axis;
  • $Q_1$ at $0.25 n$, $Q_3$ at $0.75 n$;
  • IQR $= Q_3 - Q_1$.
  • 中位数对应 $y$ 轴上的 $0.5 n$;
  • $Q_1$ 对应 $0.25 n$,$Q_3$ 对应 $0.75 n$;
  • IQR $= Q_3 - Q_1$。
Outlier rule: 异常值规则(outlier): $x < Q_1 - 1.5 \cdot \text{IQR}$ or $x > Q_3 + 1.5 \cdot \text{IQR}$.
Five-number summary & box plot五数概括与箱形图
$$ \{\text{min}, \; Q_1, \; \text{median}, \; Q_3, \; \text{max}\} $$

Box runs from $Q_1$ to $Q_3$, with the median marked inside. Whiskers extend from the box to the smallest non-outlier and the largest non-outlier. Outliers are drawn as individual points outside the whiskers.

箱子由 $Q_1$ 延至 $Q_3$,中位数(median)在箱内标出。须(whiskers)从箱子延伸到最小的非异常值与最大的非异常值。异常值(outliers)单独以点的形式画在须外。

Discrete quartiles — the position rule离散数据的四分位数 — 位置法则 For a sorted dataset of size $n$: 对已排序、大小为 $n$ 的数据集:
  • Median $= $ value at position $\frac{n+1}{2}$ (interpolate if fractional).
  • $Q_1$ $= $ median of the lower half; $Q_3$ $= $ median of the upper half.
  • 中位数 $= $ 位置 $\frac{n+1}{2}$ 处的值(位置为分数时做线性插值)。
  • $Q_1$(下四分位数,quartile)$= $ 数据下半部分的中位数;$Q_3$(上四分位数)$= $ 上半部分的中位数。
Different textbooks define $Q_1, Q_3$ slightly differently — the IB sticks with the "split-then-median" definition above. Calculators may give slightly different values; trust the by-hand version on Paper 1 and the GDC's value on Paper 2. 不同教材对 $Q_1, Q_3$ 的定义略有出入 — IB 始终采用上述"先分半、再取中位数"的定义。不同计算器给出的值可能略有差异;试卷 1 信手算结果,试卷 2 信图形计算器(GDC)。
Outlier rule — the $1.5 \cdot \text{IQR}$ fences异常值规则 — $1.5 \cdot \text{IQR}$ 围栏 Compute:计算: $$ L = Q_1 - 1.5 \cdot \text{IQR}, \qquad U = Q_3 + 1.5 \cdot \text{IQR}. $$ Any observation strictly below $L$ or above $U$ is an outlier and must be drawn as a separate dot on the box plot. Whiskers extend to the most extreme non-outlier, not to the outlier itself. 任何严格小于 $L$ 或严格大于 $U$ 的观测都是异常值(outlier),在箱形图上必须单独画成一个点。须延伸到最极端的非异常值,而不是延伸到异常值本身。

Worked Example — Five-number summary by hand例题 — 手算五数概括

Problem: Find the five-number summary of $\{3, 5, 7, 8, 9, 11, 12, 14, 16, 18, 25\}$, then identify any outliers.

$n = 11$, already sorted.

Min $= 3$, max $= 25$. Median at position $\frac{11+1}{2} = 6$: the 6th value is $11$.

Lower half (excluding the median): $\{3, 5, 7, 8, 9\}$. $Q_1$ $=$ median of this $=$ $7$.

Upper half (excluding the median): $\{12, 14, 16, 18, 25\}$. $Q_3$ $=$ median of this $=$ $16$.

IQR $= 16 - 7 = 9$. Fences: $L = 7 - 1.5(9) = -6.5$, $U = 16 + 1.5(9) = 29.5$.

No data falls outside $[-6.5, 29.5]$, so no outliers. (The value $25$ is large but lies inside $U$.)

Five-number summary: $\{3, 7, 11, 16, 25\}$.

题目:求 $\{3, 5, 7, 8, 9, 11, 12, 14, 16, 18, 25\}$ 的五数概括,并指出所有异常值。

$n = 11$,已排序。

最小值 $= 3$,最大值 $= 25$。中位数对应位置 $\frac{11+1}{2} = 6$:第 6 个值为 $11$。

下半部分(不含中位数):$\{3, 5, 7, 8, 9\}$。$Q_1$ $=$ 其中位数 $=$ $7$。

上半部分(不含中位数):$\{12, 14, 16, 18, 25\}$。$Q_3$ $=$ 其中位数 $=$ $16$。

IQR $= 16 - 7 = 9$。围栏:$L = 7 - 1.5(9) = -6.5$,$U = 16 + 1.5(9) = 29.5$。

没有数据落在 $[-6.5, 29.5]$ 之外,故无异常值。($25$ 虽大,但仍在 $U$ 之内。)

五数概括:$\{3, 7, 11, 16, 25\}$。

Worked Example — Outlier check例题 — 异常值检验

Problem: Test scores have $Q_1 = 52$, median $= 64$, $Q_3 = 73$, and one student scored $100$. Is $100$ an outlier?

$\text{IQR} = 73 - 52 = 21$. Upper fence $U = 73 + 1.5(21) = 73 + 31.5 = 104.5$.

Since $100 < 104.5$, the score is not an outlier under the IQR rule — though it is the maximum.

题目:某次考试 $Q_1 = 52$,中位数 $= 64$,$Q_3 = 73$,有一位学生得了 $100$ 分。问 $100$ 是否为异常值?

$\text{IQR} = 73 - 52 = 21$。上围栏 $U = 73 + 1.5(21) = 73 + 31.5 = 104.5$。

由于 $100 < 104.5$,按 IQR 规则它不是异常值 — 但它确实是最大值。

Worked Example — Reading the box plot例题 — 看懂箱形图

The five-number summary $\{3, 7, 11, 16, 25\}$ above becomes:

3
7
11
16
25

The right whisker is longer than the left — suggests a mild right skew (mean would be slightly above the median of $11$). The IQR-box itself is centred near the median, which is the typical look.

上述五数概括 $\{3, 7, 11, 16, 25\}$ 对应的箱形图为:

3
7
11
16
25

右须比左须长 — 提示存在轻度右偏(均值略高于中位数 $11$)。IQR 箱子本身大致以中位数为中心,是常见形态。

▸ Going deeper — Why $1.5 \cdot \text{IQR}$?▸ 深入探究 — 为何选择 $1.5 \cdot \text{IQR}$?

The cutoff is conventional, not derived from a deep theorem — but it isn't arbitrary. John Tukey, who introduced the box plot, chose $1.5$ as a compromise: large enough that normal data rarely produces "outliers" (about $0.7\%$ on each tail), small enough that genuine outliers in skewed data still get flagged.

这是约定俗成的阈值,并非源自某条深奥定理 — 但也并非随意。提出箱形图的 John Tukey 选取 $1.5$ 作为折中:足够大,使得正态数据极少产生"异常值"(每侧约 $0.7\%$);又足够小,能保证偏态数据中的真异常值仍被标出。

Some texts also use a $3 \cdot \text{IQR}$ "extreme outlier" fence. The IB does not require it — just the $1.5 \cdot \text{IQR}$ rule above. When you flag an outlier in a written answer, name the rule explicitly: "$x = 25$ lies above $Q_3 + 1.5\, \text{IQR}$, so it is an outlier."

部分教材另外采用 $3 \cdot \text{IQR}$ 的"极端异常值"围栏。IB 不要求 — 只需上述 $1.5 \cdot \text{IQR}$ 规则。在书面答题时标出异常值,要明确写出规则:"$x = 25$ 高于 $Q_3 + 1.5\, \text{IQR}$,故为异常值。"

A dataset has $Q_1 = 18$, $Q_3 = 30$. Which value is an outlier under the $1.5 \cdot \text{IQR}$ rule?某数据集 $Q_1 = 18$,$Q_3 = 30$。按 $1.5 \cdot \text{IQR}$ 规则,哪一个值是异常值?
SL 4.2
$49$
$45$
$12$
$5$
Correct! $\text{IQR} = 12$, $U = 30 + 1.5 \cdot 12 = 48$. $49 > 48$, so $49$ is an outlier. $45$ sits inside the fence. $L = 18 - 18 = 0$, so $12$ and $5$ are both above $L$ and thus not outliers either.正确!$\text{IQR} = 12$,$U = 30 + 1.5 \cdot 12 = 48$。$49 > 48$,故 $49$ 为异常值。$45$ 在围栏内。$L = 18 - 18 = 0$,因此 $12$ 与 $5$ 都在 $L$ 之上,亦非异常值。
$\text{IQR} = 30 - 18 = 12$. $U = 30 + 1.5 \cdot 12 = 48$; $L = 18 - 1.5 \cdot 12 = 0$. Only $49$ exceeds $U$.$\text{IQR} = 30 - 18 = 12$。$U = 30 + 1.5 \cdot 12 = 48$;$L = 18 - 1.5 \cdot 12 = 0$。只有 $49$ 超出 $U$。
From a cumulative-frequency graph with $n = 80$, the median is read at the $y$-value:从 $n = 80$ 的累积频率图上读中位数,应取的 $y$ 值是:
SL 4.2
$20$
$80$
$40$
$60$
Correct! Median sits at $0.5 n = 40$. Then $Q_1$ at $0.25 n = 20$ and $Q_3$ at $0.75 n = 60$.正确!中位数对应 $0.5 n = 40$。同样 $Q_1$ 对应 $0.25 n = 20$,$Q_3$ 对应 $0.75 n = 60$。
The median is at $0.5 n = 0.5 \cdot 80 = 40$ on the cumulative-frequency $y$-axis.中位数在累积频率图的 $y$ 轴上对应 $0.5 n = 0.5 \cdot 80 = 40$。

1.4 Measures of Central Tendency1.4 集中趋势的度量 SL 4.3

Three summaries of "where the data is": mean, median, mode. "数据在哪里"的三种概括:均值、中位数、众数(meanmedianmode)。 The mean $\bar{x} = \tfrac{1}{n}\sum x_i$ is the centre of mass — sensitive to outliers. The median is the middle value when ordered — robust to outliers, the right choice for skewed data. The mode is the most frequent value (the modal class for grouped data). On grouped data the mean is estimated using class midpoints. 均值 $\bar{x} = \tfrac{1}{n}\sum x_i$ 是数据的质心 — 对异常值敏感。中位数是排序后的中间值 — 对异常值稳健,是偏态数据的恰当选择。众数是最常出现的值(分组数据则报告众数类(modal class)。分组数据的均值用组中值估计
Mean — raw vs. grouped data均值 — 原始数据 vs. 分组数据
$$ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}, \qquad \bar{x} \approx \frac{\sum_i f_i \, x_i}{\sum_i f_i} $$

Left: the exact sample mean for raw data. Right: an estimate for grouped data, where $x_i$ is the midpoint of class $i$ and $f_i$ its frequency. The estimate is only as good as the assumption that within-class values cluster near the midpoint — on Paper 1 the IB expects you to write this approximation explicitly.

左:原始数据的精确样本均值。右:分组数据的估计值,其中 $x_i$ 为第 $i$ 类的组中值,$f_i$ 为对应频数。该估计仅在"类内数据聚集于组中值附近"的假设下才好 — 在试卷 1 中,IB 期望你显式写出这是近似而非等号。

When to use which何时选哪一个 Mean. Default for symmetric distributions; numerically convenient (every later formula uses it).
Median. Default for skewed distributions or any dataset with notable outliers (e.g. incomes, house prices).
Mode (or modal class). Used when "most common" is the question (e.g. shoe-size stock to keep). For continuous data, report the modal class, not a single mode.
均值(mean)。对称分布的默认选择;在数值上方便(后续每一个公式都用到它)。
中位数(median)。偏态分布或存在明显异常值的数据集的默认选择(如收入、房价)。
众数(mode,或众数类 modal class)。当问题问"最常见"时用(如该备货哪一种鞋码)。连续数据应报告众数类,而不是单一众数。

Worked Example — Raw data mean and median例题 — 原始数据的均值与中位数

Problem: $\{4, 6, 7, 7, 9, 12, 15\}$. Find the mean, median, and mode.

$n = 7$. Sum $= 4 + 6 + 7 + 7 + 9 + 12 + 15 = 60$.

$$ \bar{x} = \frac{60}{7} \approx 8.57. $$

Median (already sorted) at position $\tfrac{7+1}{2} = 4$: the 4th value is $7$.

Mode: $7$ appears twice; all others once. Mode $= 7$.

题目:$\{4, 6, 7, 7, 9, 12, 15\}$。求均值、中位数与众数。

$n = 7$。和 $= 4 + 6 + 7 + 7 + 9 + 12 + 15 = 60$。

$$ \bar{x} = \frac{60}{7} \approx 8.57. $$

中位数(数据已排序)位于位置 $\tfrac{7+1}{2} = 4$:第 4 个值为 $7$。

众数:$7$ 出现两次,其余均只出现一次。众数 $= 7$。

Worked Example — Estimating the mean from grouped data例题 — 由分组数据估计均值

Problem: Using the masses table from 1.2:

Class$40 \le m < 50$$50 \le m < 60$$60 \le m < 70$$70 \le m < 80$$80 \le m < 90$
Midpoint $x_i$4555657585
Frequency $f_i$4111483

$\sum f_i = 40$. Compute $\sum f_i x_i$:

$$ 4(45) + 11(55) + 14(65) + 8(75) + 3(85) = 180 + 605 + 910 + 600 + 255 = 2550. $$ $$ \bar{x} \approx \frac{2550}{40} = 63.75 \text{ kg}. $$

Modal class: $60 \le m < 70$ (highest frequency, $14$). Since the data is continuous, we report the class, not a single value.

题目:使用 1.2 节的体重分组表:

$40 \le m < 50$$50 \le m < 60$$60 \le m < 70$$70 \le m < 80$$80 \le m < 90$
组中值 $x_i$4555657585
频数 $f_i$4111483

$\sum f_i = 40$。计算 $\sum f_i x_i$:

$$ 4(45) + 11(55) + 14(65) + 8(75) + 3(85) = 180 + 605 + 910 + 600 + 255 = 2550. $$ $$ \bar{x} \approx \frac{2550}{40} = 63.75 \text{ kg}. $$

众数类(modal class):$60 \le m < 70$(频数最高,$14$)。数据连续,因此报告的是类,而不是单一值。

Worked Example — Mean is fragile, median is not例题 — 均值脆弱,中位数稳健

Problem: Salaries (in thousands): $\{32, 34, 35, 36, 38, 40, 250\}$. Find the mean and median, and comment.

$\bar{x} = \frac{32 + 34 + 35 + 36 + 38 + 40 + 250}{7} = \frac{465}{7} \approx 66.4$ thousand.

Median (sorted, $n = 7$, position 4): $36$ thousand.

Comment. The mean ($\approx 66.4$k) is dragged upward by the single high earner ($250$k). The median ($36$k) represents the typical employee much better. With outliers present, IB usually rewards naming the median as the "more appropriate measure".

题目:工资(单位:千元):$\{32, 34, 35, 36, 38, 40, 250\}$。求均值与中位数,并加以评论。

$\bar{x} = \frac{32 + 34 + 35 + 36 + 38 + 40 + 250}{7} = \frac{465}{7} \approx 66.4$ 千元。

中位数(排序后 $n = 7$,第 4 位):$36$ 千元。

评论。均值($\approx 66.4$ 千元)被那位高收入者($250$ 千元)单方面拉高。中位数($36$ 千元)更能代表典型员工。当存在异常值(outlier)时,IB 通常奖励将中位数判为"更合适的度量"。

▸ Going deeper — The mean as a minimizer▸ 深入探究 — 均值作为最小化解

The mean is the unique value $c$ that minimizes the sum of squared deviations:

均值是使偏差平方和最小的唯一常数 $c$:

$$ \sum_{i=1}^{n} (x_i - c)^2 \;\;\text{is minimised at}\;\; c = \bar{x}. $$

Quick proof: differentiate with respect to $c$, set to zero:

简证:对 $c$ 求导,并令导数为零:

$$ \frac{d}{dc}\sum (x_i - c)^2 = -2 \sum (x_i - c) = 0 \;\;\Longrightarrow\;\; c = \frac{\sum x_i}{n} = \bar{x}. $$

The median, by contrast, minimises the sum of absolute deviations $\sum |x_i - c|$. Squaring penalises a single big deviation a lot more than absolute value does — that's exactly why the mean is sensitive to outliers and the median isn't. This is the deep reason behind variance / SD using squares: they pair naturally with the mean.

相对地,中位数(median)使绝对偏差之和 $\sum |x_i - c|$ 最小。平方对单个大偏差的惩罚远高于绝对值 — 这正是均值对异常值敏感、而中位数稳健的根源。这也是方差(variance)/ 标准差(standard deviation)必须用平方的深层理由:它们与均值天然配对。

The grouped-data table has midpoints $\{5, 15, 25\}$ with frequencies $\{2, 5, 3\}$. The estimated mean is:某分组数据表的组中值为 $\{5, 15, 25\}$,频数为 $\{2, 5, 3\}$。估计均值为:
SL 4.3
$15$
$45$
$16$
$22.5$
Correct! $\sum f_i x_i = 2(5) + 5(15) + 3(25) = 10 + 75 + 75 = 160$. $\sum f_i = 10$. Mean $\approx 160/10 = 16$.正确!$\sum f_i x_i = 2(5) + 5(15) + 3(25) = 10 + 75 + 75 = 160$。$\sum f_i = 10$。均值 $\approx 160/10 = 16$。
Use $\bar{x} \approx \frac{\sum f_i x_i}{\sum f_i} = \frac{160}{10} = 16$. Don't average the midpoints alone — weight by frequency.用 $\bar{x} \approx \frac{\sum f_i x_i}{\sum f_i} = \frac{160}{10} = 16$。不要只对组中值取平均 — 必须按频数加权。
A dataset is heavily right-skewed with one large outlier. Which summary is most appropriate?某数据集严重右偏,含一个大异常值。哪种概括最合适?
SL 4.3
Mean & SD.均值与标准差。
Median & IQR.中位数与四分位距。
Mode only.仅众数。
Range only.仅极差。
Correct! Both median and IQR are robust to outliers and to skew. Mean and SD are sensitive to the long tail; mode and range alone don't summarise enough.正确!中位数与 IQR 对异常值和偏态都很稳健。均值与标准差对长尾敏感;仅给出众数或极差信息量不足。
With an outlier and skew, use median + IQR. Mean + SD are distorted; mode and range alone don't summarise spread well.存在异常值与偏态时,请用 中位数 + IQR。均值 + 标准差被扭曲;仅众数或极差不足以描述离散程度。

1.5 Measures of Dispersion — Variance & Standard Deviation1.5 离散程度的度量 — 方差与标准差 SL 4.3

Dispersion measures how spread the data is around its centre. 离散程度衡量数据围绕中心的分散程度。 Range $= \max - \min$ (very fragile). IQR $= Q_3 - Q_1$ (robust). Variance $\sigma^2$ averages squared deviations from the mean; SD $\sigma$ is its square root and is in the same units as the data.
IB defaults to dividing by $n$, not $n - 1$. Always use $\sigma$, $\sigma^2$ (calculator displays this as $\sigma_n$ or $\sigma_x$ — not $s_x$).
极差(range $= \max - \min$(非常脆弱)。四分位距(interquartile range,IQR) $= Q_3 - Q_1$(稳健)。方差(variance $\sigma^2$ 是与均值偏差平方的平均;标准差(standard deviation,SD) $\sigma$ 是方差的平方根,单位与原始数据相同。
IB 默认除以 $n$,而非 $n - 1$。始终使用 $\sigma$、$\sigma^2$(计算器显示为 $\sigma_n$ 或 $\sigma_x$ — 不是 $s_x$)。
Population standard deviation & variance (IB convention)总体标准差与方差(IB 约定)
$$ \sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2, \qquad \sigma = \sqrt{\sigma^2}. $$

For grouped data with class midpoints $x_i$, frequencies $f_i$, and $n = \sum f_i$:

对分组数据:组中值为 $x_i$,频数为 $f_i$,$n = \sum f_i$:

$$ \sigma = \sqrt{\frac{\sum_i f_i\,(x_i - \bar{x})^2}{\sum_i f_i}}. $$
Computational form (no need to compute deviations first)计算式(无需先逐个算偏差)
$$ \sigma^2 = \overline{x^2} - \bar{x}^2 = \frac{\sum x_i^2}{n} - \bar{x}^2. $$

This is the by-hand-friendly version: compute $\sum x_i$ and $\sum x_i^2$ once, subtract. Avoids round-off from subtracting many small deviations.

这是适合手算的形式:只算一次 $\sum x_i$ 与 $\sum x_i^2$,再做减法。避免对许多小偏差逐个相减带来的舍入误差。

Why squared?为什么要平方? Deviations $x_i - \bar{x}$ sum to zero by construction: 由构造可知,偏差 $x_i - \bar{x}$ 之和必为零: $$ \sum (x_i - \bar{x}) = \sum x_i - n\bar{x} = n\bar{x} - n\bar{x} = 0. $$ So you can't just average raw deviations. Squaring removes the sign; the SD then "un-squares" to get back to the original units. (Averaging $|x_i - \bar{x}|$ also works mathematically, but the squared version pairs with the mean — see the "Going deeper" in 1.4.) 所以不能直接对原始偏差取平均。平方去掉符号;标准差再开方回到原始单位。(数学上对 $|x_i - \bar{x}|$ 取平均也可行,但平方版本与均值天然配对 — 详见 1.4 的"深入探究"。)
Two-statistic rule"两个统计量"原则 Always report dispersion alongside centre. A dataset is best summarized by either (mean, SD) for symmetric data or (median, IQR) for skewed data. Reporting just the centre throws away half the picture; reporting just the spread leaves no anchor. 报告中心时务必同时报告离散程度。对称数据最好用 (均值, 标准差);偏态数据最好用 (中位数, IQR)。只报告中心丢掉一半信息;只报告离散程度则没有定位点。

Worked Example — SD from raw data例题 — 由原始数据求标准差

Problem: Find the standard deviation of $\{4, 6, 7, 9, 14\}$.

$n = 5$. $\sum x_i = 40$, so $\bar{x} = 8$.

Deviations and their squares:

$$ \begin{aligned} (4 - 8)^2 &= 16 \\ (6 - 8)^2 &= 4 \\ (7 - 8)^2 &= 1 \\ (9 - 8)^2 &= 1 \\ (14 - 8)^2 &= 36 \end{aligned} \qquad \sum = 58. $$ $$ \sigma^2 = \frac{58}{5} = 11.6, \qquad \sigma = \sqrt{11.6} \approx 3.41. $$

题目:求 $\{4, 6, 7, 9, 14\}$ 的标准差。

$n = 5$。$\sum x_i = 40$,所以 $\bar{x} = 8$。

偏差及其平方:

$$ \begin{aligned} (4 - 8)^2 &= 16 \\ (6 - 8)^2 &= 4 \\ (7 - 8)^2 &= 1 \\ (9 - 8)^2 &= 1 \\ (14 - 8)^2 &= 36 \end{aligned} \qquad \sum = 58. $$ $$ \sigma^2 = \frac{58}{5} = 11.6, \qquad \sigma = \sqrt{11.6} \approx 3.41. $$

Worked Example — SD by the computational form例题 — 用计算式求标准差

Problem: Same data $\{4, 6, 7, 9, 14\}$, using $\sigma^2 = \overline{x^2} - \bar{x}^2$.

$\sum x_i^2 = 16 + 36 + 49 + 81 + 196 = 378$. $\overline{x^2} = 378/5 = 75.6$. $\bar{x}^2 = 64$.

$$ \sigma^2 = 75.6 - 64 = 11.6, \qquad \sigma \approx 3.41. $$

Same answer, faster — especially useful on Paper 1 when the data is small but the deviations would be messy.

题目:同一数据 $\{4, 6, 7, 9, 14\}$,改用 $\sigma^2 = \overline{x^2} - \bar{x}^2$。

$\sum x_i^2 = 16 + 36 + 49 + 81 + 196 = 378$。$\overline{x^2} = 378/5 = 75.6$。$\bar{x}^2 = 64$。

$$ \sigma^2 = 75.6 - 64 = 11.6, \qquad \sigma \approx 3.41. $$

结果相同,速度更快 — 在试卷 1(Paper 1)数据量小但偏差不整齐时尤为好用。

Worked Example — Grouped-data SD例题 — 分组数据标准差

Problem: For the masses table (1.4), with $\bar{x} = 63.75$, find $\sigma$.

Deviations from $\bar{x}$, weighted:

$$ \begin{aligned} 4(45 - 63.75)^2 &= 4(351.5625) = 1406.25 \\ 11(55 - 63.75)^2 &= 11(76.5625) = 842.1875 \\ 14(65 - 63.75)^2 &= 14(1.5625) = 21.875 \\ 8(75 - 63.75)^2 &= 8(126.5625) = 1012.5 \\ 3(85 - 63.75)^2 &= 3(451.5625) = 1354.6875 \end{aligned} $$ $$ \sum f_i (x_i - \bar{x})^2 = 4637.5, \qquad \sigma = \sqrt{\frac{4637.5}{40}} = \sqrt{115.9375} \approx 10.77 \text{ kg}. $$

(On Paper 2 you would put midpoints in L1, frequencies in L2, and read $\sigma_x$ directly from 1-Var Stats.)

题目:用 1.4 节的体重分组表,已知 $\bar{x} = 63.75$,求 $\sigma$。

对 $\bar{x}$ 的加权偏差:

$$ \begin{aligned} 4(45 - 63.75)^2 &= 4(351.5625) = 1406.25 \\ 11(55 - 63.75)^2 &= 11(76.5625) = 842.1875 \\ 14(65 - 63.75)^2 &= 14(1.5625) = 21.875 \\ 8(75 - 63.75)^2 &= 8(126.5625) = 1012.5 \\ 3(85 - 63.75)^2 &= 3(451.5625) = 1354.6875 \end{aligned} $$ $$ \sum f_i (x_i - \bar{x})^2 = 4637.5, \qquad \sigma = \sqrt{\frac{4637.5}{40}} = \sqrt{115.9375} \approx 10.77 \text{ kg}. $$

(在试卷 2 上你会把组中值放入 L1、频数放入 L2,然后从 1-Var Stats 直接读出 $\sigma_x$。)

▸ Going deeper — Equivalence of the two SD formulas▸ 深入探究 — 两个标准差公式等价

Both forms compute the same quantity. Expand $(x_i - \bar{x})^2$:

两种形式计算同一个量。展开 $(x_i - \bar{x})^2$:

$$ \sum (x_i - \bar{x})^2 = \sum x_i^2 - 2\bar{x}\sum x_i + n \bar{x}^2 = \sum x_i^2 - 2\bar{x} (n\bar{x}) + n\bar{x}^2 = \sum x_i^2 - n\bar{x}^2. $$

Divide by $n$:

两边除以 $n$:

$$ \sigma^2 = \frac{\sum (x_i - \bar{x})^2}{n} = \frac{\sum x_i^2}{n} - \bar{x}^2 = \overline{x^2} - \bar{x}^2. $$

The computational form is preferred on Paper 1 because it avoids subtracting potentially close-in-value deviations — that's a source of catastrophic cancellation if you compute by calculator with limited precision.

试卷 1 优先用计算式:它避免了对若干数值相近的偏差逐个相减 — 在精度有限的计算器上,这会造成灾难性的有效数字抵消。

▸ Going deeper — Why the IB uses $n$, not $n - 1$▸ 深入探究 — 为何 IB 用 $n$ 而非 $n - 1$

For a sample of size $n$ drawn from a larger population, the unbiased estimator of the population variance divides by $n - 1$ (the "Bessel correction"). The $-1$ accounts for using $\bar{x}$ in place of the unknown $\mu$ — one degree of freedom is "spent" estimating the mean.

从更大总体(population)中抽取容量为 $n$ 的样本sample)时,总体方差的无偏估计应除以 $n - 1$("贝塞尔校正")。这个 $-1$ 用来补偿用 $\bar{x}$ 代替未知 $\mu$ — 一个自由度"花"在了估计均值上。

However, the IB syllabus treats the data given in the question as the entire population of interest, so $\mu$ is known (it equals $\bar{x}$ when you have everyone). With $\mu$ known, dividing by $n$ gives an unbiased estimator of the true variance — no correction needed.

不过,IB 教学大纲把题目给定的数据当作所关注的整个总体,因此 $\mu$ 已知(涵盖全部个体时它就等于 $\bar{x}$)。$\mu$ 已知后,除以 $n$ 已是真实方差的无偏估计 — 无需校正。

Practically: use $\sigma_x$ on your calculator, never $s_x$. If your final answer uses $s_x$ where the markscheme used $\sigma_x$, you'll lose accuracy marks. For $n$ large, the two differ negligibly — but on Paper 1 with tiny $n$, the gap is noticeable.

实务上:计算器上始终用 $\sigma_x$,绝不用 $s_x$。评分标准用 $\sigma_x$ 而你用 $s_x$,会丢失精确度分。$n$ 很大时两者差异可忽略 — 但试卷 1 上 $n$ 很小,差距很明显。

For data $\{3, 5, 5, 7\}$ with $\bar{x} = 5$, the population SD is:数据 $\{3, 5, 5, 7\}$ 且 $\bar{x} = 5$,总体标准差为:
SL 4.3
$\sqrt{2} \approx 1.41$
$\sqrt{8/3} \approx 1.63$
$2$
$\sqrt{8} \approx 2.83$
Correct! Squared deviations: $(3-5)^2 = 4$, $(5-5)^2 = 0$, $(5-5)^2 = 0$, $(7-5)^2 = 4$, sum $= 8$. $\sigma^2 = 8/4 = 2$, $\sigma = \sqrt{2}$. (Option B would use $n - 1 = 3$, the sample formula — not the IB convention.)正确!偏差平方:$(3-5)^2 = 4$、$(5-5)^2 = 0$、$(5-5)^2 = 0$、$(7-5)^2 = 4$,和 $= 8$。$\sigma^2 = 8/4 = 2$,$\sigma = \sqrt{2}$。(选项 B 使用 $n - 1 = 3$,是样本公式 — 非 IB 约定。)
$\sum (x_i - \bar{x})^2 = 4 + 0 + 0 + 4 = 8$. $\sigma^2 = 8/n = 8/4 = 2$, so $\sigma = \sqrt{2}$. Use the population formula (divide by $n$, not $n - 1$).$\sum (x_i - \bar{x})^2 = 4 + 0 + 0 + 4 = 8$。$\sigma^2 = 8/n = 8/4 = 2$,所以 $\sigma = \sqrt{2}$。使用总体公式(除以 $n$,而非 $n - 1$)。
Which of these is most sensitive to a single extreme outlier?下列哪项对单个极端异常值敏感?
SL 4.3
Range.极差。
IQR.四分位距。
Median.中位数。
Mode.众数。
Correct! The range uses both extremes directly, so one large outlier inflates it immediately. SD is also sensitive (it's the runner-up). IQR, median, and mode are robust.正确!极差直接使用两端的极值,因此单个大异常值会立刻把它推高。标准差也敏感(仅次于极差)。IQR、中位数、众数都是稳健的。
Range $= \max - \min$ — a single extreme can shift it dramatically. IQR, median, mode are robust because they ignore the tails.极差 $= \max - \min$ — 一个极值就能让它剧烈变化。IQR、中位数、众数则因为忽略尾部而稳健。

1.6 Effect of Linear Transformations on Data1.6 数据线性变换的影响 SL 4.3

Transform every observation by $y_i = a x_i + b$. Then: 对每个观测做线性变换(linear transformation)$y_i = a x_i + b$。则: $$ \bar{y} = a \bar{x} + b, \qquad \sigma_y = |a|\, \sigma_x, \qquad \sigma_y^2 = a^2 \sigma_x^2. $$
  • Adding $b$ shifts mean by $b$; SD unchanged.
  • Multiplying by $a$ multiplies mean by $a$; multiplies SD by $|a|$.
  • Median, $Q_1$, $Q_3$, IQR all transform the same way as the mean (subject to sign of $a$ for ordering).
  • 加 $b$使均值平移 $b$;标准差不变。
  • 乘以 $a$使均值乘以 $a$;标准差乘以 $|a|$。
  • 中位数、$Q_1$、$Q_3$、IQR 全部按均值同样的方式变换(在 $a$ 的符号影响下注意顺序)。
Centre and spread under $y = ax + b$$y = ax + b$ 下的中心与离散
$$ \bar{y} = a\bar{x} + b \qquad \sigma_y = |a|\sigma_x \qquad \text{Var}(y) = a^2\,\text{Var}(x) $$
Why these rules规则从何而来 Mean. $\bar{y} = \tfrac{1}{n}\sum (a x_i + b) = a \tfrac{1}{n}\sum x_i + b = a\bar{x} + b$.
Variance. $y_i - \bar{y} = (a x_i + b) - (a\bar{x} + b) = a(x_i - \bar{x})$. Square, average:
均值。$\bar{y} = \tfrac{1}{n}\sum (a x_i + b) = a \tfrac{1}{n}\sum x_i + b = a\bar{x} + b$。
方差。$y_i - \bar{y} = (a x_i + b) - (a\bar{x} + b) = a(x_i - \bar{x})$。平方再求平均:
$$ \sigma_y^2 = \frac{1}{n}\sum a^2 (x_i - \bar{x})^2 = a^2 \sigma_x^2 \quad\Longrightarrow\quad \sigma_y = |a|\sigma_x. $$ The shift $b$ vanishes from variance — spread is translation-invariant. The scale $a$ comes out squared in variance and as $|a|$ in SD. 平移 $b$ 在方差中消失 — 离散程度对平移不变。比例因子 $a$ 在方差中以平方出现,在标准差中以 $|a|$ 出现。
Use case — unit conversion应用场景 — 单位换算 Temperatures in Celsius: mean $20{}^\circ$C, SD $3{}^\circ$C. Convert to Fahrenheit using $F = \tfrac{9}{5} C + 32$. 摄氏温度:均值 $20{}^\circ$C,标准差 $3{}^\circ$C。用 $F = \tfrac{9}{5} C + 32$ 换算为华氏。 $$ \bar{F} = \tfrac{9}{5}(20) + 32 = 68\,{}^\circ\text{F}, \qquad \sigma_F = \tfrac{9}{5}(3) = 5.4\,{}^\circ\text{F}. $$ The $+32$ doesn't affect SD — only the scale factor does. This is a classic IB hook. $+32$ 不影响标准差 — 只有比例因子影响。这是 IB 的经典考点。

Worked Example — Marks scaled and shifted例题 — 分数的伸缩与平移

Problem: A class's raw marks have $\bar{x} = 56$, $\sigma_x = 12$. The teacher scales each mark by $y = 1.2 x + 10$ to produce final marks. Find $\bar{y}$ and $\sigma_y$.

$$ \bar{y} = 1.2(56) + 10 = 67.2 + 10 = 77.2. $$ $$ \sigma_y = |1.2|(12) = 14.4. $$

The shift by $10$ moves the mean only. The scale by $1.2$ inflates both. The IB markscheme awards one mark for each of the two transformed values.

题目:某班原始分数有 $\bar{x} = 56$,$\sigma_x = 12$。老师按 $y = 1.2 x + 10$ 调整每个分数生成最终成绩。求 $\bar{y}$ 与 $\sigma_y$。

$$ \bar{y} = 1.2(56) + 10 = 67.2 + 10 = 77.2. $$ $$ \sigma_y = |1.2|(12) = 14.4. $$

$+10$ 仅平移均值;$\times 1.2$ 让两者都放大。IB 评分标准对两个变换后的值各给 1 分。

Worked Example — Reverse direction例题 — 反向

Problem: Three measurements have variance $25$. Each is then halved. What is the new variance and SD?

Transformation: $y = 0.5 x$ (no shift). Variance multiplies by $a^2 = 0.25$:

$$ \sigma_y^2 = (0.5)^2 (25) = 6.25, \qquad \sigma_y = \sqrt{6.25} = 2.5. $$

Equivalently $\sigma_y = 0.5 \cdot \sigma_x = 0.5 \cdot 5 = 2.5$.

题目:三个观测的方差为 $25$。然后每个都减半。新的方差与标准差是多少?

变换:$y = 0.5 x$(无平移)。方差乘以 $a^2 = 0.25$:

$$ \sigma_y^2 = (0.5)^2 (25) = 6.25, \qquad \sigma_y = \sqrt{6.25} = 2.5. $$

等价地,$\sigma_y = 0.5 \cdot \sigma_x = 0.5 \cdot 5 = 2.5$。

▸ Going deeper — Standardization and the $z$-score▸ 深入探究 — 标准化与 $z$ 分数

The transformation $z = \dfrac{x - \mu}{\sigma}$ centres and scales the data so that $\bar{z} = 0$ and $\sigma_z = 1$.

变换 $z = \dfrac{x - \mu}{\sigma}$ 把数据居中、归一化为 $\bar{z} = 0$、$\sigma_z = 1$。

Plug into the linear-transformation rules with $a = 1/\sigma$ and $b = -\mu/\sigma$:

在线性变换规则中代入 $a = 1/\sigma$、$b = -\mu/\sigma$:

$$ \bar{z} = \frac{1}{\sigma}\,\mu - \frac{\mu}{\sigma} = 0, \qquad \sigma_z = \left|\frac{1}{\sigma}\right|\,\sigma = 1. $$

The $z$-score answers "how many standard deviations is this observation from the mean?" — the foundation for the normal-distribution work in D3 and for outlier-flagging in real-world statistics. It's also the cleanest demonstration that linear transformations are information-preserving: shape, skewness, and PMCC are all invariant under them.

$z$ 分数回答"这个观测距均值有多少个标准差"— 它是 D3 正态分布工作的基础,也是现实统计中标记异常值的工具。这也是线性变换信息守恒最简洁的演示:形状、偏态、PMCC 在变换下都保持不变。

A dataset has mean $10$ and SD $4$. After applying $y = -2x + 5$, the new mean and SD are:某数据集均值为 $10$,标准差为 $4$。应用 $y = -2x + 5$ 后,新的均值与标准差为:
SL 4.3
$\bar{y} = -15$, $\sigma_y = -8$.
$\bar{y} = -25$, $\sigma_y = -8$.
$\bar{y} = -15$, $\sigma_y = 8$.
$\bar{y} = -25$, $\sigma_y = 8$.
Correct! $\bar{y} = -2(10) + 5 = -15$. $\sigma_y = |-2| \cdot 4 = 8$. SD is always non-negative — the absolute value is the safety net for the $a < 0$ case.正确!$\bar{y} = -2(10) + 5 = -15$。$\sigma_y = |-2| \cdot 4 = 8$。标准差恒非负 — 当 $a < 0$ 时绝对值就是安全网。
$\bar{y} = a\bar{x} + b = -2(10) + 5 = -15$. $\sigma_y = |a|\sigma_x = 2 \cdot 4 = 8$ (SD is non-negative).$\bar{y} = a\bar{x} + b = -2(10) + 5 = -15$。$\sigma_y = |a|\sigma_x = 2 \cdot 4 = 8$(标准差非负)。

1.7 Bivariate Data: Scatter, Correlation, Regression1.7 双变量数据:散点图、相关与回归 SL 4.4

Bivariate data $(x_i, y_i)$ is shown on a scatter plot. Three questions follow. 双变量数据(bivariate data)$(x_i, y_i)$ 用散点图(scatter diagram)展示。随之有三个问题。
  1. Direction & form. Positive / negative / no association. Linear / curved.
  2. Strength. Pearson's PMCC $r = \dfrac{S_{xy}}{\sqrt{S_{xx}\, S_{yy}}} \in [-1, 1]$. Closer to $\pm 1$ ⇒ stronger linear association.
  3. Best-fit line. The regression line of $y$ on $x$, $y = ax + b$, always passes through $(\bar{x}, \bar{y})$. Use it to predict $y$ from $x$ inside the data range.
  1. 方向与形态。正相关 / 负相关 / 无相关。线性 / 非线性。
  2. 强弱。皮尔逊积矩相关系数(Pearson's product-moment correlation coefficient,PMCC)$r = \dfrac{S_{xy}}{\sqrt{S_{xx}\, S_{yy}}} \in [-1, 1]$。越接近 $\pm 1$ ⇒ 线性关联越强。
  3. 最佳拟合直线。$y$ 关于 $x$ 的回归直线(regression line)$y = ax + b$ 恒过 $(\bar{x}, \bar{y})$。用它在数据范围内由 $x$ 预测 $y$。
Sample sums (define everything else)样本和(用以定义其余量)
$$ S_{xx} = \sum (x_i - \bar{x})^2, \qquad S_{yy} = \sum (y_i - \bar{y})^2, \qquad S_{xy} = \sum (x_i - \bar{x})(y_i - \bar{y}). $$

Computational form: $S_{xy} = \sum x_i y_i - n\bar{x}\bar{y}$, and similarly $S_{xx} = \sum x_i^2 - n\bar{x}^2$.

计算式:$S_{xy} = \sum x_i y_i - n\bar{x}\bar{y}$,类似地 $S_{xx} = \sum x_i^2 - n\bar{x}^2$。

Pearson's product-moment correlation coefficient皮尔逊积矩相关系数
$$ r = \frac{S_{xy}}{\sqrt{S_{xx}\,S_{yy}}}, \qquad -1 \le r \le 1. $$
Least-squares regression line of $y$ on $x$$y$ 关于 $x$ 的最小二乘回归直线
$$ y - \bar{y} = b(x - \bar{x}), \qquad b = \frac{S_{xy}}{S_{xx}}. $$

Equivalently $y = a + bx$ in the IB's notation (slope $= b$, intercept $= a$) — this matches the GDC's LinReg(a + bx) output. The line always passes through $(\bar{x}, \bar{y})$.

在 IB 记号下等价写为 $y = a + bx$(斜率 $= b$,截距 $= a$)— 与图形计算器(GDC)的 LinReg(a + bx) 输出一致。直线恒过 $(\bar{x}, \bar{y})$。

Interpreting $r$如何解读 $r$ Sign. $r > 0$: $y$ tends to increase with $x$. $r < 0$: $y$ tends to decrease with $x$.
Strength. Rough thresholds (vary by source — the IB accepts a sensible verbal description):
符号。$r > 0$:$y$ 随 $x$ 倾向增加。$r < 0$:$y$ 随 $x$ 倾向减少。
强弱。常见阈值(不同教材略有出入 — IB 接受合理的文字描述):
  • $|r| \ge 0.9$ very strong
  • $0.7 \le |r| < 0.9$ strong
  • $0.5 \le |r| < 0.7$ moderate
  • $0.3 \le |r| < 0.5$ weak
  • $|r| < 0.3$ very weak / no linear association
  • $|r| \ge 0.9$ 极强
  • $0.7 \le |r| < 0.9$ 强
  • $0.5 \le |r| < 0.7$ 中等
  • $0.3 \le |r| < 0.5$ 弱
  • $|r| < 0.3$ 极弱 / 无线性关联
Two non-negotiable cautions两条不可商量的告诫 Correlation is not causation. A strong $r$ says only that the variables move together, not that one causes the other. Lurking variables and reverse causation routinely explain strong $r$.
$r$ measures linear association. Data that follows $y = x^2$ on $[-1, 1]$ has $r = 0$ but a perfect non-linear pattern. Always look at the scatter plot before quoting $r$.
Don't extrapolate. The regression line is only reliable inside the $x$-range of the data. Predictions outside (extrapolation) assume the linear pattern continues — usually it doesn't.
相关(correlation)不等于因果。$r$ 很强只说明两变量同步变化,并不证明一个引起另一个。潜在变量与反向因果都能造成强 $r$。
$r$ 只衡量线性关联。数据若在 $[-1, 1]$ 上满足 $y = x^2$,则 $r = 0$,但却是完美的非线性模式。引用 $r$ 之前务必看散点图。
不要外推(extrapolation)。回归直线只在数据的 $x$ 区间内可靠。区间外的预测假设线性模式继续延伸 — 通常并不成立。

Worked Example — PMCC and regression line by hand例题 — 手算 PMCC 与回归直线

Problem: Five $(x, y)$ pairs:

$x$23578
$y$4571012

$n = 5$. $\bar{x} = 25/5 = 5$, $\bar{y} = 38/5 = 7.6$.

Sums:

$$ \sum x_i^2 = 4 + 9 + 25 + 49 + 64 = 151, \qquad \sum y_i^2 = 16 + 25 + 49 + 100 + 144 = 334, $$ $$ \sum x_i y_i = 2\cdot 4 + 3\cdot 5 + 5\cdot 7 + 7\cdot 10 + 8\cdot 12 = 8 + 15 + 35 + 70 + 96 = 224. $$

$S_{xx} = 151 - 5(5)^2 = 151 - 125 = 26$. $S_{yy} = 334 - 5(7.6)^2 = 334 - 288.8 = 45.2$. $S_{xy} = 224 - 5(5)(7.6) = 224 - 190 = 34$.

PMCC.

$$ r = \frac{34}{\sqrt{26 \cdot 45.2}} = \frac{34}{\sqrt{1175.2}} \approx \frac{34}{34.281} \approx 0.992. $$

Very strong positive linear association.

Regression line. $b = 34/26 \approx 1.308$. Passing through $(\bar{x}, \bar{y}) = (5, 7.6)$:

$$ y - 7.6 = 1.308(x - 5) \;\Longrightarrow\; y \approx 1.31 x + 1.06. $$

Read this off the GDC on Paper 2 (LinReg(a + bx)); the by-hand version above is what you'd write on Paper 1.

题目:5 组 $(x, y)$:

$x$23578
$y$4571012

$n = 5$。$\bar{x} = 25/5 = 5$,$\bar{y} = 38/5 = 7.6$。

各项求和:

$$ \sum x_i^2 = 4 + 9 + 25 + 49 + 64 = 151, \qquad \sum y_i^2 = 16 + 25 + 49 + 100 + 144 = 334, $$ $$ \sum x_i y_i = 2\cdot 4 + 3\cdot 5 + 5\cdot 7 + 7\cdot 10 + 8\cdot 12 = 8 + 15 + 35 + 70 + 96 = 224. $$

$S_{xx} = 151 - 5(5)^2 = 151 - 125 = 26$。$S_{yy} = 334 - 5(7.6)^2 = 334 - 288.8 = 45.2$。$S_{xy} = 224 - 5(5)(7.6) = 224 - 190 = 34$。

PMCC。

$$ r = \frac{34}{\sqrt{26 \cdot 45.2}} = \frac{34}{\sqrt{1175.2}} \approx \frac{34}{34.281} \approx 0.992. $$

极强的正向线性关联。

回归直线。$b = 34/26 \approx 1.308$。过 $(\bar{x}, \bar{y}) = (5, 7.6)$:

$$ y - 7.6 = 1.308(x - 5) \;\Longrightarrow\; y \approx 1.31 x + 1.06. $$

在试卷 2 直接从 GDC(LinReg(a + bx))读出;上面这种手算写法是试卷 1 上要写的。

Worked Example — Using the line for prediction例题 — 用回归直线做预测

Problem: For the data above, predict $y$ when $x = 6$. Comment on the reliability.

$$ \hat{y} = 1.308(6) + 1.06 \approx 8.91. $$

$x = 6$ lies inside the observed range $[2, 8]$, so the prediction is interpolation — reliable given the very strong $r \approx 0.99$.

By contrast, predicting at $x = 20$ would be extrapolation: the linear pattern is not guaranteed beyond the data.

题目:用上题数据预测 $x = 6$ 时的 $y$。评估其可靠性。

$$ \hat{y} = 1.308(6) + 1.06 \approx 8.91. $$

$x = 6$ 位于观测范围 $[2, 8]$ 内,预测属内插(interpolation — 鉴于 $r \approx 0.99$ 极强,结果可靠。

反之,预测 $x = 20$ 即外推(extrapolation):数据范围外不能保证线性模式仍然成立。

Worked Example — Interpreting parameters例题 — 解释回归参数

Problem: A regression of weekly sales $y$ ($\${,}000$) on advertising spend $x$ ($\${,}000$) gives $y = 4.2 x + 15$. Interpret the slope and intercept.

Slope $a = 4.2$. For every $\$1{,}000$ extra spent on advertising, weekly sales increase by $\$4{,}200$ on average.

Intercept $b = 15$. When advertising spend is $\$0$, the model predicts weekly sales of $\$15{,}000$. Caveat: $x = 0$ may lie outside the observed range, in which case the intercept's interpretation is shaky — flag this in a written answer.

题目:用每周销售额 $y$(单位 $\${,}000$)对广告投入 $x$(单位 $\${,}000$)作回归,得到 $y = 4.2 x + 15$。请解释斜率与截距。

斜率 $a = 4.2$。广告每多投入 $\$1{,}000$,每周销售额平均增加 $\$4{,}200$。

截距 $b = 15$。广告投入为 $\$0$ 时,模型预测每周销售额为 $\$15{,}000$。注意:$x = 0$ 可能在观测范围之外,此时截距的解释并不可靠 — 在书面答题中要标注这一点。

Visualizing the fit将拟合可视化

(̅x, ̅y) x y

The dashed line is the best fit. It must pass through the highlighted mean point $(\bar{x}, \bar{y})$ — that's the easy way to sketch the line "by eye" if asked.

虚线即最佳拟合直线。它必须过高亮的均值点 $(\bar{x}, \bar{y})$ — 这也是题目要求"目测"作图时最方便的画法。

▸ Going deeper — Why the regression line passes through $(\bar{x}, \bar{y})$▸ 深入探究 — 为何回归直线恒过 $(\bar{x}, \bar{y})$

Take the slope-form equation $y - \bar{y} = b(x - \bar{x})$ and substitute $x = \bar{x}$:

取点斜式方程 $y - \bar{y} = b(x - \bar{x})$,代入 $x = \bar{x}$:

$$ y - \bar{y} = b(\bar{x} - \bar{x}) = 0 \;\;\Longrightarrow\;\; y = \bar{y}. $$

So the point $(\bar{x}, \bar{y})$ lies on the line by construction. The reason this happens is that the least-squares method minimises $\sum (y_i - (ax_i + b))^2$; differentiating with respect to $b$ and setting to zero gives $\sum (y_i - ax_i - b) = 0$, equivalently $\bar{y} = a\bar{x} + b$.

因此 $(\bar{x}, \bar{y})$ 按构造就落在直线上。背后原因是:最小二乘(least squares)方法极小化 $\sum (y_i - (ax_i + b))^2$;对 $b$ 求导并令导数为零,得到 $\sum (y_i - ax_i - b) = 0$,等价于 $\bar{y} = a\bar{x} + b$。

Practical use: in a Paper 1 question with no calculator, if you're given $\bar{x}, \bar{y}$ and the slope $b$, write the line in slope-form immediately — no extra arithmetic.

实用法:试卷 1 不允许使用计算器时,若题目给出 $\bar{x}, \bar{y}$ 与斜率 $b$,直接用点斜式写出方程 — 无需额外运算。

▸ Going deeper — Why PMCC is invariant under linear transformations▸ 深入探究 — 为何 PMCC 在线性变换下不变

Apply $u = a x + b$, $v = c y + d$ with $a, c \ne 0$. Then $\bar{u} = a\bar{x} + b$, $\bar{v} = c\bar{y} + d$, and

应用 $u = a x + b$,$v = c y + d$,其中 $a, c \ne 0$。则 $\bar{u} = a\bar{x} + b$,$\bar{v} = c\bar{y} + d$,且

$$ S_{uv} = \sum (u_i - \bar{u})(v_i - \bar{v}) = \sum a(x_i - \bar{x}) \cdot c(y_i - \bar{y}) = a c \, S_{xy}. $$

Similarly $S_{uu} = a^2 S_{xx}$, $S_{vv} = c^2 S_{yy}$. So:

类似地 $S_{uu} = a^2 S_{xx}$,$S_{vv} = c^2 S_{yy}$。于是:

$$ r_{uv} = \frac{ac\, S_{xy}}{\sqrt{a^2 S_{xx} \cdot c^2 S_{yy}}} = \frac{ac}{|ac|} \cdot \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}} = \text{sign}(ac) \cdot r_{xy}. $$

If $ac > 0$, $r$ is unchanged. If $ac < 0$ (i.e. exactly one of the variables is reflected), the sign flips but the magnitude stays the same. This makes $r$ a true measure of association: it doesn't depend on the units chosen.

若 $ac > 0$,$r$ 不变;若 $ac < 0$(即恰有一个变量被反向),符号翻转而大小不变。这使 $r$ 成为真正的关联度度量:它不依赖于所选单位。

Bivariate data give $r = -0.95$. The correct interpretation:双变量数据给出 $r = -0.95$。正确的解读是:
SL 4.4
$x$ causes $y$ to decrease.$x$ 导致 $y$ 减少。
There is a strong positive linear association.存在强正向线性关联。
There is a strong negative linear association; $y$ tends to decrease as $x$ increases.存在强负向线性关联;$y$ 随 $x$ 增加而倾向减少。
The data lie exactly on a line of negative slope.数据精确落在一条负斜率的直线上。
Correct! $|r| = 0.95$ is strong (close to $1$). Negative sign means $y$ falls as $x$ rises. "Causes" overstates — correlation alone never proves causation. "Exactly on a line" would require $r = -1$.正确!$|r| = 0.95$ 很强(接近 $1$)。负号表示 $y$ 随 $x$ 上升而下降。"导致"用力过猛 — 仅凭相关绝不能证明因果。"精确落在一条直线上"则要求 $r = -1$。
$r = -0.95$ ⇒ strong negative linear association ($y$ falls as $x$ rises). Correlation is not causation, and "exactly on a line" requires $r = \pm 1$ exactly.$r = -0.95$ ⇒ 强负向线性关联($y$ 随 $x$ 上升而下降)。相关不等于因果,"精确落在一条直线上"严格要求 $r = \pm 1$。
A regression line $y = 2.4 x + 7$ comes from data with $\bar{x} = 5$. What must $\bar{y}$ be?某回归直线 $y = 2.4 x + 7$ 来自 $\bar{x} = 5$ 的数据。$\bar{y}$ 必为:
SL 4.4
$19$
$7$
$5$
Cannot be determined.无法确定。
Correct! The line passes through $(\bar{x}, \bar{y})$, so $\bar{y} = 2.4 (\bar{x}) + 7 = 2.4(5) + 7 = 19$.正确!直线过 $(\bar{x}, \bar{y})$,所以 $\bar{y} = 2.4 (\bar{x}) + 7 = 2.4(5) + 7 = 19$。
The line always passes through $(\bar{x}, \bar{y})$. Plug $\bar{x} = 5$ into the line: $\bar{y} = 2.4(5) + 7 = 19$.直线恒过 $(\bar{x}, \bar{y})$。把 $\bar{x} = 5$ 代入:$\bar{y} = 2.4(5) + 7 = 19$。

Exam Strategy & Common Pitfalls考试策略与常见陷阱

M
Memorize需要背
  • $\bar{x} = \dfrac{\sum f_i x_i}{\sum f_i}$ for grouped data SL 4.3
  • $\sigma = \sqrt{\dfrac{\sum(x_i - \mu)^2}{n}}$ (population) SL 4.3
  • $\sigma^2 = \overline{x^2} - \bar{x}^2$ computational form SL 4.3
  • Linear-transformation rule: $\bar{y} = a\bar{x} + b$, $\sigma_y = |a|\sigma_x$ SL 4.3
  • Outlier rule: $x < Q_1 - 1.5\,\text{IQR}$ or $x > Q_3 + 1.5\,\text{IQR}$ SL 4.2
  • PMCC: $r = S_{xy} / \sqrt{S_{xx} S_{yy}}$, $r \in [-1, 1]$ SL 4.4
  • Regression line passes through $(\bar{x}, \bar{y})$ SL 4.4
  • 分组数据均值 $\bar{x} = \dfrac{\sum f_i x_i}{\sum f_i}$ SL 4.3
  • 总体标准差 $\sigma = \sqrt{\dfrac{\sum(x_i - \mu)^2}{n}} $ SL 4.3
  • 计算式 $\sigma^2 = \overline{x^2} - \bar{x}^2$ SL 4.3
  • 线性变换规则:$\bar{y} = a\bar{x} + b$,$\sigma_y = |a|\sigma_x$ SL 4.3
  • 异常值规则:$x < Q_1 - 1.5\,\text{IQR}$ 或 $x > Q_3 + 1.5\,\text{IQR}$ SL 4.2
  • PMCC:$r = S_{xy} / \sqrt{S_{xx} S_{yy}}$,$r \in [-1, 1]$ SL 4.4
  • 回归直线恒过 $(\bar{x}, \bar{y})$ SL 4.4
U
Understand需要理解
  • Why a random sample matters — bias is fatal even with large $n$.
  • Why bar area (not height) encodes frequency on a continuous histogram.
  • Why mean is sensitive to outliers and median isn't (squared vs. absolute).
  • Why SD divides by $n$ in the IB, not $n - 1$ (population convention).
  • Why $\sigma_y = |a|\sigma_x$: shift vanishes, scale comes out squared then rooted.
  • Why $r$ is invariant under linear transformations of $x$ or $y$.
  • Why the regression line passes through $(\bar{x}, \bar{y})$ (least-squares condition).
  • 为什么随机样本(random sample)至关重要 — 抽样偏差(sampling bias)无法通过加大 $n$ 修复。
  • 为什么连续直方图(histogram)用柱子的面积而非高度表示频数。
  • 为什么均值对异常值敏感、中位数不敏感(平方 vs. 绝对值)。
  • 为什么 IB 的标准差除以 $n$ 而非 $n - 1$(总体约定)。
  • 为什么 $\sigma_y = |a|\sigma_x$:平移消失,比例因子先平方再开方。
  • 为什么 PMCC $r$ 在 $x$ 或 $y$ 的线性变换下保持不变。
  • 为什么回归直线恒过 $(\bar{x}, \bar{y})$(最小二乘条件的副产物)。

Common Pitfalls常见陷阱

Top student errors学生最常见的错误 1. Using $s_x$ (sample SD, divide by $n - 1$) instead of $\sigma_x$ (divide by $n$). The IB defaults to the latter.
2. Saying "$r = 0.6$ means $x$ causes $y$." Correlation $\ne$ causation; always hedge in writing.
3. Computing PMCC and quoting it without ever looking at the scatter plot — $r$ doesn't detect curvature.
4. Extrapolating: using the regression line to predict $y$ for $x$ far outside the data range.
5. Forgetting that the regression line passes through $(\bar{x}, \bar{y})$. This is the fastest way to recover the intercept on Paper 1.
6. On grouped-data mean, averaging midpoints without weighting by frequency.
7. Outlier rule: forgetting the factor $1.5$, or applying the rule to $\bar{x} \pm 1.5\sigma$ instead of $Q_1, Q_3 \pm 1.5\,\text{IQR}$.
8. Linear transformations: adding $b$ to the SD (it doesn't change), or forgetting the absolute value on $|a|$ when $a < 0$.
9. Reporting just one of (centre, spread) instead of both.
10. Reading the median off a cumulative-frequency graph at $0.5$ on the $y$-axis instead of $0.5 n$.
1. 用 $s_x$(样本标准差,除以 $n - 1$)代替 $\sigma_x$(除以 $n$)。IB 默认后者。
2. 说"$r = 0.6$ 意味着 $x$ 导致 $y$"。相关 $\ne$ 因果;书面回答必须留有余地。
3. 算出 PMCC 就直接引用,根本没看散点图 — $r$ 检测不到曲率。
4. 外推:用回归直线在数据范围之外的 $x$ 处预测 $y$。
5. 忘记回归直线恒过 $(\bar{x}, \bar{y})$。这在试卷 1 上是最快恢复截距的方法。
6. 分组数据求均值时,对组中值取算术平均,忘了按频数加权。
7. 异常值规则:漏掉系数 $1.5$,或错用为 $\bar{x} \pm 1.5\sigma$ 而非 $Q_1, Q_3 \pm 1.5\,\text{IQR}$。
8. 线性变换:把 $b$ 加到标准差上(标准差不变),或当 $a < 0$ 时忘了取 $|a|$。
9. 只报告中心或离散程度其中之一,而非两者一起报。
10. 在累积频率图上把中位数读到 $y$ 轴的 $0.5$ 而不是 $0.5 n$。
Paper-specific notes各试卷专题提示 Paper 1 (no calc). Small datasets with clean numbers. Expect by-hand mean, median, quartiles, SD via the computational form $\overline{x^2} - \bar{x}^2$, and a regression line where they give you $\bar{x}, \bar{y}$, and one of $b$ or $r$. Outlier-rule applications. Interpreting a given $r$ in words.
Paper 2 (calc). Larger datasets entered as L1 / L2 lists. Read 1-Var Stats for mean and SD; 2-Var Stats and LinReg for $r$, regression line, and predictions. Histograms and box plots already drawn — you describe shape and read off summary statistics. Often a final-mark question on whether a prediction is reliable (inside vs. outside data range).
试卷 1(不可使用计算器)。数据量小、数字干净。可能要求手算均值、中位数、四分位数(quartile);用计算式 $\overline{x^2} - \bar{x}^2$ 算标准差;给出 $\bar{x}, \bar{y}$ 以及 $b$ 或 $r$ 之一,要求写出回归直线。考异常值规则的应用。要用文字解读给定的 $r$。
试卷 2(可使用计算器)。数据量较大,作为 L1 / L2 列表录入。用 1-Var Stats 读出均值与标准差;用 2-Var Stats 与 LinReg 读出 $r$、回归直线与预测值。直方图与箱形图通常已经画好 — 你只需描述形状并读出概括统计量。结尾一道分数题常问预测是否可靠(在数据范围内还是外)。

Flashcards速记卡

0 / 14 flipped0 / 14 已翻面
Population vs. sample?总体与样本?
Population = all individuals.
Sample = subset measured.
总体 = 全体个体。
样本 = 实际测量的子集。
Outlier rule (IQR)?异常值规则(IQR)?
$$x < Q_1 - 1.5\,\text{IQR}\;\;\text{or}\;\;x > Q_3 + 1.5\,\text{IQR}$$
Mean of grouped data?分组数据的均值?
$$\bar{x} \approx \frac{\sum f_i x_i}{\sum f_i}$$
Population variance (IB convention)?总体方差(IB 约定)?
$$\sigma^2 = \frac{\sum (x_i - \mu)^2}{n}$$
Computational form of $\sigma^2$?$\sigma^2$ 的计算式?
$$\sigma^2 = \overline{x^2} - \bar{x}^2$$
Grouped-data SD?分组数据的标准差?
$$\sigma = \sqrt{\frac{\sum f_i (x_i - \bar{x})^2}{\sum f_i}}$$
Under $y = ax + b$, new mean?$y = ax + b$ 下,新的均值?
$$\bar{y} = a\bar{x} + b$$
Under $y = ax + b$, new SD?$y = ax + b$ 下,新的标准差?
$$\sigma_y = |a|\,\sigma_x$$
PMCC formula?PMCC 公式?
$$r = \frac{S_{xy}}{\sqrt{S_{xx}\,S_{yy}}}$$
Range of $r$?$r$ 的取值范围?
$$-1 \le r \le 1$$
Regression line of $y$ on $x$?$y$ 关于 $x$ 的回归直线?
$$y - \bar{y} = b(x - \bar{x}),\quad b = \frac{S_{xy}}{S_{xx}}$$
Regression line always passes through?回归直线恒过的点?
$$(\bar{x},\,\bar{y})$$
$z$-score?$z$ 分数?
$$z = \frac{x - \mu}{\sigma}$$
Five-number summary?五数概括?
$$\{\min,\,Q_1,\,\text{med},\,Q_3,\,\max\}$$

Unit D1 — Practice QuizUnit D1 — 练习测验

Ten mixed-difficulty items. Your score updates in real time at the top of the page. Aim for 8/10 before exam day.

10 道难度混合的题目。得分会在页面顶部实时更新。考试前争取达到 8/10。

1. A school surveys students by emailing every 10th student on the alphabetical roll. This is best described as:1. 某校以姓名字母顺序的名单上每隔 10 人发邮件做学生问卷调查。这种抽样最合适的描述是:
Q1 · SL 4.1
Stratified random.分层随机抽样。
Simple random.简单随机抽样。
Systematic.系统抽样。
Convenience.便利抽样。
Correct! "Every $k$-th from an ordered list" is the systematic sampling definition. It is not simple random because once the first student is chosen, the rest are determined.正确!"在有序名单上每 $k$ 人取一个"正是系统抽样(systematic sample)的定义。一旦第一个学生确定,其余都由间隔决定,所以不是简单随机。
Every 10th student from an ordered list is the defining shape of a systematic sample."在有序名单上每隔 10 人取一个"正是系统抽样的标准定义。
2. Find the median of $\{2, 5, 5, 8, 11, 11, 14, 19\}$.2. 求 $\{2, 5, 5, 8, 11, 11, 14, 19\}$ 的中位数。
Q2 · SL 4.3
$8$
$9.5$
$11$
$10$
Correct! $n = 8$, so the median is the mean of the 4th and 5th values: $(8 + 11)/2 = 9.5$.正确!$n = 8$,中位数为第 4 与第 5 个值的平均:$(8 + 11)/2 = 9.5$。
$n = 8$ — even — median is the mean of positions 4 and 5: $(8 + 11)/2 = 9.5$.$n = 8$(偶数),中位数为第 4、5 位的均值:$(8 + 11)/2 = 9.5$。
3. A box plot shows $Q_1 = 24$, $Q_3 = 38$. Which value is an outlier?3. 某箱形图显示 $Q_1 = 24$,$Q_3 = 38$。哪个值是异常值?
Q3 · SL 4.2
$2$
$45$
$10$
$50$
Correct! IQR $= 14$. $L = 24 - 21 = 3$; $U = 38 + 21 = 59$. Only $2 < 3$ is an outlier. $10$ is above $L$, and $45, 50$ are below $U$.正确!IQR $= 14$。$L = 24 - 21 = 3$;$U = 38 + 21 = 59$。只有 $2 < 3$ 是异常值。$10$ 在 $L$ 之上,$45, 50$ 在 $U$ 之下。
$\text{IQR} = 38 - 24 = 14$. Fences $L = 24 - 1.5(14) = 3$ and $U = 38 + 1.5(14) = 59$. Only the value $2$ is outside $[3, 59]$.$\text{IQR} = 38 - 24 = 14$。围栏 $L = 24 - 1.5(14) = 3$、$U = 38 + 1.5(14) = 59$。只有 $2$ 落在 $[3, 59]$ 之外。
4. The (population) standard deviation of $\{4, 6, 8, 10\}$ is:4. $\{4, 6, 8, 10\}$ 的(总体)标准差为:
Q4 · SL 4.3
$\sqrt{6} \approx 2.45$
$\sqrt{20/3} \approx 2.58$
$\sqrt{5} \approx 2.24$
$2$
Correct! $\bar{x} = 7$. Squared deviations: $9, 1, 1, 9$, sum $= 20$. $\sigma^2 = 20/4 = 5$, $\sigma = \sqrt{5} \approx 2.24$. (Option B uses $n - 1 = 3$ — not IB convention.)正确!$\bar{x} = 7$。偏差平方:$9, 1, 1, 9$,和 $= 20$。$\sigma^2 = 20/4 = 5$,$\sigma = \sqrt{5} \approx 2.24$。(选项 B 用 $n - 1 = 3$ — 非 IB 约定。)
IB uses $\sigma^2 = \sum (x_i - \bar{x})^2 / n$. Here sum $= 20$, $n = 4$, so $\sigma^2 = 5$ and $\sigma = \sqrt{5}$.IB 用 $\sigma^2 = \sum (x_i - \bar{x})^2 / n$。这里和 $= 20$,$n = 4$,所以 $\sigma^2 = 5$,$\sigma = \sqrt{5}$。
5. A grouped table has midpoints $\{10, 20, 30, 40\}$ with frequencies $\{2, 8, 6, 4\}$. The estimated mean is:5. 某分组表组中值为 $\{10, 20, 30, 40\}$,频数为 $\{2, 8, 6, 4\}$。估计均值为:
Q5 · SL 4.3
$26$
$25$
$22$
$30$
Correct! $\sum f_i x_i = 2(10) + 8(20) + 6(30) + 4(40) = 20 + 160 + 180 + 160 = 520$. $\sum f_i = 20$. Mean $= 520/20 = 26$.正确!$\sum f_i x_i = 2(10) + 8(20) + 6(30) + 4(40) = 20 + 160 + 180 + 160 = 520$。$\sum f_i = 20$。均值 $= 520/20 = 26$。
Weighted sum: $20 + 160 + 180 + 160 = 520$. Total frequency $= 20$. Mean $= 26$.加权和:$20 + 160 + 180 + 160 = 520$。总频数 $= 20$。均值 $= 26$。
6. Heights have mean $170$ cm and SD $8$ cm. After converting to metres ($y = 0.01 x$), the new mean and SD are:6. 身高均值 $170$ cm、标准差 $8$ cm。换算为米($y = 0.01 x$)后,新的均值与标准差为:
Q6 · SL 4.3
$\bar{y} = 1.70$, $\sigma_y = 8$
$\bar{y} = 170$, $\sigma_y = 0.08$
$\bar{y} = 1.70$, $\sigma_y = 0.80$
$\bar{y} = 1.70$, $\sigma_y = 0.08$
Correct! $\bar{y} = 0.01 \cdot 170 = 1.70$ m. $\sigma_y = |0.01| \cdot 8 = 0.08$ m. Scale factor applies to both.正确!$\bar{y} = 0.01 \cdot 170 = 1.70$ m。$\sigma_y = |0.01| \cdot 8 = 0.08$ m。比例因子同时作用于两者。
Under $y = 0.01 x$: $\bar{y} = 0.01 \cdot 170 = 1.70$, $\sigma_y = 0.01 \cdot 8 = 0.08$ — same factor scales both.在 $y = 0.01 x$ 下:$\bar{y} = 0.01 \cdot 170 = 1.70$,$\sigma_y = 0.01 \cdot 8 = 0.08$ — 同一因子同时缩放两者。
7. Bivariate data has PMCC $r = -0.12$. The correct one-line interpretation is:7. 某双变量数据 PMCC 为 $r = -0.12$。正确的一行解读是:
Q7 · SL 4.4
Strong negative linear association.强负向线性关联。
Strong positive linear association.强正向线性关联。
Very weak / no linear association.极弱 / 无线性关联。
Perfect negative correlation.完美负相关。
Correct! $|r| = 0.12$ is well below $0.3$, so almost no linear association. The negative sign is too weak to read into.正确!$|r| = 0.12$ 远低于 $0.3$,几乎没有线性关联。这么弱的负号也不宜过度解读。
$|r| = 0.12 < 0.3$ ⇒ very weak / negligible linear association. A scatter plot might still show a non-linear pattern.$|r| = 0.12 < 0.3$ ⇒ 极弱 / 可忽略的线性关联。散点图也许仍能显示非线性模式。
8. The regression line of $y$ on $x$ for a dataset is $y = 1.5 x + 4$. If $\bar{x} = 6$, then $\bar{y}$ equals:8. 某数据集 $y$ 关于 $x$ 的回归直线为 $y = 1.5 x + 4$。若 $\bar{x} = 6$,则 $\bar{y}$ 为:
Q8 · SL 4.4
$10$
$13$
$4$
Cannot be determined.无法确定。
Correct! The line passes through $(\bar{x}, \bar{y})$, so $\bar{y} = 1.5(6) + 4 = 13$.正确!直线过 $(\bar{x}, \bar{y})$,所以 $\bar{y} = 1.5(6) + 4 = 13$。
Use the fact that the line passes through $(\bar{x}, \bar{y})$: $\bar{y} = 1.5\bar{x} + 4 = 1.5(6) + 4 = 13$.利用直线过 $(\bar{x}, \bar{y})$:$\bar{y} = 1.5\bar{x} + 4 = 1.5(6) + 4 = 13$。
9. A scatter plot shows points lying on a curve $y = x^2$ for $x \in [-1, 1]$. The PMCC $r$ is:9. 散点图上各点在 $x \in [-1, 1]$ 区间内落在曲线 $y = x^2$ 上。则 PMCC $r$:
Q9 · SL 4.4
Close to $+1$.接近 $+1$。
Close to $-1$.接近 $-1$。
Exactly $+1$.恰为 $+1$。
Close to $0$.接近 $0$。
Correct! $y = x^2$ is symmetric in $x$ on $[-1, 1]$: for every point $(x_0, x_0^2)$, the mirror point $(-x_0, x_0^2)$ has the same $y$. The covariance $S_{xy}$ averages to zero, so $r \approx 0$. The relationship is strong — just not linear.正确!$y = x^2$ 在 $[-1, 1]$ 上关于 $x$ 对称:每个点 $(x_0, x_0^2)$ 都有镜像点 $(-x_0, x_0^2)$,$y$ 相同。协方差 $S_{xy}$ 平均到零,因此 $r \approx 0$。关系本身很强 — 只是不是线性
PMCC measures linear association only. A symmetric $y = x^2$ has $r \approx 0$ — lesson: always look at the scatter plot, not just $r$.PMCC 只衡量线性关联。对称的 $y = x^2$ 有 $r \approx 0$ — 教训:永远要看散点图,而不是只看 $r$。
10. A regression of ice-cream sales $y$ on temperature $x$ ($^\circ$C), based on data with $10 \le x \le 28$, gives $y = 4.2 x - 18$. Using the line to predict sales at $x = 5\,^\circ$C is:10. 冰淇淋销售额 $y$ 对气温 $x$($^\circ$C)的回归基于 $10 \le x \le 28$ 的数据,得到 $y = 4.2 x - 18$。用此直线在 $x = 5\,^\circ$C 预测销售额:
Q10 · SL 4.4
Reliable — the line was computed from the data.可靠 — 直线由数据算得。
Reliable — PMCC is presumably high.可靠 — PMCC 想必很高。
Unreliable — $x = 5$ is outside the observed data range (extrapolation).不可靠 — $x = 5$ 在观测数据范围之外(外推)。
Unreliable — the slope is positive.不可靠 — 斜率为正。
Correct! $x = 5$ lies below the minimum observed $x = 10$. Any prediction outside the data range relies on extrapolation: we have no evidence the linear pattern continues there.正确!$x = 5$ 低于观测的最小值 $x = 10$。在数据范围之外的预测必然是外推:没有证据表明线性模式在那里仍然成立。
$x = 5$ is outside the observed range $[10, 28]$ — that's extrapolation. The regression line is only safe to use for interpolation.$x = 5$ 在观测范围 $[10, 28]$ 之外 — 这就是外推。回归直线只在内插(interpolation)时安全可用。

Readiness Checklist备考清单

Click each item you've mastered. Aim for 100% before exam day. The IB sub-topic reference is tagged on each item.

勾选你已掌握的每一项。考试前争取 100%。每项都标注了对应的 IB 子主题。

0 / 14 mastered0 / 14 已掌握

IB Paper-Style PracticeIB 试卷风格练习

IB exam-style questions across Paper 1A (short response, no calc), Paper 1B (extended response, no calc), and Paper 2 (calculator). D1 is SL-only — no Paper 3. EMH difficulty mix. Mark-by-mark solutions live in the separate solutions file. Use this after the in-page quiz and flashcards.

IB 考试风格题,涵盖 Paper 1A(短答,无计算器)、Paper 1B(长答,无计算器)、Paper 2(可用计算器)。D1 仅 SL,无 Paper 3。难度按 EMH 分级。逐分解答见独立的解答文档。建议在做完本页测验与闪卡后再来。

Practice Questions →练习题 → Mark-by-mark Solutions →逐分解答 →