Univariate Data单变量数据
Sub-topics 4.1 – 4.4 of IB AA HL Topic 4. Sampling, presentation of data, central tendency & dispersion, linear correlation & regression.
IB AA HL 主题 4 的子主题 4.1–4.4:抽样(sampling)、数据呈现、集中趋势与离散程度、线性相关(linear correlation)与回归(regression)。
How to use this guide使用指南
Read the Cram cheat-sheet next, then skim the dashed-gold "Cram-Mode Cheat" box at the top of each section, plus the formula boxes. One sentence to leave with: mean and SD describe the centre and spread; median and IQR describe them robustly; PMCC $r$ measures linear strength on $[-1, 1]$ and the regression line $y = ax + b$ goes through $(\bar{x}, \bar{y})$. Do one worked example per section. Then take the practice quiz.
先看考前速查表,再扫一眼每节顶端金色虚线框里的"考前速查"方框,以及公式框。一句话带走:均值(mean)与标准差(standard deviation)描述中心与离散程度;中位数(median)与四分位距(interquartile range)做稳健描述;皮尔逊积矩相关系数(Pearson's product-moment correlation coefficient)$r$ 在 $[-1, 1]$ 上衡量线性强弱;回归直线(regression line)$y = ax + b$ 恒过 $(\bar{x}, \bar{y})$。每节做一道例题,最后做练习测验。
Open every ▸ Going deeper. Univariate data is the topic where IB rewards interpretation over arithmetic. Owning why mean shifts under $y = ax + b$ but SD only scales by $|a|$, and why $r$ is invariant under linear transformations of either variable, is what separates a 5 from a 7 on the data-handling paper.
展开每一个 ▸ Going deeper(深入探究)小节。单变量数据是 IB 更看重解释而非计算的主题。真正搞懂:为什么在 $y = ax + b$ 下均值随之平移但标准差只按 $|a|$ 缩放;为什么相关系数(correlation coefficient)$r$ 在任一变量做线性变换(linear transformation)时保持不变。这就是数据处理卷上 5 分与 7 分的分水岭。
HL chips appear here.本单元每一节都是 SL 核心内容(子主题 SL 4.1–4.4)。HL 独有的条件概率(conditional probability)、贝叶斯(Bayes)和连续随机变量等子主题在 Unit D2 与 D3。此处不出现 HL 标签。
Paper 2)上,图形计算器(GDC)是主要工具。凡是题面写"求均值"、"求 $r$"或"求回归直线",IB 评分标准期望你录入数据一次后直接读出答案。我们不演示具体 TI / Casio 按键 — 你只需要记住菜单名:1-Var Stats、2-Var Stats、LinReg(a + bx)。试卷 1(Paper 1)的统计题则使用干净的数字,可手算解决。
Cram Cheat-Sheet考前速查表
Sampling & data types抽样与数据类型 SL 4.1
- Population = every individual of interest; sample = a subset actually measured.
- Random sample ⇒ every individual has equal chance of selection. Other methods: systematic, stratified, quota, convenience (biased).
- Discrete data = counts (integers). Continuous = measurements on a continuum.
- 总体(
population) = 所关注的全体个体;样本(sample) = 实际被测量的子集。 - 随机样本(
random sample) ⇒ 每个个体被抽到的概率相等。其他方法:系统抽样(systematic)、分层抽样(stratified)、配额抽样(quota)、便利抽样(convenience,有偏)。 - 离散(
discrete)数据 = 计数(整数)。连续(continuous) = 在连续区间上的测量值。
Presenting data数据呈现 SL 4.2
- Frequency table → histogram (bars touch; area $\propto$ frequency for continuous data).
- Cumulative frequency curve → read off median, $Q_1$, $Q_3$, percentiles.
- Box-and-whisker: five-number summary $\{\min, Q_1, \text{med}, Q_3, \max\}$.
- Outlier rule: $x < Q_1 - 1.5 \cdot \text{IQR}$ or $x > Q_3 + 1.5 \cdot \text{IQR}$.
- 频率分布表(
frequency table) → 直方图(histogram)(连续数据时柱子相连;面积 $\propto$ 频数)。 - 累积频率曲线(
cumulative frequency curve) → 读出中位数、$Q_1$、$Q_3$ 与百分位数(percentiles)。 - 箱形图(
box plot):五数概括(five-number summary)$\{\min, Q_1, \text{med}, Q_3, \max\}$。 - 异常值规则(
outlierrule):$x < Q_1 - 1.5 \cdot \text{IQR}$ 或 $x > Q_3 + 1.5 \cdot \text{IQR}$。
Central tendency集中趋势 SL 4.3
- Mean $\bar{x} = \dfrac{\sum x_i}{n}$ or $\dfrac{\sum f_i x_i}{\sum f_i}$ for grouped data (use the midpoint of each class).
- Median = middle value when ordered. Modal class = class with greatest frequency density.
- If a single big outlier exists, prefer median + IQR over mean + SD.
- 均值(
mean)$\bar{x} = \dfrac{\sum x_i}{n}$;分组数据用 $\dfrac{\sum f_i x_i}{\sum f_i}$($x_i$ 取每个组的组中值)。 - 中位数(
median) = 排序后处于中间的值。众数类(modal class) = 频率密度(frequency density)最大的那个类。 - 如果出现单个大异常值,优先选用 中位数 + IQR,而不是 均值 + 标准差。
Dispersion离散程度 SL 4.3
- Range = $\max - \min$. IQR $= Q_3 - Q_1$.
- IB defaults to the population SD (divide by $n$, not $n-1$): $$ \sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{n}}, \qquad \sigma^2 = \frac{\sum (x_i - \mu)^2}{n}. $$
- For grouped data: $\sigma = \sqrt{\dfrac{\sum f_i (x_i - \bar{x})^2}{\sum f_i}}$.
- 极差(
range) = $\max - \min$。四分位距(interquartile range,IQR)$= Q_3 - Q_1$。 - IB 默认使用总体标准差(
population standard deviation)(除以 $n$,而非 $n-1$): $$ \sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{n}}, \qquad \sigma^2 = \frac{\sum (x_i - \mu)^2}{n}. $$ - 分组数据:$\sigma = \sqrt{\dfrac{\sum f_i (x_i - \bar{x})^2}{\sum f_i}}$。
Linear transformations线性变换 SL 4.3
- If $y_i = a x_i + b$, then $\bar{y} = a \bar{x} + b$ and $\sigma_y = |a|\, \sigma_x$, $\sigma_y^2 = a^2 \sigma_x^2$.
- Adding a constant shifts the mean, leaves SD alone. Scaling by $a$ multiplies both mean and SD (SD by $|a|$).
- 若 $y_i = a x_i + b$,则 $\bar{y} = a \bar{x} + b$,$\sigma_y = |a|\, \sigma_x$,$\sigma_y^2 = a^2 \sigma_x^2$。
- 加一个常数平移均值,标准差不变。乘以 $a$ 同时缩放均值与标准差(标准差按 $|a|$ 缩放)。
Bivariate & regression双变量数据与回归 SL 4.4
- PMCC: $r = \dfrac{S_{xy}}{\sqrt{S_{xx}\, S_{yy}}} \in [-1, 1]$. $|r|$ near $1$ ⇒ strong linear; near $0$ ⇒ weak.
- Regression line of $y$ on $x$: $y - \bar{y} = b(x - \bar{x})$ with $b = S_{xy}/S_{xx}$. Equivalently $y = ax + b$. Always passes through $(\bar{x}, \bar{y})$.
- Only predict inside the data range (interpolation). Extrapolation is unreliable.
- 皮尔逊积矩相关系数(
Pearson's product-moment correlation coefficient,PMCC):$r = \dfrac{S_{xy}}{\sqrt{S_{xx}\, S_{yy}}} \in [-1, 1]$。$|r|$ 接近 $1$ ⇒ 线性关系强;接近 $0$ ⇒ 弱。 - $y$ 关于 $x$ 的回归直线(
regression line):$y - \bar{y} = b(x - \bar{x})$,其中 $b = S_{xy}/S_{xx}$。等价写为 $y = ax + b$。恒过 $(\bar{x}, \bar{y})$。 - 只在数据范围内做预测(内插,
interpolation)。外推(extrapolation)不可靠。
1.1 Populations and Samples1.1 总体与样本 SL 4.1
population)是所关注的全体个体;样本(sample)是你实际测量的子集。
A random sample gives every individual an equal chance of being chosen and is the only kind whose statistics are unbiased estimators of population parameters. Real data is discrete (countable, integer-valued like number of siblings) or continuous (measurable on a continuum like height). Bias arises from how you sample, what people choose to report, and which observations get excluded.随机样本(random sample)让每个个体被选中的概率相等,也是唯一一种统计量为总体参数无偏估计的样本。真实数据要么是离散(可数、取整数值,如兄弟姐妹数),要么是连续(在连续区间上可测,如身高)。抽样偏差(sampling bias)来源于抽样方式、受访者选择性报告与哪些观测被剔除。
The IB uses $\mu$ for the population mean and $\sigma$ for the population SD. The sample mean is $\bar{x}$. Because IB defaults to the population formula, the sample SD it reports also divides by $n$ — some textbooks call this $\sigma_n$ on the calculator.
IB 用 $\mu$ 表示总体均值(population mean),$\sigma$ 表示总体标准差(population standard deviation)。样本均值(sample mean)为 $\bar{x}$。由于 IB 默认采用总体公式,它报告的样本标准差也是除以 $n$ — 部分教材在计算器上称之为 $\sigma_n$。
Systematic sample. Choose every $k$-th individual from an ordered list. Cheap but biased if the list has periodicity.
Stratified sample. Split the population into groups (strata: e.g. Year 1, Year 2, Year 3), then sample randomly within each in proportion to its size. Best when strata differ.
Quota sample. Fill pre-set numbers from each group, but selection within each is non-random (e.g. street interviewer takes the first 10 women they see).
Convenience sample. Whoever is easiest to reach. Almost always biased and the IB will call this out. 简单随机抽样(
simple random sample)。每个个体被选中的概率相同,例如从帽子里抽名字或用随机数发生器。系统抽样(
systematic sample)。在有序名单上每隔 $k$ 个抽 1 个。成本低,但若名单存在周期性,则会产生偏差。分层抽样(
stratified sample)。将总体按特征划分为若干层(如 1 年级、2 年级、3 年级),再在每层内按其规模成比例地随机抽样。层间差异显著时最优。配额抽样(
quota sample)。各组配额事先确定,但组内选择并非随机(如街头访问员遇到前 10 名女性即停止)。便利抽样(
convenience sample)。谁最容易接触就抽谁。几乎一定有偏,IB 也一定会指出。
Non-response bias. Those who do not respond differ systematically from those who do.
Self-reported data. People misreport (income up, weight down).
Volunteer bias. Volunteers tend to be more motivated — results don't generalize.
Measurement bias. The instrument or wording skews answers (leading questions, faulty scale). 选择偏差(
selection bias)。样本不具代表性(如清晨 6 点健身房调查偏向早起者)。无回应偏差(
non-response bias)。不回应者与回应者存在系统性差异。自报数据(
self-reported data)。人们会失真报告(收入虚报偏高、体重虚报偏低)。志愿者偏差(
volunteer bias)。志愿者通常更具积极性 — 结论难以推广。测量偏差(
measurement bias)。仪器或措辞使回答偏向(诱导性问题、刻度有误)。
Worked Example — Identifying the right method例题 — 选对方法
Problem: A school has $400$ Year 12 students, $350$ Year 13 students. The principal wants a sample of $30$ for a survey on study habits. Suggest a sampling method and explain.
Best choice: stratified random sampling. The two year groups likely differ in study habits, so we want the sample to reflect both in proportion.
Proportion: $\frac{400}{750} \approx 0.533$, $\frac{350}{750} \approx 0.467$. Take $\lceil 30 \times 0.533 \rceil = 16$ from Year 12 and $30 - 16 = 14$ from Year 13. Within each stratum, choose randomly (random-number generator).
Convenience sampling (just asking 30 friends) would be biased. A simple random sample across all $750$ would also work but might under-represent the smaller stratum by chance.
题目:某校 12 年级有 $400$ 人,13 年级有 $350$ 人。校长想抽取 $30$ 人调查学习习惯。建议一种抽样方法并说明理由。
最佳选择:分层随机抽样(stratified random sampling)。两个年级的学习习惯很可能不同,因此样本应按比例反映两层。
比例:$\frac{400}{750} \approx 0.533$,$\frac{350}{750} \approx 0.467$。从 12 年级取 $\lceil 30 \times 0.533 \rceil = 16$ 人,从 13 年级取 $30 - 16 = 14$ 人。每层内用随机数发生器随机抽取。
便利抽样(只问 30 位朋友)会产生偏差。在全部 $750$ 人中做简单随机抽样也可行,但可能偶然地使较小的那层代表不足。
Worked Example — Discrete vs. continuous例题 — 离散与连续之分
Problem: Classify the following as discrete or continuous:
(a) number of cars passing a sensor per hour, (b) time taken to run 100 m, (c) shoe size, (d) blood pressure.
(a) Discrete — you can count $0, 1, 2, \ldots$ cars; no fractional cars.
(b) Continuous — time can be any positive real number (resolution limited only by the clock).
(c) Discrete — shoe sizes are quantized ($7, 7.5, 8, \ldots$). Half-sizes still make the sample space discrete, not continuous.
(d) Continuous — blood pressure is measured on a continuum, though instruments report integers (mmHg). Treat as continuous unless the problem says otherwise.
题目:把下列变量分类为离散数据(discrete data)或连续数据(continuous data):
(a) 某传感器每小时通过的汽车数;(b) 跑 100 m 所用时间;(c) 鞋码;(d) 血压。
(a) 离散 — 汽车数只能是 $0, 1, 2, \ldots$,没有分数辆汽车。
(b) 连续 — 时间可以是任意正实数(分辨率仅受计时器限制)。
(c) 离散 — 鞋码是量化的($7, 7.5, 8, \ldots$)。即使存在半码,样本空间仍是离散的,不是连续的。
(d) 连续 — 血压在连续区间上测量,尽管仪器只报告整数(mmHg)。除非题目特别说明,否则按连续处理。
▸ Going deeper — Why "random" matters mathematically▸ 深入探究 — 为何"随机"在数学上至关重要
The mean $\bar{X}$ of a simple random sample of size $n$ from a population with mean $\mu$ and SD $\sigma$ satisfies:
从均值为 $\mu$、标准差为 $\sigma$ 的总体中抽取大小为 $n$ 的简单随机样本,其样本均值 $\bar{X}$ 满足:
$$ \mathbb{E}[\bar{X}] = \mu, \qquad \text{Var}(\bar{X}) = \frac{\sigma^2}{n}. $$The first equation is the unbiasedness property: the sample mean has no systematic error. The second is the square-root law: precision improves like $\sqrt{n}$. Both follow from linearity of expectation and independence of the draws.
第一个等式即无偏性:样本均值没有系统误差。第二个为平方根定律:精度按 $\sqrt{n}$ 改善。两者都来自期望的线性性与抽样的独立性。
If your sample is not random — say, a convenience sample — both properties can fail. $\mathbb{E}[\bar{X}]$ can differ from $\mu$ (bias), and the apparent precision $\sigma / \sqrt{n}$ over-states reality. The IB markscheme does not ask for this derivation in D1, but it explains why bias is fatal: more data does not fix it.
若样本不是随机的(例如便利抽样),两条性质都可能失效:$\mathbb{E}[\bar{X}]$ 与 $\mu$ 出现偏差,名义精度 $\sigma / \sqrt{n}$ 也高估了真实情况。IB 评分标准在 D1 不要求做这段推导,但它解释了为什么偏差是致命的 — 单靠加大数据量解决不了。
1.2 Frequency Distributions & Histograms1.2 频率分布与直方图 SL 4.2
frequency distribution)里给出频数(frequency),再画直方图(histogram)。
For a discrete variable, bars sit on integer values and the height is the frequency. For a continuous variable, bars touch and the area of each bar is proportional to its frequency. Relative frequency $=$ frequency $/$ total — lets you compare two datasets of different sizes.
对离散变量,柱子立在整数刻度上,高度即频数。对连续变量,柱子相连,每个柱子的面积与其频数成正比。相对频率(relative frequency)$=$ 频数 $/$ 总数 — 用于比较规模不同的两组数据。
When classes have unequal widths, plot frequency density on the $y$-axis so that bar area still represents frequency. IB rarely uses unequal classes — but when it does, this is the only safe way to read the histogram.
当各类组距不等时,在 $y$ 轴上画频率密度(frequency density),这样柱子的面积仍然代表频数。IB 很少出现不等组距 — 但一旦出现,这是读取直方图的唯一安全做法。
A few classes (5–7) gives a smooth shape; too many gives a noisy one. The IB will tell you the classes. 对连续数据,IB 采用左闭右开的组界,如 $10 \le x < 20$。组中值 $x_i = \tfrac{1}{2}(\text{lower} + \text{upper})$ 即(下界 + 上界)的一半,用于估计分组数据的均值 — 详见 1.4。
类数较少(5–7 组)时形状平滑;过多则噪声大。IB 通常会直接给出类。
Worked Example — Building a histogram例题 — 画一张直方图
Problem: The masses (kg) of 40 students are grouped:
| Mass (kg) | $40 \le m < 50$ | $50 \le m < 60$ | $60 \le m < 70$ | $70 \le m < 80$ | $80 \le m < 90$ |
|---|---|---|---|---|---|
| Frequency | 4 | 11 | 14 | 8 | 3 |
All classes have equal width (10 kg), so we plot frequency on the $y$-axis directly:
The shape is roughly symmetric and unimodal, peaking in $60 \le m < 70$. The modal class is $60 \le m < 70$.
题目:40 名学生的体重(kg)已分组如下:
| 体重 (kg) | $40 \le m < 50$ | $50 \le m < 60$ | $60 \le m < 70$ | $70 \le m < 80$ | $80 \le m < 90$ |
|---|---|---|---|---|---|
| 频数 | 4 | 11 | 14 | 8 | 3 |
各类组距相等(10 kg),所以可直接在 $y$ 轴画频数:
形状大致对称、单峰,峰值在 $60 \le m < 70$。众数类(modal class)为 $60 \le m < 70$。
Worked Example — Relative frequency to compare例题 — 用相对频率做比较
Problem: Class A (25 students) and Class B (40 students) both have $8$ students scoring in $[70, 80)$. Which class has a higher proportion in that band?
Compare relative frequencies:
$$ \text{A: } \frac{8}{25} = 0.32 = 32\%, \qquad \text{B: } \frac{8}{40} = 0.20 = 20\%. $$Class A has the higher proportion. Raw frequencies are identical but say nothing on their own when class sizes differ.
题目:A 班 25 人、B 班 40 人,两班都各有 $8$ 人得分在 $[70, 80)$。哪个班该区间占比更高?
比较相对频率(relative frequency):
A 班占比更高。两班的原始频数相同,但在样本量不同的情况下单看频数毫无意义。
Right-skewed $\Leftrightarrow$ long tail to the right $\Leftrightarrow$ mean $>$ median.
Left-skewed $\Leftrightarrow$ long tail to the left $\Leftrightarrow$ mean $<$ median.
Symmetric $\Leftrightarrow$ mean $\approx$ median. 描述直方图分两步:(1) 形状(对称 / 左偏 / 右偏 / 双峰 / 均匀),(2) 中心与离散程度(峰值在哪、宽度多大)。
右偏(
right-skewed) $\Leftrightarrow$ 右尾长 $\Leftrightarrow$ 均值 $>$ 中位数。左偏(
left-skewed) $\Leftrightarrow$ 左尾长 $\Leftrightarrow$ 均值 $<$ 中位数。对称(
symmetric) $\Leftrightarrow$ 均值 $\approx$ 中位数。
▸ Going deeper — Why area, not height, encodes frequency▸ 深入探究 — 为什么是面积而非高度表示频数
For continuous data the histogram is a discrete approximation to an unknown probability density. The defining property of a density is that area under it on an interval equals the probability of landing in that interval. So a histogram bar's area should equal the frequency (or relative frequency) of its interval, not its height.
对于连续数据,直方图是某个未知概率密度(probability density)的离散近似。密度的本质性质是:在某区间上其曲线下面积等于落入该区间的概率。所以直方图每根柱子的面积应当等于其区间的频数(或相对频率),而非高度。
With equal-width classes, height and area are proportional, and IB lets you plot either — just say "frequency" on the $y$-axis. With unequal classes, plotting height alone visually exaggerates wide classes and hides narrow ones. Frequency density restores the proportionality:
组距相等时,高度与面积成正比,IB 允许直接画频数 — 只需在 $y$ 轴标注"frequency"。组距不等时,单画高度会在视觉上夸大宽类、压缩窄类。频率密度恢复了比例关系:
$$ \text{area of bar} = \text{freq. density} \times \text{width} = \frac{f_i}{\text{width}} \times \text{width} = f_i. $$This is also why the limiting curve of relative-frequency histograms (as bins shrink, $n$ grows) is the probability density function — the link to Topics D2–D3.
这也是为什么相对频率直方图在组距趋于零、$n$ 趋于无穷时的极限曲线就是概率密度函数(probability density function)— 即与主题 D2–D3 的连接点。
1.3 Cumulative Frequency & Box Plots1.3 累积频率与箱形图 SL 4.2
cumulative frequency curve)可以让你目测百分位数;箱形图(box plot)则把五数概括(five-number summary)一图呈现。
Plot cumulative frequency vs. upper class boundary, join the points smoothly. Read:
以累积频率为纵轴、各类上界为横轴作图,平滑连线后读取:
- median at $0.5 n$ on the $y$-axis;
- $Q_1$ at $0.25 n$, $Q_3$ at $0.75 n$;
- IQR $= Q_3 - Q_1$.
- 中位数对应 $y$ 轴上的 $0.5 n$;
- $Q_1$ 对应 $0.25 n$,$Q_3$ 对应 $0.75 n$;
- IQR $= Q_3 - Q_1$。
outlier):
$x < Q_1 - 1.5 \cdot \text{IQR}$ or或 $x > Q_3 + 1.5 \cdot \text{IQR}$.
Box runs from $Q_1$ to $Q_3$, with the median marked inside. Whiskers extend from the box to the smallest non-outlier and the largest non-outlier. Outliers are drawn as individual points outside the whiskers.
箱子由 $Q_1$ 延至 $Q_3$,中位数(median)在箱内标出。须(whiskers)从箱子延伸到最小的非异常值与最大的非异常值。异常值(outliers)单独以点的形式画在须外。
- Median $= $ value at position $\frac{n+1}{2}$ (interpolate if fractional).
- $Q_1$ $= $ median of the lower half; $Q_3$ $= $ median of the upper half.
- 中位数 $= $ 位置 $\frac{n+1}{2}$ 处的值(位置为分数时做线性插值)。
- $Q_1$(下四分位数,
quartile)$= $ 数据下半部分的中位数;$Q_3$(上四分位数)$= $ 上半部分的中位数。
GDC)。
outlier),在箱形图上必须单独画成一个点。须延伸到最极端的非异常值,而不是延伸到异常值本身。
Worked Example — Five-number summary by hand例题 — 手算五数概括
Problem: Find the five-number summary of $\{3, 5, 7, 8, 9, 11, 12, 14, 16, 18, 25\}$, then identify any outliers.
$n = 11$, already sorted.
Min $= 3$, max $= 25$. Median at position $\frac{11+1}{2} = 6$: the 6th value is $11$.
Lower half (excluding the median): $\{3, 5, 7, 8, 9\}$. $Q_1$ $=$ median of this $=$ $7$.
Upper half (excluding the median): $\{12, 14, 16, 18, 25\}$. $Q_3$ $=$ median of this $=$ $16$.
IQR $= 16 - 7 = 9$. Fences: $L = 7 - 1.5(9) = -6.5$, $U = 16 + 1.5(9) = 29.5$.
No data falls outside $[-6.5, 29.5]$, so no outliers. (The value $25$ is large but lies inside $U$.)
Five-number summary: $\{3, 7, 11, 16, 25\}$.
题目:求 $\{3, 5, 7, 8, 9, 11, 12, 14, 16, 18, 25\}$ 的五数概括,并指出所有异常值。
$n = 11$,已排序。
最小值 $= 3$,最大值 $= 25$。中位数对应位置 $\frac{11+1}{2} = 6$:第 6 个值为 $11$。
下半部分(不含中位数):$\{3, 5, 7, 8, 9\}$。$Q_1$ $=$ 其中位数 $=$ $7$。
上半部分(不含中位数):$\{12, 14, 16, 18, 25\}$。$Q_3$ $=$ 其中位数 $=$ $16$。
IQR $= 16 - 7 = 9$。围栏:$L = 7 - 1.5(9) = -6.5$,$U = 16 + 1.5(9) = 29.5$。
没有数据落在 $[-6.5, 29.5]$ 之外,故无异常值。($25$ 虽大,但仍在 $U$ 之内。)
五数概括:$\{3, 7, 11, 16, 25\}$。
Worked Example — Outlier check例题 — 异常值检验
Problem: Test scores have $Q_1 = 52$, median $= 64$, $Q_3 = 73$, and one student scored $100$. Is $100$ an outlier?
$\text{IQR} = 73 - 52 = 21$. Upper fence $U = 73 + 1.5(21) = 73 + 31.5 = 104.5$.
Since $100 < 104.5$, the score is not an outlier under the IQR rule — though it is the maximum.
题目:某次考试 $Q_1 = 52$,中位数 $= 64$,$Q_3 = 73$,有一位学生得了 $100$ 分。问 $100$ 是否为异常值?
$\text{IQR} = 73 - 52 = 21$。上围栏 $U = 73 + 1.5(21) = 73 + 31.5 = 104.5$。
由于 $100 < 104.5$,按 IQR 规则它不是异常值 — 但它确实是最大值。
Worked Example — Reading the box plot例题 — 看懂箱形图
The five-number summary $\{3, 7, 11, 16, 25\}$ above becomes:
The right whisker is longer than the left — suggests a mild right skew (mean would be slightly above the median of $11$). The IQR-box itself is centred near the median, which is the typical look.
上述五数概括 $\{3, 7, 11, 16, 25\}$ 对应的箱形图为:
右须比左须长 — 提示存在轻度右偏(均值略高于中位数 $11$)。IQR 箱子本身大致以中位数为中心,是常见形态。
▸ Going deeper — Why $1.5 \cdot \text{IQR}$?▸ 深入探究 — 为何选择 $1.5 \cdot \text{IQR}$?
The cutoff is conventional, not derived from a deep theorem — but it isn't arbitrary. John Tukey, who introduced the box plot, chose $1.5$ as a compromise: large enough that normal data rarely produces "outliers" (about $0.7\%$ on each tail), small enough that genuine outliers in skewed data still get flagged.
这是约定俗成的阈值,并非源自某条深奥定理 — 但也并非随意。提出箱形图的 John Tukey 选取 $1.5$ 作为折中:足够大,使得正态数据极少产生"异常值"(每侧约 $0.7\%$);又足够小,能保证偏态数据中的真异常值仍被标出。
Some texts also use a $3 \cdot \text{IQR}$ "extreme outlier" fence. The IB does not require it — just the $1.5 \cdot \text{IQR}$ rule above. When you flag an outlier in a written answer, name the rule explicitly: "$x = 25$ lies above $Q_3 + 1.5\, \text{IQR}$, so it is an outlier."
部分教材另外采用 $3 \cdot \text{IQR}$ 的"极端异常值"围栏。IB 不要求 — 只需上述 $1.5 \cdot \text{IQR}$ 规则。在书面答题时标出异常值,要明确写出规则:"$x = 25$ 高于 $Q_3 + 1.5\, \text{IQR}$,故为异常值。"
1.4 Measures of Central Tendency1.4 集中趋势的度量 SL 4.3
mean、median、mode)。
The mean $\bar{x} = \tfrac{1}{n}\sum x_i$ is the centre of mass — sensitive to outliers. The median is the middle value when ordered — robust to outliers, the right choice for skewed data. The mode is the most frequent value (the modal class for grouped data). On grouped data the mean is estimated using class midpoints.
均值 $\bar{x} = \tfrac{1}{n}\sum x_i$ 是数据的质心 — 对异常值敏感。中位数是排序后的中间值 — 对异常值稳健,是偏态数据的恰当选择。众数是最常出现的值(分组数据则报告众数类(modal class))。分组数据的均值用组中值估计。
Left: the exact sample mean for raw data. Right: an estimate for grouped data, where $x_i$ is the midpoint of class $i$ and $f_i$ its frequency. The estimate is only as good as the assumption that within-class values cluster near the midpoint — on Paper 1 the IB expects you to write this approximation explicitly.
左:原始数据的精确样本均值。右:分组数据的估计值,其中 $x_i$ 为第 $i$ 类的组中值,$f_i$ 为对应频数。该估计仅在"类内数据聚集于组中值附近"的假设下才好 — 在试卷 1 中,IB 期望你显式写出这是近似而非等号。
Median. Default for skewed distributions or any dataset with notable outliers (e.g. incomes, house prices).
Mode (or modal class). Used when "most common" is the question (e.g. shoe-size stock to keep). For continuous data, report the modal class, not a single mode. 均值(
mean)。对称分布的默认选择;在数值上方便(后续每一个公式都用到它)。中位数(
median)。偏态分布或存在明显异常值的数据集的默认选择(如收入、房价)。众数(
mode,或众数类 modal class)。当问题问"最常见"时用(如该备货哪一种鞋码)。连续数据应报告众数类,而不是单一众数。
Worked Example — Raw data mean and median例题 — 原始数据的均值与中位数
Problem: $\{4, 6, 7, 7, 9, 12, 15\}$. Find the mean, median, and mode.
$n = 7$. Sum $= 4 + 6 + 7 + 7 + 9 + 12 + 15 = 60$.
$$ \bar{x} = \frac{60}{7} \approx 8.57. $$Median (already sorted) at position $\tfrac{7+1}{2} = 4$: the 4th value is $7$.
Mode: $7$ appears twice; all others once. Mode $= 7$.
题目:$\{4, 6, 7, 7, 9, 12, 15\}$。求均值、中位数与众数。
$n = 7$。和 $= 4 + 6 + 7 + 7 + 9 + 12 + 15 = 60$。
$$ \bar{x} = \frac{60}{7} \approx 8.57. $$中位数(数据已排序)位于位置 $\tfrac{7+1}{2} = 4$:第 4 个值为 $7$。
众数:$7$ 出现两次,其余均只出现一次。众数 $= 7$。
Worked Example — Estimating the mean from grouped data例题 — 由分组数据估计均值
Problem: Using the masses table from 1.2:
| Class | $40 \le m < 50$ | $50 \le m < 60$ | $60 \le m < 70$ | $70 \le m < 80$ | $80 \le m < 90$ |
|---|---|---|---|---|---|
| Midpoint $x_i$ | 45 | 55 | 65 | 75 | 85 |
| Frequency $f_i$ | 4 | 11 | 14 | 8 | 3 |
$\sum f_i = 40$. Compute $\sum f_i x_i$:
$$ 4(45) + 11(55) + 14(65) + 8(75) + 3(85) = 180 + 605 + 910 + 600 + 255 = 2550. $$ $$ \bar{x} \approx \frac{2550}{40} = 63.75 \text{ kg}. $$Modal class: $60 \le m < 70$ (highest frequency, $14$). Since the data is continuous, we report the class, not a single value.
题目:使用 1.2 节的体重分组表:
| 类 | $40 \le m < 50$ | $50 \le m < 60$ | $60 \le m < 70$ | $70 \le m < 80$ | $80 \le m < 90$ |
|---|---|---|---|---|---|
| 组中值 $x_i$ | 45 | 55 | 65 | 75 | 85 |
| 频数 $f_i$ | 4 | 11 | 14 | 8 | 3 |
$\sum f_i = 40$。计算 $\sum f_i x_i$:
$$ 4(45) + 11(55) + 14(65) + 8(75) + 3(85) = 180 + 605 + 910 + 600 + 255 = 2550. $$ $$ \bar{x} \approx \frac{2550}{40} = 63.75 \text{ kg}. $$众数类(modal class):$60 \le m < 70$(频数最高,$14$)。数据连续,因此报告的是类,而不是单一值。
Worked Example — Mean is fragile, median is not例题 — 均值脆弱,中位数稳健
Problem: Salaries (in thousands): $\{32, 34, 35, 36, 38, 40, 250\}$. Find the mean and median, and comment.
$\bar{x} = \frac{32 + 34 + 35 + 36 + 38 + 40 + 250}{7} = \frac{465}{7} \approx 66.4$ thousand.
Median (sorted, $n = 7$, position 4): $36$ thousand.
Comment. The mean ($\approx 66.4$k) is dragged upward by the single high earner ($250$k). The median ($36$k) represents the typical employee much better. With outliers present, IB usually rewards naming the median as the "more appropriate measure".
题目:工资(单位:千元):$\{32, 34, 35, 36, 38, 40, 250\}$。求均值与中位数,并加以评论。
$\bar{x} = \frac{32 + 34 + 35 + 36 + 38 + 40 + 250}{7} = \frac{465}{7} \approx 66.4$ 千元。
中位数(排序后 $n = 7$,第 4 位):$36$ 千元。
评论。均值($\approx 66.4$ 千元)被那位高收入者($250$ 千元)单方面拉高。中位数($36$ 千元)更能代表典型员工。当存在异常值(outlier)时,IB 通常奖励将中位数判为"更合适的度量"。
▸ Going deeper — The mean as a minimizer▸ 深入探究 — 均值作为最小化解
The mean is the unique value $c$ that minimizes the sum of squared deviations:
均值是使偏差平方和最小的唯一常数 $c$:
$$ \sum_{i=1}^{n} (x_i - c)^2 \;\;\text{is minimised at}\;\; c = \bar{x}. $$Quick proof: differentiate with respect to $c$, set to zero:
简证:对 $c$ 求导,并令导数为零:
$$ \frac{d}{dc}\sum (x_i - c)^2 = -2 \sum (x_i - c) = 0 \;\;\Longrightarrow\;\; c = \frac{\sum x_i}{n} = \bar{x}. $$The median, by contrast, minimises the sum of absolute deviations $\sum |x_i - c|$. Squaring penalises a single big deviation a lot more than absolute value does — that's exactly why the mean is sensitive to outliers and the median isn't. This is the deep reason behind variance / SD using squares: they pair naturally with the mean.
相对地,中位数(median)使绝对偏差之和 $\sum |x_i - c|$ 最小。平方对单个大偏差的惩罚远高于绝对值 — 这正是均值对异常值敏感、而中位数稳健的根源。这也是方差(variance)/ 标准差(standard deviation)必须用平方的深层理由:它们与均值天然配对。
1.5 Measures of Dispersion — Variance & Standard Deviation1.5 离散程度的度量 — 方差与标准差 SL 4.3
IB defaults to dividing by $n$, not $n - 1$. Always use $\sigma$, $\sigma^2$ (calculator displays this as $\sigma_n$ or $\sigma_x$ — not $s_x$). 极差(
range) $= \max - \min$(非常脆弱)。四分位距(interquartile range,IQR) $= Q_3 - Q_1$(稳健)。方差(variance) $\sigma^2$ 是与均值偏差平方的平均;标准差(standard deviation,SD) $\sigma$ 是方差的平方根,单位与原始数据相同。IB 默认除以 $n$,而非 $n - 1$。始终使用 $\sigma$、$\sigma^2$(计算器显示为 $\sigma_n$ 或 $\sigma_x$ — 不是 $s_x$)。
For grouped data with class midpoints $x_i$, frequencies $f_i$, and $n = \sum f_i$:
对分组数据:组中值为 $x_i$,频数为 $f_i$,$n = \sum f_i$:
$$ \sigma = \sqrt{\frac{\sum_i f_i\,(x_i - \bar{x})^2}{\sum_i f_i}}. $$This is the by-hand-friendly version: compute $\sum x_i$ and $\sum x_i^2$ once, subtract. Avoids round-off from subtracting many small deviations.
这是适合手算的形式:只算一次 $\sum x_i$ 与 $\sum x_i^2$,再做减法。避免对许多小偏差逐个相减带来的舍入误差。
Worked Example — SD from raw data例题 — 由原始数据求标准差
Problem: Find the standard deviation of $\{4, 6, 7, 9, 14\}$.
$n = 5$. $\sum x_i = 40$, so $\bar{x} = 8$.
Deviations and their squares:
$$ \begin{aligned} (4 - 8)^2 &= 16 \\ (6 - 8)^2 &= 4 \\ (7 - 8)^2 &= 1 \\ (9 - 8)^2 &= 1 \\ (14 - 8)^2 &= 36 \end{aligned} \qquad \sum = 58. $$ $$ \sigma^2 = \frac{58}{5} = 11.6, \qquad \sigma = \sqrt{11.6} \approx 3.41. $$题目:求 $\{4, 6, 7, 9, 14\}$ 的标准差。
$n = 5$。$\sum x_i = 40$,所以 $\bar{x} = 8$。
偏差及其平方:
$$ \begin{aligned} (4 - 8)^2 &= 16 \\ (6 - 8)^2 &= 4 \\ (7 - 8)^2 &= 1 \\ (9 - 8)^2 &= 1 \\ (14 - 8)^2 &= 36 \end{aligned} \qquad \sum = 58. $$ $$ \sigma^2 = \frac{58}{5} = 11.6, \qquad \sigma = \sqrt{11.6} \approx 3.41. $$Worked Example — SD by the computational form例题 — 用计算式求标准差
Problem: Same data $\{4, 6, 7, 9, 14\}$, using $\sigma^2 = \overline{x^2} - \bar{x}^2$.
$\sum x_i^2 = 16 + 36 + 49 + 81 + 196 = 378$. $\overline{x^2} = 378/5 = 75.6$. $\bar{x}^2 = 64$.
$$ \sigma^2 = 75.6 - 64 = 11.6, \qquad \sigma \approx 3.41. $$Same answer, faster — especially useful on Paper 1 when the data is small but the deviations would be messy.
题目:同一数据 $\{4, 6, 7, 9, 14\}$,改用 $\sigma^2 = \overline{x^2} - \bar{x}^2$。
$\sum x_i^2 = 16 + 36 + 49 + 81 + 196 = 378$。$\overline{x^2} = 378/5 = 75.6$。$\bar{x}^2 = 64$。
$$ \sigma^2 = 75.6 - 64 = 11.6, \qquad \sigma \approx 3.41. $$结果相同,速度更快 — 在试卷 1(Paper 1)数据量小但偏差不整齐时尤为好用。
Worked Example — Grouped-data SD例题 — 分组数据标准差
Problem: For the masses table (1.4), with $\bar{x} = 63.75$, find $\sigma$.
Deviations from $\bar{x}$, weighted:
$$ \begin{aligned} 4(45 - 63.75)^2 &= 4(351.5625) = 1406.25 \\ 11(55 - 63.75)^2 &= 11(76.5625) = 842.1875 \\ 14(65 - 63.75)^2 &= 14(1.5625) = 21.875 \\ 8(75 - 63.75)^2 &= 8(126.5625) = 1012.5 \\ 3(85 - 63.75)^2 &= 3(451.5625) = 1354.6875 \end{aligned} $$ $$ \sum f_i (x_i - \bar{x})^2 = 4637.5, \qquad \sigma = \sqrt{\frac{4637.5}{40}} = \sqrt{115.9375} \approx 10.77 \text{ kg}. $$(On Paper 2 you would put midpoints in L1, frequencies in L2, and read $\sigma_x$ directly from 1-Var Stats.)
题目:用 1.4 节的体重分组表,已知 $\bar{x} = 63.75$,求 $\sigma$。
对 $\bar{x}$ 的加权偏差:
$$ \begin{aligned} 4(45 - 63.75)^2 &= 4(351.5625) = 1406.25 \\ 11(55 - 63.75)^2 &= 11(76.5625) = 842.1875 \\ 14(65 - 63.75)^2 &= 14(1.5625) = 21.875 \\ 8(75 - 63.75)^2 &= 8(126.5625) = 1012.5 \\ 3(85 - 63.75)^2 &= 3(451.5625) = 1354.6875 \end{aligned} $$ $$ \sum f_i (x_i - \bar{x})^2 = 4637.5, \qquad \sigma = \sqrt{\frac{4637.5}{40}} = \sqrt{115.9375} \approx 10.77 \text{ kg}. $$(在试卷 2 上你会把组中值放入 L1、频数放入 L2,然后从 1-Var Stats 直接读出 $\sigma_x$。)
▸ Going deeper — Equivalence of the two SD formulas▸ 深入探究 — 两个标准差公式等价
Both forms compute the same quantity. Expand $(x_i - \bar{x})^2$:
两种形式计算同一个量。展开 $(x_i - \bar{x})^2$:
$$ \sum (x_i - \bar{x})^2 = \sum x_i^2 - 2\bar{x}\sum x_i + n \bar{x}^2 = \sum x_i^2 - 2\bar{x} (n\bar{x}) + n\bar{x}^2 = \sum x_i^2 - n\bar{x}^2. $$Divide by $n$:
两边除以 $n$:
$$ \sigma^2 = \frac{\sum (x_i - \bar{x})^2}{n} = \frac{\sum x_i^2}{n} - \bar{x}^2 = \overline{x^2} - \bar{x}^2. $$The computational form is preferred on Paper 1 because it avoids subtracting potentially close-in-value deviations — that's a source of catastrophic cancellation if you compute by calculator with limited precision.
试卷 1 优先用计算式:它避免了对若干数值相近的偏差逐个相减 — 在精度有限的计算器上,这会造成灾难性的有效数字抵消。
▸ Going deeper — Why the IB uses $n$, not $n - 1$▸ 深入探究 — 为何 IB 用 $n$ 而非 $n - 1$
For a sample of size $n$ drawn from a larger population, the unbiased estimator of the population variance divides by $n - 1$ (the "Bessel correction"). The $-1$ accounts for using $\bar{x}$ in place of the unknown $\mu$ — one degree of freedom is "spent" estimating the mean.
从更大总体(population)中抽取容量为 $n$ 的样本(sample)时,总体方差的无偏估计应除以 $n - 1$("贝塞尔校正")。这个 $-1$ 用来补偿用 $\bar{x}$ 代替未知 $\mu$ — 一个自由度"花"在了估计均值上。
However, the IB syllabus treats the data given in the question as the entire population of interest, so $\mu$ is known (it equals $\bar{x}$ when you have everyone). With $\mu$ known, dividing by $n$ gives an unbiased estimator of the true variance — no correction needed.
不过,IB 教学大纲把题目给定的数据当作所关注的整个总体,因此 $\mu$ 已知(涵盖全部个体时它就等于 $\bar{x}$)。$\mu$ 已知后,除以 $n$ 已是真实方差的无偏估计 — 无需校正。
Practically: use $\sigma_x$ on your calculator, never $s_x$. If your final answer uses $s_x$ where the markscheme used $\sigma_x$, you'll lose accuracy marks. For $n$ large, the two differ negligibly — but on Paper 1 with tiny $n$, the gap is noticeable.
实务上:计算器上始终用 $\sigma_x$,绝不用 $s_x$。评分标准用 $\sigma_x$ 而你用 $s_x$,会丢失精确度分。$n$ 很大时两者差异可忽略 — 但试卷 1 上 $n$ 很小,差距很明显。
1.6 Effect of Linear Transformations on Data1.6 数据线性变换的影响 SL 4.3
linear transformation)$y_i = a x_i + b$。则:
$$ \bar{y} = a \bar{x} + b, \qquad \sigma_y = |a|\, \sigma_x, \qquad \sigma_y^2 = a^2 \sigma_x^2. $$
- Adding $b$ shifts mean by $b$; SD unchanged.
- Multiplying by $a$ multiplies mean by $a$; multiplies SD by $|a|$.
- Median, $Q_1$, $Q_3$, IQR all transform the same way as the mean (subject to sign of $a$ for ordering).
- 加 $b$使均值平移 $b$;标准差不变。
- 乘以 $a$使均值乘以 $a$;标准差乘以 $|a|$。
- 中位数、$Q_1$、$Q_3$、IQR 全部按均值同样的方式变换(在 $a$ 的符号影响下注意顺序)。
Variance. $y_i - \bar{y} = (a x_i + b) - (a\bar{x} + b) = a(x_i - \bar{x})$. Square, average: 均值。$\bar{y} = \tfrac{1}{n}\sum (a x_i + b) = a \tfrac{1}{n}\sum x_i + b = a\bar{x} + b$。
方差。$y_i - \bar{y} = (a x_i + b) - (a\bar{x} + b) = a(x_i - \bar{x})$。平方再求平均: $$ \sigma_y^2 = \frac{1}{n}\sum a^2 (x_i - \bar{x})^2 = a^2 \sigma_x^2 \quad\Longrightarrow\quad \sigma_y = |a|\sigma_x. $$ The shift $b$ vanishes from variance — spread is translation-invariant. The scale $a$ comes out squared in variance and as $|a|$ in SD. 平移 $b$ 在方差中消失 — 离散程度对平移不变。比例因子 $a$ 在方差中以平方出现,在标准差中以 $|a|$ 出现。
Worked Example — Marks scaled and shifted例题 — 分数的伸缩与平移
Problem: A class's raw marks have $\bar{x} = 56$, $\sigma_x = 12$. The teacher scales each mark by $y = 1.2 x + 10$ to produce final marks. Find $\bar{y}$ and $\sigma_y$.
$$ \bar{y} = 1.2(56) + 10 = 67.2 + 10 = 77.2. $$ $$ \sigma_y = |1.2|(12) = 14.4. $$The shift by $10$ moves the mean only. The scale by $1.2$ inflates both. The IB markscheme awards one mark for each of the two transformed values.
题目:某班原始分数有 $\bar{x} = 56$,$\sigma_x = 12$。老师按 $y = 1.2 x + 10$ 调整每个分数生成最终成绩。求 $\bar{y}$ 与 $\sigma_y$。
$$ \bar{y} = 1.2(56) + 10 = 67.2 + 10 = 77.2. $$ $$ \sigma_y = |1.2|(12) = 14.4. $$$+10$ 仅平移均值;$\times 1.2$ 让两者都放大。IB 评分标准对两个变换后的值各给 1 分。
Worked Example — Reverse direction例题 — 反向
Problem: Three measurements have variance $25$. Each is then halved. What is the new variance and SD?
Transformation: $y = 0.5 x$ (no shift). Variance multiplies by $a^2 = 0.25$:
$$ \sigma_y^2 = (0.5)^2 (25) = 6.25, \qquad \sigma_y = \sqrt{6.25} = 2.5. $$Equivalently $\sigma_y = 0.5 \cdot \sigma_x = 0.5 \cdot 5 = 2.5$.
题目:三个观测的方差为 $25$。然后每个都减半。新的方差与标准差是多少?
变换:$y = 0.5 x$(无平移)。方差乘以 $a^2 = 0.25$:
$$ \sigma_y^2 = (0.5)^2 (25) = 6.25, \qquad \sigma_y = \sqrt{6.25} = 2.5. $$等价地,$\sigma_y = 0.5 \cdot \sigma_x = 0.5 \cdot 5 = 2.5$。
▸ Going deeper — Standardization and the $z$-score▸ 深入探究 — 标准化与 $z$ 分数
The transformation $z = \dfrac{x - \mu}{\sigma}$ centres and scales the data so that $\bar{z} = 0$ and $\sigma_z = 1$.
变换 $z = \dfrac{x - \mu}{\sigma}$ 把数据居中、归一化为 $\bar{z} = 0$、$\sigma_z = 1$。
Plug into the linear-transformation rules with $a = 1/\sigma$ and $b = -\mu/\sigma$:
在线性变换规则中代入 $a = 1/\sigma$、$b = -\mu/\sigma$:
$$ \bar{z} = \frac{1}{\sigma}\,\mu - \frac{\mu}{\sigma} = 0, \qquad \sigma_z = \left|\frac{1}{\sigma}\right|\,\sigma = 1. $$The $z$-score answers "how many standard deviations is this observation from the mean?" — the foundation for the normal-distribution work in D3 and for outlier-flagging in real-world statistics. It's also the cleanest demonstration that linear transformations are information-preserving: shape, skewness, and PMCC are all invariant under them.
$z$ 分数回答"这个观测距均值有多少个标准差"— 它是 D3 正态分布工作的基础,也是现实统计中标记异常值的工具。这也是线性变换信息守恒最简洁的演示:形状、偏态、PMCC 在变换下都保持不变。
1.7 Bivariate Data: Scatter, Correlation, Regression1.7 双变量数据:散点图、相关与回归 SL 4.4
bivariate data)$(x_i, y_i)$ 用散点图(scatter diagram)展示。随之有三个问题。
- Direction & form. Positive / negative / no association. Linear / curved.
- Strength. Pearson's PMCC $r = \dfrac{S_{xy}}{\sqrt{S_{xx}\, S_{yy}}} \in [-1, 1]$. Closer to $\pm 1$ ⇒ stronger linear association.
- Best-fit line. The regression line of $y$ on $x$, $y = ax + b$, always passes through $(\bar{x}, \bar{y})$. Use it to predict $y$ from $x$ inside the data range.
- 方向与形态。正相关 / 负相关 / 无相关。线性 / 非线性。
- 强弱。皮尔逊积矩相关系数(
Pearson's product-moment correlation coefficient,PMCC)$r = \dfrac{S_{xy}}{\sqrt{S_{xx}\, S_{yy}}} \in [-1, 1]$。越接近 $\pm 1$ ⇒ 线性关联越强。 - 最佳拟合直线。$y$ 关于 $x$ 的回归直线(
regression line)$y = ax + b$ 恒过 $(\bar{x}, \bar{y})$。用它在数据范围内由 $x$ 预测 $y$。
Computational form: $S_{xy} = \sum x_i y_i - n\bar{x}\bar{y}$, and similarly $S_{xx} = \sum x_i^2 - n\bar{x}^2$.
计算式:$S_{xy} = \sum x_i y_i - n\bar{x}\bar{y}$,类似地 $S_{xx} = \sum x_i^2 - n\bar{x}^2$。
Equivalently $y = a + bx$ in the IB's notation (slope $= b$, intercept $= a$) — this matches the GDC's LinReg(a + bx) output. The line always passes through $(\bar{x}, \bar{y})$.
在 IB 记号下等价写为 $y = a + bx$(斜率 $= b$,截距 $= a$)— 与图形计算器(GDC)的 LinReg(a + bx) 输出一致。直线恒过 $(\bar{x}, \bar{y})$。
Strength. Rough thresholds (vary by source — the IB accepts a sensible verbal description): 符号。$r > 0$:$y$ 随 $x$ 倾向增加。$r < 0$:$y$ 随 $x$ 倾向减少。
强弱。常见阈值(不同教材略有出入 — IB 接受合理的文字描述):
- $|r| \ge 0.9$ very strong
- $0.7 \le |r| < 0.9$ strong
- $0.5 \le |r| < 0.7$ moderate
- $0.3 \le |r| < 0.5$ weak
- $|r| < 0.3$ very weak / no linear association
- $|r| \ge 0.9$ 极强
- $0.7 \le |r| < 0.9$ 强
- $0.5 \le |r| < 0.7$ 中等
- $0.3 \le |r| < 0.5$ 弱
- $|r| < 0.3$ 极弱 / 无线性关联
$r$ measures linear association. Data that follows $y = x^2$ on $[-1, 1]$ has $r = 0$ but a perfect non-linear pattern. Always look at the scatter plot before quoting $r$.
Don't extrapolate. The regression line is only reliable inside the $x$-range of the data. Predictions outside (extrapolation) assume the linear pattern continues — usually it doesn't. 相关(
correlation)不等于因果。$r$ 很强只说明两变量同步变化,并不证明一个引起另一个。潜在变量与反向因果都能造成强 $r$。$r$ 只衡量线性关联。数据若在 $[-1, 1]$ 上满足 $y = x^2$,则 $r = 0$,但却是完美的非线性模式。引用 $r$ 之前务必看散点图。
不要外推(
extrapolation)。回归直线只在数据的 $x$ 区间内可靠。区间外的预测假设线性模式继续延伸 — 通常并不成立。
Worked Example — PMCC and regression line by hand例题 — 手算 PMCC 与回归直线
Problem: Five $(x, y)$ pairs:
| $x$ | 2 | 3 | 5 | 7 | 8 |
|---|---|---|---|---|---|
| $y$ | 4 | 5 | 7 | 10 | 12 |
$n = 5$. $\bar{x} = 25/5 = 5$, $\bar{y} = 38/5 = 7.6$.
Sums:
$$ \sum x_i^2 = 4 + 9 + 25 + 49 + 64 = 151, \qquad \sum y_i^2 = 16 + 25 + 49 + 100 + 144 = 334, $$ $$ \sum x_i y_i = 2\cdot 4 + 3\cdot 5 + 5\cdot 7 + 7\cdot 10 + 8\cdot 12 = 8 + 15 + 35 + 70 + 96 = 224. $$$S_{xx} = 151 - 5(5)^2 = 151 - 125 = 26$. $S_{yy} = 334 - 5(7.6)^2 = 334 - 288.8 = 45.2$. $S_{xy} = 224 - 5(5)(7.6) = 224 - 190 = 34$.
PMCC.
$$ r = \frac{34}{\sqrt{26 \cdot 45.2}} = \frac{34}{\sqrt{1175.2}} \approx \frac{34}{34.281} \approx 0.992. $$Very strong positive linear association.
Regression line. $b = 34/26 \approx 1.308$. Passing through $(\bar{x}, \bar{y}) = (5, 7.6)$:
$$ y - 7.6 = 1.308(x - 5) \;\Longrightarrow\; y \approx 1.31 x + 1.06. $$Read this off the GDC on Paper 2 (LinReg(a + bx)); the by-hand version above is what you'd write on Paper 1.
题目:5 组 $(x, y)$:
| $x$ | 2 | 3 | 5 | 7 | 8 |
|---|---|---|---|---|---|
| $y$ | 4 | 5 | 7 | 10 | 12 |
$n = 5$。$\bar{x} = 25/5 = 5$,$\bar{y} = 38/5 = 7.6$。
各项求和:
$$ \sum x_i^2 = 4 + 9 + 25 + 49 + 64 = 151, \qquad \sum y_i^2 = 16 + 25 + 49 + 100 + 144 = 334, $$ $$ \sum x_i y_i = 2\cdot 4 + 3\cdot 5 + 5\cdot 7 + 7\cdot 10 + 8\cdot 12 = 8 + 15 + 35 + 70 + 96 = 224. $$$S_{xx} = 151 - 5(5)^2 = 151 - 125 = 26$。$S_{yy} = 334 - 5(7.6)^2 = 334 - 288.8 = 45.2$。$S_{xy} = 224 - 5(5)(7.6) = 224 - 190 = 34$。
PMCC。
$$ r = \frac{34}{\sqrt{26 \cdot 45.2}} = \frac{34}{\sqrt{1175.2}} \approx \frac{34}{34.281} \approx 0.992. $$极强的正向线性关联。
回归直线。$b = 34/26 \approx 1.308$。过 $(\bar{x}, \bar{y}) = (5, 7.6)$:
$$ y - 7.6 = 1.308(x - 5) \;\Longrightarrow\; y \approx 1.31 x + 1.06. $$在试卷 2 直接从 GDC(LinReg(a + bx))读出;上面这种手算写法是试卷 1 上要写的。
Worked Example — Using the line for prediction例题 — 用回归直线做预测
Problem: For the data above, predict $y$ when $x = 6$. Comment on the reliability.
$$ \hat{y} = 1.308(6) + 1.06 \approx 8.91. $$$x = 6$ lies inside the observed range $[2, 8]$, so the prediction is interpolation — reliable given the very strong $r \approx 0.99$.
By contrast, predicting at $x = 20$ would be extrapolation: the linear pattern is not guaranteed beyond the data.
题目:用上题数据预测 $x = 6$ 时的 $y$。评估其可靠性。
$$ \hat{y} = 1.308(6) + 1.06 \approx 8.91. $$$x = 6$ 位于观测范围 $[2, 8]$ 内,预测属内插(interpolation) — 鉴于 $r \approx 0.99$ 极强,结果可靠。
反之,预测 $x = 20$ 即外推(extrapolation):数据范围外不能保证线性模式仍然成立。
Worked Example — Interpreting parameters例题 — 解释回归参数
Problem: A regression of weekly sales $y$ ($\${,}000$) on advertising spend $x$ ($\${,}000$) gives $y = 4.2 x + 15$. Interpret the slope and intercept.
Slope $a = 4.2$. For every $\$1{,}000$ extra spent on advertising, weekly sales increase by $\$4{,}200$ on average.
Intercept $b = 15$. When advertising spend is $\$0$, the model predicts weekly sales of $\$15{,}000$. Caveat: $x = 0$ may lie outside the observed range, in which case the intercept's interpretation is shaky — flag this in a written answer.
题目:用每周销售额 $y$(单位 $\${,}000$)对广告投入 $x$(单位 $\${,}000$)作回归,得到 $y = 4.2 x + 15$。请解释斜率与截距。
斜率 $a = 4.2$。广告每多投入 $\$1{,}000$,每周销售额平均增加 $\$4{,}200$。
截距 $b = 15$。广告投入为 $\$0$ 时,模型预测每周销售额为 $\$15{,}000$。注意:$x = 0$ 可能在观测范围之外,此时截距的解释并不可靠 — 在书面答题中要标注这一点。
Visualizing the fit将拟合可视化
The dashed line is the best fit. It must pass through the highlighted mean point $(\bar{x}, \bar{y})$ — that's the easy way to sketch the line "by eye" if asked.
虚线即最佳拟合直线。它必须过高亮的均值点 $(\bar{x}, \bar{y})$ — 这也是题目要求"目测"作图时最方便的画法。
▸ Going deeper — Why the regression line passes through $(\bar{x}, \bar{y})$▸ 深入探究 — 为何回归直线恒过 $(\bar{x}, \bar{y})$
Take the slope-form equation $y - \bar{y} = b(x - \bar{x})$ and substitute $x = \bar{x}$:
取点斜式方程 $y - \bar{y} = b(x - \bar{x})$,代入 $x = \bar{x}$:
$$ y - \bar{y} = b(\bar{x} - \bar{x}) = 0 \;\;\Longrightarrow\;\; y = \bar{y}. $$So the point $(\bar{x}, \bar{y})$ lies on the line by construction. The reason this happens is that the least-squares method minimises $\sum (y_i - (ax_i + b))^2$; differentiating with respect to $b$ and setting to zero gives $\sum (y_i - ax_i - b) = 0$, equivalently $\bar{y} = a\bar{x} + b$.
因此 $(\bar{x}, \bar{y})$ 按构造就落在直线上。背后原因是:最小二乘(least squares)方法极小化 $\sum (y_i - (ax_i + b))^2$;对 $b$ 求导并令导数为零,得到 $\sum (y_i - ax_i - b) = 0$,等价于 $\bar{y} = a\bar{x} + b$。
Practical use: in a Paper 1 question with no calculator, if you're given $\bar{x}, \bar{y}$ and the slope $b$, write the line in slope-form immediately — no extra arithmetic.
实用法:试卷 1 不允许使用计算器时,若题目给出 $\bar{x}, \bar{y}$ 与斜率 $b$,直接用点斜式写出方程 — 无需额外运算。
▸ Going deeper — Why PMCC is invariant under linear transformations▸ 深入探究 — 为何 PMCC 在线性变换下不变
Apply $u = a x + b$, $v = c y + d$ with $a, c \ne 0$. Then $\bar{u} = a\bar{x} + b$, $\bar{v} = c\bar{y} + d$, and
应用 $u = a x + b$,$v = c y + d$,其中 $a, c \ne 0$。则 $\bar{u} = a\bar{x} + b$,$\bar{v} = c\bar{y} + d$,且
$$ S_{uv} = \sum (u_i - \bar{u})(v_i - \bar{v}) = \sum a(x_i - \bar{x}) \cdot c(y_i - \bar{y}) = a c \, S_{xy}. $$Similarly $S_{uu} = a^2 S_{xx}$, $S_{vv} = c^2 S_{yy}$. So:
类似地 $S_{uu} = a^2 S_{xx}$,$S_{vv} = c^2 S_{yy}$。于是:
$$ r_{uv} = \frac{ac\, S_{xy}}{\sqrt{a^2 S_{xx} \cdot c^2 S_{yy}}} = \frac{ac}{|ac|} \cdot \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}} = \text{sign}(ac) \cdot r_{xy}. $$If $ac > 0$, $r$ is unchanged. If $ac < 0$ (i.e. exactly one of the variables is reflected), the sign flips but the magnitude stays the same. This makes $r$ a true measure of association: it doesn't depend on the units chosen.
若 $ac > 0$,$r$ 不变;若 $ac < 0$(即恰有一个变量被反向),符号翻转而大小不变。这使 $r$ 成为真正的关联度度量:它不依赖于所选单位。
Exam Strategy & Common Pitfalls考试策略与常见陷阱
- $\bar{x} = \dfrac{\sum f_i x_i}{\sum f_i}$ for grouped data SL 4.3
- $\sigma = \sqrt{\dfrac{\sum(x_i - \mu)^2}{n}}$ (population) SL 4.3
- $\sigma^2 = \overline{x^2} - \bar{x}^2$ computational form SL 4.3
- Linear-transformation rule: $\bar{y} = a\bar{x} + b$, $\sigma_y = |a|\sigma_x$ SL 4.3
- Outlier rule: $x < Q_1 - 1.5\,\text{IQR}$ or $x > Q_3 + 1.5\,\text{IQR}$ SL 4.2
- PMCC: $r = S_{xy} / \sqrt{S_{xx} S_{yy}}$, $r \in [-1, 1]$ SL 4.4
- Regression line passes through $(\bar{x}, \bar{y})$ SL 4.4
- 分组数据均值 $\bar{x} = \dfrac{\sum f_i x_i}{\sum f_i}$ SL 4.3
- 总体标准差 $\sigma = \sqrt{\dfrac{\sum(x_i - \mu)^2}{n}} $ SL 4.3
- 计算式 $\sigma^2 = \overline{x^2} - \bar{x}^2$ SL 4.3
- 线性变换规则:$\bar{y} = a\bar{x} + b$,$\sigma_y = |a|\sigma_x$ SL 4.3
- 异常值规则:$x < Q_1 - 1.5\,\text{IQR}$ 或 $x > Q_3 + 1.5\,\text{IQR}$ SL 4.2
- PMCC:$r = S_{xy} / \sqrt{S_{xx} S_{yy}}$,$r \in [-1, 1]$ SL 4.4
- 回归直线恒过 $(\bar{x}, \bar{y})$ SL 4.4
- Why a random sample matters — bias is fatal even with large $n$.
- Why bar area (not height) encodes frequency on a continuous histogram.
- Why mean is sensitive to outliers and median isn't (squared vs. absolute).
- Why SD divides by $n$ in the IB, not $n - 1$ (population convention).
- Why $\sigma_y = |a|\sigma_x$: shift vanishes, scale comes out squared then rooted.
- Why $r$ is invariant under linear transformations of $x$ or $y$.
- Why the regression line passes through $(\bar{x}, \bar{y})$ (least-squares condition).
- 为什么随机样本(
random sample)至关重要 — 抽样偏差(sampling bias)无法通过加大 $n$ 修复。 - 为什么连续直方图(
histogram)用柱子的面积而非高度表示频数。 - 为什么均值对异常值敏感、中位数不敏感(平方 vs. 绝对值)。
- 为什么 IB 的标准差除以 $n$ 而非 $n - 1$(总体约定)。
- 为什么 $\sigma_y = |a|\sigma_x$:平移消失,比例因子先平方再开方。
- 为什么 PMCC $r$ 在 $x$ 或 $y$ 的线性变换下保持不变。
- 为什么回归直线恒过 $(\bar{x}, \bar{y})$(最小二乘条件的副产物)。
Common Pitfalls常见陷阱
2. Saying "$r = 0.6$ means $x$ causes $y$." Correlation $\ne$ causation; always hedge in writing.
3. Computing PMCC and quoting it without ever looking at the scatter plot — $r$ doesn't detect curvature.
4. Extrapolating: using the regression line to predict $y$ for $x$ far outside the data range.
5. Forgetting that the regression line passes through $(\bar{x}, \bar{y})$. This is the fastest way to recover the intercept on Paper 1.
6. On grouped-data mean, averaging midpoints without weighting by frequency.
7. Outlier rule: forgetting the factor $1.5$, or applying the rule to $\bar{x} \pm 1.5\sigma$ instead of $Q_1, Q_3 \pm 1.5\,\text{IQR}$.
8. Linear transformations: adding $b$ to the SD (it doesn't change), or forgetting the absolute value on $|a|$ when $a < 0$.
9. Reporting just one of (centre, spread) instead of both.
10. Reading the median off a cumulative-frequency graph at $0.5$ on the $y$-axis instead of $0.5 n$. 1. 用 $s_x$(样本标准差,除以 $n - 1$)代替 $\sigma_x$(除以 $n$)。IB 默认后者。
2. 说"$r = 0.6$ 意味着 $x$ 导致 $y$"。相关 $\ne$ 因果;书面回答必须留有余地。
3. 算出 PMCC 就直接引用,根本没看散点图 — $r$ 检测不到曲率。
4. 外推:用回归直线在数据范围之外的 $x$ 处预测 $y$。
5. 忘记回归直线恒过 $(\bar{x}, \bar{y})$。这在试卷 1 上是最快恢复截距的方法。
6. 分组数据求均值时,对组中值取算术平均,忘了按频数加权。
7. 异常值规则:漏掉系数 $1.5$,或错用为 $\bar{x} \pm 1.5\sigma$ 而非 $Q_1, Q_3 \pm 1.5\,\text{IQR}$。
8. 线性变换:把 $b$ 加到标准差上(标准差不变),或当 $a < 0$ 时忘了取 $|a|$。
9. 只报告中心或离散程度其中之一,而非两者一起报。
10. 在累积频率图上把中位数读到 $y$ 轴的 $0.5$ 而不是 $0.5 n$。
Paper 2 (calc). Larger datasets entered as L1 / L2 lists. Read 1-Var Stats for mean and SD; 2-Var Stats and LinReg for $r$, regression line, and predictions. Histograms and box plots already drawn — you describe shape and read off summary statistics. Often a final-mark question on whether a prediction is reliable (inside vs. outside data range). 试卷 1(不可使用计算器)。数据量小、数字干净。可能要求手算均值、中位数、四分位数(
quartile);用计算式 $\overline{x^2} - \bar{x}^2$ 算标准差;给出 $\bar{x}, \bar{y}$ 以及 $b$ 或 $r$ 之一,要求写出回归直线。考异常值规则的应用。要用文字解读给定的 $r$。试卷 2(可使用计算器)。数据量较大,作为 L1 / L2 列表录入。用 1-Var Stats 读出均值与标准差;用 2-Var Stats 与 LinReg 读出 $r$、回归直线与预测值。直方图与箱形图通常已经画好 — 你只需描述形状并读出概括统计量。结尾一道分数题常问预测是否可靠(在数据范围内还是外)。
Flashcards速记卡
Sample = subset measured.总体 = 全体个体。
样本 = 实际测量的子集。
Unit D1 — Practice QuizUnit D1 — 练习测验
Ten mixed-difficulty items. Your score updates in real time at the top of the page. Aim for 8/10 before exam day.
10 道难度混合的题目。得分会在页面顶部实时更新。考试前争取达到 8/10。
systematic sample)的定义。一旦第一个学生确定,其余都由间隔决定,所以不是简单随机。interpolation)时安全可用。Readiness Checklist备考清单
Click each item you've mastered. Aim for 100% before exam day. The IB sub-topic reference is tagged on each item.
勾选你已掌握的每一项。考试前争取 100%。每项都标注了对应的 IB 子主题。
- SL 4.1 I can distinguish population from sample, and name the major sampling methods (simple random, systematic, stratified, quota, convenience).我能区分总体(
population)与样本(sample),并说出主要抽样方法(简单随机、系统、分层、配额、便利)。 - SL 4.1 I can identify likely sources of bias (selection, non-response, self-report, volunteer) in a sampling design.对一个抽样设计,我能指出可能的偏差来源(选择、无回应、自报、志愿者)。
- SL 4.1 I can classify a variable as discrete or continuous and pick the right summary.我能判断一个变量是离散(
discrete)还是连续(continuous),并选择合适的概括方式。 - SL 4.2 I can build a frequency table and histogram, describe shape (skew / symmetry / modality), and read off the modal class.我能构建频率分布表与直方图,描述形状(偏态 / 对称 / 峰数),并读出众数类(
modal class)。 - SL 4.2 I can read median and quartiles from a cumulative-frequency graph and draw the corresponding box plot.我能从累积频率图上读出中位数与四分位数,并画出对应的箱形图(
box plot)。 - SL 4.2 I can apply the $1.5 \cdot \text{IQR}$ outlier rule and report outliers on a box plot.我能应用 $1.5 \cdot \text{IQR}$ 异常值规则,并在箱形图上标注异常值。
- SL 4.3 I can compute mean, median, and mode of raw data and estimate the mean from grouped data via $\sum f_i x_i / \sum f_i$.我能计算原始数据的均值、中位数与众数,并用 $\sum f_i x_i / \sum f_i$ 估计分组数据的均值。
- SL 4.3 I can compute the population variance and SD by both the deviations form and $\overline{x^2} - \bar{x}^2$.我能分别用偏差形式与 $\overline{x^2} - \bar{x}^2$ 计算总体方差与标准差。
- SL 4.3 I know the IB uses $n$ (not $n - 1$) in the SD and can read $\sigma_x$ on the GDC.我知道 IB 的标准差除以 $n$(而非 $n - 1$),并能在图形计算器(
GDC)上读出 $\sigma_x$。 - SL 4.3 I can apply the linear transformation rules $\bar{y} = a\bar{x} + b$ and $\sigma_y = |a|\sigma_x$, including the unit-conversion use case.我能应用线性变换规则 $\bar{y} = a\bar{x} + b$、$\sigma_y = |a|\sigma_x$,并能处理单位换算这种典型情形。
- SL 4.4 I can interpret the sign and magnitude of $r$, and avoid the correlation-vs-causation trap in writing.我能解读 $r$ 的符号与大小,并在书面回答中避开"相关即因果"的陷阱。
- SL 4.4 I can fit a regression line by hand using $(\bar{x}, \bar{y})$ and $b = S_{xy}/S_{xx}$, or read it from the GDC.我能用 $(\bar{x}, \bar{y})$ 与 $b = S_{xy}/S_{xx}$ 手算回归直线,或直接从 GDC 读出。
- SL 4.4 I can use the regression line to predict, and decide when a prediction is interpolation vs. extrapolation.我能用回归直线做预测,并判断这是内插还是外推(
extrapolation)。 - SL 4.4 I can interpret the slope and intercept of a regression line in context, with appropriate caveats.我能结合背景解释回归直线的斜率与截距,并给出适当的限制条件。
IB Paper-Style PracticeIB 试卷风格练习
IB exam-style questions across Paper 1A (short response, no calc), Paper 1B (extended response, no calc), and Paper 2 (calculator). D1 is SL-only — no Paper 3. EMH difficulty mix. Mark-by-mark solutions live in the separate solutions file. Use this after the in-page quiz and flashcards.
IB 考试风格题,涵盖 Paper 1A(短答,无计算器)、Paper 1B(长答,无计算器)、Paper 2(可用计算器)。D1 仅 SL,无 Paper 3。难度按 EMH 分级。逐分解答见独立的解答文档。建议在做完本页测验与闪卡后再来。