PAC learning¶
Basic definitions:¶
- Input and Output spaces $\mathcal{X}, \mathcal{Y}\in\{0,1\}$
- Concept $c:\mathcal{X}\rightarrow\mathcal{Y}$
- Concept Class $\mathcal{C}\ni c$ as set of concepts to learn
- Hypothesis set $\mathcal{H}$ = fixed set of possible concepts (not necessarily coinciding with $\mathcal{C}$
Definition Generalization error: Given hypothesis $h\in\mathcal{H}$, target concept $c\in\mathcal{C}$ and underlying distribution $\mathcal{D}$, the generalization error (risk) of $h$ is defined as
$$R(h)=\underset{x\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq c(x)\right]=$$ $$\underset{x\sim\mathcal{D}}{\mathbb{E}}\left[\mathbb{I}_{h(x)\neq c(x)}(x)\right]$$where $\mathbb{I}_A(x)$ denotes the indicator function for set $A$
$$\mathbb{I}_A(x)=\cases{ 1 & $x\in A$ \cr 0 & $x\notin A$}$$
Definition Empirical error: Given hypothesis $h\in\mathcal{H}$, target concept $c\in\mathcal{C}$ and sample $\mathcal{S}=(x_1,...,x_n)$, the empirical error (empirical risk) of $h$ is defined as
$$\hat{R}_\mathcal{S}(h)=\frac{1}{n}\sum_{i=1}^n\mathbb{I}_{h(x_i)\neq c(x_i)}$$Theorems¶
Learning bound - finite $\mathcal{H}$¶
Proposition Let $\mathcal{H}$ be a finite set of functions $\mathcal{X}\rightarrow\mathcal{Y}$. Let $\mathcal{A}$ be an algorithm that returns a consistent hypothesis $h_\mathcal{S}:\hat{R}_\mathcal{S}(h_\mathcal{S})=0$ for any target concept and i.i.d sample. Then if
$$n\geq\frac{1}{\epsilon}\left(log|\mathcal{H}|+log\frac{1}{\delta}\right),$$we have for any $\epsilon, \delta>0$
$$\mathbb{P}_{\mathcal{S}\sim\mathcal{D}^n}\left[R(h_\mathcal{S})\leq\epsilon\right]\geq1-\delta$$
Proof Define for any $\epsilon>0$, $\mathcal{H}_\epsilon=\{h\in\mathcal{H}:R(h)>\epsilon\}$, then for $h_\epsilon\in\mathcal{H}_\epsilon$:
$$\mathbb{P}\left[\hat{R}_\mathcal{S}(h_\epsilon)=0\right]\leq(1-\epsilon)^n$$Now
$$\mathbb{P}\left[ h \in \mathcal{H}_\epsilon \cap\hat{R}_\mathcal{S}(h)=0 \right]$$ $$=\sum_{h\in\mathcal{H}_\epsilon}\mathbb{P}\left[\hat{R}_\mathcal{S}(h)=0\right]$$ $$\leq\sum_{h\in\mathcal{H}_\epsilon}(1-\epsilon)^n$$ $$\leq|\mathcal{H}|(1-\epsilon)^n\leq|\mathcal{H}|e^{-n\epsilon}\leq\delta$$where the last inequality is valid because
$$|\mathcal{H}|e^{-n\epsilon}\leq\delta$$ $$\Leftrightarrow log|\mathcal{H}|-n\epsilon\leq log(\delta)$$ $$\Leftrightarrow n\geq\frac{1}{\epsilon}\left(log|\mathcal{H}|+log\frac{1}{\delta}\right)$$as required.
Since
$$\mathbb{P}\left[ h \in \mathcal{H}_\epsilon\cap\hat{R}_\mathcal{S}(h)=0 \right]=\mathbb{P}\left[ h \in \mathcal{H}_\epsilon|\hat{R}_\mathcal{S}(h)=0 \right]\mathbb{P}\left[\hat{R}_\mathcal{S}(h)=0\right]\leq\delta$$and $\mathbb{P}\left[\hat{R}_\mathcal{S}(h)=0\right]=1$ due to required consistency, we have
$$\mathbb{P}\left[ h \in \mathcal{H}_\epsilon|\hat{R}_\mathcal{S}(h)=0 \right]+\mathbb{P}\left[ h \notin \mathcal{H}_\epsilon|\hat{R}_\mathcal{S}(h)=0 \right]=1$$and finally
$$\mathbb{P}\left[ h \notin \mathcal{H}_\epsilon|\hat{R}_\mathcal{S}(h)=0 \right]\geq1-\delta\quad\square$$Single hypothesis - generalization and empirical error¶
Corollary For any hypothesis $h:X\rightarrow\{0,1\}$ and fixed $\epsilon>0$, we have
$$\mathbb{P}_{S\sim\mathcal{D}^n}\left[\hat{R}_\mathcal{S}(h)-R(h)\geq\epsilon\right]\leq exp(-2n\epsilon^2)$$ $$\mathbb{P}_{S\sim\mathcal{D}^n}\left[\hat{R}_\mathcal{S}(h)-R(h)\leq -\epsilon\right]\leq exp(-2n\epsilon^2)$$Sources¶
- Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.