PACLearning

PAC learning

Basic definitions:

  • Input and Output spaces $\mathcal{X}, \mathcal{Y}\in\{0,1\}$
  • Concept $c:\mathcal{X}\rightarrow\mathcal{Y}$
  • Concept Class $\mathcal{C}\ni c$ as set of concepts to learn
  • Hypothesis set $\mathcal{H}$ = fixed set of possible concepts (not necessarily coinciding with $\mathcal{C}$




Definition Generalization error: Given hypothesis $h\in\mathcal{H}$, target concept $c\in\mathcal{C}$ and underlying distribution $\mathcal{D}$, the generalization error (risk) of $h$ is defined as

$$R(h)=\underset{x\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq c(x)\right]=$$

$$\underset{x\sim\mathcal{D}}{\mathbb{E}}\left[\mathbb{I}_{h(x)\neq c(x)}(x)\right]$$

where $\mathbb{I}_A(x)$ denotes the indicator function for set $A$

$$\mathbb{I}_A(x)=\cases{ 1 & $x\in A$ \cr 0 & $x\notin A$}$$




Definition Empirical error: Given hypothesis $h\in\mathcal{H}$, target concept $c\in\mathcal{C}$ and sample $\mathcal{S}=(x_1,...,x_n)$, the empirical error (empirical risk) of $h$ is defined as

$$\hat{R}_\mathcal{S}(h)=\frac{1}{n}\sum_{i=1}^n\mathbb{I}_{h(x_i)\neq c(x_i)}$$

Theorems

Learning bound - finite $\mathcal{H}$

Proposition Let $\mathcal{H}$ be a finite set of functions $\mathcal{X}\rightarrow\mathcal{Y}$. Let $\mathcal{A}$ be an algorithm that returns a consistent hypothesis $h_\mathcal{S}:\hat{R}_\mathcal{S}(h_\mathcal{S})=0$ for any target concept and i.i.d sample. Then if

$$n\geq\frac{1}{\epsilon}\left(log|\mathcal{H}|+log\frac{1}{\delta}\right),$$

we have for any $\epsilon, \delta>0$

$$\mathbb{P}_{\mathcal{S}\sim\mathcal{D}^n}\left[R(h_\mathcal{S})\leq\epsilon\right]\geq1-\delta$$




Proof Define for any $\epsilon>0$, $\mathcal{H}_\epsilon=\{h\in\mathcal{H}:R(h)>\epsilon\}$, then for $h_\epsilon\in\mathcal{H}_\epsilon$:

$$\mathbb{P}\left[\hat{R}_\mathcal{S}(h_\epsilon)=0\right]\leq(1-\epsilon)^n$$

Now

$$\mathbb{P}\left[ h \in \mathcal{H}_\epsilon \cap\hat{R}_\mathcal{S}(h)=0 \right]$$

$$=\sum_{h\in\mathcal{H}_\epsilon}\mathbb{P}\left[\hat{R}_\mathcal{S}(h)=0\right]$$

$$\leq\sum_{h\in\mathcal{H}_\epsilon}(1-\epsilon)^n$$

$$\leq|\mathcal{H}|(1-\epsilon)^n\leq|\mathcal{H}|e^{-n\epsilon}\leq\delta$$

where the last inequality is valid because

$$|\mathcal{H}|e^{-n\epsilon}\leq\delta$$

$$\Leftrightarrow log|\mathcal{H}|-n\epsilon\leq log(\delta)$$

$$\Leftrightarrow n\geq\frac{1}{\epsilon}\left(log|\mathcal{H}|+log\frac{1}{\delta}\right)$$

as required.

Since

$$\mathbb{P}\left[ h \in \mathcal{H}_\epsilon\cap\hat{R}_\mathcal{S}(h)=0 \right]=\mathbb{P}\left[ h \in \mathcal{H}_\epsilon|\hat{R}_\mathcal{S}(h)=0 \right]\mathbb{P}\left[\hat{R}_\mathcal{S}(h)=0\right]\leq\delta$$

and $\mathbb{P}\left[\hat{R}_\mathcal{S}(h)=0\right]=1$ due to required consistency, we have

$$\mathbb{P}\left[ h \in \mathcal{H}_\epsilon|\hat{R}_\mathcal{S}(h)=0 \right]+\mathbb{P}\left[ h \notin \mathcal{H}_\epsilon|\hat{R}_\mathcal{S}(h)=0 \right]=1$$

and finally

$$\mathbb{P}\left[ h \notin \mathcal{H}_\epsilon|\hat{R}_\mathcal{S}(h)=0 \right]\geq1-\delta\quad\square$$

Single hypothesis - generalization and empirical error

Corollary For any hypothesis $h:X\rightarrow\{0,1\}$ and fixed $\epsilon>0$, we have

$$\mathbb{P}_{S\sim\mathcal{D}^n}\left[\hat{R}_\mathcal{S}(h)-R(h)\geq\epsilon\right]\leq exp(-2n\epsilon^2)$$

$$\mathbb{P}_{S\sim\mathcal{D}^n}\left[\hat{R}_\mathcal{S}(h)-R(h)\leq -\epsilon\right]\leq exp(-2n\epsilon^2)$$

Sources

  • Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.