# Class materials

The following textbooks will be useful for this course:

# Lectures

• 10-03-2016: We proved Markov's and Chebyshev's inequalities, and used these to conclude that the Maximum Likelihood Estimator $\dpi{300}\inline \hat p_n$, applied to $\dpi{300}\inline \mathrm{Bernoulli}(p)$ random variables $\dpi{300}\inline X_1,X_2,\ldots,X_n$, converges in probability to the true value $\dpi{300}\inline p$. (This is essentially the Weak Law of Large Numbers.) Later, we stated (but did not prove yet) the Chernoff-Hoeffding inequality, and showed how it implies the almost-sure convergence of $\dpi{300}\inline \hat p_n$ to $\dpi{300}\inline p$ via the Borell-Cantelli lemma. In our proof of the Borel-Cantelli lemma, we used 2 facts. Fact 1: For non-negative $\dpi{300}\inline x_1,x_2,\ldots$ such that
$\dpi{300}\displaystyle \sum_{k=1}^\infty x_k<\infty,$

we have
$\dpi{300}\displaystyle \lim_{n\to\infty}\sum_{k=n}^\infty x_k=0.$

Fact 2:
$\dpi{300}\displaystyle \sum_{n=1}^\infty e^{-2n^{1/3}}<\infty.$

You should prove both for yourself.
• 17-03-2016: We proved Hoeffding's Inequality, which relies on Hoeffding's Lemma (which we also proved). Then we proved McDiarmid's inequality and used it to prove that
$\dpi{300}\displaystyle \sup_{j\in{\mathbb{N}}}{\left| \hat p_j-p_j \right|}{\mathop{\longrightarrow}_{n\to\infty}} 0$

almost surely.

Next, we defined Uniform Glivenko-Cantelli and Universal UGC (UUGC) function classes. Suppose $\dpi{300}\inline ({\mathcal{X}},P)$ is a probability space and $\dpi{300}\inline {\mathcal{F}}\subseteq{\left\{ 0,1 \right\}}^{\mathcal{X}}$. Let $\dpi{300}\inline (X_1,\ldots,X_n)\sim P^n$ and define $\dpi{300}\inline P_nf=n^{-1}\sum_{i=1}^nf(X_i)$ and $\dpi{300}\inline Pf={\mathbb{E}}[f(X_1)]$. Define also

$\dpi{300}\displaystyle \Delta_n:=\sup_{f\in{\mathcal{F}}}{\left| Pf-P_nf \right|}.$

We defined $\dpi{300}\inline {\mathcal{F}}$ to be UGC with repsect to $\dpi{300}\inline P$ if
$\dpi{300}\displaystyle {\mathbb{E}}[\Delta_n]{\mathop{\longrightarrow}_{n\to\infty}} 0$

and to be UUGC if
$\dpi{300}\displaystyle \lim_{n\to\infty} \sup_P {\mathbb{E}}_{P^n}[\Delta_n] = 0.$

(In both cases, convergence almost surely – and hence in probability – also follows.)

We observed that any finite $\dpi{300}\inline {\mathcal{F}}$ is UUGC and also exhibited an infinite UUGC over $\dpi{300}\inline {\mathcal{X}}={\mathbb{N}}$: the class of indicator functions.

• 07-04-2016: We showed how to control the expected uniform deviation $\dpi{300}\inline {\mathbb{E}}\sup_{f\in{\mathcal{F}}}(Pf-P_nf)$ in terms of Rademacher averages. The lecture material is outlined here, with references to sources; see also these lecture notes. Along the way, we proved that $\dpi{300}\inline {\mathbb{E}}\max_{1\le i\le N}Y_i\le\gamma\sqrt{2\log N}$, where $\dpi{300}\inline Y_1,\ldots,Y_N$ is are independent $\dpi{300}\inline \gamma$-subguassian random variables. This proof also shows that $\dpi{300}\inline {\mathbb{E}}\max_{1\le i\le N}|Y_i|\le\gamma\sqrt{2\log(2N)}$; can you see why? [The answer is in the linked note.] It is also true that for independent Gaussian random variables $\dpi{300}\inline Y_i$ with variance $\dpi{300}\inline \gamma^2$, $\dpi{300}\inline {\mathbb{E}}\max_{1\le i\le N}Y_i\ge C\gamma\sqrt{\log N}$ for some universal constant $\dpi{300}\inline C>0$; one may take $\dpi{300}\inline C=1/\sqrt{\pi\log 2}$.

• 14-04-2016: We proved the Perles-Sauer-Shelah-VC lemma. Our proof was by induction, along the same lines as here, or here (don't have the Kearns-Vazirani book?! Write to me and I'll provide a scan of the relevant section). In fact, there are at least 4 distinct proofs of this fundamental combinatorial fact; three are presented here and my favorite is via Pajor's lemma (which I unfortunately didn't have time to present in class). It is in general not easy to compute the VC-dimension of a hypothesis class exactly, but various compositional properties of the VC-dimension are also known.

In the second half of the class, I explained the connection between Universal Glivenko-Cantelli classes and PAC learnability. The material is summarized in these notes; please read them carefully.

• 05-05-2016: We showed that UUGC implies agnostic PAC learnability; this is proved in Theorem 4.1 here. Let $\dpi{300}\inline \hat h_S\in{\mathcal{H}}$ be the Empirical Risk Minimizer of the sample $\dpi{300}\inline S\sim{\mathcal{D}}^n$ and define the Excess Risk

$\dpi{300}\displaystyle R = {\operatorname{err}}(\hat h_S,{\mathcal{D}})-\inf_{h\in{\mathcal{H}}}{\operatorname{err}}(h,{\mathcal{D}}).$

We showed that, for fixed confidence parameter $\dpi{300}\inline \delta$, we have
$\dpi{300}\displaystyle R = O{\left( \sqrt{\frac{d}{n}\log\frac{n}{d}} \right)},$

and claimed that the $\dpi{300}\inline \log\frac{n}{d}$ can be removed with (much) additional effort. For now, however, we are focusing on a matching lower bound:
$\dpi{300}\displaystyle R = \Omega{\left( \sqrt{\frac{d}{n}} \right)},$

meaning that for any learner and any sample size, we can construct an adversarial distribution that forces an excess risk proportional to $\dpi{300}\inline \sqrt{\frac{d}{n}}$. We started in this direction by analyzing the optimal predictor of a random coin's bias and proving Pinsker's inequality.
• 26-05-2016: We proved the $\dpi{300}\inline \Omega{\left( \sqrt{\frac{d}{n}} \right)}$ agnostic PAC lower bound (actually, a lower bound on the minimax expected excess risk – which you're asked convert to a sample complexity lower bound in one of the exercises).
• 09-06-2016: We proved the $\dpi{300}\inline O(\sqrt{n/d})$ agnostic PAC upper bound using the Dudley chaining integral and the Dudley-Pollard packing bound.