Class materials

The following textbooks will be useful for this course:

Lectures

  • 10-03-2016: We proved Markov's and Chebyshev's inequalities, and used these to conclude that the Maximum Likelihood Estimator \hat p_n, applied to \mathrm{Bernoulli}(p) random variables X_1,X_2,\ldots,X_n, converges in probability to the true value p. (This is essentially the Weak Law of Large Numbers.) Later, we stated (but did not prove yet) the Chernoff-Hoeffding inequality, and showed how it implies the almost-sure convergence of \hat p_n to p via the Borell-Cantelli lemma. In our proof of the Borel-Cantelli lemma, we used 2 facts. Fact 1: For non-negative x_1,x_2,\ldots such that
    \sum_{k=1}^\infty x_k<\infty,

    we have
    \lim_{n\to\infty}\sum_{k=n}^\infty x_k=0.

    Fact 2:
     \sum_{n=1}^\infty e^{-2n^{1/3}}<\infty.

    You should prove both for yourself.
  • 17-03-2016: We proved Hoeffding's Inequality, which relies on Hoeffding's Lemma (which we also proved). Then we proved McDiarmid's inequality and used it to prove that
    \sup_{j\in{\mathbb{N}}}{\left| \hat p_j-p_j \right|}{\mathop{\longrightarrow}_{n\to\infty}} 0

    almost surely.

    Next, we defined Uniform Glivenko-Cantelli and Universal UGC (UUGC) function classes. Suppose ({\mathcal{X}},P) is a probability space and {\mathcal{F}}\subseteq{\left\{ 0,1 \right\}}^{\mathcal{X}}. Let (X_1,\ldots,X_n)\sim P^n and define P_nf=n^{-1}\sum_{i=1}^nf(X_i) and Pf={\mathbb{E}}[f(X_1)]. Define also

    \Delta_n:=\sup_{f\in{\mathcal{F}}}{\left| Pf-P_nf \right|}.

    We defined {\mathcal{F}} to be UGC with repsect to P if
     {\mathbb{E}}[\Delta_n]{\mathop{\longrightarrow}_{n\to\infty}} 0

    and to be UUGC if
     \lim_{n\to\infty} \sup_P {\mathbb{E}}_{P^n}[\Delta_n] = 0.

    (In both cases, convergence almost surely – and hence in probability – also follows.)

    We observed that any finite {\mathcal{F}} is UUGC and also exhibited an infinite UUGC over {\mathcal{X}}={\mathbb{N}}: the class of indicator functions.

  • 07-04-2016: We showed how to control the expected uniform deviation {\mathbb{E}}\sup_{f\in{\mathcal{F}}}(Pf-P_nf) in terms of Rademacher averages. The lecture material is outlined here, with references to sources; see also these lecture notes. Along the way, we proved that {\mathbb{E}}\max_{1\le i\le N}Y_i\le\gamma\sqrt{2\log N}, where Y_1,\ldots,Y_N is are independent \gamma-subguassian random variables. This proof also shows that {\mathbb{E}}\max_{1\le i\le N}|Y_i|\le\gamma\sqrt{2\log(2N)}; can you see why? [The answer is in the linked note.] It is also true that for independent Gaussian random variables Y_i with variance \gamma^2, {\mathbb{E}}\max_{1\le i\le N}Y_i\ge C\gamma\sqrt{\log N} for some universal constant C>0; one may take C=1/\sqrt{\pi\log 2}.

  • 14-04-2016: We proved the Perles-Sauer-Shelah-VC lemma. Our proof was by induction, along the same lines as here, or here (don't have the Kearns-Vazirani book?! Write to me and I'll provide a scan of the relevant section). In fact, there are at least 4 distinct proofs of this fundamental combinatorial fact; three are presented here and my favorite is via Pajor's lemma (which I unfortunately didn't have time to present in class). It is in general not easy to compute the VC-dimension of a hypothesis class exactly, but various compositional properties of the VC-dimension are also known.

    In the second half of the class, I explained the connection between Universal Glivenko-Cantelli classes and PAC learnability. The material is summarized in these notes; please read them carefully.

  • 05-05-2016: We showed that UUGC implies agnostic PAC learnability; this is proved in Theorem 4.1 here. Let \hat h_S\in{\mathcal{H}} be the Empirical Risk Minimizer of the sample S\sim{\mathcal{D}}^n and define the Excess Risk

    R = {\operatorname{err}}(\hat h_S,{\mathcal{D}})-\inf_{h\in{\mathcal{H}}}{\operatorname{err}}(h,{\mathcal{D}}).

    We showed that, for fixed confidence parameter \delta, we have
     R = O{\left(  \sqrt{\frac{d}{n}\log\frac{n}{d}} \right)},

    and claimed that the \log\frac{n}{d} can be removed with (much) additional effort. For now, however, we are focusing on a matching lower bound:
    R = \Omega{\left(  \sqrt{\frac{d}{n}} \right)},

    meaning that for any learner and any sample size, we can construct an adversarial distribution that forces an excess risk proportional to \sqrt{\frac{d}{n}}. We started in this direction by analyzing the optimal predictor of a random coin's bias and proving Pinsker's inequality.
  • 26-05-2016: We proved the \Omega{\left(  \sqrt{\frac{d}{n}} \right)} agnostic PAC lower bound (actually, a lower bound on the minimax expected excess risk – which you're asked convert to a sample complexity lower bound in one of the exercises).
  • 09-06-2016: We proved the O(\sqrt{n/d}) agnostic PAC upper bound using the Dudley chaining integral and the Dudley-Pollard packing bound.