Class materials
The following textbooks will be useful for this course: Foundations of Machine Learning by Mohri, Rostamizadeh and Talwalkar
 Understanding Machine Learning by Shai ShalevShwartz and Shai BenDavid
 Neural Network Learning: Theoretical Foundations by Martin Anthony and Peter L. Bartlett
 A Probabilistic Theory of Pattern Recognition by Luc Devroye, László Györfi, Gábor Lugosi
 A Theory of Learning and Generalization M. Vidyasagar
Lectures

10032016: We proved Markov's and Chebyshev's inequalities, and used these to conclude that the Maximum Likelihood Estimator , applied to random variables , converges in probability to the true value . (This is essentially the Weak Law of Large Numbers.) Later, we stated (but did not prove yet) the ChernoffHoeffding inequality, and showed how it implies the almostsure convergence of to via the BorellCantelli lemma. In our proof of the BorelCantelli lemma, we used 2 facts. Fact 1: For nonnegative such that
we have
Fact 2:
You should prove both for yourself. 
17032016: We proved Hoeffding's Inequality, which relies on Hoeffding's Lemma (which we also proved). Then we proved McDiarmid's inequality and used it to prove that
almost surely.Next, we defined Uniform GlivenkoCantelli and Universal UGC (UUGC) function classes. Suppose is a probability space and . Let and define and . Define also
We defined to be UGC with repsect to if
and to be UUGC if
(In both cases, convergence almost surely – and hence in probability – also follows.)We observed that any finite is UUGC and also exhibited an infinite UUGC over : the class of indicator functions.

07042016: We showed how to control the expected uniform deviation in terms of Rademacher averages. The lecture material is outlined
here, with references to sources; see also these lecture notes. Along the way, we proved that , where is are independent subguassian random variables. This proof also shows that ; can you see why? [The answer is in the linked note.] It is also true that
for independent Gaussian random variables with variance , for some universal constant ; one may take .

14042016:
We proved the PerlesSauerShelahVC lemma. Our proof was by induction, along the same lines as here, or here (don't have the KearnsVazirani book?! Write to me and I'll provide a scan of the relevant section). In fact, there are at least 4 distinct proofs of this fundamental combinatorial fact;
three are presented
here
and my favorite is via
Pajor's lemma (which I unfortunately didn't have time to present in class).
It is in general not easy to compute the VCdimension of a hypothesis class exactly, but
various compositional properties of the VCdimension are also known.
In the second half of the class, I explained the connection between Universal GlivenkoCantelli classes and PAC learnability. The material is summarized in these notes; please read them carefully.

05052016: We showed that UUGC implies agnostic PAC learnability; this is proved in Theorem 4.1 here. Let be the Empirical Risk Minimizer of the sample and define the Excess Risk
We showed that, for fixed confidence parameter , we have
and claimed that the can be removed with (much) additional effort. For now, however, we are focusing on a matching lower bound:
meaning that for any learner and any sample size, we can construct an adversarial distribution that forces an excess risk proportional to . We started in this direction by analyzing the optimal predictor of a random coin's bias and proving Pinsker's inequality.  26052016: We proved the agnostic PAC lower bound (actually, a lower bound on the minimax expected excess risk – which you're asked convert to a sample complexity lower bound in one of the exercises).
 09062016: We proved the agnostic PAC upper bound using the Dudley chaining integral and the DudleyPollard packing bound.