Class materials
The following textbooks will be useful for this course:- Foundations of Machine Learning by Mohri, Rostamizadeh and Talwalkar
- Understanding Machine Learning by Shai Shalev-Shwartz and Shai Ben-David
- Neural Network Learning: Theoretical Foundations by Martin Anthony and Peter L. Bartlett
- A Probabilistic Theory of Pattern Recognition by Luc Devroye, László Györfi, Gábor Lugosi
- A Theory of Learning and Generalization M. Vidyasagar
Lectures
-
10-03-2016: We proved Markov's and Chebyshev's inequalities, and used these to conclude that the Maximum Likelihood Estimator
, applied to
random variables
, converges in probability to the true value
. (This is essentially the Weak Law of Large Numbers.) Later, we stated (but did not prove yet) the Chernoff-Hoeffding inequality, and showed how it implies the almost-sure convergence of
to
via the Borell-Cantelli lemma. In our proof of the Borel-Cantelli lemma, we used 2 facts. Fact 1: For non-negative
such that
we have
Fact 2:
You should prove both for yourself.- 17-03-2016: We proved Hoeffding's Inequality, which relies on Hoeffding's Lemma (which we also proved). Then we proved McDiarmid's inequality and used it to prove that
almost surely.Next, we defined Uniform Glivenko-Cantelli and Universal UGC (UUGC) function classes. Suppose
is a probability space and
. Let
and define
and
. Define also
We definedto be UGC with repsect to
if
and to be UUGC if
(In both cases, convergence almost surely – and hence in probability – also follows.)We observed that any finite
is UUGC and also exhibited an infinite UUGC over
: the class of indicator functions.
- 07-04-2016: We showed how to control the expected uniform deviation
in terms of Rademacher averages. The lecture material is outlined here, with references to sources; see also these lecture notes. Along the way, we proved that
, where
is are independent
-subguassian random variables. This proof also shows that
; can you see why? [The answer is in the linked note.] It is also true that for independent Gaussian random variables
with variance
,
for some universal constant
; one may take
.
- 14-04-2016: We proved the Perles-Sauer-Shelah-VC lemma. Our proof was by induction, along the same lines as here, or here (don't have the Kearns-Vazirani book?! Write to me and I'll provide a scan of the relevant section). In fact, there are at least 4 distinct proofs of this fundamental combinatorial fact; three are presented here and my favorite is via Pajor's lemma (which I unfortunately didn't have time to present in class). It is in general not easy to compute the VC-dimension of a hypothesis class exactly, but various compositional properties of the VC-dimension are also known.
In the second half of the class, I explained the connection between Universal Glivenko-Cantelli classes and PAC learnability. The material is summarized in these notes; please read them carefully.
05-05-2016: We showed that UUGC implies agnostic PAC learnability; this is proved in Theorem 4.1 here. Let
be the Empirical Risk Minimizer of the sample
and define the Excess Risk
We showed that, for fixed confidence parameter, we have
and claimed that thecan be removed with (much) additional effort. For now, however, we are focusing on a matching lower bound:
meaning that for any learner and any sample size, we can construct an adversarial distribution that forces an excess risk proportional to. We started in this direction by analyzing the optimal predictor of a random coin's bias and proving Pinsker's inequality.
- 26-05-2016: We proved the
agnostic PAC lower bound (actually, a lower bound on the minimax expected excess risk – which you're asked convert to a sample complexity lower bound in one of the exercises).
- 09-06-2016: We proved the
agnostic PAC upper bound using the Dudley chaining integral and the Dudley-Pollard packing bound.
- 17-03-2016: We proved Hoeffding's Inequality, which relies on Hoeffding's Lemma (which we also proved). Then we proved McDiarmid's inequality and used it to prove that