Data Mining of the Web

This course is offered by Jeff Ullman. The schedule has changed; the first class is 16:30 May 14 in the Saal Auditorium.

More information about office hours, schedule, etc. will be posted before the first class. The Course Poster.

Scheduling News: the missing class will be made up on Thursday May 22, 4:30PM-8PM in the Saal Auditorium.

Instructor Email: ullman (at) gmail (dot) com

Gradiance News: Because we did not finish all I planned, I will move the deadline for the first two assignments back to May 21. But please do try to work out what you can now, especially the harder problems in the "Algorithms" set.

Regarding the discussion of the best number of hash functions to use when there are 8 times as many bits as there are members of the set S, I'm going to let you work this out. However, if we use k hash functions, the expected number of bits that will be turned to 1 is 1 - e-k/8. You want to maximize the probability that a given member of file F that is not in S will hash at least once to a bit that is not 1. What is this probability as a function of k? What value of k maximizes the probability.

Slides

Topic Class Slides PDF Versions
Introduction May 14 PPT PDF
A-Priori Algorithm May 14 PPT PDF
Hash-Based Improvements May 14 and May 21 PPT PDF
PageRank and Related Topics May 21 PPT PDF
Shingling-Minhashing-LSH May 22 PPT PDF
Applications of LSH May 22 PPT PDF
Map-Reduce May 28 PPT PDF
Stream-Mining 1 May 28 PPT PDF
Stream-Mining 2 May 28 PPT PDF

Gradiance Homeworks

Go to The Gradiance Home Page and create an account. Then, sign up for course FA335CA1 . There will be weekly automated homeworks posted. You should work the problems and then answer random short-answer questions about them. If you get a question wrong, you are given a hint and allowed to try again.

Note: if you get "assignment temporarily closed," wait 10 minutes. The system prevents rapid guessing.

Assignments:

AssignmentDue at Sundown on:
Frequent Itemsets - BasicsMay 28
Frequent Itemsets - AlgorithmsMay 28
PageRankMay 28
Minhash-LSHMay 29
Map/Reduce-StreamsJune 4 (appears May 28)