皇冠体育app

Stnf

CS426: Mining Massive Datasets

Spring 2013



Announcements

Jun 13: Solution for Homework 3 posted.
Jun 06: Project 2 posted. Due on July 02, 23:59pm.
May 31: Homework 3 posted. Due on June 09.
May 30: Solution for Homework 2 posted.
May 20: Solution for Homework 1 posted.
May 17: Homework 2 posted. Due on May 27.
May 02: Homework 1 posted. Due on May 15.
May 02: Project 1 posted. Due on May 08, 23:59pm.
Apr 25: Deadline for group formation is Apr 28.
Apr 21: Course website launched.


Course Description

This course will introduce basic and advanced techniques for massive datasets processing. Topics include: data mining basics, cloud computing platforms, programming models and MapReduce, large scale machine learning and data mining algorithms, and data-intensive applications. The goal of this course is to help students understand and exploit the techniques of a new computing paradigm called data-intensive scalable computing (DISC).

 

Instructor

Wu-Jun Li ([email protected]; //www.cs.sjtu.edu.cn/~liwujun; Rm 3-537, SEIEE Building; 34206661)
Office Hours: TBD

 

Teaching Assistant

Zhi-Qin Yu ([email protected]; Rm 3-503, SEIEE Building)
Office Hours: TBD

 

Lecture Time and Venue

Mon 14:00 - 15:40
Wed 10:00 - 11:40
Fri 08:00 - 09:40
Rm 105, Dong Shang Yuan (东上院 105)

 

Textbook

[MMDS]: Anand Rajaraman and Jeffrey D. Ullman. Mining of Massive Datasets. Cambridge University Press, 2011.
( You can download it from the book website.)

 

Reference Books

[DM]: Jiawei Han, and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, Second Edition, 2006.
The English reprint edition (英文影印版) can be bought through China-Pub.

[PRML]: Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[HA]: Chuck Lam. Hadoop in Action. Manning Publications, First Edition, 2010.

[Aliyun]: 周憬宇,李武军,过敏意.《飞天开放平台编程指南-阿里云计算的实践》. 电子工业出版社,2013年3月. [China-Pub]

 

Course Topics and Schedule (tentative)

(I acknowledge Prof. Jeffrey D. Ullman and Dr. Jure Leskovec for allowing me to use their slides, and to make some modifications if necessary. )

Date

Topics

Slides

Readings

22/04/2013

Introduction: Data-Intensive Scalable Computing (DISC); Data Mining

DM Ch.1; MMDS Ch.1
DISC (ref1, ref2, ref3)

24/04/2013
MapReduce and Hadoop
HA Ch.1 - 3
MMDS Ch.2

26/04/2013
27/04/2013

Frequent Itemsets and Association Rules

MMDS Ch.6; DM Ch.5

28/04/2013
Form groups, and send the group information to TA. Deadline: 23:59pm

03/05/2013
06/05/2013

Dimensionality Reduction and Matrix Factorization
MMDS Ch.11;
ref1;
Math Basics
08/05/2013
Project 1 due. Deadline: 23:59pm [Project 1]

08/05/2013
10/05/2013

Recommender Systems
MMDS Ch.9

13/05/2013
15/05/2013

Search Engines

17/05/2013
20/05/2013

Link Analysis: PageRank, Hubs and Authorities, Spam Detection
MMDS Ch.5
22/05/2013
24/05/2013

Unsupervised Learning: Clustering

MMDS Ch.7

27/05/2013
29/05/2013
31/05/2013

Supervised Learning: Perceptron,Naive Bayes,kNN,SVM

03/06/2013
05/06/2013

Finding Similar Items: Minhashing and Locality-Sensitive Hashing

MMDS Ch.3
07/06/2013
09/06/2013

Mining Data Streams

MMDS Ch.4
14/06/2013

Course Review

02/07/2013
Project 2 due. Deadline: 23:59pm [Project 2]

 

Prerequisites

data structure, design and analysis of algorithms, linear algebra, probability theory

 

Grading Scheme

1. Class attendance (10%)

2. Homework (20%)

Homework 1
Homework 2
Homework 3

3. Final exam (40%)

4. Project (30%)

Project 1
Project 2

 

Late Assignments

Assignments turned in late will be penalized 20% per late day.

 

Academic Honor Code

Honesty and integrity are central to the academic work. All your submitted assignments must be entirely your own (or your own group's). Any student found cheating or performing plagiarism will receive a final score of zero for this course.

 



 

Home Page