Table of contents
 Story
 Slides
 Spotfire Dashboard
 Research Notes
 CS246: Mining Massive Datasets Slides
 Chapter 1 Introduction
 Slide 1 Mining of Massive Datasets: Course Introduction
 Slide 2 What is Data Mining? Knowledge discovery from data
 Slide 3 Some Massive Data Statistics
 Slide 4 Data contains value and knowledge
 Slide 5 Data Mining
 Slide 6 Good news: Demand for Data Mining
 Slide 7 What is Data Mining?
 Slide 8 Data Mining Tasks
 Slide 9 Meaningfulness of Analytic Answers 1
 Slide 10 Meaningfulness of Analytic Answers 2
 Slide 11 What matters when dealing with data?
 Slide 12 Data Mining: Cultures
 Slide 13 This Class: CS246
 Slide 14 What will we learn? 1
 Slide 15 What will we learn? 2
 Slide 16 How It All Fits Together
 Slide 17 How do you want that data?
 Slide 18 About the Course
 Slide 19 2014 CS246 Course Staff
 Slide 20 Course Logistics
 Slide 21 Logistics: Communication
 Slide 22 Work for the Course 1
 Slide 23 Work for the Course 2
 Slide 24 Course Calender
 Slide 25 Prerequisites
 Slide 26 Recitation Sessions
 Slide 27 What's after the class
 Slide 28 3 Todo items
 Chapter 1 Introduction
 Hadoop Tutorial
 General Instructions
 1 Setting up a virtual machine
 2 Running Hadoop jobs
 2.1 Creating a Hadoop project in Eclipse
 2.2 Running Hadoop jobs in standalone mode
 2.3 Running Hadoop in pseudodistributed mode
 2.4 Debugging Hadoop jobs
 2.5 Example project
 Figure 15: Create a Hadoop Project
 Figure 16: Create a Hadoop Project
 Figure 17: Create a Hadoop Project
 Figure 18: Create a Hadoop Project
 Figure 19: Create a Hadoop Project
 Figure 20: Create a Hadoop Project
 Figure 21: Create a Hadoop Project
 Figure 22: Create a java file
 Figure 23: Create a java file
 Figure 24: Create WordCount.java
 Figure 25: Create WordCount.java
 Figure 26: Create WordCount.java
 Figure 27: Run WordCount.java
 Figure 28: Run WordCount.java
 Figure 29: Run WordCount.java
 Figure 30: Run WordCount.java
 Figure 31: Run WordCount.java
 Figure 32: Export a hadoop project
 Figure 33: Run WordCount.java
 Figure 34: Export a Hadoop project
 Figure 35: Export a Hadoop project
 Figure 36: Run WordCount job
 Figure 37: Run WordCount job
 Figure 38: Run WordCount job
 Figure 39: View WordCount job logs
 Figure 40: View WordCount job logs
 Figure 41: View WordCount job logs
 Figure 42: View WordCount job logs
 2.6 Using your local machine for development
 Further Hadoop tutorials
 Further Eclipse tutorials
 3 Task: Write your own Hadoop Job
 CS246: Mining Massive Datasets Winter 2015
 Mining of Massive Datasets
 Mining of Massive Datasets
 Preface
 Contents
 1 Data Mining
 1.1 What is Data Mining?
 1.2 Statistical Limits on Data Mining
 1.3 Things Useful to Know
 1.4 Outline of the Book
 1.5 Summary of Chapter 1
 1.6 References for Chapter 1
 1.7 Footnotes for Chapter 1
 2 MapReduce and the New Software Stack
 2.1 Distributed File Systems
 2.2 MapReduce
 2.3 Algorithms Using MapReduce
 2.3.1 MatrixVector Multiplication by MapReduce
 2.3.2 If the Vector v Cannot Fit in Main Memory
 2.3.3 RelationalAlgebra Operations
 2.3.4 Computing Selections by MapReduce
 2.3.5 Computing Projections by MapReduce
 2.3.6 Union, Intersection, and Difference by MapReduce
 2.3.7 Computing Natural Join by MapReduce
 2.3.8 Grouping and Aggregation by MapReduce
 2.3.9 Matrix Multiplication
 2.3.10 Matrix Multiplication with One MapReduce Step
 2.3.11 Exercises for Section 2.3
 2.4 Extensions to MapReduce
 2.5 The Communication Cost Model
 2.6 Complexity Theory for MapReduce
 2.7 Summary of Chapter 2
 2.8 References for Chapter 2
 3 Finding Similar Items
 3.1 Applications of NearNeighbor Search
 3.2 Shingling of Documents
 3.3 SimilarityPreserving Summaries of Sets
 3.4 LocalitySensitive Hashing for Documents
 3.5 Distance Measures
 3.6 The Theory of LocalitySensitive Functions
 3.7 LSH Families for Other Distance Measures
 3.8 Applications of LocalitySensitive Hashing
 3.9 Methods for High Degrees of Similarity
 3.10 Summary of Chapter 3
 3.11 References for Chapter 3
 4 Mining Data Streams
 4.1 The Stream Data Model
 4.2 Sampling Data in a Stream
 4.3 Filtering Streams
 4.4 Counting Distinct Elements in a Stream
 4.5 Estimating Moments
 4.6 Counting Ones in a Window
 4.6.1 The Cost of Exact Counts
 4.6.2 The DatarGionisIndykMotwani Algorithm
 4.6.3 Storage Requirements for the DGIM Algorithm
 4.6.4 Query Answering in the DGIM Algorithm
 4.6.5 Maintaining the DGIM Conditions
 4.6.6 Reducing the Error
 4.6.7 Extensions to the Counting of Ones
 4.6.8 Exercises for Section 4.6
 4.7 Decaying Windows
 4.8 Summary of Chapter 4
 4.9 References for Chapter 4
 5 Link Analysis
 6 Frequent Itemsets
 7 Clustering
 8 Advertising on the Web
 8.1 Issues in OnLine Advertising
 8.2 OnLine Algorithms
 8.3 The Matching Problem
 8.4 The Adwords Problem
 8.4.1 History of Search Advertising
 8.4.2 Definition of the Adwords Problem
 8.4.3 The Greedy Approach to the Adwords Problem
 8.4.4 The Balance Algorithm
 8.4.5 A Lower Bound on Competitive Ratio for Balance
 8.4.6 The Balance Algorithm with Many Bidders
 8.4.7 The Generalized Balance Algorithm
 8.4.8 Final Observations About the Adwords Problem
 8.4.9 Exercises for Section 8.4
 8.5 Adwords Implementation
 8.6 Summary of Chapter 8
 8.7 References for Chapter 8
 9 Recommendation Systems
 10 Mining SocialNetwork Graphs
 10.1 Social Networks as Graphs
 10.2 Clustering of SocialNetwork Graphs
 10.3 Direct Discovery of Communities
 10.4 Partitioning of Graphs
 10.5 Finding Overlapping Communities
 10.6 Simrank
 10.7 Counting Triangles
 10.8 Neighborhood Properties of Graphs
 10.8.1 Directed Graphs and Neighborhoods
 10.8.2 The Diameter of a Graph
 10.8.3 Transitive Closure and Reachability
 10.8.4 Transitive Closure Via MapReduce
 10.8.5 Smart Transitive Closure
 10.8.6 Transitive Closure by Graph Reduction
 10.8.7 Approximating the Sizes of Neighborhoods
 10.8.8 Exercises for Section 10.8
 10.9 Summary of Chapter 10
 10.10 References for Chapter 10
 11 Dimensionality Reduction
 12 LargeScale Machine Learning
 12.1 The MachineLearning Model
 12.2 Perceptrons
 12.2.1 Training a Perceptron with Zero Threshold
 12.2.2 Convergence of Perceptrons
 12.2.3 The Winnow Algorithm
 12.2.4 Allowing the Threshold to Vary
 12.2.5 Multiclass Perceptrons
 12.2.6 Transforming the Training Set
 12.2.7 Problems With Perceptrons
 12.2.8 Parallel Implementation of Perceptrons
 12.2.9 Exercises for Section 12.2
 12.3 SupportVector Machines
 12.4 Learning from Nearest Neighbors
 12.5 Comparison of Learning Methods
 12.6 Summary of Chapter 12
 12.7 References for Chapter 12
 Story
 Slides
 Spotfire Dashboard
 Research Notes
 CS246: Mining Massive Datasets Slides
 Chapter 1 Introduction
 Slide 1 Mining of Massive Datasets: Course Introduction
 Slide 2 What is Data Mining? Knowledge discovery from data
 Slide 3 Some Massive Data Statistics
 Slide 4 Data contains value and knowledge
 Slide 5 Data Mining
 Slide 6 Good news: Demand for Data Mining
 Slide 7 What is Data Mining?
 Slide 8 Data Mining Tasks
 Slide 9 Meaningfulness of Analytic Answers 1
 Slide 10 Meaningfulness of Analytic Answers 2
 Slide 11 What matters when dealing with data?
 Slide 12 Data Mining: Cultures
 Slide 13 This Class: CS246
 Slide 14 What will we learn? 1
 Slide 15 What will we learn? 2
 Slide 16 How It All Fits Together
 Slide 17 How do you want that data?
 Slide 18 About the Course
 Slide 19 2014 CS246 Course Staff
 Slide 20 Course Logistics
 Slide 21 Logistics: Communication
 Slide 22 Work for the Course 1
 Slide 23 Work for the Course 2
 Slide 24 Course Calender
 Slide 25 Prerequisites
 Slide 26 Recitation Sessions
 Slide 27 What's after the class
 Slide 28 3 Todo items
 Chapter 1 Introduction
 Hadoop Tutorial
 General Instructions
 1 Setting up a virtual machine
 2 Running Hadoop jobs
 2.1 Creating a Hadoop project in Eclipse
 2.2 Running Hadoop jobs in standalone mode
 2.3 Running Hadoop in pseudodistributed mode
 2.4 Debugging Hadoop jobs
 2.5 Example project
 Figure 15: Create a Hadoop Project
 Figure 16: Create a Hadoop Project
 Figure 17: Create a Hadoop Project
 Figure 18: Create a Hadoop Project
 Figure 19: Create a Hadoop Project
 Figure 20: Create a Hadoop Project
 Figure 21: Create a Hadoop Project
 Figure 22: Create a java file
 Figure 23: Create a java file
 Figure 24: Create WordCount.java
 Figure 25: Create WordCount.java
 Figure 26: Create WordCount.java
 Figure 27: Run WordCount.java
 Figure 28: Run WordCount.java
 Figure 29: Run WordCount.java
 Figure 30: Run WordCount.java
 Figure 31: Run WordCount.java
 Figure 32: Export a hadoop project
 Figure 33: Run WordCount.java
 Figure 34: Export a Hadoop project
 Figure 35: Export a Hadoop project
 Figure 36: Run WordCount job
 Figure 37: Run WordCount job
 Figure 38: Run WordCount job
 Figure 39: View WordCount job logs
 Figure 40: View WordCount job logs
 Figure 41: View WordCount job logs
 Figure 42: View WordCount job logs
 2.6 Using your local machine for development
 Further Hadoop tutorials
 Further Eclipse tutorials
 3 Task: Write your own Hadoop Job
 CS246: Mining Massive Datasets Winter 2015
 Mining of Massive Datasets
 Mining of Massive Datasets
 Preface
 Contents
 1 Data Mining
 1.1 What is Data Mining?
 1.2 Statistical Limits on Data Mining
 1.3 Things Useful to Know
 1.4 Outline of the Book
 1.5 Summary of Chapter 1
 1.6 References for Chapter 1
 1.7 Footnotes for Chapter 1
 2 MapReduce and the New Software Stack
 2.1 Distributed File Systems
 2.2 MapReduce
 2.3 Algorithms Using MapReduce
 2.3.1 MatrixVector Multiplication by MapReduce
 2.3.2 If the Vector v Cannot Fit in Main Memory
 2.3.3 RelationalAlgebra Operations
 2.3.4 Computing Selections by MapReduce
 2.3.5 Computing Projections by MapReduce
 2.3.6 Union, Intersection, and Difference by MapReduce
 2.3.7 Computing Natural Join by MapReduce
 2.3.8 Grouping and Aggregation by MapReduce
 2.3.9 Matrix Multiplication
 2.3.10 Matrix Multiplication with One MapReduce Step
 2.3.11 Exercises for Section 2.3
 2.4 Extensions to MapReduce
 2.5 The Communication Cost Model
 2.6 Complexity Theory for MapReduce
 2.7 Summary of Chapter 2
 2.8 References for Chapter 2
 3 Finding Similar Items
 3.1 Applications of NearNeighbor Search
 3.2 Shingling of Documents
 3.3 SimilarityPreserving Summaries of Sets
 3.4 LocalitySensitive Hashing for Documents
 3.5 Distance Measures
 3.6 The Theory of LocalitySensitive Functions
 3.7 LSH Families for Other Distance Measures
 3.8 Applications of LocalitySensitive Hashing
 3.9 Methods for High Degrees of Similarity
 3.10 Summary of Chapter 3
 3.11 References for Chapter 3
 4 Mining Data Streams
 4.1 The Stream Data Model
 4.2 Sampling Data in a Stream
 4.3 Filtering Streams
 4.4 Counting Distinct Elements in a Stream
 4.5 Estimating Moments
 4.6 Counting Ones in a Window
 4.6.1 The Cost of Exact Counts
 4.6.2 The DatarGionisIndykMotwani Algorithm
 4.6.3 Storage Requirements for the DGIM Algorithm
 4.6.4 Query Answering in the DGIM Algorithm
 4.6.5 Maintaining the DGIM Conditions
 4.6.6 Reducing the Error
 4.6.7 Extensions to the Counting of Ones
 4.6.8 Exercises for Section 4.6
 4.7 Decaying Windows
 4.8 Summary of Chapter 4
 4.9 References for Chapter 4
 5 Link Analysis
 6 Frequent Itemsets
 7 Clustering
 8 Advertising on the Web
 8.1 Issues in OnLine Advertising
 8.2 OnLine Algorithms
 8.3 The Matching Problem
 8.4 The Adwords Problem
 8.4.1 History of Search Advertising
 8.4.2 Definition of the Adwords Problem
 8.4.3 The Greedy Approach to the Adwords Problem
 8.4.4 The Balance Algorithm
 8.4.5 A Lower Bound on Competitive Ratio for Balance
 8.4.6 The Balance Algorithm with Many Bidders
 8.4.7 The Generalized Balance Algorithm
 8.4.8 Final Observations About the Adwords Problem
 8.4.9 Exercises for Section 8.4
 8.5 Adwords Implementation
 8.6 Summary of Chapter 8
 8.7 References for Chapter 8
 9 Recommendation Systems
 10 Mining SocialNetwork Graphs
 10.1 Social Networks as Graphs
 10.2 Clustering of SocialNetwork Graphs
 10.3 Direct Discovery of Communities
 10.4 Partitioning of Graphs
 10.5 Finding Overlapping Communities
 10.6 Simrank
 10.7 Counting Triangles
 10.8 Neighborhood Properties of Graphs
 10.8.1 Directed Graphs and Neighborhoods
 10.8.2 The Diameter of a Graph
 10.8.3 Transitive Closure and Reachability
 10.8.4 Transitive Closure Via MapReduce
 10.8.5 Smart Transitive Closure
 10.8.6 Transitive Closure by Graph Reduction
 10.8.7 Approximating the Sizes of Neighborhoods
 10.8.8 Exercises for Section 10.8
 10.9 Summary of Chapter 10
 10.10 References for Chapter 10
 11 Dimensionality Reduction
 12 LargeScale Machine Learning
 12.1 The MachineLearning Model
 12.2 Perceptrons
 12.2.1 Training a Perceptron with Zero Threshold
 12.2.2 Convergence of Perceptrons
 12.2.3 The Winnow Algorithm
 12.2.4 Allowing the Threshold to Vary
 12.2.5 Multiclass Perceptrons
 12.2.6 Transforming the Training Set
 12.2.7 Problems With Perceptrons
 12.2.8 Parallel Implementation of Perceptrons
 12.2.9 Exercises for Section 12.2
 12.3 SupportVector Machines
 12.4 Learning from Nearest Neighbors
 12.5 Comparison of Learning Methods
 12.6 Summary of Chapter 12
 12.7 References for Chapter 12
Story
Slides
Spotfire Dashboard
Research Notes
CS246: Mining Massive Datasets Slides
Chapter 1 Introduction
Hadoop Tutorial
Source: http://web.stanford.edu/class/cs246/...s/tutorial.pdf (PDF)
General Instructions
The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you acquainted with the code and homework submission system. Completing the tutorial is optional but by handing in the results in time students will earn 5 points. This tutorial is to be completed individually.
Here you will learn how to write, compile, debug and execute a simple Hadoop program. First part of the assignment serves as a tutorial and the second part asks you to write your own Hadoop program.
Section 1 describes the virtual machine environment. Instead of the virtual machine, you are welcome to setup your own pseudodistributed or fully distributed cluster if you pre fer. Any version of Hadoop that is at least 1.0 will suce. (For an easy way to set up a cluster, try Cloudera Manager:http://archive.cloudera.com/cm5/installer/latest/ clouderamanagerinstaller.bin.) If you choose to setup your own cluster, you are re sponsible for making sure the cluster is working properly. The TAs will be unable to help you debug conguration issues in your own cluster.
Section 2 explains how to use the Eclipse environment in the virtual machine, including how to create a project, how to run jobs, and how to debug jobs. Section 2.5 gives an endtoend example of creating a project, adding code, building, running, and debugging it.
Section 3 is the actual homework assignment. There are no deliverable for sections 1 and 2. In section 3, you are asked to write and submit your own MapReduce job
This assignment requires you to upload the code and handin the output for Section 3.
All students should submit the output via Scoryst and upload the code
Scoryst: create account at https://scoryst.com/enroll/suyVLAjbns
Upload the code: http://snap.stanford.edu/submit/
1 Setting up a virtual machine
 Download and install VirtualBox on your machine: http://virtualbox.org/wiki/Downloads
 Download the Cloudera Quickstart VM at http://www.cloudera.com/content/clou...cdh53x.html
 Uncompress the VM archive. It is compressed with 7zip. If needed you can download a tool to uncompress the archive at http://www.7zip.org/.
 Start VirtualBox and click Import Appliance in the File dropdown menu. Click the folder icon beside the location eld. Browse to the uncompressed archive folder, select the .ovf le, and click the Open button. Click the Continue button. Click the Import button.
 Your virtual machine should now appear in the left column. Select it and click on Start to launch it.
 To verify that the VM is running and you can access it, open a browser to the URL: http://localhost:8088. You should see the resource manager UI. The VM uses port forwarding for the common Hadoop ports, so when the VM is running, those ports on localhost will redirect to the VM.
 Optional: Open the Virtual Box preferences (File ! Preferences ! Network) and select the Adapter 2 tab. Click the Enable Network Adapter checkbox. Select Host only Adapter. If the list of networks is empty, add a new network. Click OK. If you do this step, you will be able to connect to the running virtual machine via SSH from the host OS at 192.168.56.101. The username and password are 'cloudera'.
The virtual machine includes the following software
 CentOS 6.4
 JDK 7 (1.7.0 67)
 Hadoop 2.5.0
 Eclipse 4.2.6 (Juno)
The virtual machine runs best with 4096MB of RAM, but has been tested to function with 1024MB. Note that at 1024MB, while it did technically function, it was very slow to start up.
2 Running Hadoop jobs
Generally Hadoop can be run in three modes.
1. Standalone (or local) mode: There are no daemons used in this mode. Hadoop uses the local le system as an substitute for HDFS le system. The jobs will run as if there is 1 mapper and 1 reducer.
2. Pseudodistributed mode: All the daemons run on a single machine and this setting mimics the behavior of a cluster. All the daemons run on your machine locally using the HDFS protocol. There can be multiple mappers and reducers.
3. Fullydistributed mode: This is how Hadoop runs on a real cluster.
In this homework we will show you how to run Hadoop jobs in Standalone mode (very useful for developing and debugging) and also in Pseudodistributed mode (to mimic the behavior of a cluster environment).
2.1 Creating a Hadoop project in Eclipse
(There is a plugin for Eclipse that makes it simple to create a new Hadoop project and execute Hadoop jobs, but the plugin is only well maintained for Hadoop 1.0.4, which is a rather old version of Hadoop. There is a project at https://github.com/winghc/hadoop2xeclipseplugin that is working to update the plugin for Hadoop 2.0. You can try it out if you like, but your mileage may vary.)
To create a project:
1. Open Eclipse. If you just launched the VM, you may have to close the Firefox window to find the Eclipse icon on the desktop.
2. Rightclick on the training node in the Package Explorer and select Copy. See Figure 1.
Figure 1: Create a Hadoop Project
3. Rightclick on the training node in the Package Explorer and select Paste. See Figure 2.
Figure 2: Create a Hadoop Project
4. In the popup dialog, enter the new project name in the Project Name eld and click OK. See Figure 3.
2.2 Running Hadoop jobs in standalone mode
Once you've created your project and written the source code, to run the project in stand alone mode, do the following:
1. Rightclick on the project and select Run As > RunConfigurations. See Figure 4.
Figure 4: Run a Hadoop Project
2. In the popup dialog, select the Java Application node and click the New launch configuration button in the upper left corner. See Figure 5.
Figure 5: Run a Hadoop Project
3. Enter a name in the Name eld and the name of the main class in the Main class eld. See Figure 6.
Figure 6: Run a Hadoop Project
4. Switch to the Arguments tab and input the required arguments. Click Apply. See Figure 7. To run the job immediately, click on the Run button. Otherwise click Close and complete the following step.
Figure 7: Run a Hadoop Project
5. Rightclick on the project and select Run As ! Java Application. See Figure 8.
Figure 8: Run a Hadoop Project
6. In the popup dialog select the main class from the selection list and click OK. See Figure 9.
2.3 Running Hadoop in pseudodistributed mode
Once you've created your project and written the source code, to run the project in pseudo distributed mode, do the following:
1. Rightclick on the project and select Export. See Figure 10.
Figure 10: Run a Hadoop Project
2. In the popup dialog, expand the Java node and select JAR file. See Figure 11. Click Next >
Figure 11: Run a Hadoop Project
3. Enter a path in the JAR file field and click Finish. See Figure 12.
2.4 Debugging Hadoop jobs
To debug an issue with a job, the easiest approach is to run the job in standalone mode and use a debugger. To debug your job, do the following steps:
1. Rightclick on the project and select Debug As ! Java Application. See Figure 13.
Figure 13: Debug a Hadoop project
2. In the popup dialog select the main class from the selection list and click OK. See Figure 14.
Figure 14: Run a Hadoop Project
You can use the Eclipse debugging features to debug your job execution. See the additional Eclipse tutorials at the end of section 2.6 for help using the Eclipse debugger.
When running your job in pseudodistributed mode, the output from the job is logged in the task tracker's log les, which can be accessed most easily by pointing a web browser to port 8088 of the server, which will the localhost. From the job tracker web page, you can drill down into the failing job, the failing task, the failed attempt, and finally the log files. Note that the logs for stdout and stderr are separated, which can be useful when trying to isolate specific debugging print statements.
2.5 Example project
In this section you will create a new Eclipse Hadoop project, compile, and execute it. The program will count the frequency of all the words in a given large text file. In your virtual machine, Hadoop, Java environment and Eclipse have already been preinstalled.
Open Eclipse. If you just launched the VM, you may have to close the Firefox window to find the Eclipse icon on the desktop.
Rightclick on the training node in the Package Explorer and select Copy. See Figure 15.
Figure 15: Create a Hadoop Project
Rightclick on the training node in the Package Explorer and select Paste. See Figure 16.
Figure 16: Create a Hadoop Project
In the popup dialog, enter the new project name in the Project Name eld and click OK. See Figure 17.
Figure 17: Create a Hadoop Project
Create a new package called edu.stanford.cs246.wordcount by rightclicking on the src node and selecting New ! Package. See Figure 18.
Figure 18: Create a Hadoop Project
Enter edu.stanford.cs246.wordcount in the Name eld and click Finish. See Figure 19.
Figure 19: Create a Hadoop Project
Create a new class in that package called WordCount by rightclicking on the edu.stanford.cs246.wordcount node and selecting New ! Class. See Figure 20.
Figure 20: Create a Hadoop Project
In the popup dialog, enter WordCount as the Name. See Figure 21.
Figure 21: Create a Hadoop Project
In the Superclass eld, enter Configured and click the Browse button. From the popup window select Configured  org:apache:hadoop:conf and click the OK button. See Figure 22.
Figure 22: Create a java file
In the Interfaces section, click the Add button. From the popup window select Tool  org:apache:hadoop:util and click the OK button. See Figure 23.
Figure 23: Create a java file
Check the boxes for public static void main(String args[]) and Inherited abstract methods and click the Finish button. See Figure 24.
Figure 24: Create WordCount.java
You will now have a rough skeleton of a Java le as in Figure 25. You can now add code to this class to implement your Hadoop job.
Figure 25: Create WordCount.java
Rather than implement a job from scratch, copy the contents from http://snap.stanford.edu/class/cs246...WordCount.java and paste it into the WordCount.java file. See Figure 26. The code in WordCount.java calculates the frequency of each word in a given dataset.
Figure 26: Create WordCount.java
Download the Complete Works of William Shakespeare from Project Gutenberg at http://www.gutenberg.org/cache/epub/100/pg100.txt. You can do this simply with cURL, but you also have to be aware of the byte order mark (BOM). You can download the file and remove the BOM in one line by opening a terminal, changing to the ~/workspace/WordCount directory, and running the following command:
curl http://www.gutenberg.org/cache/epub/100/pg100.txt  perl pe 's/^nxEFnxBBnxBF//' > pg100.txt
If you copy the above command beware the quotes as the copy/paste will likely mis translate them.
Rightclick on the project and select Run As > RunConfigurations. See Figure 27.
Figure 27: Run WordCount.java
In the popup dialog, select the Java Application node and click the New launch configuration button in the upper left corner. See Figure 28.
Figure 28: Run WordCount.java
Enter a name in the Name field and WordCount in the Main class field. See Figure 29.
Figure 29: Run WordCount.java
Switch to the Arguments tab and put pg100.txt output in the Program arguments eld. See Figure 30. Click Apply and Close.
Figure 30: Run WordCount.java
Rightclick on the project and select Run As ! Java Application. See Figure 31.
Figure 31: Run WordCount.java
In the popup dialog select WordCount  edu.stanford.cs246.wordcount from the selec tion list and click OK. See Figure 32.
Figure 32: Export a hadoop project
You will see the command output in the console window, and if the job succeeds, you'll nd the results in the ~/workspace/WordCount/output directory. If the job fails complaining that it cannot find the input file, make sure that the pg100.txt le is located in the ~/workspace/WordCount directory.
Rightclick on the project and select Export. See Figure 33.
Figure 33: Run WordCount.java
In the popup dialog, expand the Java node and select JAR file. See Figure 34. Click Next >
Figure 34: Export a Hadoop project
Enter /home/cloudera/wordcount.jar in the JAR file field and click Finish. See Figure 35.
Figure 35: Export a Hadoop project
If you see an error dialog warning that the project compiled with warnings, you can simply click OK.
Open a terminal and run the following commands:
hadoop fs put workspace /WordCount/pg100.txt
hadoop jar wordcount.jar edu.stanford.cs246.wordcount.WordCount pg100.txt output
Run the command: hadoop fs ls output
You should see an output file for each reducer. Since there was only one reducer for this job, you should only see one part* file. Note that sometimes the files will be called partNNNNN, and sometimes they'll be called partrNNNNN. See Figure 36.
Figure 36: Run WordCount job
Run the command:
bin/hadoop fs cat output/partn*  head
You should see the same output as when you ran the job locally, as shown in Figure 37
Figure 37: Run WordCount job
To view the job's logs, open the browser in the VM and point it to http://localhost:8088 as in Figure 38
2.6 Using your local machine for development
If you'd rather use your own development environment instead of working in the IDE, follow these steps:
1. Make sure that you have an entry for localhost.localdomain in your /etc/hosts le, e.g. 127.0.0.1 localhost localhost.localdomain
2. Install a copy of Hadoop locally. The easiest way to do that is to simply download the archive from http://archive.cloudera.com/cdh5/cdh...latest.tar.gz and unpack it.
3. In the unpacked archive, you'll nd a etc/hadoop directory. In that directory, open the coresite.xml le and modify it as follows:
4. Next, open the yarnsite.xml le in the same directory and modify it as follows:
You can now run the Hadoop binaries located in the bin directory in the archive, and they will connect to the cluster running in your virtual machine.
Further Hadoop tutorials
Yahoo! Hadoop Tutorial: http://developer.yahoo.com/hadoop/tutorial/
Cloudera Hadoop Tutorial: http://www.cloudera.com/content/clou...Tutorial.html
How to Debug MapReduce Programs: http://wiki.apache.org/hadoop/HowToD...ReducePrograms
Further Eclipse tutorials
Genera Eclipse tutorial: http://www.vogella.com/articles/Eclipse/article.html.
Tutorial on how to use the Eclipse debugger: http://www.vogella.com/articles/Ecli...g/article.html.
3 Task: Write your own Hadoop Job
Now you will write your rst MapReduce job to accomplish the following task:
 Write a Hadoop MapReduce program which outputs the number of words that start with each letter. This means that for every letter we want to count the total number of words that start with that letter. In your implementation ignore the letter case, i.e., consider all words as lower case. You can ignore all nonalphabetic characters.
 Run your program over the same input data as above.
What to handin: Handin the printout of the output le and upload the source code.
CS246: Mining Massive Datasets Winter 2015
Source: http://web.stanford.edu/class/cs246/intro_handout.pdf (PDF)
Course Information
Instructor: Jure Leskovec
Oce Hours: Wednesdays 910am, Gates 418
Lectures: 9:30AM  10:45AM Tuesday and Thursday in NVidia, Huang Engineering Center
Course website: http://cs246.stanford.edu
Contact:
Email us at cs246win1415staff@lists.stanford.edu
Use Piazza to post questions: http://piazza.com/stanford/winter2015/cs246
SCPD students can attend oce hours remotely via a Google Hangout; the link will be posted on Piazza just before the oce hours start.
TAs and office hours: See the course website for times and locations.
Topics
 MapReduce and Hadoop
 Frequent itemsets and Association rules
 Near Neighbor Search in High Dimensions
 Locality Sensitive Hashing (LSH)
 Dimensionality reduction: SVD and CUR
 Recommender systems
 Clustering
 Analysis of massive graphs
 Link Analysis: PageRank, HITS
 Web spam and TrustRank
 Proximity search on graphs
 Largescale supervised machine learning
 Mining data streams
 Learning through experimentation
 Web advertising
 Optimizing submodular functions
Assignments and grading
 Four problem sets requiring coding and theory (40%)
 Final exam (40%)
 Gradiance quizzes (20%)
 Piazza and course participation (up to 1% extra credit)
Homework policy
Questions We try very hard to make questions unambiguous, but some ambiguities may remain. Ask (i.e., post a question on Piazza) if confused or state your assumptions explicitly. Reasonable assumptions will be accepted in case of ambiguous questions.
Honor code We take honor code extremely seriously (http://stanford.io/1F3TWNO). The standard penalty includes a onequarter suspension from the University and 40 hours of community service.We strongly encourage students to form study groups. Students may discuss and work on homework problems in groups. However, each student must write down the solutions and the code independently.In addition, each student should write down the set of people whom s/he interacted.
Late assignments Each student will have a total of 2 late periods to use for homeworks. One late period expires at 5pm. (If the assignment is due on Thursday 5pm then the late period expires next Tuesday 5pm.) No assignment will be accepted more than one late period after its due date.
Assignment submission All students (SCPD and nonSCPD) submit their homeworks via Scoryst (http://www.scoryst.com). Students can typeset or scan their homeworks. Students also need to upload their code at http://snap.stanford.edu/submit. Put all the code for a single question into a single le and upload it. Refer to the course FAQ for more info.
Regrade requests We take great care to ensure that grading is fair and consistent. Since we will always use the same grading procedure, any grades you receive are unlikely to change signi cantly. However, if you feel that your work deserves a regrade, submit a written request via Scoryst within a week of receiving your grade. However, note that we reserve the right to regrade the entire assignment. Moreover, if the regrade request is unjusti ed and thus not honored, then every future unsuccessful regrade request will be penalized 5 points.
Gradiance Quizzes are posted on Friday afternoon and due exactly a week later (hard deadline Friday 11:59pm Paci c time). Once the deadline has passed students will not be able to submit the quiz.
Prerequisites
Students are expected to have the following background (recitation sessions will refresh these topics):
 The ability to write very nontrivial computer programs (at a minimum, at the level of CS107). Good knowledge of Java will be extremely helpful since most assignments will require the use of Hadoop/Java.
 Familiarity with basic probability theory is essential (at a minimum, at the level of CS109 or Stat116).
 Familiarity with writing rigorous proofs (at a minimum at the level of CS 103).
 Familiarity with basic linear algebra (e.g., any of Math 51, Math 103, Math 113, CS 205, or EE 263).
 Familiarity with algorithmic analysis (e.g., CS 161).
Materials
Notes and reading assignments will be posted on the course web site. Reading for the class will be from: Mining Massive Datasets by J. Leskovec, A. Rajaraman, J. Ullman (PDFs at http://mmds.org).
Important dates
Assignment  Out Date  Due Date (all 5pm) 
Hadoop tutorial  now  Jan 13 
Assignment 1  Jan 8  Jan 22 
Assignment 2  Jan 22  Feb 5 
Assignment 3  Feb 5  Feb 19 
Assignment 4  Feb 19  Mar 5

Alternate final  Mar 16, 710pm  
Final exam  Mar 20, 12:153:15pm 
We will also hold two review sessions in the rst two weeks of the course (sessions will be video recorded):
 Review of basic probability. Friday, January 9, at 4:155:30pm in Gates B01.
 Review of basic linear algebra. Friday, January 16, at 4:155:30pm in Gates B01.
Next steps for students
 Register for Piazza: http://piazza.com/stanford/winter2015/cs246
 Register for Gradiance: http://www.newgradiance.com/services class token B343F7F0
 Register for Scoryst: https://scoryst.com/enroll/suyVLAjbns
 Download Hadoop VM, start the tutorial: http://cs246.stanford.edu/homeworks/tutorial.pdf
Mining of Massive Datasets
Source: http://www.mmds.org/#courses
The book
The book is based on Stanford Computer Science course CS246: Mining Massive Datasets (and CS345A: Data Mining).
The book, like the course, is designed at the undergraduate computer science level with no formal prerequisites. To support deeper explorations, most of the chapters are supplemented with further reading references.
The Mining of Massive Datasets book has been published by Cambridge University Press. You can get 20% discount here.
By agreement with the publisher, you can download the book for free from this page. Cambridge University Press does, however, retain copyright on the work, and we expect that you will obtain their permission and acknowledge our authorship if you republish parts or all of it.
We welcome your feedback on the manuscript.
The MOOC (Massive Open Online Course)
We are launching an online course based on the Mining Massive Datases book:
The course starts September 29 2014 and will run for 9 weeks with 7 weeks of lectures. Additional information and registration.
The 2nd edition of the book (v2.1)
The following is the second edition of the book. There are three new chapters, on mining large graphs, dimensionality reduction, and machine learning. There is also a revised Chapter 2 that treats mapreduce programming in a manner closer to how it is used in practice.
Together with each chapter there is aslo a set of lecture slides that we use for teaching Stanford CS246: Mining Massive Datasets course. Note that the slides do not necessarily cover all the material convered in the corresponding chapters.
Chapter  Title  Book  Slides  

Preface and Table of Contents  
Chapter 1  Data Mining  PPT  
Chapter 2  MapReduce and the New Software Stack  PPT  
Chapter 3  Finding Similar Items  PPT  
Chapter 4  Mining Data Streams  Part 1: Part 2:  PDF  PPT PPT  
Chapter 5  Link Analysis  Part 1: Part 2:  PDF  PPT PPT  
Chapter 6  Frequent Itemsets  PPT  
Chapter 7  Clustering  PPT  
Chapter 8  Advertising on the Web  PPT  
Chapter 9  Recommendation Systems  Part 1: Part 2:  PDF  PPT PPT  
Chapter 10  Mining SocialNetwork Graphs  Part 1: Part 2:  PDF  PPT PPT  
Chapter 11  Dimensionality Reduction  PPT  
Chapter 12  LargeScale Machine Learning  Part 1: Part 2:  PDF  PPT PPT  
Index  
Errata  HTML 
Download the latest version of the book as a single big PDF file (511 pages, 3 MB).
Download the full version of the book with a hyperlinked table of contents that make it easy to jump around: PDF file (513 pages, 3.69 MB).
The Errata for the second edition of the book: HTML.
Note to the users of provided slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org/.
Comments and corrections are most welcome. Please let us know if you are using these materials in your course and we will list and link to your course.
Stanford big data courses
CS246
CS246: Mining Massive Datasets is graduate level course that discusses data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis is on Map Reduce as a tool for creating parallel algorithms that can process very large amounts of data.
CS341
CS341 Project in Mining Massive Data Sets is an advanced project based course. Students work on data mining and machine learning algorithms for analyzing very large amounts of data. Both interesting big datasets as well as computational infrastructure (large MapReduce cluster) are provided by course staff. Generally, students first take CS246 followed by CS341.
CS341 is generously supported by Amazon by giving us access to their EC2 platform.
CS224W
CS224W: Social and Information Networks is graduate level course that covers recent research on the structure and analysis of such large social and information networks and on models and algorithms that abstract their basic properties. Class explores how to practically analyze large scale network data and how to reason about it through models for network structure and evolution.
You can take Stanford courses!
If you are not a Stanford student, you can still take CS246 as well as CS224W or earn a Stanford Mining Massive Datasets graduate certificate by completing a sequence of four Stanford Computer Science courses. A graduate certificate is a great way to keep the skills and knowledge in your field current. More information is available at the Stanford Center for Professional Development (SCPD).
Supporting materials
If you are an instructor interested in using the Gradiance Automated Homework System with this book, start by creating an account for yourself here. Then, email your chosen login and the request to become an instructor for the MMDS book to support@gradiance.com. You will then be able to create a class using these materials. Manuals explaining the use of the system are available here.
Students who want to use the Gradiance Automated Homework System for selfstudy can register here. Then, use the class token1EDD8A1D
to join the "omnibus class" for the MMDS book. See The Student Guide for more information.
Previous versions of the book
Version 1.0
The following materials are equivalent to the published book, with errata corrected to July 4, 2012.
Chapter  Title  Book 

Preface and Table of Contents  
Chapter 1  Data Mining  
Chapter 2  LargeScale File Systems and MapReduce  
Chapter 3  Finding Similar Items  
Chapter 4  Mining Data Streams  
Chapter 5  Link Analysis  
Chapter 6  Frequent Itemsets  
Chapter 7  Clustering  
Chapter 8  Advertising on the Web  
Chapter 9  Recommendation Systems  
Index  
Errata  HTML 
Download the book as published here (340 pages, 2 MB).
Mining of Massive Datasets
Source: http://infolab.stanford.edu/~ullman/mmds/bookL.pdf (PDF)
Jure Leskovec
Stanford Univ.
Anand Rajaraman
Milliway Labs
Jeffrey D. Ullman
Stanford Univ.
Copyright c 2010, 2011, 2012, 2013, 2014 Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman
Preface
This book evolved from material developed over several years by Anand Rajaraman and Jeff Ullman for a onequarter course at Stanford. The course CS345A, titled “Web Mining,” was designed as an advanced graduate course, although it has become accessible and interesting to advanced undergraduates. When Jure Leskovec joined the Stanford faculty, we reorganized the material considerably. He introduced a new course CS224W on network analysis and added material to CS345A, which was renumbered CS246. The three authors also introduced a largescale datamining project course, CS341. The book now contains material taught in all three courses.
What the Book Is About
At the highest level of description, this book is about data mining. However, it focuses on data mining of very large amounts of data, that is, data so large it does not fit in main memory. Because of the emphasis on size, many of our examples are about the Web or data derived from the Web. Further, the book takes an algorithmic point of view: data mining is about applying algorithms
to data, rather than using data to “train” a machinelearning engine of some sort. The principal topics covered are:
1. Distributed file systems and mapreduce as a tool for creating parallel algorithms that succeed on very large amounts of data.
2. Similarity search, including the key techniques of minhashing and localitysensitive hashing.
3. Datastream processing and specialized algorithms for dealing with data that arrives so fast it must be processed immediately or lost.
4. The technology of search engines, including Google’s PageRank, linkspam detection, and the hubsandauthorities approach.
5. Frequentitemset mining, including association rules, marketbaskets, the APriori Algorithm and its improvements.
6. Algorithms for clustering very large, highdimensional datasets.
7. Two key problems for Web applications: managing advertising and recommendation systems.
8. Algorithms for analyzing and mining the structure of very large graphs, especially socialnetwork graphs.
9. Techniques for obtaining the important properties of a large dataset by dimensionality reduction, including singularvalue decomposition and latent semantic indexing.
10. Machinelearning algorithms that can be applied to very large data, such as perceptrons, supportvector machines, and gradient descent.
Prerequisites
To appreciate fully the material in this book, we recommend the following prerequisites:
1. An introduction to database systems, covering SQL and related programming systems.
2. A sophomorelevel course in data structures, algorithms, and discrete math.
3. A sophomorelevel course in software systems, software engineering, and programming languages.
Exercises
The book contains extensive exercises, with some for almost every section. We indicate harder exercises or parts of exercises with an exclamation point. The hardest exercises have a double exclamation point.
Support on the Web
Go to http://www.mmds.org for slides, homework assignments, project requirements, and exams from courses related to this book.
Gradiance Automated Homework
There are automated exercises based on this book, using the Gradiance rootquestion technology, available at www.gradiance.com/services. Students may enter a public class by creating an account at that site and entering the class with code 1EDD8A1D. Instructors may use the site by making an account there and then emailing support at gradiance dot com with their login name, the name of their school, and a request to use the MMDS materials.
Acknowledgements
Cover art is by Scott Ullman.
We would like to thank Foto Afrati, Arun Marathe, and Rok Sosic for critical readings of a draft of this manuscript.
Errors were also reported by Rajiv Abraham, Apoorv Agarwal, Aris Anagnostopoulos, Atilla Soner Balkir, Arnaud Belletoile, Robin Bennett, Susan Biancani, Amitabh Chaudhary, Leland Chen, Anastasios Gounaris, Shrey Gupta, Waleed Hameid, Saman Haratizadeh, Lachlan Kang, Ed Knorr, Haewoon Kwak, Ellis Lau, Greg Lee, Ethan Lozano, Yunan Luo, Michael Mahoney, Justin Meyer, Bryant Moscon, Brad Penoff, Philips Kokoh Prasetyo, Qi Ge, Rich Seiter, Hitesh Shetty, Angad Singh, Sandeep Sripada, Dennis Sidharta, Krzysztof Stencel, Mark Storus, Roshan Sumbaly, Zack Taylor, Tim Triche Jr., Wang Bin, Weng ZhenBin, Robert West, Oscar Wu, Xie Ke, Nicolas Zhao, and Zhou Jingbo, The remaining errors are ours, of course.
J. L.
A. R.
J. D. U.
Palo Alto, CA
March, 2014
Contents
1 Data Mining
In this intoductory chapter we begin with the essence of data mining and a discussion of how data mining is treated by the various disciplines that contribute to this field. We cover “Bonferroni’s Principle,” which is really a warning about overusing the ability to mine data. This chapter is also the place where we summarize a few useful ideas that are not data mining but are useful in understanding some important datamining concepts. These include the TF.IDF measure of word importance, behavior of hash functions and indexes, and identities involving e, the base of natural logarithms. Finally, we give an outline of the topics covered in the balance of the book.
1.1 What is Data Mining?
The most commonly accepted definition of “data mining” is the discovery of “models” for data. A “model,” however, can be one of several things. We mention below the most important directions in modeling.
1.1.1 Statistical Modeling
Statisticians were the first to use the term “data mining.” Originally, “data mining” or “data dredging” was a derogatory term referring to attempts to extract information that was not supported by the data. Section 1.2 illustrates the sort of errors one can make by trying to extract what really isn’t in the data. Today, “data mining” has taken on a positive meaning. Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution from which the visible data is drawn.
Example 1.1
Suppose our data is a set of numbers. This data is much simpler than data that would be datamined, but it will serve as an example. A statistician might decide that the data comes from a Gaussian distribution and use a formula to compute the most likely parameters of this Gaussian. The mean and standard deviation of this Gaussian distribution completely characterize the distribution and would become the model of the data.
1.1.2 Machine Learning
There are some who regard data mining as synonymous with machine learning. There is no question that some data mining appropriately uses algorithms from machine learning. Machinelearning practitioners use the data as a training set, to train an algorithm of one of the many types used by machinelearning practitioners, such as Bayes nets, supportvector machines, decision trees, hidden Markov models, and many others.
There are situations where using data in this way makes sense. The typical case where machine learning is a good approach is when we have little idea of what we are looking for in the data. For example, it is rather unclear what it is about movies that makes certain moviegoers like or dislike it. Thus, in answering the “Netflix challenge” to devise an algorithm that predicts the ratings of movies by users, based on a sample of their responses, machinelearning algorithms have proved quite successful. We shall discuss a simple form of this type of algorithm in Section 9.4.
On the other hand, machine learning has not proved successful in situations where we can describe the goals of the mining more directly. An interesting case in point is the attempt by WhizBang! Labs 1 to use machine learning to locate people’s resumes on theWeb. It was not able to do better than algorithms designed by hand to look for some of the obvious words and phrases that appear in the typical resume. Since everyone who has looked at or written a resume has a pretty good idea of what resumes contain, there was no mystery about what makes a Web page a resume. Thus, there was no advantage to machinelearning over the direct design of an algorithm to discover resumes.
1.1.3 Computational Approaches to Modeling
More recently, computer scientists have looked at data mining as an algorithmic problem. In this case, the model of the data is simply the answer to a complex query about it. For instance, given the set of numbers of Example 1.1, we might compute their average and standard deviation. Note that these values might not be the parameters of the Gaussian that best fits the data, although they will almost certainly be very close if the size of the data is large.
There are many different approaches to modeling data. We have already mentioned the possibility of constructing a statistical process whereby the data could have been generated. Most other approaches to modeling can be described as either
1. Summarizing the data succinctly and approximately, or
2. Extracting the most prominent features of the data and ignoring the rest.
We shall explore these two approaches in the following sections.
1.1.4 Summarization
One of the most interesting forms of summarization is the PageRank idea, which made Google successful and which we shall cover in Chapter 5. In this form of Web mining, the entire complex structure of the Web is summarized by a single number for each page. This number, the “PageRank” of the page, is (oversimplifying somewhat) the probability that a random walker on the graph would be at that page at any given time. The remarkable property this ranking has is that it reflects very well the “importance” of the page – the degree to which typical searchers would like that page returned as an answer to their search query.
Another important form of summary – clustering – will be covered in Chapter 7. Here, data is viewed as points in a multidimensional space. Points that are “close” in this space are assigned to the same cluster. The clusters themselves are summarized, perhaps by giving the centroid of the cluster and the average distance from the centroid of points in the cluster. These cluster summaries become the summary of the entire data set.
Example 1.2
A famous instance of clustering to solve a problem took place long ago in London, and it was done entirely without computers. 2 The physician John Snow, dealing with a Cholera outbreak plotted the cases on a map of the city. A small illustration suggesting the process is shown in Fig. 1.1.
Figure 1.1: Plotting cholera cases on a map of London
The cases clustered around some of the intersections of roads. These intersections were the locations of wells that had become contaminated; people who lived nearest these wells got sick, while people who lived nearer to wells that had not been contaminated did not get sick. Without the ability to cluster the data, the cause of Cholera would not have been discovered.
1.1.5 Feature Extraction
The typical featurebased model looks for the most extreme examples of a phenomenon and represents the data by these examples. If you are familiar with Bayes nets, a branch of machine learning and a topic we do not cover in this book, you know how a complex relationship between objects is represented by finding the strongest statistical dependencies among these objects and using only those in representing all statistical connections. Some of the important kinds of feature extraction from largescale data that we shall study are:
1. Frequent Itemsets. This model makes sense for data that consists of “baskets” of small sets of items, as in the marketbasket problem that we shall discuss in Chapter 6. We look for small sets of items that appear together in many baskets, and these “frequent itemsets” are the characterization of the data that we seek. The original application of this sort of mining was true market baskets: the sets of items, such as hamburger and ketchup, that people tend to buy together when checking out at the cash register of a store or super market.
2. Similar Items. Often, your data looks like a collection of sets, and the objective is to find pairs of sets that have a relatively large fraction of their elements in common. An example is treating customers at an online store like Amazon as the set of items they have bought. In order for Amazon to recommend something else they might like, Amazon can look for “similar” customers and recommend something many of these customers have bought. This process is called “collaborative filtering.” If customers were singleminded, that is, they bought only one kind of thing, then clustering customers might work. However, since customers tend to have interests in many different things, it is more useful to find, for each customer, a small number of other customers who are similar in their tastes, and represent the data by these connections. We discuss similarity in Chapter 3.
1.2 Statistical Limits on Data Mining
A common sort of datamining problem involves discovering unusual events hidden within massive amounts of data. This section is a discussion of the problem, including “Bonferroni’s Principle,” a warning against overzealous use of data mining.
1.2.1 Total Information Awareness
In 2002, the Bush administration put forward a plan to mine all the data it could find, including creditcard receipts, hotel records, travel data, and many other kinds of information in order to track terrorist activity. This idea naturally caused great concern among privacy advocates, and the project, called TIA, or Total Information Awareness, was eventually killed by Congress, although it is unclear whether the project in fact exists under another name. It is not the purpose of this book to discuss the difficult issue of the privacysecurity tradeoff. However, the prospect of TIA or a system like it does raise technical questions about its feasibility and the realism of its assumptions.
The concern raised by many is that if you look at so much data, and you try to find within it activities that look like terrorist behavior, are you not going to find many innocent activities – or even illicit activities that are not terrorism – that will result in visits from the police and maybe worse than just a visit? The answer is that it all depends on how narrowly you define the activities that you look for. Statisticians have seen this problem in many guises and have a theory, which we introduce in the next section.
1.2.2 Bonferroni’s Principle
Suppose you have a certain amount of data, and you look for events of a certain type within that data. You can expect events of this type to occur, even if the data is completely random, and the number of occurrences of these events will grow as the size of the data grows. These occurrences are “bogus,” in the sense that they have no cause other than that random data will always have some number of unusual features that look significant but aren’t. A theorem of statistics, known as the Bonferroni correction gives a statistically sound way to avoid most of these bogus positive responses to a search through the data. Without going into the statistical details, we offer an informal version, Bon ferroni’s principle, that helps us avoid treating random occurrences as if they were real. Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. If this number is significantly larger than the number of real instances you hope to find, then you must expect almost anything you find to be bogus, i.e., a statistical artifact rather than evidence of what you are looking for. This observation is the informal statement of Bonferroni’s principle.
In a situation like searching for terrorists, where we expect that there are few terrorists operating at any one time, Bonferroni’s principle says that we may only detect terrorists by looking for events that are so rare that they are unlikely to occur in random data. We shall give an extended example in the next section.
1.2.3 An Example of Bonferroni’s Principle
Suppose there are believed to be some “evildoers” out there, and we want to detect them. Suppose further that we have reason to believe that periodically, evildoers gather at a hotel to plot their evil. Let us make the following assumptions about the size of the problem:
1. There are one billion people who might be evildoers.
2. Everyone goes to a hotel one day in 100.
3. A hotel holds 100 people. Hence, there are 100,000 hotels – enough to hold the 1% of a billion people who visit a hotel on any given day.
4. We shall examine hotel records for 1000 days.
To find evildoers in this data, we shall look for people who, on two different days, were both at the same hotel. Suppose, however, that there really are no evildoers. That is, everyone behaves at random, deciding with probability 0.01 to visit a hotel on any given day, and if so, choosing one of the 10^{5} hotels at random. Would we find any pairs of people who appear to be evildoers?
We can do a simple approximate calculation as follows. The probability of any two people both deciding to visit a hotel on any given day is .0001. The chance that they will visit the same hotel is this probability divided by 10^{5}, the number of hotels. Thus, the chance that they will visit the same hotel on one given day is 10^{−9}. The chance that they will visit the same hotel on two different given days is the square of this number, 10^{−18}. Note that the hotels can be different on the two days.
Now, we must consider how many events will indicate evildoing. An “event” in this sense is a pair of people and a pair of days, such that the two people were at the same hotel on each of the two days. To simplify the arithmetic, note that for large n, (^{n/}_{2}) is about n^{2}/2. We shall use this approximation in what follows. Thus, the number of pairs of people is (10^{9/}2) = 5 × 10^{17}. The number of pairs of days is ^{1000/}_{2} = 5 × 10^{5}. The expected number of events that look like evildoing is the product of the number of pairs of people, the number of pairs of days, and the probability that any one pair of people and pair of days is an instance of the behavior we are looking for. That number is
5 × 10^{17} × 5 × 10^{5} × 10^{−18} = 250, 000
That is, there will be a quarter of a million pairs of people who look like evildoers, even though they are not.
Now, suppose there really are 10 pairs of evildoers out there. The police will need to investigate a quarter of a million other pairs in order to find the real evildoers. In addition to the intrusion on the lives of half a million innocent people, the work involved is sufficiently great that this approach to finding evildoers is probably not feasible.
1.2.4 Exercises for Section 1.2
Exercise 1.2.1
Using the information from Section 1.2.3, what would be the number of suspected pairs if the following changes were made to the data (and all other numbers remained as they were in that section)?
(a) The number of days of observation was raised to 2000.
(b) The number of people observed was raised to 2 billion (and there were therefore 200,000 hotels).
(c) We only reported a pair as suspect if they were at the same hotel at the same time on three different days.
Exercise 1.2.2
Suppose we have information about the supermarket purchases of 100 million people. Each person goes to the supermarket 100 times in a year and buys 10 of the 1000 items that the supermarket sells. We believe that a pair of terrorists will buy exactly the same set of 10 items (perhaps the ingredients for a bomb?) at some time during the year. If we search for pairs of people who have bought the same set of items, would we expect that any such people found were truly terrorists? 3
1.3 Things Useful to Know
In this section, we offer brief introductions to subjects that you may or may not have seen in your study of other courses. Each will be useful in the study of data mining. They include:
1. The TF.IDF measure of word importance.
2. Hash functions and their use.
3. Secondary storage (disk) and its effect on running time of algorithms.
4. The base e of natural logarithms and identities involving that constant.
5. Power laws.
1.3.1 Importance of Words in Documents
In several applications of data mining, we shall be faced with the problem of categorizing documents (sequences of words) by their topic. Typically, topics are identified by finding the special words that characterize documents about that topic. For instance, articles about baseball would tend to have many occurrences of words like “ball,” “bat,” “pitch,”, “run,” and so on. Once we have classified documents to determine they are about baseball, it is not hard to notice that words such as these appear unusually frequently. However, until we have made the classification, it is not possible to identify these words as characteristic.
Thus, classification often starts by looking at documents, and finding the significant words in those documents. Our first guess might be that the words appearing most frequently in a document are the most significant. However, that intuition is exactly opposite of the truth. The most frequent words will most surely be the common words such as “the” or “and,” which help build ideas but do not carry any significance themselves. In fact, the several hundred most common words in English (called stop words) are often removed from documents before any attempt to classify them.
In fact, the indicators of the topic are relatively rare words. However, not all rare words are equally useful as indicators. There are certain words, for example “notwithstanding” or “albeit,” that appear rarely in a collection of documents, yet do not tell us anything useful. On the other hand, a word like “chukker” is probably equally rare, but tips us off that the document is about the sport of polo. The difference between rare words that tell us something and those that do not has to do with the concentration of the useful words in just a few documents. That is, the presence of a word like “albeit” in a document does not make it terribly more likely that it will appear multiple times. However, if an article mentions “chukker” once, it is likely to tell us what happened in the “first chukker,” then the “second chukker,” and so on. That is, the word is likely to be repeated if it appears at all.
The formal measure of how concentrated into relatively few documents are the occurrences of a given word is called TF.IDF (Term Frequency times Inverse Document Frequency). It is normally computed as follows. Suppose we have a collection of N documents. Define f_{ij} to be the frequency (number of occurrences) of term (word) i in document j. Then, define the term frequency TF_{ij} to be:
TF _{ij} = f_{ij} / max_{k} f_{kj}
That is, the term frequency of term i in document j is f ij normalized by dividing it by the maximum number of occurrences of any term (perhaps excluding stop words) in the same document. Thus, the most frequent term in document j gets a TF of 1, and other terms get fractions as their term frequency for this document.
The IDF for a term is defined as follows. Suppose term i appears in ni of the N documents in the collection. Then IDF _{i} = log_{2}(N/ni). The TF.IDF score for term i in document j is then defined to be TF_{ij }× IDF_{i}. The terms with the highest TF.IDF score are often the terms that best characterize the topic of the document.
Example 1.3
Suppose our repository consists of 2 ^{20 }= 1,048,576 documents. Suppose word w appears in 2 ^{10} = 1024 of these documents. Then IDF_{w} = log_{2}(2^{20}/2^{10}) = log 2(2^{10}) = 10. Consider a document j in which w appears 20 times, and that is the maximum number of times in which any word appears (perhaps after eliminating stop words). Then TFwj = 1, and the TF.IDF score for w in document j is 10.
Suppose that in document k, word w appears once, while the maximum number of occurrences of any word in this document is 20. Then TF_{wk }= 1/20, and the TF.IDF score for w in document k is 1/2
1.3.2 Hash Functions
The reader has probably heard of hash tables, and perhaps used them in Java classes or similar packages. The hash functions that make hash tables feasible are also essential components in a number of datamining algorithms, where the hash table takes an unfamiliar form. We shall review the basics here.
First, a hash function h takes a hashkey value as an argument and produces a bucket number as a result. The bucket number is an integer, normally in the range 0 to B − 1, where B is the number of buckets. Hashkeys can be of any type. There is an intuitive property of hash functions that they “randomize” hashkeys. To be precise, if hashkeys are drawn randomly from a reasonable population of possible hashkeys, then h will send approximately equal numbers of hashkeys to each of the B buckets. It would be impossible to do so if, for example, the population of possible hashkeys were smaller than B. Such a population would not be “reasonable.” However, there can be more subtle reasons why a hash function fails to achieve an approximately uniform distribution into buckets.
Example 1.4
Suppose hashkeys are positive integers. A common and simple hash function is to pick h(x) = x mod B, that is, the remainder when x is divided by B. That choice works fine if our population of hashkeys is all positive integers. 1/Bth of the integers will be assigned to each of the buckets. However, suppose our population is the even integers, and B = 10. Then only buckets 0, 2, 4, 6, and 8 can be the value of h(x), and the hash function is distinctly nonrandom in its behavior. On the other hand, if we picked B = 11, then we would find that 1/11th of the even integers get sent to each of the 11 buckets, so the hash function would work very well.
The generalization of Example 1.4 is that when hashkeys are integers, chosing B so it has any common factor with all (or even most of) the possible hashkeys will result in nonrandom distribution into buckets. Thus, it is normally preferred that we choose B to be a prime. That choice reduces the chance of nonrandom behavior, although we still have to consider the possibility that all hashkeys have B as a factor. Of course there are many other types of hash functions not based on modular arithmetic. We shall not try to summarize the options here, but some sources of information will be mentioned in the bibliographic notes.
What if hashkeys are not integers? In a sense, all data types have values that are composed of bits, and sequences of bits can always be interpreted as integers. However, there are some simple rules that enable us to convert common types to integers. For example, if hashkeys are strings, convert each character to its ASCII or Unicode equivalent, which can be interpreted as a small integer. Sum the integers before dividing by B. As long as B is smaller than the typical sum of character codes for the population of strings, the distribution into buckets will be relatively uniform. If B is larger, then we can partition the characters of a string into groups of several characters each. Treat the concatenation of the codes for the characters of a group as a single integer. Sum the integers associated with all the groups of a string, and divide by B as before. For instance, if B is around a billion, or 230, then grouping characters four at a time will give us 32bit integers. The sum of several of these will distribute fairly evenly into a billion buckets.
For more complex data types, we can extend the idea used for converting strings to integers, recursively
For a type that is a record, each of whose components has its own type, recursively convert the value of each component to an integer, using the algorithm appropriate for the type of that component. Sum the integers for the components, and convert the integer sum to buckets by dividing by B.
For a type that is an array, set, or bag of elements of some one type, convert the values of the elements’ type to integers, sum the integers, and divide by B.
1.3.3 Indexes
An index is a data structure that makes it efficient to retrieve objects given the value of one or more elements of those objects. The most common situation is one where the objects are records, and the index is on one of the fields of that record. Given a value v for that field, the index lets us retrieve all the records with value v in that field. For example, we could have a file of (name, address, phone) triples, and an index on the phone field. Given a phone number, the index allows us to find quickly the record or records with that phone number.
There are many ways to implement indexes, and we shall not attempt to survey the matter here. The bibliographic notes give suggestions for further reading. However, a hash table is one simple way to build an index. The field or fields on which the index is based form the hashkey for a hash function. Records have the hash function applied to value of the hashkey, and the record itself is placed in the bucket whose number is determined by the hash function. The bucket could be a list of records in mainmemory, or a disk block, for example.
Then, given a hashkey value, we can hash it, find the bucket, and need to search only that bucket to find the records with that value for the hashkey. If we choose the number of buckets B to be comparable to the number of records in the file, then there will be relatively few records in any bucket, and the search of a bucket takes little time.
Figure 1.2: A hash table used as an index
phone numbers are hashed to buckets, and the entire record is placed in the bucket whose number is the hash value of the phone
Example 1.5
Figure 1.2 suggests what a mainmemory index of records with name, address, and phone fields might look like. Here, the index is on the phone field, and buckets are linked lists. We show the phone 8005551212 hashed to bucket number 17. There is an array of bucket headers, whose ith element is the head of a linked list for the bucket numbered i. We show expanded one of the elements of the linked list. It contains a record with name, address, and phone fields. This record is in fact one with the phone number 8005551212. Other records in that bucket may or may not have this phone number. We only know that whatever phone number they have is a phone that hashes to 17.
1.3.4 Secondary Storage
It is important, when dealing with largescale data, that we have a good understanding of the difference in time taken to perform computations when the data is initially on disk, as opposed to the time needed if the data is initially in main memory. The physical characteristics of disks is another subject on which we could say much, but shall say only a little and leave the interested reader to follow the bibliographic notes.
Disks are organized into blocks, which are the minimum units that the operating system uses to move data between main memory and disk. For example, theWindows operating system uses blocks of 64K bytes (i.e., 2^{16} = 65,536 bytes to be exact). It takes approximately ten milliseconds to access (move the disk head to the track of the block and wait for the block to rotate under the head) and read a disk block. That delay is at least five orders of magnitude (a factor of 10^{5}) slower than the time taken to read a word from main memory, so if all we want to do is access a few bytes, there is an overwhelming benefit to having data in main memory. In fact, if we want to do something simple to every byte of a disk block, e.g., treat the block as a bucket of a hash table and search for a particular value of the hashkey among all the records in that bucket, then the time taken to move the block from disk to main memory will be far larger than the time taken to do the computation.
By organizing our data so that related data is on a single cylinder (the collection of blocks reachable at a fixed radius from the center of the disk, and therefore accessible without moving the disk head), we can read all the blocks on the cylinder into main memory in considerably less than 10 milliseconds per block. You can assume that a disk cannot transfer data to main memory at more than a hundred million bytes per second, no matter how that data is organized. That is not a problem when your dataset is a megabyte. But a dataset of a hundred gigabytes or a terabyte presents problems just accessing it, let alone doing anything useful with it.
1.3.5 The Base of Natural Logarithms
The constant e = 2.7182818 · · · has a number of useful special properties. In particular, e is the limit of (1 + 1/x)^{x} as x goes to infinity. The values of this expression for x = 1, 2, 3, 4 are approximately 2, 2.25, 2.37, 2.44, so you should find it easy to believe that the limit of this series is around 2.72.
Some algebra lets us obtain approximations to many seemingly complex expressions. Consider (1+a)^{b}, where a is small. We can rewrite the expression as (1+a)^{(1/a)(ab)}. Then substitute a = 1/x and 1/a = x, so we have (1+ 1/x )^{x(ab)}, which is
((1 + 1/x)^{x})^{ab}
Since a is assumed small, x is large, so the subexpression (1+ 1/x )^{x} will be close to the limiting value of e. We can thus approximate (1 + a)b as eab.
Similar identities hold when a is negative. That is, the limit as x goes to infinity of (1 − 1 x )x is 1/e. It follows that the approximation (1 + a)b = eab holds even when a is a small negative number. Put another way, (1 − a)^{b} is approximately e^{−ab} when a is small and b is large.
Some other useful approximations follow from the Taylor expansion of e^{x}. That is, e^{x} = P∞ i=0 xi/i!, or ex = 1 + x + x^{2}/2 + x^{3}/6 + x^{4}/24 + · · · . When x is large, the above series converges slowly, although it does converge because n! grows faster than x^{n} for any constant x. However, when x is small, either positive or negative, the series converges rapidly, and only a few terms are necessary to get a good approximation.
Example 1.6
Let x = 1/2. Then
e^{1/2} = 1 + 1/2 + 1/8 + 1/48 + 1/384 + · · ·
or approximately e^{1/2} = 1.64844
Let x = −1. Then
e^{−1} = 1 − 1 + 1/2 − 1/6 + 1/24 − 1/120 + 1/720 − 1/5040 + · · ·
or approximately e^{−1} = 0.36786
1.3.6 Power Laws
There are many phenomena that relate two variables by a power law, that is, a linear relationship between the logarithms of the variables. Figure 1.3 suggests such a relationship. If x is the horizontal axis and y is the vertical axis, then the relationship is log10 y = 6 − 2 log10 x.
Figure 1.3: A power law with a slope of −2
Example 1.7 : We might examine book sales at Amazon.com, and let x represent the rank of books by sales. Then y is the number of sales of the xth bestselling book over some period. The implication of the graph of Fig. 1.3 would be that the bestselling book sold 1,000,000 copies, the 10th bestselling book sold 10,000 copies, the 100th bestselling book sold 100 copies, and so on for all ranks between these numbers and beyond. The implication that above rank 1000 the sales are a fraction of a book is too extreme, and we would in fact expect the line to flatten out for ranks much higher than 1000.
The Matthew Effect
Often, the existence of power laws with values of the exponent higher than 1 are explained by the Matthew effect. In the biblical Book of Matthew, there is a verse about “the rich get richer.” Many phenomena exhibit this behavior, where getting a high value of some property causes that very property to increase. For example, if a Web page has many links in, then people are more likely to find the page and may choose to link to it from one of their pages as well. As another example, if a book is selling well on Amazon, then it is likely to be advertised when customers go to the Amazon site. Some of these people will choose to buy the book as well, thus increasing the sales of this book.
The general form of a power law relating x and y is log y = b+a log x. If we raise the base of the logarithm (which doesn’t actually matter), say e, to the values on both sides of this equation, we get y = ebea log x = ebxa. Since eb is just “some constant,” let us replace it by constant c. Thus, a power law can be written as y = cxa for some constants a and c.
Example 1.8
In Fig. 1.3 we see that when x = 1, y = 106, and when x = 1000, y = 1. Making the first substitution, we see 106 = c. The second substitution gives us 1 = c(1000)a. Since we now know c = 106, the second equation gives us 1 = 106(1000)a, from which we see a = −2. That is, the law expressed by Fig. 1.3 is y = 106x−2, or y = 106/x2.
We shall meet in this book many ways that power laws govern phenomena. Here are some examples:
1. Node Degrees in the Web Graph: Order all pages by the number of inlinks to that page. Let x be the position of a page in this ordering, and let y be the number of inlinks to the xth page. Then y as a function of x looks very much like Fig. 1.3. The exponent a is slightly larger than the −2 shown there; it has been found closer to 2.1.
2. Sales of Products: Order products, say books at Amazon.com, by their sales over the past year. Let y be the number of sales of the xth most popular book. Again, the function y(x) will look something like Fig. 1.3. we shall discuss the consequences of this distribution of sales in Section 9.1.2, where we take up the matter of the “long tail.”
3. Sizes of Web Sites: Count the number of pages at Web sites, and order sites by the number of their pages. Let y be the number of pages at the xth site. Again, the function y(x) follows a power law.
4. Zipf ’s Law: This power law originally referred to the frequency of words in a collection of documents. If you order words by frequency, and let y be the number of times the xth word in the order appears, then you get a power law, although with a much shallower slope than that of Fig. 1.3. Zipf’s observation was that y = cx−1/2. Interestingly, a number of other kinds of data follow this particular power law. For example, if we order states in the US by population and let y be the population of the xth most populous state, then x and y obey Zipf’s law approximately.
1.3.7 Exercises for Section 1.3
Exercise 1.3.1
Suppose there is a repository of ten million documents. What (to the nearest integer) is the IDF for a word that appears in (a) 40 documents (b) 10,000 documents?
Exercise 1.3.2
Suppose there is a repository of ten million documents, and word w appears in 320 of them. In a particular document d, the maximum number of occurrences of a word is 15. Approximately what is the TF.IDF score for w if that word appears (a) once (b) five times?
Exercise 1.3.3
Suppose hashkeys are drawn from the population of all nonnegative integers that are multiples of some constant c, and hash function h(x) is x mod 15. For what values of c will h be a suitable hash function, i.e., a large random choice of hashkeys will be divided roughly equally into buckets?
Exercise 1.3.4
In terms of e, give approximations to
(a) (1.01)^{500} (b) (1.05)^{1000} (c) (0.9)^{40}
Exercise 1.3.5
Use the Taylor expansion of ex to compute, to three decimal places: (a) e^{1/10} (b) e^{−1/10} (c) e^{2}.
1.4 Outline of the Book
This section gives brief summaries of the remaining chapters of the book.
Chapter 2 is not about data mining per se. Rather, it introduces us to the MapReduce methodology for exploiting parallelism in computing clouds (racks of interconnected processors). There is reason to believe that cloud computing, and MapReduce in particular, will become the normal way to compute when analysis of very large amounts of data is involved. A pervasive issue in later chapters will be the exploitation of the MapReduce methodology to implement the algorithms we cover.
Chapter 3 is about finding similar items. Our starting point is that items can be represented by sets of elements, and similar sets are those that have a large fraction of their elements in common. The key techniques of minhashing and localitysensitive hashing are explained. These techniques have numerous applications and often give surprisingly efficient solutions to problems that appear impossible for massive data sets.
In Chapter 4, we consider data in the form of a stream. The difference between a stream and a database is that the data in a stream is lost if you do not do something about it immediately. Important examples of streams are the streams of search queries at a search engine or clicks at a popular Web site. In this chapter, we see several of the surprising applications of hashing that make management of stream data feasible.
Chapter 5 is devoted to a single application: the computation of PageRank. This computation is the idea that made Google stand out from other search engines, and it is still an essential part of how search engines know what pages the user is likely to want to see. Extensions of PageRank are also essential in the fight against spam (euphemistically called “search engine optimization”), and we shall examine the latest extensions of the idea for the purpose of combating spam.
Then, Chapter 6 introduces the marketbasket model of data, and its canonical problems of association rules and finding frequent itemsets. In the marketbasket model, data consists of a large collection of baskets, each of which contains a small set of items. We give a sequence of algorithms capable of finding all frequent pairs of items, that is pairs of items that appear together in many baskets. Another sequence of algorithms are useful for finding most of the frequent itemsets larger than pairs, with high efficiency.
Chapter 7 examines the problem of clustering. We assume a set of items with a distance measure defining how close or far one item is from another. The goal is to examine a large amount of data and partition it into subsets (clusters), each cluster consisting of items that are all close to one another, yet far from items in the other clusters.
Chapter 8 is devoted to online advertising and the computational problems it engenders. We introduce the notion of an online algorithm – one where a good response must be given immediately, rather than waiting until we have seen the entire dataset. The idea of competitive ratio is another important concept covered in this chapter; it is the ratio of the guaranteed performance of an online algorithm compared with the performance of the optimal algorithm that is allowed to see all the data before making any decisions. These ideas are used to give good algorithms that match bids by advertisers for the right to display their ad in response to a query against the search queries arriving at a search engine.
Chapter 9 is devoted to recommendation systems. Many Web applications involve advising users on what they might like. The Netflix challenge is one example, where it is desired to predict what movies a user would like, or Amazon’s problem of pitching a product to a customer based on information about what they might be interested in buying. There are two basic approaches to recommendation. We can characterize items by features, e.g., the stars of a movie, and recommend items with the same features as those the user is known to like. Or, we can look at other users with preferences similar to that of the user in question, and see what they liked (a technique known as collaborative filtering).
In Chapter 10, we study social networks and algorithms for their analysis. The canonical example of a social network is the graph of Facebook friends, where the nodes are people, and edges connect two people if they are friends. Directed graphs, such as followers on Twitter, can also be viewed as social networks. A common example of a problem to be addressed is identifying “communities,” that is, small sets of nodes with an unusually large number of edges among them. Other questions about social networks are general questions about graphs, such as computing the transitive closure or diameter of a graph, but are made more difficult by the size of typical networks.
Chapter 11 looks at dimensionality reduction. We are given a very large matrix, typically sparse. Think of the matrix as representing a relationship between two kinds of entities, e.g., ratings of movies by viewers. Intuitively, there are a small number of concepts, many fewer concepts than there are movies or viewers, that explain why certain viewers like certain movies. We offer several algorithms that simplify matrices by decomposing them into a product of matrices that are much smaller in one of the two dimensions. One matrix relates entities of one kind to the small number of concepts and another relates the concepts to the other kind of entity. If done correctly, the product of the smaller matrices will be very close to the original matrix.
Finally, Chapter 12 discusses algorithms for machine learning from very large datasets. Techniques covered include perceptrons, supportvector machines, finding models by gradient descent, nearestneighbor models, and decision trees.
1.5 Summary of Chapter 1
 Data Mining: This term refers to the process of extracting useful models of data. Sometimes, a model can be a summary of the data, or it can be the set of most extreme features of the data.
 Bonferroni’s Principle: If we are willing to view as an interesting feature of data something of which many instances can be expected to exist in random data, then we cannot rely on such features being significant. This observation limits our ability to mine data for features that are not sufficiently rare in practice.
 TF.IDF: The measure called TF.IDF lets us identify words in a collection of documents that are useful for determining the topic of each document. A word has high TF.IDF score in a document if it appears in relatively few documents, but appears in this one, and when it appears in a document it tends to appear many times.
 Hash Functions: A hash function maps hashkeys of some data type to integer bucket numbers. A good hash function distributes the possible hashkey values approximately evenly among buckets. Any data type can be the domain of a hash function.
 Indexes: An index is a data structure that allows us to store and retrieve data records efficiently, given the value in one or more of the fields of the record. Hashing is one way to build an index.
 Storage on Disk: When data must be stored on disk (secondary memory), it takes very much more time to access a desired data item than if the same data were stored in main memory. When data is large, it is important that algorithms strive to keep needed data in main memory.
 Power Laws: Many phenomena obey a law that can be expressed as y = cx^{a} for some power a, often around −2. Such phenomena include the sales of the xth most popular book, or the number of inlinks to the xth most popular page.
1.6 References for Chapter 1
[7] is a clear introduction to the basics of data mining. [2] covers data mining principally from the point of view of machine learning and statistics.
For construction of hash functions and hash tables, see [4]. Details of the TF.IDF measure and other matters regarding document processing can be found in [5]. See [3] for more on managing indexes, hash tables, and data on disk.
Power laws pertaining to the Web were explored by [1]. The Matthew effect was first observed in [6].
1. A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Weiner, “Graph structure in the web,” Com puter Networks 33:1–6, pp. 309–320, 2000.
2. M.M. Gaber, Scientific Data Mining and Knowledge Discovery — Prin ciples and Foundations, Springer, New York, 2010.
3. H. GarciaMolina, J.D. Ullman, and J. Widom, Database Systems: The Complete Book Second Edition, PrenticeHall, Upper Saddle River, NJ, 2009.
4. D.E. Knuth, The Art of Computer Programming Vol. 3 (Sorting and Searching), Second Edition, AddisonWesley, Upper Saddle River, NJ, 1998.
5. C.P. Manning, P. Raghavan, and H. Sch¨utze, Introduction to Information Retrieval, Cambridge Univ. Press, 2008.
6. R.K. Merton, “The Matthew effect in science,” Science 159:3810, pp. 56– 63, Jan. 5, 1968.
7. P.N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, AddisonWesley, Upper Saddle River, NJ, 2005.
1.7 Footnotes for Chapter 1
1
This startup attempted to use machine learning to mine largescale data, and hired many of the top machinelearning people to do so. Unfortunately, it was not able to survive.
3
That is, assume our hypothesis that terrorists will surely buy a set of 10 items in common at some time during the year. We don’t want to address the matter of whether or not terrorists would necessarily do so.
2 MapReduce and the New Software Stack
Modern datamining applications, often called “bigdata” analysis, require us to manage immense amounts of data quickly. In many of these applications, the data is extremely regular, and there is ample opportunity to exploit parallelism. Important examples are:
1. The ranking of Web pages by importance, which involves an iterated matrixvector multiplication where the dimension is many billions.
2. Searches in “friends” networks at socialnetworking sites, which involve graphs with hundreds of millions of nodes and many billions of edges.
To deal with applications such as these, a new software stack has evolved. These programming systems are designed to get their parallelism not from a “supercomputer,” but from “computing clusters” – large collections of commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches. The software stack begins with a new form of file system, called a “distributed file system,” which features much larger units than the disk blocks in a conventional operating system. Distributed file systems also provide replication of data or redundancy to protect against the frequent media failures that occur when data is distributed over thousands of lowcost compute nodes.
On top of these file systems, many different higherlevel programming systems have been developed. Central to the new software stack is a programming system called MapReduce. Implementations of MapReduce enable many of the most common calculations on largescale data to be performed on computing clusters efficiently and in a way that is tolerant of hardware failures during the computation.
MapReduce systems are evolving and extending rapidly. Today, it is common for MapReduce programs to be created from still higherlevel programming systems, often an implementation of SQL. Further, MapReduce turns out to be a useful, but simple, case of more general and powerful ideas. We include in this chapter a discussion of generalizations of MapReduce, first to systems that support acyclic workflows and then to systems that implement recursive algorithms.
Our last topic for this chapter is the design of good MapReduce algorithms, a subject that often differs significantly from the matter of designing good parallel algorithms to be run on a supercomputer. When designing MapReduce algorithms, we often find that the greatest cost is in the communication. We thus investigate communication cost and what it tells us about the most efficient MapReduce algorithms. For several common applications of MapReduce we are able to give families of algorithms that optimally trade the communication cost against the degree of parallelism.
2.1 Distributed File Systems
2.1.1 Physical Organization of Compute Nodes
2.1.2 LargeScale FileSystem Organization
2.2 MapReduce
2.2.1 The Map Tasks
2.2.2 Grouping by Key
2.2.3 The Reduce Tasks
2.2.4 Combiners
2.2.5 Details of MapReduce Execution
2.2.6 Coping With Node Failures
2.2.7 Exercises for Section 2.2
2.3 Algorithms Using MapReduce
2.3.1 MatrixVector Multiplication by MapReduce
2.3.2 If the Vector v Cannot Fit in Main Memory
2.3.3 RelationalAlgebra Operations
2.3.4 Computing Selections by MapReduce
2.3.5 Computing Projections by MapReduce
2.3.6 Union, Intersection, and Difference by MapReduce
2.3.7 Computing Natural Join by MapReduce
2.3.8 Grouping and Aggregation by MapReduce
2.3.9 Matrix Multiplication
2.3.10 Matrix Multiplication with One MapReduce Step
2.3.11 Exercises for Section 2.3
2.4 Extensions to MapReduce
2.4.1 Workflow Systems
2.4.2 Recursive Extensions to MapReduce
2.4.3 Pregel
2.4.4 Exercises for Section 2.4
2.5 The Communication Cost Model
2.5.1 CommunicationCost for Task Networks
2.5.2 WallClock Time
2.5.3 Multiway Joins
2.5.4 Exercises for Section 2.5
2.6 Complexity Theory for MapReduce
2.6.1 Reducer Size and Replication Rate
2.6.2 An Example: Similarity Joins
2.6.3 A Graph Model for MapReduce Problems
2.6.4 Mapping Schemas
2.6.5 When Not All Inputs Are Present
2.6.6 Lower Bounds on Replication Rate
2.6.7 Case Study: Matrix Multiplication
2.6.8 Exercises for Section 2.6
2.7 Summary of Chapter 2
 Cluster Computing: A common architecture for very largescale applications is a cluster of compute nodes (processor chip, main memory, and disk). Compute nodes are mounted in racks, and the nodes on a rack are connected, typically by gigabit Ethernet. Racks are also connected by a highspeed network or switch.
 Distributed File Systems: An architecture for very largescale file systems has developed recently. Files are composed of chunks of about 64 megabytes, and each chunk is replicated several times, on different compute nodes or racks.
 MapReduce: This programming system allows one to exploit parallelism inherent in cluster computing, and manages the hardware failures that can occur during a long computation on many nodes. Many Map tasks and many Reduce tasks are managed by a Master process. Tasks on a failed compute node are rerun by the Master.
 The Map Function: This function is written by the user. It takes a collection of input objects and turns each into zero or more keyvalue pairs. Keys are not necessarily unique.
 The Reduce Function: A MapReduce programming system sorts all the keyvalue pairs produced by all the Map tasks, forms all the values associated with a given key into a list and distributes keylist pairs to Reduce tasks. Each Reduce task combines the elements on each list, by applying the function written by the user. The results produced by all the Reduce tasks form the output of the MapReduce process.
 Reducers: It is often convenient to refer to the application of the Reduce function to a single key and its associated value list as a “reducer.”
 Hadoop: This programming system is an opensource implementation of a distributed file system (HDFS, the Hadoop Distributed File System) and MapReduce (Hadoop itself). It is available through the Apache Foundation.
 Managing ComputeNode Failures: MapReduce systems support restart of tasks that fail because their compute node, or the rack containing that node, fail. Because Map and Reduce tasks deliver their output only after they finish, it is possible to restart a failed task without concern for possible repetition of the effects of that task. It is necessary to restart the entire job only if the node at which the Master executes fails.
 Applications of MapReduce: While not all parallel algorithms are suitable for implementation in the MapReduce framework, there are simple implementations of matrixvector and matrixmatrix multiplication. Also, the principal operators of relational algebra are easily implemented in MapReduce.
 Workflow Systems: MapReduce has been generalized to systems that support any acyclic collection of functions, each of which can be instantiated by any number of tasks, each responsible for executing that function on a portion of the data.
 Recursive Workflows: When implementing a recursive collection of functions, it is not always possible to preserve the ability to restart any failed task, because recursive tasks may have produced output that was consumed by another task before the failure. A number of schemes for checkpointing parts of the computation to allow restart of single tasks, or restart all tasks from a recent point, have been proposed.
 CommunicationCost : Many applications of MapReduce or similar systems do very simple things for each task. Then, the dominant cost is usually the cost of transporting data from where it is created to where it is used. In these cases, efficiency of a MapReduce algorithm can be estimated by calculating the sum of the sizes of the inputs to all the tasks.
 Multiway Joins: It is sometimes more efficient to replicate tuples of the relations involved in a join and have the join of three or more relations computed as a single MapReduce job. The technique of Lagrangean multipliers can be used to optimize the degree of replication for each of the participating relations.
 Star Joins: Analytic queries often involve a very large fact table joined with smaller dimension tables. These joins can always be done efficiently by the multiwayjoin technique. An alternative is to distribute the fact table and replicate the dimension tables permanently, using the same strategy as would be used if we were taking the multiway join of the fact table and every dimension table.
 Replication Rate and Reducer Size: It is often convenient to measure communication by the replication rate, which is the communication per input. Also, the reducer size is the maximum number of inputs associated with any reducer. For many problems, it is possible to derive a lower bound on replication rate as a function of the reducer size.
 Representing Problems as Graphs: It is possible to represent many problems that are amenable to MapReduce computation by a graph in which nodes represent inputs and outputs. An output is connected to all the inputs that are needed to compute that output.
 Mapping Schemas: Given the graph of a problem, and given a reducer size, a mapping schema is an assignment of the inputs to one or more reducers so that no reducer is assigned more inputs than the reducer size permits, and yet for every output there is some reducer that gets all the inputs needed to compute that output. The requirement that there be a mapping schema for any MapReduce algorithm is a good expression of what makes MapReduce algorithms different from general parallel computations.
 Matrix Multiplication by MapReduce: There is a family of onepass Map Reduce algorithms that performs multiplication of n × n matrices with the minimum possible replication rate r = 2n2/q, where q is the reducer size. On the other hand, a twopass MapReduce algorithm for the same problem with the same reducer size can use up to a factor of n less communication.
2.8 References for Chapter 2
3 Finding Similar Items
A fundamental datamining problem is to examine data for “similar” items. We shall take up applications in Section 3.1, but an example would be looking at a collection of Web pages and finding nearduplicate pages. These pages could be plagiarisms, for example, or they could be mirrors that have almost the same content but differ in information about the host and about other mirrors.
We begin by phrasing the problem of similarity as one of finding sets with a relatively large intersection. We show how the problem of finding textually similar documents can be turned into such a set problem by the technique known as “shingling.” Then, we introduce a technique called “minhashing,” which compresses large sets in such a way that we can still deduce the similarity of the underlying sets from their compressed versions. Other techniques that work when the required degree of similarity is very high are covered in Section 3.9.
Another important problem that arises when we search for similar items of any kind is that there may be far too many pairs of items to test each pair for their degree of similarity, even if computing the similarity of any one pair can be made very easy. That concern motivates a technique called “localitysensitive hashing,” for focusing our search on pairs that are most likely to be similar.
Finally, we explore notions of “similarity” that are not expressible as intersection of sets. This study leads us to consider the theory of distance measures in arbitrary spaces. It also motivates a general framework for localitysensitive hashing that applies for other definitions of “similarity.”
3.1 Applications of NearNeighbor Search
3.1.1 Jaccard Similarity of Sets
3.1.2 Similarity of Documents
3.1.3 Collaborative Filtering as a SimilarSets Problem
3.1.4 Exercises for Section 3.1
3.2 Shingling of Documents
3.2.1 kShingles
3.2.2 Choosing the Shingle Size
3.2.3 Hashing Shingles
3.2.4 Shingles Built from Words
3.2.5 Exercises for Section 3.2
3.3 SimilarityPreserving Summaries of Sets
3.3.1 Matrix Representation of Sets
3.3.2 Minhashing
3.3.3 Minhashing and Jaccard Similarity
3.3.4 Minhash Signatures
3.3.5 Computing Minhash Signatures
3.3.6 Exercises for Section 3.3
3.4 LocalitySensitive Hashing for Documents
3.4.1 LSH for Minhash Signatures
3.4.2 Analysis of the Banding Technique
3.4.3 Combining the Techniques
3.4.4 Exercises for Section 3.4
3.5 Distance Measures
3.5.1 Definition of a Distance Measure
3.5.2 Euclidean Distances
3.5.3 Jaccard Distance
3.5.4 Cosine Distance
3.5.5 Edit Distance
3.5.6 Hamming Distance
3.5.7 Exercises for Section 3.5
3.6 The Theory of LocalitySensitive Functions
3.6.1 LocalitySensitive Functions
3.6.2 LocalitySensitive Families for Jaccard Distance
3.6.3 Amplifying a LocalitySensitive Family
3.6.4 Exercises for Section 3.6
3.7 LSH Families for Other Distance Measures
3.7.1 LSH Families for Hamming Distance
3.7.2 Random Hyperplanes and the Cosine Distance
3.7.3 Sketches
3.7.4 LSH Families for Euclidean Distance
3.7.5 More LSH Families for Euclidean Spaces
3.7.6 Exercises for Section 3.7
3.8 Applications of LocalitySensitive Hashing
3.8.1 Entity Resolution
3.8.2 An EntityResolution Example
3.8.3 Validating Record Matches
3.8.4 Matching Fingerprints
3.8.5 A LSH Family for Fingerprint Matching
3.8.6 Similar News Articles
3.8.7 Exercises for Section 3.8
3.9 Methods for High Degrees of Similarity
3.9.1 Finding Identical Items
3.9.2 Representing Sets as Strings
3.9.3 LengthBased Filtering
3.9.4 Prefix Indexing
3.9.5 Using Position Information
3.9.6 Using Position and Length in Indexes
3.9.7 Exercises for Section 3.9
3.10 Summary of Chapter 3
 Jaccard Similarity: The Jaccard similarity of sets is the ratio of the size of the intersection of the sets to the size of the union. This measure of similarity is suitable for many applications, including textual similarity of documents and similarity of buying habits of customers.
 Shingling: A kshingle is any k characters that appear consecutively in a document. If we represent a document by its set of kshingles, then the Jaccard similarity of the shingle sets measures the textual similarity of documents. Sometimes, it is useful to hash shingles to bit strings of shorter length, and use sets of hash values to represent documents.
 Minhashing: A minhash function on sets is based on a permutation of the universal set. Given any such permutation, the minhash value for a set is that element of the set that appears first in the permuted order.
 Minhash Signatures: We may represent sets by picking some list of permutations and computing for each set its minhash signature, which is the sequence of minhash values obtained by applying each permutation on the list to that set. Given two sets, the expected fraction of the permutations that will yield the same minhash value is exactly the Jaccard similarity of the sets.
 Efficient Minhashing: Since it is not really possible to generate random permutations, it is normal to simulate a permutation by picking a random hash function and taking the minhash value for a set to be the least hash value of any of the set’s members.
 LocalitySensitive Hashing for Signatures: This technique allows us to avoid computing the similarity of every pair of sets or their minhash signatures. If we are given signatures for the sets, we may divide them into bands, and only measure the similarity of a pair of sets if they are identical in at least one band. By choosing the size of bands appropriately, we can eliminate from consideration most of the pairs that do not meet our threshold of similarity.
 Distance Measures: A distance measure is a function on pairs of points in a space that satisfy certain axioms. The distance between two points is 0 if the points are the same, but greater than 0 if the points are different. The distance is symmetric; it does not matter in which order we consider the two points. A distance measure must satisfy the triangle inequality: the distance between two points is never more than the sum of the distances between those points and some third point.
 Euclidean Distance: The most common notion of distance is the Euclidean distance in an ndimensional space. This distance, sometimes called the L2norm, is the square root of the sum of the squares of the differences between the points in each dimension. Another distance suitable for Euclidean spaces, called Manhattan distance or the L1norm is the sum of the magnitudes of the differences between the points in each dimension.
 Jaccard Distance: One minus the Jaccard similarity is a distance measure, called the Jaccard distance.
 Cosine Distance: The angle between vectors in a vector space is the cosine distance measure. We can compute the cosine of that angle by taking the dot product of the vectors and dividing by the lengths of the vectors.
 Edit Distance: This distance measure applies to a space of strings, and is the number of insertions and/or deletions needed to convert one string into the other. The edit distance can also be computed as the sum of the lengths of the strings minus twice the length of the longest common subsequence of the strings.
 Hamming Distance: This distance measure applies to a space of vectors. The Hamming distance between two vectors is the number of positions in which the vectors differ.
 Generalized LocalitySensitive Hashing: We may start with any collection of functions, such as the minhash functions, that can render a decision as to whether or not a pair of items should be candidates for similarity checking. The only constraint on these functions is that they provide a lower bound on the probability of saying “yes” if the distance (according to some distance measure) is below a given limit, and an upper bound on the probability of saying “yes” if the distance is above another given limit. We can then increase the probability of saying “yes” for nearby items and at the same time decrease the probability of saying “yes” for distant items to as great an extent as we wish, by applying an AND construction and an OR construction.
 Random Hyperplanes and LSH for Cosine Distance: We can get a set of basis functions to start a generalized LSH for the cosine distance measure by identifying each function with a list of randomly chosen vectors. We apply a function to a given vector v by taking the dot product of v with each vector on the list. The result is a sketch consisting of the signs (+1 or −1) of the dot products. The fraction of positions in which the sketches of two vectors agree, multiplied by 180, is an estimate of the angle between the two vectors.
 LSH For Euclidean Distance: A set of basis functions to start LSH for Euclidean distance can be obtained by choosing random lines and projecting points onto those lines. Each line is broken into fixedlength intervals, and the function answers “yes” to a pair of points that fall into the same interval.
 HighSimilarity Detection by String Comparison: An alternative approach to finding similar items, when the threshold of Jaccard similarity is close to 1, avoids using minhashing and LSH. Rather, the universal set is ordered, and sets are represented by strings, consisting their elements in order. The simplest way to avoid comparing all pairs of sets or their strings is to note that highly similar sets will have strings of approximately the same length. If we sort the strings, we can compare each string with only a small number of the immediately following strings.
 Character Indexes: If we represent sets by strings, and the similarity threshold is close to 1, we can index all strings by their first few characters. The prefix whose characters must be indexed is approximately the length of the string times the maximum Jaccard distance (1 minus the minimum Jaccard similarity).
 Position Indexes: We can index strings not only on the characters in their prefixes, but on the position of that character within the prefix. We reduce the number of pairs of strings that must be compared, because if two strings share a character that is not in the first position in both strings, then we know that either there are some preceding characters that are in the union but not the intersection, or there is an earlier symbol that appears in both strings.
 Suffix Indexes: We can also index strings based not only on the characters in their prefixes and the positions of those characters, but on the length of the character’s suffix – the number of positions that follow it in the string. This structure further reduces the number of pairs that must be compared, because a common symbol with different suffix lengths implies additional characters that must be in the union but not in the intersection.
3.11 References for Chapter 3
4 Mining Data Streams
Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make another assumption: data arrives in a stream or streams, and if it is not processed immediately or stored, then it is lost forever. Moreover, we shall assume that the data arrives so rapidly that it is not feasible to store it all in active storage (i.e., in a conventional database), and then interact with it at the time of our choosing.
The algorithms for processing streams each involve summarization of the stream in some way. We shall start by considering how to make a useful sample of a stream and how to filter a stream to eliminate most of the “undesirable” elements. We then show how to estimate the number of different elements in a stream using much less storage than would be required if we listed all the elements we have seen.
Another approach to summarizing a stream is to look at only a fixedlength “window” consisting of the last n elements for some (typically large) n. We then query the window as if it were a relation in a database. If there are many streams and/or n is large, we may not be able to store the entire window for every stream, so we need to summarize even the windows. We address the fundamental problem of maintaining an approximate count on the number of 1’s in the window of a bit stream, while using much less space than would be needed to store the entire window itself. This technique generalizes to approximating various kinds of sums.
4.1 The Stream Data Model
4.1.1 A DataStreamManagement System
4.1.2 Examples of Stream Sources
4.1.3 Stream Queries
4.1.4 Issues in Stream Processing
4.2 Sampling Data in a Stream
4.2.1 A Motivating Example
4.2.2 Obtaining a Representative Sample
4.2.3 The General Sampling Problem
4.2.4 Varying the Sample Size
4.2.5 Exercises for Section 4.2
4.3 Filtering Streams
4.3.1 A Motivating Example
4.3.2 The Bloom Filter
4.3.3 Analysis of Bloom Filtering
4.3.4 Exercises for Section 4.3
4.4 Counting Distinct Elements in a Stream
4.4.1 The CountDistinct Problem
4.4.2 The FlajoletMartin Algorithm
4.4.3 Combining Estimates
4.4.4 Space Requirements
4.4.5 Exercises for Section 4.4
4.5 Estimating Moments
4.5.1 Definition of Moments
4.5.2 The AlonMatiasSzegedy Algorithm for Second Moments
4.5.3 Why the AlonMatiasSzegedy Algorithm Works
4.5.4 HigherOrder Moments
4.5.5 Dealing With Infinite Streams
4.5.6 Exercises for Section 4.5
4.6 Counting Ones in a Window
4.6.1 The Cost of Exact Counts
4.6.2 The DatarGionisIndykMotwani Algorithm
4.6.3 Storage Requirements for the DGIM Algorithm
4.6.4 Query Answering in the DGIM Algorithm
4.6.5 Maintaining the DGIM Conditions
4.6.6 Reducing the Error
4.6.7 Extensions to the Counting of Ones
4.6.8 Exercises for Section 4.6
4.7 Decaying Windows
4.7.1 The Problem of MostCommon Elements
4.7.2 Definition of the Decaying Window
4.7.3 Finding the Most Popular Elements
4.8 Summary of Chapter 4
 The Stream Data Model : This model assumes data arrives at a processing engine at a rate that makes it infeasible to store everything in active storage. One strategy to dealing with streams is to maintain summaries of the streams, sufficient to answer the expected queries about the data. A second approach is to maintain a sliding window of the most recently arrived data.
 Sampling of Streams: To create a sample of a stream that is usable for a class of queries, we identify a set of key attributes for the stream. By hashing the key of any arriving stream element, we can use the hash value to decide consistently whether all or none of the elements with that key will become part of the sample.
 Bloom Filters: This technique allows us to filter streams so elements that belong to a particular set are allowed through, while most nonmembers are deleted. We use a large bit array, and several hash functions. Members of the selected set are hashed to buckets, which are bits in the array, and those bits are set to 1. To test a stream element for membership, we hash the element to a set of bits using each of the hash functions, and only accept the element if all these bits are 1.
 Counting Distinct Elements: To estimate the number of different elements appearing in a stream, we can hash elements to integers, interpreted as binary numbers. 2 raised to the power that is the longest sequence of 0’s seen in the hash value of any stream element is an estimate of the number of different elements. By using many hash functions and combining these estimates, first by taking averages within groups, and then taking the median of the averages, we get a reliable estimate.
 Moments of Streams: The kth moment of a stream is the sum of the kth powers of the counts of each element that appears at least once in the stream. The 0th moment is the number of distinct elements, and the 1st moment is the length of the stream.
 Estimating Second Moments: A good estimate for the second moment, or surprise number, is obtained by choosing a random position in the stream, taking twice the number of times this element appears in the stream from that position onward, subtracting 1, and multiplying by the length of the stream. Many random variables of this type can be combined like the estimates for counting the number of distinct elements, to produce a reliable estimate of the second moment.
 Estimating Higher Moments: The technique for second moments works for kth moments as well, as long as we replace the formula 2x−1 (where x is the number of times the element appears at or after the selected position) by xk − (x − 1)k.
 Estimating the Number of 1’s in a Window: We can estimate the number of 1’s in a window of 0’s and 1’s by grouping the 1’s into buckets. Each bucket has a number of 1’s that is a power of 2; there are one or two buckets of each size, and sizes never decrease as we go back in time. If we record only the position and size of the buckets, we can represent the contents of a window of size N with O(log2 N) space.
 Answering Queries About Numbers of 1’s: If we want to know the approximate numbers of 1’s in the most recent k elements of a binary stream, we find the earliest bucket B that is at least partially within the last k positions of the window and estimate the number of 1’s to be the sum of the sizes of each of the more recent buckets plus half the size of B. This estimate can never be off by more that 50% of the true count of 1’s.
 Closer Approximations to the Number of 1’s: By changing the rule for how many buckets of a given size can exist in the representation of a binary window, so that either r or r −1 of a given size may exist, we can assure that the approximation to the true number of 1’s is never off by more than 1/r.
 Exponentially Decaying Windows: Rather than fixing a window size, we can imagine that the window consists of all the elements that ever arrived in the stream, but with the element that arrived t time units ago weighted by e−ct for some timeconstant c. Doing so allows us to maintain certain summaries of an exponentially decaying window easily. For instance, the weighted sum of elements can be recomputed, when a new element arrives, by multiplying the old sum by 1 − c and then adding the new element.
 Maintaining Frequent Elements in an Exponentially Decaying Window: We can imagine that each item is represented by a binary stream, where 0 means the item was not the element arriving at a given time, and 1 means that it was. We can find the elements whose sum of their binary stream is at least 1/2. When a new element arrives, multiply all recorded sums by 1 minus the time constant, add 1 to the count of the item that just arrived, and delete from the record any item whose sum has fallen below 1/2.
4.9 References for Chapter 4
5 Link Analysis
One of the biggest changes in our lives in the decade following the turn of the century was the availability of efficient and accurate Web search, through search engines such as Google. While Google was not the first search engine, it was the first able to defeat the spammers who had made search almost useless. Moreover, the innovation provided by Google was a nontrivial technological advance, called “PageRank.” We shall begin the chapter by explaining what PageRank is and how it is computed efficiently.
Yet the war between those who want to make the Web useful and those who would exploit it for their own purposes is never over. When PageRank was established as an essential technique for a search engine, spammers invented ways to manipulate the PageRank of a Web page, often called link spam.1 That development led to the response of TrustRank and other techniques for preventing spammers from attacking PageRank. We shall discuss TrustRank and other approaches to detecting link spam.
Finally, this chapter also covers some variations on PageRank. These techniques include topicsensitive PageRank (which can also be adapted for combating link spam) and the HITS, or “hubs and authorities” approach to evaluating pages on the Web.;
5.1 PageRank
5.1.1 Early Search Engines and Term Spam
5.1.2 Definition of PageRank
5.1.3 Structure of the Web
5.1.4 Avoiding Dead Ends
5.1.5 Spider Traps and Taxation
5.1.6 Using PageRank in a Search Engine
5.1.7 Exercises for Section 5.1
5.2 Efficient Computation of PageRank
5.2.1 Representing Transition Matrices
5.2.2 PageRank Iteration Using MapReduce
5.2.3 Use of Combiners to Consolidate the Result Vector
5.2.4 Representing Blocks of the Transition Matrix
5.2.5 Other Efficient Approaches to PageRank Iteration
5.2.6 Exercises for Section 5.2
5.3 TopicSensitive PageRank
5.3.1 Motivation for TopicSensitive Page Rank
5.3.2 Biased Random Walks
5.3.3 Using TopicSensitive PageRank
5.3.4 Inferring Topics from Words
5.3.5 Exercises for Section 5.3
5.4 Link Spam
5.4.1 Architecture of a Spam Farm
5.4.2 Analysis of a Spam Farm
5.4.3 Combating Link Spam
5.4.4 TrustRank
5.4.5 Spam Mass
5.4.6 Exercises for Section 5.4
5.5 Hubs and Authorities
5.5.1 The Intuition Behind HITS
5.5.2 Formalizing Hubbiness and Authority
5.5.3 Exercises for Section 5.5
5.6 Summary of Chapter 5
 Term Spam: Early search engines were unable to deliver relevant results because they were vulnerable to term spam – the introduction into Web pages of words that misrepresented what the page was about.
 The Google Solution to Term Spam: Google was able to counteract term spam by two techniques. First was the PageRank algorithm for determining the relative importance of pages on the Web. The second was a strategy of believing what other pages said about a given page, in or near their links to that page, rather than believing only what the page said about itself.
 PageRank: PageRank is an algorithm that assigns a real number, called its PageRank, to each page on the Web. The PageRank of a page is a measure of how important the page is, or how likely it is to be a good response to a search query. In its simplest form, PageRank is a solution to the recursive equation “a page is important if important pages link to it.”
 Transition Matrix of the Web: We represent links in the Web by a matrix whose ith row and ith column represent the ith page of the Web. If there are one or more links from page j to page i, then the entry in row i and column j is 1/k, where k is the number of pages to which page j links. Other entries of the transition matrix are 0.
 Computing PageRank on Strongly Connected Web Graphs: For strongly connected Web graphs (those where any node can reach any other node), PageRank is the principal eigenvector of the transition matrix. We can compute PageRank by starting with any nonzero vector and repeatedly multiplying the current vector by the transition matrix, to get a better estimate.7 After about 50 iterations, the estimate will be very close to the limit, which is the true PageRank.
 The Random Surfer Model : Calculation of PageRank can be thought of as simulating the behavior of many random surfers, who each start at a random page and at any step move, at random, to one of the pages to which their current page links. The limiting probability of a surfer being at a given page is the PageRank of that page. The intuition is that people tend to create links to the pages they think are useful, so random surfers will tend to be at a useful page.
 Dead Ends: A dead end is a Web page with no links out. The presence of dead ends will cause the PageRank of some or all of the pages to go to 0 in the iterative computation, including pages that are not dead ends. We can eliminate all dead ends before undertaking a PageRank calculation by recursively dropping nodes with no arcs out. Note that dropping one node can cause another, which linked only to it, to become a dead end, so the process must be recursive.
 Spider Traps: A spider trap is a set of nodes that, while they may link to each other, have no links out to other nodes. In an iterative calculation of PageRank, the presence of spider traps cause all the PageRank to be captured within that set of nodes.
 Taxation Schemes: To counter the effect of spider traps (and of dead ends, if we do not eliminate them), PageRank is normally computed in a way that modifies the simple iterative multiplication by the transition matrix. A parameter β is chosen, typically around 0.85. Given an estimate of the PageRank, the next estimate is computed by multiplying the estimate by β times the transition matrix, and then adding (1−β)/n to the estimate for each page, where n is the total number of pages.
 Taxation and Random Surfers: The calculation of PageRank using taxation parameter β can be thought of as giving each random surfer a probability 1 − β of leaving the Web, and introducing an equivalent number of surfers randomly throughout the Web.
 Efficient Representation of Transition Matrices: Since a transition matrix is very sparse (almost all entries are 0), it saves both time and space to represent it by listing its nonzero entries. However, in addition to being sparse, the nonzero entries have a special property: they are all the same in any given column; the value of each nonzero entry is the inverse of the number of nonzero entries in that column. Thus, the preferred representation is columnbycolumn, where the representation of a column is the number of nonzero entries, followed by a list of the rows where those entries occur.
 Very LargeScale Matrix–Vector Multiplication: For Websized graphs, it may not be feasible to store the entire PageRank estimate vector in the main memory of one machine. Thus, we can break the vector into k segments and break the transition matrix into k2 squares, called blocks, assigning each square to one machine. The vector segments are each sent to k machines, so there is a small additional cost in replicating the vector.
 Representing Blocks of a Transition Matrix : When we divide a transition matrix into square blocks, the columns are divided into k segments. To represent a segment of a column, nothing is needed if there are no nonzero entries in that segment. However, if there are one or more nonzero entries, then we need to represent the segment of the column by the total number of nonzero entries in the column (so we can tell what value the nonzero entries have) followed by a list of the rows with nonzero entries.
 TopicSensitive PageRank: If we know the queryer is interested in a certain topic, then it makes sense to bias the PageRank in favor of pages on that topic. To compute this form of PageRank, we identify a set of pages known to be on that topic, and we use it as a “teleport set.” The PageRank calculation is modified so that only the pages in the teleport set are given a share of the tax, rather than distributing the tax among all pages on the Web.
 Creating Teleport Sets: For topicsensitive PageRank to work, we need to identify pages that are very likely to be about a given topic. One approach is to start with the pages that the open directory (DMOZ) identifies with that topic. Another is to identify words known to be associated with the topic, and select for the teleport set those pages that have an unusually high number of occurrences of such words.
 Link Spam: To fool the PageRank algorithm, unscrupulous actors have created spam farms. These are collections of pages whose purpose is to concentrate high PageRank on a particular target page.
 Structure of a Spam Farm: Typically, a spam farm consists of a target page and very many supporting pages. The target page links to all the supporting pages, and the supporting pages link only to the target page. In addition, it is essential that some links from outside the spam farm be created. For example, the spammer might introduce links to their target page by writing comments in other people’s blogs or discussion groups.
 TrustRank: One way to ameliorate the effect of link spam is to compute a topicsensitive PageRank called TrustRank, where the teleport set is a collection of trusted pages. For example, the home pages of universities could serve as the trusted set. This technique avoids sharing the tax in the PageRank calculation with the large numbers of supporting pages in spam farms and thus preferentially reduces their PageRank.
 Spam Mass: To identify spam farms, we can compute both the conventional PageRank and the TrustRank for all pages. Those pages that have much lower TrustRank than PageRank are likely to be part of a spam farm.
 Hubs and Authorities: While PageRank gives a onedimensional view of the importance of pages, an algorithm called HITS tries to measure two different aspects of importance. Authorities are those pages that contain valuable information. Hubs are pages that, while they do not themselves contain the information, link to places where the information can be found.
 Recursive Formulation of the HITS Algorithm: Calculation of the hubs and authorities scores for pages depends on solving the recursive equations: “a hub links to many authorities, and an authority is linked to by many hubs.” The solution to these equations is essentially an iterated matrix–vector multiplication, just like PageRank’s. However, the existence of dead ends or spider traps does not affect the solution to the HITS equations in the way they do for PageRank, so no taxation scheme is necessary.
5.7 References for Chapter 5
6 Frequent Itemsets
We turn in this chapter to one of the major families of techniques for characterizing data: the discovery of frequent itemsets. This problem is often viewed as the discovery of “association rules,” although the latter is a more complex characterization of data, whose discovery depends fundamentally on the discovery of frequent itemsets.
To begin, we introduce the “marketbasket” model of data, which is essentially a manymany relationship between two kinds of elements, called “items” and “baskets,” but with some assumptions about the shape of the data. The frequentitemsets problem is that of finding sets of items that appear in (are related to) many of the same baskets.
The problem of finding frequent itemsets differs from the similarity search discussed in Chapter 3. Here we are interested in the absolute number of baskets that contain a particular set of items. In Chapter 3 we wanted items that have a large fraction of their baskets in common, even if the absolute number of baskets is small.
The difference leads to a new class of algorithms for finding frequent itemsets. We begin with the APriori Algorithm, which works by eliminating most large sets as candidates by looking first at smaller sets and recognizing that a large set cannot be frequent unless all its subsets are. We then consider various improvements to the basic APriori idea, concentrating on very large data sets that stress the available main memory.
Next, we consider approximate algorithms that work faster but are not guaranteed to find all frequent itemsets. Also in this class of algorithms are those that exploit parallelism, including the parallelism we can obtain through a MapReduce formulation. Finally, we discuss briefly how to find frequent itemsets in a data stream.
6.1 The MarketBasket Model
6.1.1 Definition of Frequent Itemsets
6.1.2 Applications of Frequent Itemsets
6.1.3 Association Rules
6.1.4 Finding Association Rules with High Confidence
6.1.5 Exercises for Section 6.1
6.2 Market Baskets and the APriori Algorithm
6.2.1 Representation of MarketBasket Data
6.2.2 Use of Main Memory for Itemset Counting
6.2.3 Monotonicity of Itemsets
6.2.4 Tyranny of Counting Pairs
6.2.5 The APriori Algorithm
6.2.6 APriori for All Frequent Itemsets
6.2.7 Exercises for Section 6.2
6.3 Handling Larger Datasets in Main Memory
6.3.1 The Algorithm of Park, Chen, and Yu
6.3.2 The Multistage Algorithm
6.3.3 The Multihash Algorithm
6.3.4 Exercises for Section 6.3
6.4 LimitedPass Algorithms
6.4.1 The Simple, Randomized Algorithm
6.4.2 Avoiding Errors in Sampling Algorithms
6.4.3 The Algorithm of Savasere, Omiecinski, and Navathe
6.4.4 The SON Algorithm and MapReduce
6.4.5 Toivonen’s Algorithm
6.4.6 Why Toivonen’s Algorithm Works
6.4.7 Exercises for Section 6.4
6.5 Counting Frequent Items in a Stream
6.5.1 Sampling Methods for Streams
6.5.2 Frequent Itemsets in Decaying Windows
6.5.3 Hybrid Methods
6.5.4 Exercises for Section 6.5
6.6 Summary of Chapter 6
 MarketBasket Data: This model of data assumes there are two kinds of entities: items and baskets. There is a many–many relationship between items and baskets. Typically, baskets are related to small sets of items, while items may be related to many baskets.
 Frequent Itemsets: The support for a set of items is the number of baskets containing all those items. Itemsets with support that is at least some threshold are called frequent itemsets.
 Association Rules: These are implications that if a basket contains a certain set of items I, then it is likely to contain another particular item j as well. The probability that j is also in a basket containing I is called the confidence of the rule. The interest of the rule is the amount by which the confidence deviates from the fraction of all baskets that contain j.
 The PairCounting Bottleneck: To find frequent itemsets, we need to examine all baskets and count the number of occurrences of sets of a certain size. For typical data, with a goal of producing a small number of itemsets that are the most frequent of all, the part that often takes the most main memory is the counting of pairs of items. Thus, methods for finding frequent itemsets typically concentrate on how to minimize the main memory needed to count pairs.
 Triangular Matrices: While one could use a twodimensional array to count pairs, doing so wastes half the space, because there is no need to count pair {i, j} in both the ij and ji array elements. By arranging the pairs (i, j) for which i < j in lexicographic order, we can store only the needed counts in a onedimensional array with no wasted space, and yet be able to access the count for any pair efficiently.
 Storage of Pair Counts as Triples: If fewer than 1/3 of the possible pairs actually occur in baskets, then it is more spaceefficient to store counts of pairs as triples (i, j, c), where c is the count of the pair {i, j}, and i < j. An index structure such as a hash table allows us to find the triple for (i, j) efficiently.
 Monotonicity of Frequent Itemsets: An important property of itemsets is that if a set of items is frequent, then so are all its subsets. We exploit this property to eliminate the need to count certain itemsets by using its contrapositive: if an itemset is not frequent, then neither are its supersets.
 The APriori Algorithm for Pairs: We can find all frequent pairs by making two passes over the baskets. On the first pass, we count the items themselves, and then determine which items are frequent. On the second pass, we count only the pairs of items both of which are found frequent on the first pass. Monotonicity justifies our ignoring other pairs.
 Finding Larger Frequent Itemsets: APriori and many other algorithms allow us to find frequent itemsets larger than pairs, if we make one pass over the baskets for each size itemset, up to some limit. To find the frequent itemsets of size k, monotonicity lets us restrict our attention to only those itemsets such that all their subsets of size k − 1 have already been found frequent.
 The PCY Algorithm: This algorithm improves on APriori by creating a hash table on the first pass, using all mainmemory space that is not needed to count the items. Pairs of items are hashed, and the hashtable buckets are used as integer counts of the number of times a pair has hashed to that bucket. Then, on the second pass, we only have to count pairs of frequent items that hashed to a frequent bucket (one whose count is at least the support threshold) on the first pass.
 The Multistage Algorithm: We can insert additional passes between the first and second pass of the PCY Algorithm to hash pairs to other, independent hash tables. At each intermediate pass, we only have to hash pairs of frequent items that have hashed to frequent buckets on all previous passes.
 The Multihash Algorithm: We can modify the first pass of the PCY Algorithm to divide available main memory into several hash tables. On the second pass, we only have to count a pair of frequent items if they hashed to frequent buckets in all hash tables.
 Randomized Algorithms: Instead of making passes through all the data, we may choose a random sample of the baskets, small enough that it is possible to store both the sample and the needed counts of itemsets in main memory. The support threshold must be scaled down in proportion. We can then find the frequent itemsets for the sample, and hope that it is a good representation of the data as whole. While this method uses at most one pass through the whole dataset, it is subject to false positives (itemsets that are frequent in the sample but not the whole) and false negatives (itemsets that are frequent in the whole but not the sample).
 The SON Algorithm: An improvement on the simple randomized algorithm is to divide the entire file of baskets into segments small enough that all frequent itemsets for the segment can be found in main memory. Candidate itemsets are those found frequent for at least one segment. A second pass allows us to count all the candidates and find the exact collection of frequent itemsets. This algorithm is especially appropriate in a MapReduce setting.
 Toivonen’s Algorithm: This algorithm starts by finding frequent itemsets in a sample, but with the threshold lowered so there is little chance of missing an itemset that is frequent in the whole. Next, we examine the entire file of baskets, counting not only the itemsets that are frequent in the sample, but also, the negative border – itemsets that have not been found frequent, but all their immediate subsets are. If no member of the negative border is found frequent in the whole, then the answer is exact. But if a member of the negative border is found frequent, then the whole process has to repeat with another sample.
 Frequent Itemsets in Streams: If we use a decaying window with constant c, then we can start counting an item whenever we see it in a basket. We start counting an itemset if we see it contained within the current basket, and all its immediate proper subsets already are being counted. As the window is decaying, we multiply all counts by 1 − c and eliminate those that are less than 1/2.
6.7 References for Chapter 6
7 Clustering
Clustering is the process of examining a collection of “points,” and grouping the points into “clusters” according to some distance measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another. A suggestion of what clusters might look like was seen in Fig. 1.1. However, there the intent was that there were three clusters around three different road intersections, but two of the clusters blended into one another because they were not sufficiently separated.
Our goal in this chapter is to offer methods for discovering clusters in data. We are particularly interested in situations where the data is very large, and/or where the space either is highdimensional, or the space is not Euclidean at all. We shall therefore discuss several algorithms that assume the data does not fit in main memory. However, we begin with the basics: the two general approaches to clustering and the methods for dealing with clusters in a non Euclidean space.
7.1 Introduction to Clustering Techniques
7.1.1 Points, Spaces, and Distances
7.1.2 Clustering Strategies3
7.1.3 The Curse of Dimensionality
7.1.4 Exercises for Section 7.1
7.2 Hierarchical Clustering
7.2.1 Hierarchical Clustering in a Euclidean Space
7.2.2 Efficiency of Hierarchical Clustering
7.2.3 Alternative Rules for Controlling Hierarchical Clustering
7.2.4 Hierarchical Clustering in NonEuclidean Spaces
7.2.5 Exercises for Section 7.2
7.3 Kmeans Algorithms
7.3.1 KMeans Basics
7.3.2 Initializing Clusters for KMeans
7.3.3 Picking the Right Value of k
7.3.4 The Algorithm of Bradley, Fayyad, and Reina
7.3.5 Processing Data in the BFR Algorithm
7.3.6 Exercises for Section 7.3
7.4 The CURE Algorithm
7.4.1 Initialization in CURE
7.4.2 Completion of the CURE Algorithm
7.4.3 Exercises for Section 7.4
7.5 Clustering in NonEuclidean Spaces
7.5.1 Representing Clusters in the GRGPF Algorithm
7.5.2 Initializing the Cluster Tree
7.5.3 Adding Points in the GRGPF Algorithm
7.5.4 Splitting and Merging Clusters
7.5.5 Exercises for Section 7.5
7.6 Clustering for Streams and Parallelism
7.6.1 The StreamComputing Model
7.6.2 A StreamClustering Algorithm
7.6.3 Initializing Buckets
7.6.4 Merging Buckets
7.6.5 Answering Queries
7.6.6 Clustering in a Parallel Environment
7.6.7 Exercises for Section 7.6
7.7 Summary of Chapter 7
 Clustering: Clusters are often a useful summary of data that is in the form of points in some space. To cluster points, we need a distance measure on that space. Ideally, points in the same cluster have small distances between them, while points in different clusters have large distances between them.
 Clustering Algorithms: Clustering algorithms generally have one of two forms. Hierarchical clustering algorithms begin with all points in a cluster of their own, and nearby clusters are merged iteratively. Pointassignment clustering algorithms consider points in turn and assign them to the cluster in which they best fit.
 The Curse of Dimensionality: Points in highdimensional Euclidean spaces, as well as points in nonEuclidean spaces often behave unintuitively. Two unexpected properties of these spaces are that random points are almost always at about the same distance, and random vectors are almost always orthogonal.
 Centroids and Clustroids: In a Euclidean space, the members of a cluster can be averaged, and this average is called the centroid. In nonEuclidean spaces, there is no guarantee that points have an “average,” so we are forced to use one of the members of the cluster as a representative or typical element of the cluster. That representative is called the clustroid.
 Choosing the Clustroid: There are many ways we can define a typical point of a cluster in a nonEuclidean space. For example, we could choose the point with the smallest sum of distances to the other points, the smallest sum of the squares of those distances, or the smallest maximum distance to any other point in the cluster.
 Radius and Diameter : Whether or not the space is Euclidean, we can define the radius of a cluster to be the maximum distance from the centroid or clustroid to any point in that cluster. We can define the diameter of the cluster to be the maximum distance between any two points in the cluster. Alternative definitions, especially of the radius, are also known, for example, average distance from the centroid to the other points.
 Hierarchical Clustering: This family of algorithms has many variations, which differ primarily in two areas. First, we may chose in various ways which two clusters to merge next. Second, we may decide when to stop the merge process in various ways.
 Picking Clusters to Merge: One strategy for deciding on the best pair of clusters to merge in a hierarchical clustering is to pick the clusters with the closest centroids or clustroids. Another approach is to pick the pair of clusters with the closest points, one from each cluster. A third approach is to use the average distance between points from the two clusters.
 Stopping the Merger Process: A hierarchical clustering can proceed until there are a fixed number of clusters left. Alternatively, we could merge until it is impossible to find a pair of clusters whose merger is sufficiently compact, e.g., the merged cluster has a radius or diameter below some threshold. Another approach involves merging as long as the resulting cluster has a sufficiently high “density,” which can be defined in various ways, but is the number of points divided by some measure of the size of the cluster, e.g., the radius.
 KMeans Algorithms: This family of algorithms is of the pointassignment type and assumes a Euclidean space. It is assumed that there are exactly k clusters for some known k. After picking k initial cluster centroids, the points are considered one at a time and assigned to the closest centroid. The centroid of a cluster can migrate during point assignment, and an optional last step is to reassign all the points, while holding the centroids fixed at their final values obtained during the first pass.
 Initializing KMeans Algorithms: One way to find k initial centroids is to pick a random point, and then choose k − 1 additional points, each as far away as possible from the previously chosen points. An alternative is to start with a small sample of points and use a hierarchical clustering to merge them into k clusters.
 Picking K in a KMeans Algorithm: If the number of clusters is unknown, we can use a binarysearch technique, trying a kmeans clustering with different values of k. We search for the largest value of k for which a decrease below k clusters results in a radically higher average diameter of the clusters. This search can be carried out in a number of clustering operations that is logarithmic in the true value of k.
 The BFR Algorithm: This algorithm is a version of kmeans designed to handle data that is too large to fit in main memory. It assumes clusters are normally distributed about the axes.
 Representing Clusters in BFR: Points are read from disk one chunk at a time. Clusters are represented in main memory by the count of the number of points, the vector sum of all the points, and the vector formed by summing the squares of the components of the points in each dimension. Other collection of points, too far from a cluster centroid to be included in a cluster, are represented as “miniclusters” in the same way as the k clusters, while still other points, which are not near any other point will be represented as themselves and called “retained” points.
 Processing Points in BFR: Most of the points in a mainmemory load will be assigned to a nearby cluster and the parameters for that cluster will be adjusted to account for the new points. Unassigned points can be formed into new miniclusters, and these miniclusters can be merged with previously discovered miniclusters or retained points. After the last memory load, the miniclusters and retained points can be merged to their nearest cluster or kept as outliers.
 The CURE Algorithm: This algorithm is of the pointassignment type. It is designed for a Euclidean space, but clusters can have any shape. It handles data that is too large to fit in main memory.
 Representing Clusters in CURE: The algorithm begins by clustering a small sample of points. It then selects representative points for each cluster, by picking points in the cluster that are as far away from each other as possible. The goal is to find representative points on the fringes of the cluster. However, the representative points are then moved a fraction of the way toward the centroid of the cluster, so they lie somewhat in the interior of the cluster.
 Processing Points in CURE: After creating representative points for each cluster, the entire set of points can be read from disk and assigned to a cluster. We assign a given point to the cluster of the representative point that is closest to the given point.
 The GRGPF Algorithm: This algorithm is of the pointassignment type. It handles data that is too big to fit in main memory, and it does not assume a Euclidean space.
 Representing Clusters in GRGPF: A cluster is represented by the count of points in the cluster, the clustroid, a set of points nearest the clustroid and a set of points furthest from the clustroid. The nearby points allow us to change the clustroid if the cluster evolves, and the distant points allow for merging clusters efficiently in appropriate circumstances. For each of these points, we also record the rowsum, that is the square root of the sum of the squares of the distances from that point to all the other points of the cluster.
 Tree Organization of Clusters in GRGPF: Cluster representations are organized into a tree structure like a Btree, where nodes of the tree are typically disk blocks and contain information about many clusters. The leaves hold the representation of as many clusters as possible, while interior nodes hold a sample of the clustroids of the clusters at their descendant leaves. We organize the tree so that the clusters whose representatives are in any subtree are as close as possible.
 Processing Points in GRGPF: After initializing clusters from a sample of points, we insert each point into the cluster with the nearest clustroid. Because of the tree structure, we can start at the root and choose to visit the child with the sample clustroid nearest to the given point. Following this rule down one path in the tree leads us to a leaf, where we insert the point into the cluster with the nearest clustroid on that leaf.
 Clustering Streams: A generalization of the DGIM Algorithm (for counting 1’s in the sliding window of a stream) can be used to cluster points that are part of a slowly evolving stream. The BDMO Algorithm uses buckets similar to those of DGIM, with allowable bucket sizes forming a sequence where each size is twice the previous size.
 Representation of Buckets in BDMO: The size of a bucket is the number of points it represents. The bucket itself holds only a representation of the clusters of these points, not the points themselves. A cluster representation includes a count of the number of points, the centroid or clustroid, and other information that is needed for merging clusters according to some selected strategy.
 Merging Buckets in BDMO: When buckets must be merged, we find the best matching of clusters, one from each of the buckets, and merge them in pairs. If the stream evolves slowly, then we expect consecutive buckets to have almost the same cluster centroids, so this matching makes sense.
 Answering Queries in BDMO: A query is a length of a suffix of the sliding window. We take all the clusters in all the buckets that are at least partially within that suffix and merge them using some strategy. The resulting clusters are the answer to the query.
 Clustering Using MapReduce: We can divide the data into chunks and cluster each chunk in parallel, using a Map task. The clusters from each Map task can be further clustered in a single Reduce task.
7.8 References for Chapter 7
8 Advertising on the Web
One of the big surprises of the 21st century has been the ability of all sorts of interesting Web applications to support themselves through advertising, rather than subscription. While radio and television have managed to use advertising as their primary revenue source, most media – newspapers and magazines, for example – have had to use a hybrid approach, combining revenue from advertising and subscriptions.
By far the most lucrative venue for online advertising has been search, and much of the effectiveness of search advertising comes from the “adwords” model of matching search queries to advertisements. We shall therefore devote much of this chapter to algorithms for optimizing the way this assignment is done. The algorithms used are of an unusual type; they are greedy and they are “online” in a particular technical sense to be discussed. We shall therefore digress to discuss these two algorithmic issues – greediness and online algorithms – in general, before tackling the adwords problem.
A second interesting online advertising problem involves selecting items to advertise at an online store. This problem involves “collaborative filtering,” where we try to find customers with similar behavior in order to suggest they buy things that similar customers have bought. This subject will be treated in Section 9.3.
8.1 Issues in OnLine Advertising
8.1.1 Advertising Opportunities
8.1.2 Direct Placement of Ads
8.1.3 Issues for Display Ads
8.2 OnLine Algorithms
8.2.1 OnLine and OffLine Algorithms
8.2.2 Greedy Algorithms
8.2.3 The Competitive Ratio
8.2.4 Exercises for Section 8.2
8.3 The Matching Problem
8.3.1 Matches and Perfect Matches
8.3.2 The Greedy Algorithm for Maximal Matching
8.3.3 Competitive Ratio for Greedy Matching
8.3.4 Exercises for Section 8.3
8.4 The Adwords Problem
8.4.1 History of Search Advertising
8.4.2 Definition of the Adwords Problem
8.4.3 The Greedy Approach to the Adwords Problem
8.4.4 The Balance Algorithm
8.4.5 A Lower Bound on Competitive Ratio for Balance
8.4.6 The Balance Algorithm with Many Bidders
8.4.7 The Generalized Balance Algorithm
8.4.8 Final Observations About the Adwords Problem
8.4.9 Exercises for Section 8.4
8.5 Adwords Implementation
8.5.1 Matching Bids and Search Queries
8.5.2 More Complex Matching Problems
8.5.3 A Matching Algorithm for Documents and Bids
8.6 Summary of Chapter 8
 Targeted Advertising: The big advantage that Webbased advertising has over advertising in conventional media such as newspapers is that Web advertising can be selected according to the interests of each individual user. This advantage has enabled many Web services to be supported entirely by advertising revenue.
 On and OffLine Algorithms: Conventional algorithms that are allowed to see all their data before producing an answer are called offline. An on line algorithm is required to make a response to each element in a stream immediately, with knowledge of only the past, not the future elements in the stream.
 Greedy Algorithms: Many online algorithms are greedy, in the sense that they select their action at every step by minimizing some objective function.
 Competitive Ratio: We can measure the quality of an online algorithm by minimizing, over all possible inputs, the value of the result of the online algorithm compared with the value of the result of the best possible offline algorithm.
 Bipartite Matching: This problem involves two sets of nodes and a set of edges between members of the two sets. The goal is to find a maximal matching – as large a set of edges as possible that includes no node more than once.
 OnLine Solution to the Matching Problem: One greedy algorithm for finding a match in a bipartite graph (or any graph, for that matter) is to order the edges in some way, and for each edge in turn, add it to the match if neither of its ends are yet part of an edge previously selected for the match. This algorithm can be proved to have a competitive ratio of 1/2; that is, it never fails to match at least half as many nodes as the best offline algorithm matches.
 Search Ad Management : A search engine receives bids from advertisers on certain search queries. Some ads are displayed with each search query, and the search engine is paid the amount of the bid only if the queryer clicks on the ad. Each advertiser can give a budget, the total amount they are willing to pay for clicks in a month.
 The Adwords Problem: The data for the adwords problem is a set of bids by advertisers on certain search queries, together with a total budget for each advertiser and information about the historical clickthrough rate for each ad for each query. Another part of the data is the stream of search queries received by the search engine. The objective is to select online a fixedsize set of ads in response to each query that will maximize the revenue to the search engine.
 Simplified Adwords Problem: To see some of the nuances of ad selection, we considered a simplified version in which all bids are either 0 or 1, only one ad is shown with each query, and all advertisers have the same budget. Under this model the obvious greedy algorithm of giving the ad placement to anyone who has bid on the query and has budget remaining can be shown to have a competitive ratio of 1/2.
 The Balance Algorithm: This algorithm improves on the simple greedy algorithm. A query’s ad is given to the advertiser who has bid on the query and has the largest remaining budget. Ties can be broken arbitrarily
 Competitive Ratio of the Balance Algorithm: For the simplified adwords model, the competitive ratio of the Balance Algorithm is 3/4 for the case of two advertisers and 1−1/e, or about 63% for any number of advertisers.
 The Balance Algorithm for the Generalized Adwords Problem: When bidders can make differing bids, have different budgets, and have different clickthrough rates for different queries, the Balance Algorithm awards an ad to the advertiser with the highest value of the function = x(1−e−f ). Here, x is the product of the bid and the clickthrough rate for that advertiser and query, and f is the fraction of the advertiser’s budget that remains unspent.
 Implementing an Adwords Algorithm: The simplest version of the implementation serves in situations where the bids are on exactly the set of words in the search query. We can represent a query by the list of its words, in sorted order. Bids are stored in a hash table or similar structure, with a hash key equal to the sorted list of words. A search query can then be matched against bids by a straightforward lookup in the table.
 Matching Word Sets Against Documents: A harder version of the adwords implementation problem allows bids, which are still small sets of words as in a search query, to be matched against larger documents, such as emails or tweets. A bid set matches the document if all the words appear in the document, in any order and not necessarily adjacent.
 Hash Storage of Word Sets: A useful data structure stores the words of each bid set in the order rarestfirst. Documents have their words sorted in the same order. Word sets are stored in a hash table with the first word, in the rarestfirst order, as the key.
 Processing Documents for Bid Matches: We process the words of the document rarestfirst. Word sets whose first word is the current word are copied to a temporary hash table, with the second word as the key. Sets already in the temporary hash table are examined to see if the word that is their key matches the current word, and, if so, they are rehashed using their next word. Sets whose last word is matched are copied to the output.
8.7 References for Chapter 8
9 Recommendation Systems
There is an extensive class of Web applications that involve predicting user responses to options. Such a facility is called a recommendation system. We shall begin this chapter with a survey of the most important examples of these systems. However, to bring the problem into focus, two good examples of recommendation systems are:
1. Offering news articles to online newspaper readers, based on a prediction of reader interests.
2. Offering customers of an online retailer suggestions about what they might like to buy, based on their past history of purchases and/or product searches.
Recommendation systems use a number of different technologies. We can classify these systems into two broad groups.
 Contentbased systems examine properties of the items recommended. For instance, if a Netflix user has watched many cowboy movies, then recommend a movie classified in the database as having the “cowboy” genre.
 Collaborative filtering systems recommend items based on similarity measures between users and/or items. The items recommended to a user are those preferred by similar users. This sort of recommendation system can use the groundwork laid in Chapter 3 on similarity search and Chapter 7 on clustering. However, these technologies by themselves are not sufficient, and there are some new algorithms that have proven effective for recommendation systems.
9.1 A Model for Recommendation Systems
9.1.1 The Utility Matrix
9.1.2 The Long Tail
9.1.3 Applications of Recommendation Systems
9.1.4 Populating the Utility Matrix
9.2 ContentBased Recommendations
9.2.1 Item Profiles
9.2.2 Discovering Features of Documents
9.2.3 Obtaining Item Features From Tags
9.2.4 Representing Item Profiles
9.2.5 User Profiles
9.2.6 Recommending Items to Users Based on Content
9.2.7 Classification Algorithms
9.2.8 Exercises for Section 9.2
9.3 Collaborative Filtering
9.3.1 Measuring Similarity
9.3.2 The Duality of Similarity
9.3.3 Clustering Users and Items
9.3.4 Exercises for Section 9.3
9.4 Dimensionality Reduction
9.4.1 UVDecomposition
9.4.2 RootMeanSquare Error
9.4.3 Incremental Computation of a UVDecomposition
9.4.4 Optimizing an Arbitrary Element
9.4.5 Building a Complete UVDecomposition Algorithm
9.4.6 Exercises for Section 9.4
9.5 The NetFlix Challenge
9.6 Summary of Chapter 9
 Utility Matrices: Recommendation systems deal with users and items. A utility matrix offers known information about the degree to which a user likes an item. Normally, most entries are unknown, and the essential problem of recommending items to users is predicting the values of the unknown entries based on the values of the known entries.
 Two Classes of Recommendation Systems: These systems attempt to predict a user’s response to an item by discovering similar items and the response of the user to those. One class of recommendation system is contentbased; it measures similarity by looking for common features of the items. A second class of recommendation system uses collaborative filtering; these measure similarity of users by their item preferences and/or measure similarity of items by the users who like them.
 Item Profiles: These consist of features of items. Different kinds of items have different features on which contentbased similarity can be based. Features of documents are typically important or unusual words. Products have attributes such as screen size for a television. Media such as movies have a genre and details such as actor or performer. Tags can also be used as features if they can be acquired from interested users.
 User Profiles: A contentbased collaborative filtering system can construct profiles for users by measuring the frequency with which features appear in the items the user likes. We can then estimate the degree to which a user will like an item by the closeness of the item’s profile to the user’s profile.
 Classification of Items: An alternative to constructing a user profile is to build a classifier for each user, e.g., a decision tree. The row of the utility matrix for that user becomes the training data, and the classifier must predict the response of the user to all items, whether or not the row had an entry for that item.
 Similarity of Rows and Columns of the Utility Matrix : Collaborative filtering algorithms must measure the similarity of rows and/or columns of the utility matrix. Jaccard distance is appropriate when the matrix consists only of 1’s and blanks (for “not rated”). Cosine distance works for more general values in the utility matrix. It is often useful to normalize the utility matrix by subtracting the average value (either by row, by column, or both) before measuring the cosine distance.
 Clustering Users and Items: Since the utility matrix tends to be mostly blanks, distance measures such as Jaccard or cosine often have too little data with which to compare two rows or two columns. A preliminary step or steps, in which similarity is used to cluster users and/or items into small groups with strong similarity, can help provide more common components with which to compare rows or columns.
 UVDecomposition: One way of predicting the blank values in a utility matrix is to find two long, thin matrices U and V , whose product is an approximation to the given utility matrix. Since the matrix product UV gives values for all useritem pairs, that value can be used to predict the value of a blank in the utility matrix. The intuitive reason this method makes sense is that often there are a relatively small number of issues (that number is the “thin” dimension of U and V ) that determine whether or not a user likes an item.
 RootMeanSquare Error : A good measure of how close the product UV is to the given utility matrix is the RMSE (rootmeansquare error). The RMSE is computed by averaging the square of the differences between UV and the utility matrix, in those elements where the utility matrix is nonblank. The square root of this average is the RMSE.
 Computing U and V : One way of finding a good choice for U and V in a UVdecomposition is to start with arbitrary matrices U and V . Repeatedly adjust one of the elements of U or V to minimize the RMSE between the product UV and the given utility matrix. The process converges to a local optimum, although to have a good chance of obtaining a global optimum we must either repeat the process from many starting matrices, or search from the starting point in many different ways.
 The NetFlix Challenge: An important driver of research into recommendation systems was the NetFlix challenge. A prize of $1,000,000 was offered for a contestant who could produce an algorithm that was 10% better than NetFlix’s own algorithm at predicting movie ratings by users. The prize was awarded in Sept., 2009.
9.7 References for Chapter 9
10 Mining SocialNetwork Graphs
There is much information to be gained by analyzing the largescale data that is derived from social networks. The bestknown example of a social network is the “friends” relation found on sites like Facebook. However, as we shall see there are many other sources of data that connect people or other entities.
In this chapter, we shall study techniques for analyzing such networks. An important question about a social network is how to identify “communities,” that is, subsets of the nodes (people or other entities that form the network) with unusually strong connections. Some of the techniques used to identify communities are similar to the clustering algorithms we discussed in Chapter 7. However, communities almost never partition the set of nodes in a network. Rather, communities usually overlap. For example, you may belong to several communities of friends or classmates. The people from one community tend to know each other, but people from two different communities rarely know each other. You would not want to be assigned to only one of the communities, nor would it make sense to cluster all the people from all your communities into one cluster.
Also in this chapter we explore efficient algorithms for discovering other properties of graphs. We look at “simrank,” a way to discover similarities among nodes of a graph. We explore triangle counting as a way to measure the connectedness of a community. We give efficient algorithms for exact and approximate measurement of the neighborhood sizes of nodes in a graph. Finally, we look at efficient algorithms for computing the transitive closure.
10.1 Social Networks as Graphs
10.1.1 What is a Social Network?
10.1.2 Social Networks as Graphs
10.1.3 Varieties of Social Networks
10.1.4 Graphs With Several Node Types
10.1.5 Exercises for Section 10.1
10.2 Clustering of SocialNetwork Graphs
10.2.1 Distance Measures for SocialNetwork Graphs
10.2.2 Applying Standard Clustering Methods
10.2.3 Betweenness
10.2.4 The GirvanNewman Algorithm
10.2.5 Using Betweenness to Find Communities
10.2.6 Exercises for Section 10.2
10.3 Direct Discovery of Communities
10.3.1 Finding Cliques
10.3.2 Complete Bipartite Graphs
10.3.3 Finding Complete Bipartite Subgraphs
10.3.4 Why Complete Bipartite Graphs Must Exist
10.3.5 Exercises for Section 10.3
10.4 Partitioning of Graphs
10.4.1 What Makes a Good Partition?
10.4.2 Normalized Cuts
10.4.3 Some Matrices That Describe Graphs
10.4.4 Eigenvalues of the Laplacian Matrix
10.4.5 Alternative Partitioning Methods
10.4.6 Exercises for Section 10.4
10.5 Finding Overlapping Communities
10.5.1 The Nature of Communities
10.5.2 MaximumLikelihood Estimation
10.5.3 The AffiliationGraph Model
10.5.4 Avoiding the Use of Discrete Membership Changes
10.5.5 Exercises for Section 10.5
10.6 Simrank
10.6.1 Random Walkers on a Social Graph
10.6.2 Random Walks with Restart
10.6.3 Exercises for Section 10.6
10.7 Counting Triangles
10.7.1 Why Count Triangles?
10.7.2 An Algorithm for Finding Triangles
10.7.3 Optimality of the TriangleFinding Algorithm
10.7.4 Finding Triangles Using MapReduce
10.7.5 Using Fewer Reduce Tasks
10.7.6 Exercises for Section 10.7
10.8 Neighborhood Properties of Graphs
10.8.1 Directed Graphs and Neighborhoods
10.8.2 The Diameter of a Graph
10.8.3 Transitive Closure and Reachability
10.8.4 Transitive Closure Via MapReduce
10.8.5 Smart Transitive Closure
10.8.6 Transitive Closure by Graph Reduction
10.8.7 Approximating the Sizes of Neighborhoods
10.8.8 Exercises for Section 10.8
10.9 Summary of Chapter 10
 SocialNetwork Graphs: Graphs that represent the connections in a social network are not only large, but they exhibit a form of locality, where small subsets of nodes (communities) have a much higher density of edges than the average density.
 Communities and Clusters: While communities resemble clusters in some ways, there are also significant differences. Individuals (nodes) normally belong to several communities, and the usual distance measures fail to represent closeness among nodes of a community. As a result, standard algorithms for finding clusters in data do not work well for community finding.
 Betweenness: One way to separate nodes into communities is to measure the betweenness of edges, which is the sum over all pairs of nodes of the fraction of shortest paths between those nodes that go through the given edge. Communities are formed by deleting the edges whose betweenness is above a given threshold.
 The GirvanNewman Algorithm: The GirvanNewman Algorithm is an efficient technique for computing the betweenness of edges. A breadthfirst search from each node is performed, and a sequence of labeling steps computes the share of paths from the root to each other node that go through each of the edges. The shares for an edge that are computed for each root are summed to get the betweenness.
 Communities and Complete Bipartite Graphs: A complete bipartite graph has two groups of nodes, all possible edges between pairs of nodes chosen one from each group, and no edges between nodes of the same group. Any sufficiently dense community (a set of nodes with many edges among them) will have a large complete bipartite graph.
 Finding Complete Bipartite Graphs: We can find complete bipartite graphs by the same techniques we used for finding frequent itemsets. Nodes of the graph can be thought of both as the items and as the baskets. The basket corresponding to a node is the set of adjacent nodes, thought of as items. A complete bipartite graph with node groups of size t and s can be thought of as finding frequent itemsets of size t with support s.
 Graph Partitioning: One way to find communities is to partition a graph repeatedly into pieces of roughly similar sizes. A cut is a partition of the nodes of the graph into two sets, and its size is the number of edges that have one end in each set. The volume of a set of nodes is the number of edges with at least one end in that set.
 Normalized Cuts: We can normalize the size of a cut by taking the ratio of the size of the cut and the volume of each of the two sets formed by the cut. Then add these two ratios to get the normalized cut value. Normalized cuts with a low sum are good, in the sense that they tend to divide the nodes into two roughly equal parts, and have a relatively small size.
 Adjacency Matrices: These matrices describe a graph. The entry in row i and column j is 1 if there is an edge between nodes i and j, and 0 otherwise.
 Degree Matrices: The degree matrix for a graph has d in the ith diagonal entry if d is the degree of the ith node. Off the diagonal, all entries are 0.
 Laplacian Matrices: The Laplacian matrix for a graph is its degree matrix minus its adjacency matrix. That is, the entry in row i and column i of the Laplacian matrix is the degree of the ith node of the graph, and the entry in row i and column j, for i 6= j, is −1 if there is an edge between nodes i and j, and 0 otherwise.
 Spectral Method for Partitioning Graphs: The lowest eigenvalue for any Laplacian matrix is 0, and its corresponding eigenvector consists of all 1’s. The eigenvectors corresponding to small eigenvalues can be used to guide a partition of the graph into two parts of similar size with a small cut value. For one example, putting the nodes with a positive component in the eigenvector with the secondsmallest eigenvalue into one set and those with a negative component into the other is usually good.
 Overlapping Communities: Typically, individuals are members of several communities. In graphs describing social networks, it is normal for the probability that two individuals are friends to rise as the number of communities of which both are members grows.
 The AffiliationGraph Model : An appropriate model for membership in communities is to assume that for each community there is a probability that because of this community two members become friends (have an edge in the social network graph). Thus, the probability that two nodes have an edge is 1 minus the product of the probabilities that none of the communities of which both are members cause there to be an edge between them. We then find the assignment of nodes to communities and the values of those probabilities that best describes the observed social graph.
 MaximumLikelihood Estimation: An important modeling technique, useful for modeling communities as well as many other things, is to compute, as a function of all choices of parameter values that the model allows, the probability that the observed data would be generated. The values that yield the highest probability are assumed to be correct, and called the maximumlikelihood estimate (MLE).
 Use of Gradient Descent : If we know membership in communities, we can find the MLE by gradient descent or other methods. However, we cannot find the best membership in communities by gradient descent, because membership is discrete, not continuous.
 Improved Community Modeling by Strength of Membership: We can formulate the problem of finding the MLE of communities in a social graph by assuming individuals have a strength of membership in each community, possibly 0 if they are not a member. If we define the probability of an edge between two nodes to be a function of their membership strengths in their common communities, we can turn the problem of finding the MLE into a continuous problem and solve it using gradient descent.
 Simrank: One way to measure the similarity of nodes in a graph with several types of nodes is to start a random walker at one node and allow it to wander, with a fixed probability of restarting at the same node. The distribution of where the walker can be expected to be is a good measure of the similarity of nodes to the starting node. This process must be repeated with each node as the starting node if we are to get allpairs similarity.
 Triangles in Social Networks: The number of triangles per node is an important measure of the closeness of a community and often reflects its maturity. We can enumerate or count the triangles in a graph with m edges in O(m3/2) time, but no more efficient algorithm exists in general.
 Triangle Finding by MapReduce: We can find triangles in a single round of MapReduce by treating it as a threeway join. Each edge must be sent to a number of reducers proportional to the cube root of the total number of reducers, and the total computation time spent at all the reducers is proportional to the time of the serial algorithm for triangle finding.
 Neighborhoods: The neighborhood of radius d for a node v in a directed or undirected graph is the set of nodes reachable from v along paths of length at most d. The neighborhood profile of a node is the sequence of neighborhood sizes for all distances from 1 upwards. The diameter of a connected graph is the smallest d for which the neighborhood of radius d for any starting node includes the entire graph.
 Transitive Closure: A node v can reach node u if u is in the neighborhood of v for some radius. The transitive closure of a graph is the set of pairs of nodes (v, u) such that v can reach u.
 Computing Transitive Closure: Since the transitive closure can have a number of facts equal to the square of the number of nodes of a graph, it is infeasible to compute transitive closure directly for large graphs. One approach is to find strongly connected components of the graph and collapse them each to a single node before computing the transitive closure.
 Transitive Closure and MapReduce: We can view transitive closure computation as the iterative join of a path relation (pairs of nodes v and u such that u is known to be reachable from v) and the arc relation of the graph. Such an approach requires a number of MapReduce rounds equal to the diameter of the graph
 Transitive Closure by Recursive Doubling: An approach that uses fewer MapReduce rounds is to join the path relation with itself at each round. At each round, we double the length of paths that are able to contribute to the transitive closure. Thus, the number of needed rounds is only the base2 logarithm of the diameter of the graph
 Smart Transitive Closure: While recursive doubling can cause the same path to be considered many times, and thus increases the total computation time (compared with iteratively joining paths with single arcs), a variant called smart transitive closure avoids discovering the same path more than once. The trick is to require that when joining two paths, the first has a length that is a power of 2.
 Approximating Neighborhood Sizes: By using the FlajoletMartin technique for approximating the number of distinct elements in a stream, we can find the neighborhood sizes at different radii approximately. We maintain a set of tail lengths for each node. To increase the radius by 1, we examine each edge (u, v) and for each tail length for u we set it equal to the corresponding tail length for v if the latter is larger than the former.
10.10 References for Chapter 10
11 Dimensionality Reduction
There are many sources of data that can be viewed as a large matrix. We saw in Chapter 5 how the Web can be represented as a transition matrix. In Chapter 9, the utility matrix was a point of focus. And in Chapter 10 we examined matrices that represent social networks. In many of these matrix applications, the matrix can be summarized by finding “narrower” matrices that in some sense are close to the original. These narrow matrices have only a small number of rows or a small number of columns, and therefore can be used much more efficiently than can the original large matrix. The process of finding these narrow matrices is called dimensionality reduction.
We saw a preliminary example of dimensionality reduction in Section 9.4. There, we discussed UVdecomposition of a matrix and gave a simple algorithm for finding this decomposition. Recall that a large matrix M was decomposed into two matrices U and V whose product UV was approximately M. The matrix U had a small number of columns whereas V had a small number of rows, so each was significantly smaller than M, and yet together they represented most of the information in M that was useful in predicting ratings of items by individuals.
In this chapter we shall explore the idea of dimensionality reduction in more detail. We begin with a discussion of eigenvalues and their use in “principal component analysis” (PCA). We cover singularvalue decomposition, a more powerful version of UVdecomposition. Finally, because we are always interested in the largest data sizes we can handle, we look at another form of decomposition, called CURdecomposition, which is a variant of singularvalue decomposition that keeps the matrices of the decomposition sparse if the original matrix is sparse.
11.1 Eigenvalues and Eigenvectors
11.1.1 Definitions
11.1.2 Computing Eigenvalues and Eigenvectors
11.1.3 Finding Eigenpairs by Power Iteration
11.1.4 The Matrix of Eigenvectors
11.1.5 Exercises for Section 11.1
11.2 PrincipalComponent Analysis
11.2.1 An Illustrative Example
11.2.2 Using Eigenvectors for Dimensionality Reduction
11.2.3 The Matrix of Distances
11.2.4 Exercises for Section 11.2
11.3 SingularValue Decomposition
11.3.1 Definition of SVD
11.3.2 Interpretation of SVD
11.3.3 Dimensionality Reduction Using SVD
11.3.4 Why Zeroing Low Singular Values Works
11.3.5 Querying Using Concepts
11.3.6 Computing the SVD of a Matrix
11.3.7 Exercises for Section 11.3
11.4 CUR Decomposition
11.4.1 Definition of CUR
11.4.2 Choosing Rows and Columns Properly
11.4.3 Constructing the Middle Matrix
11.4.4 The Complete CUR Decomposition
11.4.5 Eliminating Duplicate Rows and Columns
11.4.6 Exercises for Section 11.4
11.5 Summary of Chapter 11
 Dimensionality Reduction: The goal of dimensionality reduction is to replace a large matrix by two or more other matrices whose sizes are much smaller than the original, but from which the original can be approximately reconstructed, usually by taking their product.
 Eigenvalues and Eigenvectors: A matrix may have several eigenvectors such that when the matrix multiplies the eigenvector, the result is a constant multiple of the eigenvector. That constant is the eigenvalue associated with this eigenvector. Together the eigenvector and its eigenvalue are called an eigenpair.
 Finding Eigenpairs by Power Iteration: We can find the principal eigenvector (eigenvector with the largest eigenvalue) by starting with any vector and repeatedly multiplying the current vector by the matrix to get a new vector. When the changes to the vector become small, we can treat the result as a close approximation to the principal eigenvector. By modifying the matrix, we can then use the same iteration to get the second eigenpair (that with the secondsmallest eigenvalue), and similarly get each of the eigenpairs in turn, in order of decreasing value of the eigenvalue.
 PrincipalComponent Analysis: This technique for dimensionality reduction views data consisting of a collection of points in a multidimensional space as a matrix, with rows corresponding to the points and columns to the dimensions. The product of this matrix and its transpose has eigenpairs, and the principal eigenvector can be viewed as the direction in the space along which the points best line up. The second eigenvector represents the direction in which deviations from the principal eigenvector are the greatest, and so on.
 Dimensionality Reduction by PCA: By representing the matrix of points by a small number of its eigenvectors, we can approximate the data in a way that minimizes the rootmeansquare error for the given number of columns in the representing matrix.
 SingularValue Decomposition: The singularvalue decomposition of a matrix consists of three matrices, U, , and V . The matrices U and V are columnorthonormal, meaning that as vectors, the columns are orthogonal, and their lengths are 1. The matrix is a diagonal matrix, and the values along its diagonal are called singular values. The product of U, , and the transpose of V equals the original matrix.
 Concepts: SVD is useful when there are a small number of concepts that connect the rows and columns of the original matrix. For example, if the original matrix represents the ratings given by movie viewers (rows) to movies (columns), the concepts might be the genres of the movies. The matrix U connects rows to concepts, represents the strengths of the concepts, and V connects the concepts to columns.
 Queries Using the SingularValue Decomposition: We can use the decomposition to relate new or hypothetical rows of the original matrix to the concepts represented by the decomposition. Multiply a row by the matrix V of the decomposition to get a vector indicating the extent to which that row matches each of the concepts.
 Using SVD for Dimensionality Reduction: In a complete SVD for a matrix, U and V are typically as large as the original. To use fewer columns for U and V , delete the columns corresponding to the smallest singular values from U, V , and . This choice minimizes the error in reconstructing the original matrix from the modified U, , and V .
 Decomposing Sparse Matrices: Even in the common case where the given matrix is sparse, the matrices constructed by SVD are dense. The CUR decomposition seeks to decompose a sparse matrix into sparse, smaller matrices whose product approximates the original matrix.
 CUR Decomposition: This method chooses from a given sparse matrix a set of columns C and a set of rows R, which play the role of U and V T in SVD; the user can pick any number of rows and columns. The choice of rows and columns is made randomly with a distribution that depends on the Frobenius norm, or the square root of the sum of the squares of the elements. Between C and R is a square matrix called U that is constructed by a pseudoinverse of the intersection of the chosen rows and columns.
11.6 References for Chapter 11
12 LargeScale Machine Learning
Many algorithms are today classified as “machine learning.” These algorithms share, with the other algorithms studied in this book, the goal of extracting information from data. All algorithms for analysis of data are designed to produce a useful summary of the data, from which decisions are made. Among many examples, the frequentitemset analysis that we did in Chapter 6 produces information like association rules, which can then be used for planning a sales strategy or for many other purposes.
However, algorithms called “machine learning” not only summarize our data; they are perceived as learning a model or classifier from the data, and thus discover something about data that will be seen in the future. For instance, the clustering algorithms discussed in Chapter 7 produce clusters that not only tell us something about the data being analyzed (the training set), but they allow us to classify future data into one of the clusters that result from the clustering algorithm. Thus, machinelearning enthusiasts often speak of clustering with the neologism “unsupervised learning”; the term unsupervised refers to the fact that the input data does not tell the clustering algorithm what the clusters should be. In supervised machine learning,, which is the subject of this chapter, the available data includes information about the correct way to classify at least some of the data. The data classified already is called the training set.
In this chapter, we do not attempt to cover all the different approaches to machine learning. We concentrate on methods that are suitable for very large data and that have the potential for parallel implementation. We consider the classical “perceptron” approach to learning a data classifier, where a hyperplane that separates two classes is sought. Then, we look at more modern techniques involving supportvector machines. Similar to perceptrons, these methods look for hyperplanes that best divide the classes, so that few, if any, members of the training set lie close to the hyperplane. We end with a discussion of nearestneighbor techniques, where data is classified according to the class(es) of their nearest neighbors in some space.
12.1 The MachineLearning Model
12.1.1 Training Sets
12.1.2 Some Illustrative Examples
12.1.3 Approaches to Machine Learning
12.1.4 MachineLearning Architecture
12.1.5 Exercises for Section 12.1
12.2 Perceptrons
12.2.1 Training a Perceptron with Zero Threshold
12.2.2 Convergence of Perceptrons
12.2.3 The Winnow Algorithm
12.2.4 Allowing the Threshold to Vary
12.2.5 Multiclass Perceptrons
12.2.6 Transforming the Training Set
12.2.7 Problems With Perceptrons
12.2.8 Parallel Implementation of Perceptrons
12.2.9 Exercises for Section 12.2
12.3 SupportVector Machines
12.3.1 The Mechanics of an SVM
12.3.2 Normalizing the Hyperplane
12.3.3 Finding Optimal Approximate Separators
12.3.4 SVM Solutions by Gradient Descent
12.3.5 Stochastic Gradient Descent
12.3.6 Parallel Implementation of SVM
12.3.7 Exercises for Section 12.3
12.4 Learning from Nearest Neighbors
12.4.1 The Framework for NearestNeighbor Calculations
12.4.2 Learning with One Nearest Neighbor
12.4.3 Learning OneDimensional Functions
12.4.4 Kernel Regression
12.4.5 Dealing with HighDimensional Euclidean Data
12.4.6 Dealing with NonEuclidean Distances
12.4.7 Exercises for Section 12.4
12.5 Comparison of Learning Methods
12.6 Summary of Chapter 12
 Training Sets: A training set consists of a feature vector, each component of which is a feature, and a label indicating the class to which the object represented by the feature vector belongs. Features can be categorical – belonging to an enumerated list of values – or numerical.
 Test Sets and Overfitting: When training some classifier on a training set, it is useful to remove some of the training set and use the removed data as a test set. After producing a model or classifier without using the test set, we can run the classifier on the test set to see how well it does. If the classifier does not perform as well on the test set as on the training set used, then we have overfit the training set by conforming to peculiarities of the trainingset data which is not present in the data as a whole.
 Batch Versus OnLine Learning: In batch learning, the training set is available at any time and can be used in repeated passes. Online learning uses a stream of training examples, each of which can be used only once.
 Perceptrons: This machinelearning method assumes the training set has only two class labels, positive and negative. Perceptrons work when there is a hyperplane that separates the feature vectors of the positive examples from those of the negative examples. We converge to that hyperplane by adjusting our estimate of the hyperplane by a fraction – the learning rate – of the direction that is the average of the currently misclassified points.
 The Winnow Algorithm: This algorithm is a variant of the perceptron algorithm that requires components of the feature vectors to be 0 or 1. Training examples are examined in a roundrobin fashion, and if the current classification of a training example is incorrect, the components of the estimated separator where the feature vector has 1 are adjusted up or down, in the direction that will make it more likely this training example is correctly classified in the next round.
 Nonlinear Separators: When the training points do not have a linear function that separates two classes, it may still be possible to use a perceptron to classify them. We must find a function we can use to transform the points so that in the transformed space, the separator is a hyperplane.
 SupportVector Machines: The SVM improves upon perceptrons by finding a separating hyperplane that not only separates the positive and negative points, but does so in a way that maximizes the margin – the distance perpendicular to the hyperplane to the nearest points. The points that lie exactly at this minimum distance are the support vectors. Alternatively, the SVM can be designed to allow points that are too close to the hyperplane, or even on the wrong side of the hyperplane, but minimize the error due to such misplaced points.
 Solving the SVM Equations: We can set up a function of the vector that is normal to the hyperplane, the length of the vector (which determines the margin), and the penalty for points on the wrong side of the margins. The regularization parameter determines the relative importance of a wide margin and a small penalty. The equations can be solved by several methods, including gradient descent and quadratic programming
 NearestNeighbor Learning: In this approach to machine learning, the entire training set is used as the model. For each (“query”) point to be classified, we search for its k nearest neighbors in the training set. The classification of the query point is some function of the labels of these k neighbors. The simplest case is when k = 1, in which case we can take the label of the query point to be the label of the nearest neighbor.
 Regression: A common case of nearestneighbor learning, called regression, occurs when the there is only one feature vector, and it, as well as the label, are real numbers; i.e., the data defines a realvalued function of one variable. To estimate the label, i.e., the value of the function, for an unlabeled data point, we can perform some computation involving the k nearest neighbors. Examples include averaging the neighbors or taking a weighted average, where the weight of a neighbor is some decreasing function of its distance from the point whose label we are trying to determine.
Comments