Table of contents
  1. Story
  2. Slides
  3. Spotfire Dashboard
  4. Research Notes
  5. CS246: Mining Massive Datasets Slides
    1. Chapter 1 Introduction
      1. Slide 1 Mining of Massive Datasets: Course Introduction
      2. Slide 2 What is Data Mining? Knowledge discovery from data
      3. Slide 3 Some Massive Data Statistics
      4. Slide 4 Data contains value and knowledge
      5. Slide 5 Data Mining
      6. Slide 6 Good news: Demand for Data Mining
      7. Slide 7 What is Data Mining?
      8. Slide 8 Data Mining Tasks
      9. Slide 9 Meaningfulness of Analytic Answers 1
      10. Slide 10 Meaningfulness of Analytic Answers 2
      11. Slide 11 What matters when dealing with data?
      12. Slide 12 Data Mining: Cultures
      13. Slide 13 This Class: CS246
      14. Slide 14 What will we learn? 1
      15. Slide 15 What will we learn? 2
      16. Slide 16 How It All Fits Together
      17. Slide 17 How do you want that data?
      18. Slide 18 About the Course
      19. Slide 19 2014 CS246 Course Staff
      20. Slide 20 Course Logistics
      21. Slide 21 Logistics: Communication
      22. Slide 22 Work for the Course 1
      23. Slide 23 Work for the Course 2
      24. Slide 24 Course Calender
      25. Slide 25 Prerequisites
      26. Slide 26 Recitation Sessions
      27. Slide 27 What's after the class
      28. Slide 28 3 To-do items
  6. Hadoop Tutorial
    1. General Instructions
    2. 1 Setting up a virtual machine
    3. 2 Running Hadoop jobs
      1. 2.1 Creating a Hadoop project in Eclipse
        1. Figure 1: Create a Hadoop Project
        2. Figure 2: Create a Hadoop Project
        3. Figure 3: Create a Hadoop Project
      2. 2.2 Running Hadoop jobs in standalone mode
        1. Figure 4: Run a Hadoop Project
        2. Figure 5: Run a Hadoop Project
        3. Figure 6: Run a Hadoop Project
        4. Figure 7: Run a Hadoop Project
        5. Figure 8: Run a Hadoop Project
        6. Figure 9: Run a Hadoop Project
      3. 2.3 Running Hadoop in pseudo-distributed mode
        1. Figure 10: Run a Hadoop Project
        2. Figure 11: Run a Hadoop Project
        3. Figure 12: Run a Hadoop Project
      4. 2.4 Debugging Hadoop jobs
        1. Figure 13: Debug a Hadoop project
        2. Figure 14: Run a Hadoop Project
      5. 2.5 Example project
        1. Figure 15: Create a Hadoop Project
        2. Figure 16: Create a Hadoop Project
        3. Figure 17: Create a Hadoop Project
        4. Figure 18: Create a Hadoop Project
        5. Figure 19: Create a Hadoop Project
        6. Figure 20: Create a Hadoop Project
        7. Figure 21: Create a Hadoop Project
        8. Figure 22: Create a java file
        9. Figure 23: Create a java file
        10. Figure 24: Create WordCount.java
        11. Figure 25: Create WordCount.java
        12. Figure 26: Create WordCount.java
        13. Figure 27: Run WordCount.java
        14. Figure 28: Run WordCount.java
        15. Figure 29: Run WordCount.java
        16. Figure 30: Run WordCount.java
        17. Figure 31: Run WordCount.java
        18. Figure 32: Export a hadoop project
        19. Figure 33: Run WordCount.java
        20. Figure 34: Export a Hadoop project
        21. Figure 35: Export a Hadoop project
        22. Figure 36: Run WordCount job
        23. Figure 37: Run WordCount job
        24. Figure 38: Run WordCount job
        25. Figure 39: View WordCount job logs
        26. Figure 40: View WordCount job logs
        27. Figure 41: View WordCount job logs
        28. Figure 42: View WordCount job logs
      6. 2.6 Using your local machine for development
      7. Further Hadoop tutorials
      8. Further Eclipse tutorials
    4. 3 Task: Write your own Hadoop Job
  7. CS246: Mining Massive Datasets Winter 2015
    1. Course Information
    2. Topics
    3. Assignments and grading
    4. Homework policy
    5. Prerequisites
    6. Materials
    7. Important dates
    8. Next steps for students
  8. Mining of Massive Datasets
    1. The book
    2. The MOOC (Massive Open Online Course)
    3. The 2nd edition of the book (v2.1)
    4. Stanford big data courses
      1. CS246
      2. CS341
      3. CS224W
      4. You can take Stanford courses!
    5. Supporting materials
    6. Previous versions of the book
      1. Version 1.0
  9. Mining of Massive Datasets
    1. Preface
      1. What the Book Is About
      2. Prerequisites
      3. Exercises
      4. Support on the Web
      5. Gradiance Automated Homework
      6. Acknowledgements
    2. Contents
    3. 1 Data Mining
      1. 1.1 What is Data Mining?
        1. 1.1.1 Statistical Modeling
          1. Example 1.1
        2. 1.1.2 Machine Learning
        3. 1.1.3 Computational Approaches to Modeling
        4. 1.1.4 Summarization
          1. Example 1.2
          2. Figure 1.1: Plotting cholera cases on a map of London
        5. 1.1.5 Feature Extraction
      2. 1.2 Statistical Limits on Data Mining
        1. 1.2.1 Total Information Awareness
        2. 1.2.2 Bonferroni’s Principle
        3. 1.2.3 An Example of Bonferroni’s Principle
        4. 1.2.4 Exercises for Section 1.2
          1. Exercise 1.2.1
          2. Exercise 1.2.2
      3. 1.3 Things Useful to Know
        1. 1.3.1 Importance of Words in Documents
        2. Example 1.3
        3. 1.3.2 Hash Functions
        4. Example 1.4
        5. 1.3.3 Indexes
          1. Figure 1.2: A hash table used as an index
          2. Example 1.5
        6. 1.3.4 Secondary Storage
        7. 1.3.5 The Base of Natural Logarithms
          1. Example 1.6
        8. 1.3.6 Power Laws
          1. Figure 1.3: A power law with a slope of −2
          2. The Matthew Effect
          3. Example 1.8
        9. 1.3.7 Exercises for Section 1.3
          1. Exercise 1.3.1
          2. Exercise 1.3.2
          3. Exercise 1.3.3
          4. Exercise 1.3.4
          5. Exercise 1.3.5
      4. 1.4 Outline of the Book
      5. 1.5 Summary of Chapter 1
      6. 1.6 References for Chapter 1
      7. 1.7 Footnotes for Chapter 1
        1. 1
        2. 2
        3. 3
    4. 2 MapReduce and the New Software Stack
      1. 2.1 Distributed File Systems
        1. 2.1.1 Physical Organization of Compute Nodes
        2. 2.1.2 Large-Scale File-System Organization
      2. 2.2 MapReduce
        1. 2.2.1 The Map Tasks
        2. 2.2.2 Grouping by Key
        3. 2.2.3 The Reduce Tasks
        4. 2.2.4 Combiners
        5. 2.2.5 Details of MapReduce Execution
        6. 2.2.6 Coping With Node Failures
        7. 2.2.7 Exercises for Section 2.2
      3. 2.3 Algorithms Using MapReduce
        1. 2.3.1 Matrix-Vector Multiplication by MapReduce
        2. 2.3.2 If the Vector v Cannot Fit in Main Memory
        3. 2.3.3 Relational-Algebra Operations
        4. 2.3.4 Computing Selections by MapReduce
        5. 2.3.5 Computing Projections by MapReduce
        6. 2.3.6 Union, Intersection, and Difference by MapReduce
        7. 2.3.7 Computing Natural Join by MapReduce
        8. 2.3.8 Grouping and Aggregation by MapReduce
        9. 2.3.9 Matrix Multiplication
        10. 2.3.10 Matrix Multiplication with One MapReduce Step
        11. 2.3.11 Exercises for Section 2.3
      4. 2.4 Extensions to MapReduce
        1. 2.4.1 Workflow Systems
        2. 2.4.2 Recursive Extensions to MapReduce
        3. 2.4.3 Pregel
        4. 2.4.4 Exercises for Section 2.4
      5. 2.5 The Communication Cost Model
        1. 2.5.1 Communication-Cost for Task Networks
        2. 2.5.2 Wall-Clock Time
        3. 2.5.3 Multiway Joins
        4. 2.5.4 Exercises for Section 2.5
      6. 2.6 Complexity Theory for MapReduce
        1. 2.6.1 Reducer Size and Replication Rate
        2. 2.6.2 An Example: Similarity Joins
        3. 2.6.3 A Graph Model for MapReduce Problems
        4. 2.6.4 Mapping Schemas
        5. 2.6.5 When Not All Inputs Are Present
        6. 2.6.6 Lower Bounds on Replication Rate
        7. 2.6.7 Case Study: Matrix Multiplication
        8. 2.6.8 Exercises for Section 2.6
      7. 2.7 Summary of Chapter 2
      8. 2.8 References for Chapter 2
    5. 3 Finding Similar Items
      1. 3.1 Applications of Near-Neighbor Search
        1. 3.1.1 Jaccard Similarity of Sets
        2. 3.1.2 Similarity of Documents
        3. 3.1.3 Collaborative Filtering as a Similar-Sets Problem
        4. 3.1.4 Exercises for Section 3.1
      2. 3.2 Shingling of Documents
        1. 3.2.1 k-Shingles
        2. 3.2.2 Choosing the Shingle Size
        3. 3.2.3 Hashing Shingles
        4. 3.2.4 Shingles Built from Words
        5. 3.2.5 Exercises for Section 3.2
      3. 3.3 Similarity-Preserving Summaries of Sets
        1. 3.3.1 Matrix Representation of Sets
        2. 3.3.2 Minhashing
        3. 3.3.3 Minhashing and Jaccard Similarity
        4. 3.3.4 Minhash Signatures
        5. 3.3.5 Computing Minhash Signatures
        6. 3.3.6 Exercises for Section 3.3
      4. 3.4 Locality-Sensitive Hashing for Documents
        1. 3.4.1 LSH for Minhash Signatures
        2. 3.4.2 Analysis of the Banding Technique
        3. 3.4.3 Combining the Techniques
        4. 3.4.4 Exercises for Section 3.4
      5. 3.5 Distance Measures
        1. 3.5.1 Definition of a Distance Measure
        2. 3.5.2 Euclidean Distances
        3. 3.5.3 Jaccard Distance
        4. 3.5.4 Cosine Distance
        5. 3.5.5 Edit Distance
        6. 3.5.6 Hamming Distance
        7. 3.5.7 Exercises for Section 3.5
      6. 3.6 The Theory of Locality-Sensitive Functions
        1. 3.6.1 Locality-Sensitive Functions
        2. 3.6.2 Locality-Sensitive Families for Jaccard Distance
        3. 3.6.3 Amplifying a Locality-Sensitive Family
        4. 3.6.4 Exercises for Section 3.6
      7. 3.7 LSH Families for Other Distance Measures
        1. 3.7.1 LSH Families for Hamming Distance
        2. 3.7.2 Random Hyperplanes and the Cosine Distance
        3. 3.7.3 Sketches
        4. 3.7.4 LSH Families for Euclidean Distance
        5. 3.7.5 More LSH Families for Euclidean Spaces
        6. 3.7.6 Exercises for Section 3.7
      8. 3.8 Applications of Locality-Sensitive Hashing
        1. 3.8.1 Entity Resolution
        2. 3.8.2 An Entity-Resolution Example
        3. 3.8.3 Validating Record Matches
        4. 3.8.4 Matching Fingerprints
        5. 3.8.5 A LSH Family for Fingerprint Matching
        6. 3.8.6 Similar News Articles
        7. 3.8.7 Exercises for Section 3.8
      9. 3.9 Methods for High Degrees of Similarity
        1. 3.9.1 Finding Identical Items
        2. 3.9.2 Representing Sets as Strings
        3. 3.9.3 Length-Based Filtering
        4. 3.9.4 Prefix Indexing
        5. 3.9.5 Using Position Information
        6. 3.9.6 Using Position and Length in Indexes
        7. 3.9.7 Exercises for Section 3.9
      10. 3.10 Summary of Chapter 3
      11. 3.11 References for Chapter 3
    6. 4 Mining Data Streams
      1. 4.1 The Stream Data Model
        1. 4.1.1 A Data-Stream-Management System
        2. 4.1.2 Examples of Stream Sources
        3. 4.1.3 Stream Queries
        4. 4.1.4 Issues in Stream Processing
      2. 4.2 Sampling Data in a Stream
        1. 4.2.1 A Motivating Example
        2. 4.2.2 Obtaining a Representative Sample
        3. 4.2.3 The General Sampling Problem
        4. 4.2.4 Varying the Sample Size
        5. 4.2.5 Exercises for Section 4.2
      3. 4.3 Filtering Streams
        1. 4.3.1 A Motivating Example
        2. 4.3.2 The Bloom Filter
        3. 4.3.3 Analysis of Bloom Filtering
        4. 4.3.4 Exercises for Section 4.3
      4. 4.4 Counting Distinct Elements in a Stream
        1. 4.4.1 The Count-Distinct Problem
        2. 4.4.2 The Flajolet-Martin Algorithm
        3. 4.4.3 Combining Estimates
        4. 4.4.4 Space Requirements
        5. 4.4.5 Exercises for Section 4.4
      5. 4.5 Estimating Moments
        1. 4.5.1 Definition of Moments
        2. 4.5.2 The Alon-Matias-Szegedy Algorithm for Second Moments
        3. 4.5.3 Why the Alon-Matias-Szegedy Algorithm Works
        4. 4.5.4 Higher-Order Moments
        5. 4.5.5 Dealing With Infinite Streams
        6. 4.5.6 Exercises for Section 4.5
      6. 4.6 Counting Ones in a Window
        1. 4.6.1 The Cost of Exact Counts
        2. 4.6.2 The Datar-Gionis-Indyk-Motwani Algorithm
        3. 4.6.3 Storage Requirements for the DGIM Algorithm
        4. 4.6.4 Query Answering in the DGIM Algorithm
        5. 4.6.5 Maintaining the DGIM Conditions
        6. 4.6.6 Reducing the Error
        7. 4.6.7 Extensions to the Counting of Ones
        8. 4.6.8 Exercises for Section 4.6
      7. 4.7 Decaying Windows
      8. 4.8 Summary of Chapter 4
      9. 4.9 References for Chapter 4
    7. 5 Link Analysis
      1. 5.1 PageRank
        1. 5.1.1 Early Search Engines and Term Spam
        2. 5.1.2 Definition of PageRank
        3. 5.1.3 Structure of the Web
        4. 5.1.4 Avoiding Dead Ends
        5. 5.1.5 Spider Traps and Taxation
        6. 5.1.6 Using PageRank in a Search Engine
        7. 5.1.7 Exercises for Section 5.1
      2. 5.2 Efficient Computation of PageRank
        1. 5.2.1 Representing Transition Matrices
        2. 5.2.2 PageRank Iteration Using MapReduce
        3. 5.2.3 Use of Combiners to Consolidate the Result Vector
        4. 5.2.4 Representing Blocks of the Transition Matrix
        5. 5.2.5 Other Efficient Approaches to PageRank Iteration
        6. 5.2.6 Exercises for Section 5.2
      3. 5.3 Topic-Sensitive PageRank
        1. 5.3.1 Motivation for Topic-Sensitive Page Rank
        2. 5.3.2 Biased Random Walks
        3. 5.3.3 Using Topic-Sensitive PageRank
        4. 5.3.4 Inferring Topics from Words
        5. 5.3.5 Exercises for Section 5.3
      4. 5.4 Link Spam
        1. 5.4.1 Architecture of a Spam Farm
        2. 5.4.2 Analysis of a Spam Farm
        3. 5.4.3 Combating Link Spam
        4. 5.4.4 TrustRank
        5. 5.4.5 Spam Mass
        6. 5.4.6 Exercises for Section 5.4
      5. 5.5 Hubs and Authorities
        1. 5.5.1 The Intuition Behind HITS
        2. 5.5.2 Formalizing Hubbiness and Authority
        3. 5.5.3 Exercises for Section 5.5
      6. 5.6 Summary of Chapter 5
      7. 5.7 References for Chapter 5
    8. 6 Frequent Itemsets
      1. 6.1 The Market-Basket Model
        1. 6.1.1 Definition of Frequent Itemsets
        2. 6.1.2 Applications of Frequent Itemsets
        3. 6.1.3 Association Rules
        4. 6.1.4 Finding Association Rules with High Confidence
        5. 6.1.5 Exercises for Section 6.1
      2. 6.2 Market Baskets and the A-Priori Algorithm
        1. 6.2.1 Representation of Market-Basket Data
        2. 6.2.2 Use of Main Memory for Itemset Counting
        3. 6.2.3 Monotonicity of Itemsets
        4. 6.2.4 Tyranny of Counting Pairs
        5. 6.2.5 The A-Priori Algorithm
        6. 6.2.6 A-Priori for All Frequent Itemsets
        7. 6.2.7 Exercises for Section 6.2
      3. 6.3 Handling Larger Datasets in Main Memory
        1. 6.3.1 The Algorithm of Park, Chen, and Yu
        2. 6.3.2 The Multistage Algorithm
        3. 6.3.3 The Multihash Algorithm
        4. 6.3.4 Exercises for Section 6.3
      4. 6.4 Limited-Pass Algorithms
        1. 6.4.1 The Simple, Randomized Algorithm
        2. 6.4.2 Avoiding Errors in Sampling Algorithms
        3. 6.4.3 The Algorithm of Savasere, Omiecinski, and Navathe
        4. 6.4.4 The SON Algorithm and MapReduce
        5. 6.4.5 Toivonen’s Algorithm
        6. 6.4.6 Why Toivonen’s Algorithm Works
        7. 6.4.7 Exercises for Section 6.4
      5. 6.5 Counting Frequent Items in a Stream
        1. 6.5.1 Sampling Methods for Streams
        2. 6.5.2 Frequent Itemsets in Decaying Windows
        3. 6.5.3 Hybrid Methods
        4. 6.5.4 Exercises for Section 6.5
      6. 6.6 Summary of Chapter 6
      7. 6.7 References for Chapter 6
    9. 7 Clustering
      1. 7.1 Introduction to Clustering Techniques
        1. 7.1.1 Points, Spaces, and Distances
        2. 7.1.2 Clustering Strategies3
        3. 7.1.3 The Curse of Dimensionality
        4. 7.1.4 Exercises for Section 7.1
      2. 7.2 Hierarchical Clustering
        1. 7.2.1 Hierarchical Clustering in a Euclidean Space
        2. 7.2.2 Efficiency of Hierarchical Clustering
        3. 7.2.3 Alternative Rules for Controlling Hierarchical Clustering
        4. 7.2.4 Hierarchical Clustering in Non-Euclidean Spaces
        5. 7.2.5 Exercises for Section 7.2
      3. 7.3 K-means Algorithms
        1. 7.3.1 K-Means Basics
        2. 7.3.2 Initializing Clusters for K-Means
        3. 7.3.3 Picking the Right Value of k
        4. 7.3.4 The Algorithm of Bradley, Fayyad, and Reina
        5. 7.3.5 Processing Data in the BFR Algorithm
        6. 7.3.6 Exercises for Section 7.3
      4. 7.4 The CURE Algorithm
        1. 7.4.1 Initialization in CURE
        2. 7.4.2 Completion of the CURE Algorithm
        3. 7.4.3 Exercises for Section 7.4
      5. 7.5 Clustering in Non-Euclidean Spaces
        1. 7.5.1 Representing Clusters in the GRGPF Algorithm
        2. 7.5.2 Initializing the Cluster Tree
        3. 7.5.3 Adding Points in the GRGPF Algorithm
        4. 7.5.4 Splitting and Merging Clusters
        5. 7.5.5 Exercises for Section 7.5
      6. 7.6 Clustering for Streams and Parallelism
        1. 7.6.1 The Stream-Computing Model
        2. 7.6.2 A Stream-Clustering Algorithm
        3. 7.6.3 Initializing Buckets
        4. 7.6.4 Merging Buckets
        5. 7.6.5 Answering Queries
        6. 7.6.6 Clustering in a Parallel Environment
        7. 7.6.7 Exercises for Section 7.6
      7. 7.7 Summary of Chapter 7
      8. 7.8 References for Chapter 7
    10. 8 Advertising on the Web
      1. 8.1 Issues in On-Line Advertising
        1. 8.1.1 Advertising Opportunities
        2. 8.1.2 Direct Placement of Ads
        3. 8.1.3 Issues for Display Ads
      2. 8.2 On-Line Algorithms
        1. 8.2.1 On-Line and Off-Line Algorithms
        2. 8.2.2 Greedy Algorithms
        3. 8.2.3 The Competitive Ratio
        4. 8.2.4 Exercises for Section 8.2
      3. 8.3 The Matching Problem
        1. 8.3.1 Matches and Perfect Matches
        2. 8.3.2 The Greedy Algorithm for Maximal Matching
        3. 8.3.3 Competitive Ratio for Greedy Matching
        4. 8.3.4 Exercises for Section 8.3
      4. 8.4 The Adwords Problem
        1. 8.4.1 History of Search Advertising
        2. 8.4.2 Definition of the Adwords Problem
        3. 8.4.3 The Greedy Approach to the Adwords Problem
        4. 8.4.4 The Balance Algorithm
        5. 8.4.5 A Lower Bound on Competitive Ratio for Balance
        6. 8.4.6 The Balance Algorithm with Many Bidders
        7. 8.4.7 The Generalized Balance Algorithm
        8. 8.4.8 Final Observations About the Adwords Problem
        9. 8.4.9 Exercises for Section 8.4
      5. 8.5 Adwords Implementation
        1. 8.5.1 Matching Bids and Search Queries
        2. 8.5.2 More Complex Matching Problems
        3. 8.5.3 A Matching Algorithm for Documents and Bids
      6. 8.6 Summary of Chapter 8
      7. 8.7 References for Chapter 8
    11. 9 Recommendation Systems
      1. 9.1 A Model for Recommendation Systems
        1. 9.1.1 The Utility Matrix
        2. 9.1.2 The Long Tail
        3. 9.1.3 Applications of Recommendation Systems
        4. 9.1.4 Populating the Utility Matrix
      2. 9.2 Content-Based Recommendations
        1. 9.2.1 Item Profiles
        2. 9.2.2 Discovering Features of Documents
        3. 9.2.3 Obtaining Item Features From Tags
        4. 9.2.4 Representing Item Profiles
        5. 9.2.5 User Profiles
        6. 9.2.6 Recommending Items to Users Based on Content
        7. 9.2.7 Classification Algorithms
        8. 9.2.8 Exercises for Section 9.2
      3. 9.3 Collaborative Filtering
        1. 9.3.1 Measuring Similarity
        2. 9.3.2 The Duality of Similarity
        3. 9.3.3 Clustering Users and Items
        4. 9.3.4 Exercises for Section 9.3
      4. 9.4 Dimensionality Reduction
        1. 9.4.1 UV-Decomposition
        2. 9.4.2 Root-Mean-Square Error
        3. 9.4.3 Incremental Computation of a UV-Decomposition
        4. 9.4.4 Optimizing an Arbitrary Element
        5. 9.4.5 Building a Complete UV-Decomposition Algorithm
        6. 9.4.6 Exercises for Section 9.4
      5. 9.5 The NetFlix Challenge
      6. 9.6 Summary of Chapter 9
      7. 9.7 References for Chapter 9
    12. 10 Mining Social-Network Graphs
      1. 10.1 Social Networks as Graphs
        1. 10.1.1 What is a Social Network?
        2. 10.1.2 Social Networks as Graphs
        3. 10.1.3 Varieties of Social Networks
        4. 10.1.4 Graphs With Several Node Types
        5. 10.1.5 Exercises for Section 10.1
      2. 10.2 Clustering of Social-Network Graphs
        1. 10.2.1 Distance Measures for Social-Network Graphs
        2. 10.2.2 Applying Standard Clustering Methods
        3. 10.2.3 Betweenness
        4. 10.2.4 The Girvan-Newman Algorithm
        5. 10.2.5 Using Betweenness to Find Communities
        6. 10.2.6 Exercises for Section 10.2
      3. 10.3 Direct Discovery of Communities
        1. 10.3.1 Finding Cliques
        2. 10.3.2 Complete Bipartite Graphs
        3. 10.3.3 Finding Complete Bipartite Subgraphs
        4. 10.3.4 Why Complete Bipartite Graphs Must Exist
        5. 10.3.5 Exercises for Section 10.3
      4. 10.4 Partitioning of Graphs
        1. 10.4.1 What Makes a Good Partition?
        2. 10.4.2 Normalized Cuts
        3. 10.4.3 Some Matrices That Describe Graphs
        4. 10.4.4 Eigenvalues of the Laplacian Matrix
        5. 10.4.5 Alternative Partitioning Methods
        6. 10.4.6 Exercises for Section 10.4
      5. 10.5 Finding Overlapping Communities
        1. 10.5.1 The Nature of Communities
        2. 10.5.2 Maximum-Likelihood Estimation
        3. 10.5.3 The Affiliation-Graph Model
        4. 10.5.4 Avoiding the Use of Discrete Membership Changes
        5. 10.5.5 Exercises for Section 10.5
      6. 10.6 Simrank
        1. 10.6.1 Random Walkers on a Social Graph
        2. 10.6.2 Random Walks with Restart
        3. 10.6.3 Exercises for Section 10.6
      7. 10.7 Counting Triangles
        1. 10.7.1 Why Count Triangles?
        2. 10.7.2 An Algorithm for Finding Triangles
        3. 10.7.3 Optimality of the Triangle-Finding Algorithm
        4. 10.7.4 Finding Triangles Using MapReduce
        5. 10.7.5 Using Fewer Reduce Tasks
        6. 10.7.6 Exercises for Section 10.7
      8. 10.8 Neighborhood Properties of Graphs
        1. 10.8.1 Directed Graphs and Neighborhoods
        2. 10.8.2 The Diameter of a Graph
        3. 10.8.3 Transitive Closure and Reachability
        4. 10.8.4 Transitive Closure Via MapReduce
        5. 10.8.5 Smart Transitive Closure
        6. 10.8.6 Transitive Closure by Graph Reduction
        7. 10.8.7 Approximating the Sizes of Neighborhoods
        8. 10.8.8 Exercises for Section 10.8
      9. 10.9 Summary of Chapter 10
      10. 10.10 References for Chapter 10
    13. 11 Dimensionality Reduction
      1. 11.1 Eigenvalues and Eigenvectors
        1. 11.1.1 Definitions
        2. 11.1.2 Computing Eigenvalues and Eigenvectors
        3. 11.1.3 Finding Eigenpairs by Power Iteration
        4. 11.1.4 The Matrix of Eigenvectors
        5. 11.1.5 Exercises for Section 11.1
      2. 11.2 Principal-Component Analysis
        1. 11.2.1 An Illustrative Example
        2. 11.2.2 Using Eigenvectors for Dimensionality Reduction
        3. 11.2.3 The Matrix of Distances
        4. 11.2.4 Exercises for Section 11.2
      3. 11.3 Singular-Value Decomposition
        1. 11.3.1 Definition of SVD
        2. 11.3.2 Interpretation of SVD
        3. 11.3.3 Dimensionality Reduction Using SVD
        4. 11.3.4 Why Zeroing Low Singular Values Works
        5. 11.3.5 Querying Using Concepts
        6. 11.3.6 Computing the SVD of a Matrix
        7. 11.3.7 Exercises for Section 11.3
      4. 11.4 CUR Decomposition
        1. 11.4.1 Definition of CUR
        2. 11.4.2 Choosing Rows and Columns Properly
        3. 11.4.3 Constructing the Middle Matrix
        4. 11.4.4 The Complete CUR Decomposition
        5. 11.4.5 Eliminating Duplicate Rows and Columns
        6. 11.4.6 Exercises for Section 11.4
      5. 11.5 Summary of Chapter 11
      6. 11.6 References for Chapter 11
    14. 12 Large-Scale Machine Learning
      1. 12.1 The Machine-Learning Model 
        1. 12.1.1 Training Sets
        2. 12.1.2 Some Illustrative Examples
        3. 12.1.3 Approaches to Machine Learning
        4. 12.1.4 Machine-Learning Architecture
        5. 12.1.5 Exercises for Section 12.1
      2. 12.2 Perceptrons
        1. 12.2.1 Training a Perceptron with Zero Threshold
        2. 12.2.2 Convergence of Perceptrons
        3. 12.2.3 The Winnow Algorithm
        4. 12.2.4 Allowing the Threshold to Vary
        5. 12.2.5 Multiclass Perceptrons
        6. 12.2.6 Transforming the Training Set
        7. 12.2.7 Problems With Perceptrons
        8. 12.2.8 Parallel Implementation of Perceptrons
        9. 12.2.9 Exercises for Section 12.2
      3. 12.3 Support-Vector Machines
        1. 12.3.1 The Mechanics of an SVM
        2. 12.3.2 Normalizing the Hyperplane
        3. 12.3.3 Finding Optimal Approximate Separators
        4. 12.3.4 SVM Solutions by Gradient Descent
        5. 12.3.5 Stochastic Gradient Descent
        6. 12.3.6 Parallel Implementation of SVM
        7. 12.3.7 Exercises for Section 12.3
      4. 12.4 Learning from Nearest Neighbors
        1. 12.4.1 The Framework for Nearest-Neighbor Calculations
        2. 12.4.2 Learning with One Nearest Neighbor
        3. 12.4.3 Learning One-Dimensional Functions
        4. 12.4.4 Kernel Regression
        5. 12.4.5 Dealing with High-Dimensional Euclidean Data
        6. 12.4.6 Dealing with Non-Euclidean Distances
        7. 12.4.7 Exercises for Section 12.4
      5. 12.5 Comparison of Learning Methods
      6. 12.6 Summary of Chapter 12
      7. 12.7 References for Chapter 12

Data Science for Mining of Massive Datasets

Last modified
Table of contents
  1. Story
  2. Slides
  3. Spotfire Dashboard
  4. Research Notes
  5. CS246: Mining Massive Datasets Slides
    1. Chapter 1 Introduction
      1. Slide 1 Mining of Massive Datasets: Course Introduction
      2. Slide 2 What is Data Mining? Knowledge discovery from data
      3. Slide 3 Some Massive Data Statistics
      4. Slide 4 Data contains value and knowledge
      5. Slide 5 Data Mining
      6. Slide 6 Good news: Demand for Data Mining
      7. Slide 7 What is Data Mining?
      8. Slide 8 Data Mining Tasks
      9. Slide 9 Meaningfulness of Analytic Answers 1
      10. Slide 10 Meaningfulness of Analytic Answers 2
      11. Slide 11 What matters when dealing with data?
      12. Slide 12 Data Mining: Cultures
      13. Slide 13 This Class: CS246
      14. Slide 14 What will we learn? 1
      15. Slide 15 What will we learn? 2
      16. Slide 16 How It All Fits Together
      17. Slide 17 How do you want that data?
      18. Slide 18 About the Course
      19. Slide 19 2014 CS246 Course Staff
      20. Slide 20 Course Logistics
      21. Slide 21 Logistics: Communication
      22. Slide 22 Work for the Course 1
      23. Slide 23 Work for the Course 2
      24. Slide 24 Course Calender
      25. Slide 25 Prerequisites
      26. Slide 26 Recitation Sessions
      27. Slide 27 What's after the class
      28. Slide 28 3 To-do items
  6. Hadoop Tutorial
    1. General Instructions
    2. 1 Setting up a virtual machine
    3. 2 Running Hadoop jobs
      1. 2.1 Creating a Hadoop project in Eclipse
        1. Figure 1: Create a Hadoop Project
        2. Figure 2: Create a Hadoop Project
        3. Figure 3: Create a Hadoop Project
      2. 2.2 Running Hadoop jobs in standalone mode
        1. Figure 4: Run a Hadoop Project
        2. Figure 5: Run a Hadoop Project
        3. Figure 6: Run a Hadoop Project
        4. Figure 7: Run a Hadoop Project
        5. Figure 8: Run a Hadoop Project
        6. Figure 9: Run a Hadoop Project
      3. 2.3 Running Hadoop in pseudo-distributed mode
        1. Figure 10: Run a Hadoop Project
        2. Figure 11: Run a Hadoop Project
        3. Figure 12: Run a Hadoop Project
      4. 2.4 Debugging Hadoop jobs
        1. Figure 13: Debug a Hadoop project
        2. Figure 14: Run a Hadoop Project
      5. 2.5 Example project
        1. Figure 15: Create a Hadoop Project
        2. Figure 16: Create a Hadoop Project
        3. Figure 17: Create a Hadoop Project
        4. Figure 18: Create a Hadoop Project
        5. Figure 19: Create a Hadoop Project
        6. Figure 20: Create a Hadoop Project
        7. Figure 21: Create a Hadoop Project
        8. Figure 22: Create a java file
        9. Figure 23: Create a java file
        10. Figure 24: Create WordCount.java
        11. Figure 25: Create WordCount.java
        12. Figure 26: Create WordCount.java
        13. Figure 27: Run WordCount.java
        14. Figure 28: Run WordCount.java
        15. Figure 29: Run WordCount.java
        16. Figure 30: Run WordCount.java
        17. Figure 31: Run WordCount.java
        18. Figure 32: Export a hadoop project
        19. Figure 33: Run WordCount.java
        20. Figure 34: Export a Hadoop project
        21. Figure 35: Export a Hadoop project
        22. Figure 36: Run WordCount job
        23. Figure 37: Run WordCount job
        24. Figure 38: Run WordCount job
        25. Figure 39: View WordCount job logs
        26. Figure 40: View WordCount job logs
        27. Figure 41: View WordCount job logs
        28. Figure 42: View WordCount job logs
      6. 2.6 Using your local machine for development
      7. Further Hadoop tutorials
      8. Further Eclipse tutorials
    4. 3 Task: Write your own Hadoop Job
  7. CS246: Mining Massive Datasets Winter 2015
    1. Course Information
    2. Topics
    3. Assignments and grading
    4. Homework policy
    5. Prerequisites
    6. Materials
    7. Important dates
    8. Next steps for students
  8. Mining of Massive Datasets
    1. The book
    2. The MOOC (Massive Open Online Course)
    3. The 2nd edition of the book (v2.1)
    4. Stanford big data courses
      1. CS246
      2. CS341
      3. CS224W
      4. You can take Stanford courses!
    5. Supporting materials
    6. Previous versions of the book
      1. Version 1.0
  9. Mining of Massive Datasets
    1. Preface
      1. What the Book Is About
      2. Prerequisites
      3. Exercises
      4. Support on the Web
      5. Gradiance Automated Homework
      6. Acknowledgements
    2. Contents
    3. 1 Data Mining
      1. 1.1 What is Data Mining?
        1. 1.1.1 Statistical Modeling
          1. Example 1.1
        2. 1.1.2 Machine Learning
        3. 1.1.3 Computational Approaches to Modeling
        4. 1.1.4 Summarization
          1. Example 1.2
          2. Figure 1.1: Plotting cholera cases on a map of London
        5. 1.1.5 Feature Extraction
      2. 1.2 Statistical Limits on Data Mining
        1. 1.2.1 Total Information Awareness
        2. 1.2.2 Bonferroni’s Principle
        3. 1.2.3 An Example of Bonferroni’s Principle
        4. 1.2.4 Exercises for Section 1.2
          1. Exercise 1.2.1
          2. Exercise 1.2.2
      3. 1.3 Things Useful to Know
        1. 1.3.1 Importance of Words in Documents
        2. Example 1.3
        3. 1.3.2 Hash Functions
        4. Example 1.4
        5. 1.3.3 Indexes
          1. Figure 1.2: A hash table used as an index
          2. Example 1.5
        6. 1.3.4 Secondary Storage
        7. 1.3.5 The Base of Natural Logarithms
          1. Example 1.6
        8. 1.3.6 Power Laws
          1. Figure 1.3: A power law with a slope of −2
          2. The Matthew Effect
          3. Example 1.8
        9. 1.3.7 Exercises for Section 1.3
          1. Exercise 1.3.1
          2. Exercise 1.3.2
          3. Exercise 1.3.3
          4. Exercise 1.3.4
          5. Exercise 1.3.5
      4. 1.4 Outline of the Book
      5. 1.5 Summary of Chapter 1
      6. 1.6 References for Chapter 1
      7. 1.7 Footnotes for Chapter 1
        1. 1
        2. 2
        3. 3
    4. 2 MapReduce and the New Software Stack
      1. 2.1 Distributed File Systems
        1. 2.1.1 Physical Organization of Compute Nodes
        2. 2.1.2 Large-Scale File-System Organization
      2. 2.2 MapReduce
        1. 2.2.1 The Map Tasks
        2. 2.2.2 Grouping by Key
        3. 2.2.3 The Reduce Tasks
        4. 2.2.4 Combiners
        5. 2.2.5 Details of MapReduce Execution
        6. 2.2.6 Coping With Node Failures
        7. 2.2.7 Exercises for Section 2.2
      3. 2.3 Algorithms Using MapReduce
        1. 2.3.1 Matrix-Vector Multiplication by MapReduce
        2. 2.3.2 If the Vector v Cannot Fit in Main Memory
        3. 2.3.3 Relational-Algebra Operations
        4. 2.3.4 Computing Selections by MapReduce
        5. 2.3.5 Computing Projections by MapReduce
        6. 2.3.6 Union, Intersection, and Difference by MapReduce
        7. 2.3.7 Computing Natural Join by MapReduce
        8. 2.3.8 Grouping and Aggregation by MapReduce
        9. 2.3.9 Matrix Multiplication
        10. 2.3.10 Matrix Multiplication with One MapReduce Step
        11. 2.3.11 Exercises for Section 2.3
      4. 2.4 Extensions to MapReduce
        1. 2.4.1 Workflow Systems
        2. 2.4.2 Recursive Extensions to MapReduce
        3. 2.4.3 Pregel
        4. 2.4.4 Exercises for Section 2.4
      5. 2.5 The Communication Cost Model
        1. 2.5.1 Communication-Cost for Task Networks
        2. 2.5.2 Wall-Clock Time
        3. 2.5.3 Multiway Joins
        4. 2.5.4 Exercises for Section 2.5
      6. 2.6 Complexity Theory for MapReduce
        1. 2.6.1 Reducer Size and Replication Rate
        2. 2.6.2 An Example: Similarity Joins
        3. 2.6.3 A Graph Model for MapReduce Problems
        4. 2.6.4 Mapping Schemas
        5. 2.6.5 When Not All Inputs Are Present
        6. 2.6.6 Lower Bounds on Replication Rate
        7. 2.6.7 Case Study: Matrix Multiplication
        8. 2.6.8 Exercises for Section 2.6
      7. 2.7 Summary of Chapter 2
      8. 2.8 References for Chapter 2
    5. 3 Finding Similar Items
      1. 3.1 Applications of Near-Neighbor Search
        1. 3.1.1 Jaccard Similarity of Sets
        2. 3.1.2 Similarity of Documents
        3. 3.1.3 Collaborative Filtering as a Similar-Sets Problem
        4. 3.1.4 Exercises for Section 3.1
      2. 3.2 Shingling of Documents
        1. 3.2.1 k-Shingles
        2. 3.2.2 Choosing the Shingle Size
        3. 3.2.3 Hashing Shingles
        4. 3.2.4 Shingles Built from Words
        5. 3.2.5 Exercises for Section 3.2
      3. 3.3 Similarity-Preserving Summaries of Sets
        1. 3.3.1 Matrix Representation of Sets
        2. 3.3.2 Minhashing
        3. 3.3.3 Minhashing and Jaccard Similarity
        4. 3.3.4 Minhash Signatures
        5. 3.3.5 Computing Minhash Signatures
        6. 3.3.6 Exercises for Section 3.3
      4. 3.4 Locality-Sensitive Hashing for Documents
        1. 3.4.1 LSH for Minhash Signatures
        2. 3.4.2 Analysis of the Banding Technique
        3. 3.4.3 Combining the Techniques
        4. 3.4.4 Exercises for Section 3.4
      5. 3.5 Distance Measures
        1. 3.5.1 Definition of a Distance Measure
        2. 3.5.2 Euclidean Distances
        3. 3.5.3 Jaccard Distance
        4. 3.5.4 Cosine Distance
        5. 3.5.5 Edit Distance
        6. 3.5.6 Hamming Distance
        7. 3.5.7 Exercises for Section 3.5
      6. 3.6 The Theory of Locality-Sensitive Functions
        1. 3.6.1 Locality-Sensitive Functions
        2. 3.6.2 Locality-Sensitive Families for Jaccard Distance
        3. 3.6.3 Amplifying a Locality-Sensitive Family
        4. 3.6.4 Exercises for Section 3.6
      7. 3.7 LSH Families for Other Distance Measures
        1. 3.7.1 LSH Families for Hamming Distance
        2. 3.7.2 Random Hyperplanes and the Cosine Distance
        3. 3.7.3 Sketches
        4. 3.7.4 LSH Families for Euclidean Distance
        5. 3.7.5 More LSH Families for Euclidean Spaces
        6. 3.7.6 Exercises for Section 3.7
      8. 3.8 Applications of Locality-Sensitive Hashing
        1. 3.8.1 Entity Resolution
        2. 3.8.2 An Entity-Resolution Example
        3. 3.8.3 Validating Record Matches
        4. 3.8.4 Matching Fingerprints
        5. 3.8.5 A LSH Family for Fingerprint Matching
        6. 3.8.6 Similar News Articles
        7. 3.8.7 Exercises for Section 3.8
      9. 3.9 Methods for High Degrees of Similarity
        1. 3.9.1 Finding Identical Items
        2. 3.9.2 Representing Sets as Strings
        3. 3.9.3 Length-Based Filtering
        4. 3.9.4 Prefix Indexing
        5. 3.9.5 Using Position Information
        6. 3.9.6 Using Position and Length in Indexes
        7. 3.9.7 Exercises for Section 3.9
      10. 3.10 Summary of Chapter 3
      11. 3.11 References for Chapter 3
    6. 4 Mining Data Streams
      1. 4.1 The Stream Data Model
        1. 4.1.1 A Data-Stream-Management System
        2. 4.1.2 Examples of Stream Sources
        3. 4.1.3 Stream Queries
        4. 4.1.4 Issues in Stream Processing
      2. 4.2 Sampling Data in a Stream
        1. 4.2.1 A Motivating Example
        2. 4.2.2 Obtaining a Representative Sample
        3. 4.2.3 The General Sampling Problem
        4. 4.2.4 Varying the Sample Size
        5. 4.2.5 Exercises for Section 4.2
      3. 4.3 Filtering Streams
        1. 4.3.1 A Motivating Example
        2. 4.3.2 The Bloom Filter
        3. 4.3.3 Analysis of Bloom Filtering
        4. 4.3.4 Exercises for Section 4.3
      4. 4.4 Counting Distinct Elements in a Stream
        1. 4.4.1 The Count-Distinct Problem
        2. 4.4.2 The Flajolet-Martin Algorithm
        3. 4.4.3 Combining Estimates
        4. 4.4.4 Space Requirements
        5. 4.4.5 Exercises for Section 4.4
      5. 4.5 Estimating Moments
        1. 4.5.1 Definition of Moments
        2. 4.5.2 The Alon-Matias-Szegedy Algorithm for Second Moments
        3. 4.5.3 Why the Alon-Matias-Szegedy Algorithm Works
        4. 4.5.4 Higher-Order Moments
        5. 4.5.5 Dealing With Infinite Streams
        6. 4.5.6 Exercises for Section 4.5
      6. 4.6 Counting Ones in a Window
        1. 4.6.1 The Cost of Exact Counts
        2. 4.6.2 The Datar-Gionis-Indyk-Motwani Algorithm
        3. 4.6.3 Storage Requirements for the DGIM Algorithm
        4. 4.6.4 Query Answering in the DGIM Algorithm
        5. 4.6.5 Maintaining the DGIM Conditions
        6. 4.6.6 Reducing the Error
        7. 4.6.7 Extensions to the Counting of Ones
        8. 4.6.8 Exercises for Section 4.6
      7. 4.7 Decaying Windows
      8. 4.8 Summary of Chapter 4
      9. 4.9 References for Chapter 4
    7. 5 Link Analysis
      1. 5.1 PageRank
        1. 5.1.1 Early Search Engines and Term Spam
        2. 5.1.2 Definition of PageRank
        3. 5.1.3 Structure of the Web
        4. 5.1.4 Avoiding Dead Ends
        5. 5.1.5 Spider Traps and Taxation
        6. 5.1.6 Using PageRank in a Search Engine
        7. 5.1.7 Exercises for Section 5.1
      2. 5.2 Efficient Computation of PageRank
        1. 5.2.1 Representing Transition Matrices
        2. 5.2.2 PageRank Iteration Using MapReduce
        3. 5.2.3 Use of Combiners to Consolidate the Result Vector
        4. 5.2.4 Representing Blocks of the Transition Matrix
        5. 5.2.5 Other Efficient Approaches to PageRank Iteration
        6. 5.2.6 Exercises for Section 5.2
      3. 5.3 Topic-Sensitive PageRank
        1. 5.3.1 Motivation for Topic-Sensitive Page Rank
        2. 5.3.2 Biased Random Walks
        3. 5.3.3 Using Topic-Sensitive PageRank
        4. 5.3.4 Inferring Topics from Words
        5. 5.3.5 Exercises for Section 5.3
      4. 5.4 Link Spam
        1. 5.4.1 Architecture of a Spam Farm
        2. 5.4.2 Analysis of a Spam Farm
        3. 5.4.3 Combating Link Spam
        4. 5.4.4 TrustRank
        5. 5.4.5 Spam Mass
        6. 5.4.6 Exercises for Section 5.4
      5. 5.5 Hubs and Authorities
        1. 5.5.1 The Intuition Behind HITS
        2. 5.5.2 Formalizing Hubbiness and Authority
        3. 5.5.3 Exercises for Section 5.5
      6. 5.6 Summary of Chapter 5
      7. 5.7 References for Chapter 5
    8. 6 Frequent Itemsets
      1. 6.1 The Market-Basket Model
        1. 6.1.1 Definition of Frequent Itemsets
        2. 6.1.2 Applications of Frequent Itemsets
        3. 6.1.3 Association Rules
        4. 6.1.4 Finding Association Rules with High Confidence
        5. 6.1.5 Exercises for Section 6.1
      2. 6.2 Market Baskets and the A-Priori Algorithm
        1. 6.2.1 Representation of Market-Basket Data
        2. 6.2.2 Use of Main Memory for Itemset Counting
        3. 6.2.3 Monotonicity of Itemsets
        4. 6.2.4 Tyranny of Counting Pairs
        5. 6.2.5 The A-Priori Algorithm
        6. 6.2.6 A-Priori for All Frequent Itemsets
        7. 6.2.7 Exercises for Section 6.2
      3. 6.3 Handling Larger Datasets in Main Memory
        1. 6.3.1 The Algorithm of Park, Chen, and Yu
        2. 6.3.2 The Multistage Algorithm
        3. 6.3.3 The Multihash Algorithm
        4. 6.3.4 Exercises for Section 6.3
      4. 6.4 Limited-Pass Algorithms
        1. 6.4.1 The Simple, Randomized Algorithm
        2. 6.4.2 Avoiding Errors in Sampling Algorithms
        3. 6.4.3 The Algorithm of Savasere, Omiecinski, and Navathe
        4. 6.4.4 The SON Algorithm and MapReduce
        5. 6.4.5 Toivonen’s Algorithm
        6. 6.4.6 Why Toivonen’s Algorithm Works
        7. 6.4.7 Exercises for Section 6.4
      5. 6.5 Counting Frequent Items in a Stream
        1. 6.5.1 Sampling Methods for Streams
        2. 6.5.2 Frequent Itemsets in Decaying Windows
        3. 6.5.3 Hybrid Methods
        4. 6.5.4 Exercises for Section 6.5
      6. 6.6 Summary of Chapter 6
      7. 6.7 References for Chapter 6
    9. 7 Clustering
      1. 7.1 Introduction to Clustering Techniques
        1. 7.1.1 Points, Spaces, and Distances
        2. 7.1.2 Clustering Strategies3
        3. 7.1.3 The Curse of Dimensionality
        4. 7.1.4 Exercises for Section 7.1
      2. 7.2 Hierarchical Clustering
        1. 7.2.1 Hierarchical Clustering in a Euclidean Space
        2. 7.2.2 Efficiency of Hierarchical Clustering
        3. 7.2.3 Alternative Rules for Controlling Hierarchical Clustering
        4. 7.2.4 Hierarchical Clustering in Non-Euclidean Spaces
        5. 7.2.5 Exercises for Section 7.2
      3. 7.3 K-means Algorithms
        1. 7.3.1 K-Means Basics
        2. 7.3.2 Initializing Clusters for K-Means
        3. 7.3.3 Picking the Right Value of k
        4. 7.3.4 The Algorithm of Bradley, Fayyad, and Reina
        5. 7.3.5 Processing Data in the BFR Algorithm
        6. 7.3.6 Exercises for Section 7.3
      4. 7.4 The CURE Algorithm
        1. 7.4.1 Initialization in CURE
        2. 7.4.2 Completion of the CURE Algorithm
        3. 7.4.3 Exercises for Section 7.4
      5. 7.5 Clustering in Non-Euclidean Spaces
        1. 7.5.1 Representing Clusters in the GRGPF Algorithm
        2. 7.5.2 Initializing the Cluster Tree
        3. 7.5.3 Adding Points in the GRGPF Algorithm
        4. 7.5.4 Splitting and Merging Clusters
        5. 7.5.5 Exercises for Section 7.5
      6. 7.6 Clustering for Streams and Parallelism
        1. 7.6.1 The Stream-Computing Model
        2. 7.6.2 A Stream-Clustering Algorithm
        3. 7.6.3 Initializing Buckets
        4. 7.6.4 Merging Buckets
        5. 7.6.5 Answering Queries
        6. 7.6.6 Clustering in a Parallel Environment
        7. 7.6.7 Exercises for Section 7.6
      7. 7.7 Summary of Chapter 7
      8. 7.8 References for Chapter 7
    10. 8 Advertising on the Web
      1. 8.1 Issues in On-Line Advertising
        1. 8.1.1 Advertising Opportunities
        2. 8.1.2 Direct Placement of Ads
        3. 8.1.3 Issues for Display Ads
      2. 8.2 On-Line Algorithms
        1. 8.2.1 On-Line and Off-Line Algorithms
        2. 8.2.2 Greedy Algorithms
        3. 8.2.3 The Competitive Ratio
        4. 8.2.4 Exercises for Section 8.2
      3. 8.3 The Matching Problem
        1. 8.3.1 Matches and Perfect Matches
        2. 8.3.2 The Greedy Algorithm for Maximal Matching
        3. 8.3.3 Competitive Ratio for Greedy Matching
        4. 8.3.4 Exercises for Section 8.3
      4. 8.4 The Adwords Problem
        1. 8.4.1 History of Search Advertising
        2. 8.4.2 Definition of the Adwords Problem
        3. 8.4.3 The Greedy Approach to the Adwords Problem
        4. 8.4.4 The Balance Algorithm
        5. 8.4.5 A Lower Bound on Competitive Ratio for Balance
        6. 8.4.6 The Balance Algorithm with Many Bidders
        7. 8.4.7 The Generalized Balance Algorithm
        8. 8.4.8 Final Observations About the Adwords Problem
        9. 8.4.9 Exercises for Section 8.4
      5. 8.5 Adwords Implementation
        1. 8.5.1 Matching Bids and Search Queries
        2. 8.5.2 More Complex Matching Problems
        3. 8.5.3 A Matching Algorithm for Documents and Bids
      6. 8.6 Summary of Chapter 8
      7. 8.7 References for Chapter 8
    11. 9 Recommendation Systems
      1. 9.1 A Model for Recommendation Systems
        1. 9.1.1 The Utility Matrix
        2. 9.1.2 The Long Tail
        3. 9.1.3 Applications of Recommendation Systems
        4. 9.1.4 Populating the Utility Matrix
      2. 9.2 Content-Based Recommendations
        1. 9.2.1 Item Profiles
        2. 9.2.2 Discovering Features of Documents
        3. 9.2.3 Obtaining Item Features From Tags
        4. 9.2.4 Representing Item Profiles
        5. 9.2.5 User Profiles
        6. 9.2.6 Recommending Items to Users Based on Content
        7. 9.2.7 Classification Algorithms
        8. 9.2.8 Exercises for Section 9.2
      3. 9.3 Collaborative Filtering
        1. 9.3.1 Measuring Similarity
        2. 9.3.2 The Duality of Similarity
        3. 9.3.3 Clustering Users and Items
        4. 9.3.4 Exercises for Section 9.3
      4. 9.4 Dimensionality Reduction
        1. 9.4.1 UV-Decomposition
        2. 9.4.2 Root-Mean-Square Error
        3. 9.4.3 Incremental Computation of a UV-Decomposition
        4. 9.4.4 Optimizing an Arbitrary Element
        5. 9.4.5 Building a Complete UV-Decomposition Algorithm
        6. 9.4.6 Exercises for Section 9.4
      5. 9.5 The NetFlix Challenge
      6. 9.6 Summary of Chapter 9
      7. 9.7 References for Chapter 9
    12. 10 Mining Social-Network Graphs
      1. 10.1 Social Networks as Graphs
        1. 10.1.1 What is a Social Network?
        2. 10.1.2 Social Networks as Graphs
        3. 10.1.3 Varieties of Social Networks
        4. 10.1.4 Graphs With Several Node Types
        5. 10.1.5 Exercises for Section 10.1
      2. 10.2 Clustering of Social-Network Graphs
        1. 10.2.1 Distance Measures for Social-Network Graphs
        2. 10.2.2 Applying Standard Clustering Methods
        3. 10.2.3 Betweenness
        4. 10.2.4 The Girvan-Newman Algorithm
        5. 10.2.5 Using Betweenness to Find Communities
        6. 10.2.6 Exercises for Section 10.2
      3. 10.3 Direct Discovery of Communities
        1. 10.3.1 Finding Cliques
        2. 10.3.2 Complete Bipartite Graphs
        3. 10.3.3 Finding Complete Bipartite Subgraphs
        4. 10.3.4 Why Complete Bipartite Graphs Must Exist
        5. 10.3.5 Exercises for Section 10.3
      4. 10.4 Partitioning of Graphs
        1. 10.4.1 What Makes a Good Partition?
        2. 10.4.2 Normalized Cuts
        3. 10.4.3 Some Matrices That Describe Graphs
        4. 10.4.4 Eigenvalues of the Laplacian Matrix
        5. 10.4.5 Alternative Partitioning Methods
        6. 10.4.6 Exercises for Section 10.4
      5. 10.5 Finding Overlapping Communities
        1. 10.5.1 The Nature of Communities
        2. 10.5.2 Maximum-Likelihood Estimation
        3. 10.5.3 The Affiliation-Graph Model
        4. 10.5.4 Avoiding the Use of Discrete Membership Changes
        5. 10.5.5 Exercises for Section 10.5
      6. 10.6 Simrank
        1. 10.6.1 Random Walkers on a Social Graph
        2. 10.6.2 Random Walks with Restart
        3. 10.6.3 Exercises for Section 10.6
      7. 10.7 Counting Triangles
        1. 10.7.1 Why Count Triangles?
        2. 10.7.2 An Algorithm for Finding Triangles
        3. 10.7.3 Optimality of the Triangle-Finding Algorithm
        4. 10.7.4 Finding Triangles Using MapReduce
        5. 10.7.5 Using Fewer Reduce Tasks
        6. 10.7.6 Exercises for Section 10.7
      8. 10.8 Neighborhood Properties of Graphs
        1. 10.8.1 Directed Graphs and Neighborhoods
        2. 10.8.2 The Diameter of a Graph
        3. 10.8.3 Transitive Closure and Reachability
        4. 10.8.4 Transitive Closure Via MapReduce
        5. 10.8.5 Smart Transitive Closure
        6. 10.8.6 Transitive Closure by Graph Reduction
        7. 10.8.7 Approximating the Sizes of Neighborhoods
        8. 10.8.8 Exercises for Section 10.8
      9. 10.9 Summary of Chapter 10
      10. 10.10 References for Chapter 10
    13. 11 Dimensionality Reduction
      1. 11.1 Eigenvalues and Eigenvectors
        1. 11.1.1 Definitions
        2. 11.1.2 Computing Eigenvalues and Eigenvectors
        3. 11.1.3 Finding Eigenpairs by Power Iteration
        4. 11.1.4 The Matrix of Eigenvectors
        5. 11.1.5 Exercises for Section 11.1
      2. 11.2 Principal-Component Analysis
        1. 11.2.1 An Illustrative Example
        2. 11.2.2 Using Eigenvectors for Dimensionality Reduction
        3. 11.2.3 The Matrix of Distances
        4. 11.2.4 Exercises for Section 11.2
      3. 11.3 Singular-Value Decomposition
        1. 11.3.1 Definition of SVD
        2. 11.3.2 Interpretation of SVD
        3. 11.3.3 Dimensionality Reduction Using SVD
        4. 11.3.4 Why Zeroing Low Singular Values Works
        5. 11.3.5 Querying Using Concepts
        6. 11.3.6 Computing the SVD of a Matrix
        7. 11.3.7 Exercises for Section 11.3
      4. 11.4 CUR Decomposition
        1. 11.4.1 Definition of CUR
        2. 11.4.2 Choosing Rows and Columns Properly
        3. 11.4.3 Constructing the Middle Matrix
        4. 11.4.4 The Complete CUR Decomposition
        5. 11.4.5 Eliminating Duplicate Rows and Columns
        6. 11.4.6 Exercises for Section 11.4
      5. 11.5 Summary of Chapter 11
      6. 11.6 References for Chapter 11
    14. 12 Large-Scale Machine Learning
      1. 12.1 The Machine-Learning Model 
        1. 12.1.1 Training Sets
        2. 12.1.2 Some Illustrative Examples
        3. 12.1.3 Approaches to Machine Learning
        4. 12.1.4 Machine-Learning Architecture
        5. 12.1.5 Exercises for Section 12.1
      2. 12.2 Perceptrons
        1. 12.2.1 Training a Perceptron with Zero Threshold
        2. 12.2.2 Convergence of Perceptrons
        3. 12.2.3 The Winnow Algorithm
        4. 12.2.4 Allowing the Threshold to Vary
        5. 12.2.5 Multiclass Perceptrons
        6. 12.2.6 Transforming the Training Set
        7. 12.2.7 Problems With Perceptrons
        8. 12.2.8 Parallel Implementation of Perceptrons
        9. 12.2.9 Exercises for Section 12.2
      3. 12.3 Support-Vector Machines
        1. 12.3.1 The Mechanics of an SVM
        2. 12.3.2 Normalizing the Hyperplane
        3. 12.3.3 Finding Optimal Approximate Separators
        4. 12.3.4 SVM Solutions by Gradient Descent
        5. 12.3.5 Stochastic Gradient Descent
        6. 12.3.6 Parallel Implementation of SVM
        7. 12.3.7 Exercises for Section 12.3
      4. 12.4 Learning from Nearest Neighbors
        1. 12.4.1 The Framework for Nearest-Neighbor Calculations
        2. 12.4.2 Learning with One Nearest Neighbor
        3. 12.4.3 Learning One-Dimensional Functions
        4. 12.4.4 Kernel Regression
        5. 12.4.5 Dealing with High-Dimensional Euclidean Data
        6. 12.4.6 Dealing with Non-Euclidean Distances
        7. 12.4.7 Exercises for Section 12.4
      5. 12.5 Comparison of Learning Methods
      6. 12.6 Summary of Chapter 12
      7. 12.7 References for Chapter 12

  1. Story
  2. Slides
  3. Spotfire Dashboard
  4. Research Notes
  5. CS246: Mining Massive Datasets Slides
    1. Chapter 1 Introduction
      1. Slide 1 Mining of Massive Datasets: Course Introduction
      2. Slide 2 What is Data Mining? Knowledge discovery from data
      3. Slide 3 Some Massive Data Statistics
      4. Slide 4 Data contains value and knowledge
      5. Slide 5 Data Mining
      6. Slide 6 Good news: Demand for Data Mining
      7. Slide 7 What is Data Mining?
      8. Slide 8 Data Mining Tasks
      9. Slide 9 Meaningfulness of Analytic Answers 1
      10. Slide 10 Meaningfulness of Analytic Answers 2
      11. Slide 11 What matters when dealing with data?
      12. Slide 12 Data Mining: Cultures
      13. Slide 13 This Class: CS246
      14. Slide 14 What will we learn? 1
      15. Slide 15 What will we learn? 2
      16. Slide 16 How It All Fits Together
      17. Slide 17 How do you want that data?
      18. Slide 18 About the Course
      19. Slide 19 2014 CS246 Course Staff
      20. Slide 20 Course Logistics
      21. Slide 21 Logistics: Communication
      22. Slide 22 Work for the Course 1
      23. Slide 23 Work for the Course 2
      24. Slide 24 Course Calender
      25. Slide 25 Prerequisites
      26. Slide 26 Recitation Sessions
      27. Slide 27 What's after the class
      28. Slide 28 3 To-do items
  6. Hadoop Tutorial
    1. General Instructions
    2. 1 Setting up a virtual machine
    3. 2 Running Hadoop jobs
      1. 2.1 Creating a Hadoop project in Eclipse
        1. Figure 1: Create a Hadoop Project
        2. Figure 2: Create a Hadoop Project
        3. Figure 3: Create a Hadoop Project
      2. 2.2 Running Hadoop jobs in standalone mode
        1. Figure 4: Run a Hadoop Project
        2. Figure 5: Run a Hadoop Project
        3. Figure 6: Run a Hadoop Project
        4. Figure 7: Run a Hadoop Project
        5. Figure 8: Run a Hadoop Project
        6. Figure 9: Run a Hadoop Project
      3. 2.3 Running Hadoop in pseudo-distributed mode
        1. Figure 10: Run a Hadoop Project
        2. Figure 11: Run a Hadoop Project
        3. Figure 12: Run a Hadoop Project
      4. 2.4 Debugging Hadoop jobs
        1. Figure 13: Debug a Hadoop project
        2. Figure 14: Run a Hadoop Project
      5. 2.5 Example project
        1. Figure 15: Create a Hadoop Project
        2. Figure 16: Create a Hadoop Project
        3. Figure 17: Create a Hadoop Project
        4. Figure 18: Create a Hadoop Project
        5. Figure 19: Create a Hadoop Project
        6. Figure 20: Create a Hadoop Project
        7. Figure 21: Create a Hadoop Project
        8. Figure 22: Create a java file
        9. Figure 23: Create a java file
        10. Figure 24: Create WordCount.java
        11. Figure 25: Create WordCount.java
        12. Figure 26: Create WordCount.java
        13. Figure 27: Run WordCount.java
        14. Figure 28: Run WordCount.java
        15. Figure 29: Run WordCount.java
        16. Figure 30: Run WordCount.java
        17. Figure 31: Run WordCount.java
        18. Figure 32: Export a hadoop project
        19. Figure 33: Run WordCount.java
        20. Figure 34: Export a Hadoop project
        21. Figure 35: Export a Hadoop project
        22. Figure 36: Run WordCount job
        23. Figure 37: Run WordCount job
        24. Figure 38: Run WordCount job
        25. Figure 39: View WordCount job logs
        26. Figure 40: View WordCount job logs
        27. Figure 41: View WordCount job logs
        28. Figure 42: View WordCount job logs
      6. 2.6 Using your local machine for development
      7. Further Hadoop tutorials
      8. Further Eclipse tutorials
    4. 3 Task: Write your own Hadoop Job
  7. CS246: Mining Massive Datasets Winter 2015
    1. Course Information
    2. Topics
    3. Assignments and grading
    4. Homework policy
    5. Prerequisites
    6. Materials
    7. Important dates
    8. Next steps for students
  8. Mining of Massive Datasets
    1. The book
    2. The MOOC (Massive Open Online Course)
    3. The 2nd edition of the book (v2.1)
    4. Stanford big data courses
      1. CS246
      2. CS341
      3. CS224W
      4. You can take Stanford courses!
    5. Supporting materials
    6. Previous versions of the book
      1. Version 1.0
  9. Mining of Massive Datasets
    1. Preface
      1. What the Book Is About
      2. Prerequisites
      3. Exercises
      4. Support on the Web
      5. Gradiance Automated Homework
      6. Acknowledgements
    2. Contents
    3. 1 Data Mining
      1. 1.1 What is Data Mining?
        1. 1.1.1 Statistical Modeling
          1. Example 1.1
        2. 1.1.2 Machine Learning
        3. 1.1.3 Computational Approaches to Modeling
        4. 1.1.4 Summarization
          1. Example 1.2
          2. Figure 1.1: Plotting cholera cases on a map of London
        5. 1.1.5 Feature Extraction
      2. 1.2 Statistical Limits on Data Mining
        1. 1.2.1 Total Information Awareness
        2. 1.2.2 Bonferroni’s Principle
        3. 1.2.3 An Example of Bonferroni’s Principle
        4. 1.2.4 Exercises for Section 1.2
          1. Exercise 1.2.1
          2. Exercise 1.2.2
      3. 1.3 Things Useful to Know
        1. 1.3.1 Importance of Words in Documents
        2. Example 1.3
        3. 1.3.2 Hash Functions
        4. Example 1.4
        5. 1.3.3 Indexes
          1. Figure 1.2: A hash table used as an index
          2. Example 1.5
        6. 1.3.4 Secondary Storage
        7. 1.3.5 The Base of Natural Logarithms
          1. Example 1.6
        8. 1.3.6 Power Laws
          1. Figure 1.3: A power law with a slope of −2
          2. The Matthew Effect
          3. Example 1.8
        9. 1.3.7 Exercises for Section 1.3
          1. Exercise 1.3.1
          2. Exercise 1.3.2
          3. Exercise 1.3.3
          4. Exercise 1.3.4
          5. Exercise 1.3.5
      4. 1.4 Outline of the Book
      5. 1.5 Summary of Chapter 1
      6. 1.6 References for Chapter 1
      7. 1.7 Footnotes for Chapter 1
        1. 1
        2. 2
        3. 3
    4. 2 MapReduce and the New Software Stack
      1. 2.1 Distributed File Systems
        1. 2.1.1 Physical Organization of Compute Nodes
        2. 2.1.2 Large-Scale File-System Organization
      2. 2.2 MapReduce
        1. 2.2.1 The Map Tasks
        2. 2.2.2 Grouping by Key
        3. 2.2.3 The Reduce Tasks
        4. 2.2.4 Combiners
        5. 2.2.5 Details of MapReduce Execution
        6. 2.2.6 Coping With Node Failures
        7. 2.2.7 Exercises for Section 2.2
      3. 2.3 Algorithms Using MapReduce
        1. 2.3.1 Matrix-Vector Multiplication by MapReduce
        2. 2.3.2 If the Vector v Cannot Fit in Main Memory
        3. 2.3.3 Relational-Algebra Operations
        4. 2.3.4 Computing Selections by MapReduce
        5. 2.3.5 Computing Projections by MapReduce
        6. 2.3.6 Union, Intersection, and Difference by MapReduce
        7. 2.3.7 Computing Natural Join by MapReduce
        8. 2.3.8 Grouping and Aggregation by MapReduce
        9. 2.3.9 Matrix Multiplication
        10. 2.3.10 Matrix Multiplication with One MapReduce Step
        11. 2.3.11 Exercises for Section 2.3
      4. 2.4 Extensions to MapReduce
        1. 2.4.1 Workflow Systems
        2. 2.4.2 Recursive Extensions to MapReduce
        3. 2.4.3 Pregel
        4. 2.4.4 Exercises for Section 2.4
      5. 2.5 The Communication Cost Model
        1. 2.5.1 Communication-Cost for Task Networks
        2. 2.5.2 Wall-Clock Time
        3. 2.5.3 Multiway Joins
        4. 2.5.4 Exercises for Section 2.5
      6. 2.6 Complexity Theory for MapReduce
        1. 2.6.1 Reducer Size and Replication Rate
        2. 2.6.2 An Example: Similarity Joins
        3. 2.6.3 A Graph Model for MapReduce Problems
        4. 2.6.4 Mapping Schemas
        5. 2.6.5 When Not All Inputs Are Present
        6. 2.6.6 Lower Bounds on Replication Rate
        7. 2.6.7 Case Study: Matrix Multiplication
        8. 2.6.8 Exercises for Section 2.6
      7. 2.7 Summary of Chapter 2
      8. 2.8 References for Chapter 2
    5. 3 Finding Similar Items
      1. 3.1 Applications of Near-Neighbor Search
        1. 3.1.1 Jaccard Similarity of Sets
        2. 3.1.2 Similarity of Documents
        3. 3.1.3 Collaborative Filtering as a Similar-Sets Problem
        4. 3.1.4 Exercises for Section 3.1
      2. 3.2 Shingling of Documents
        1. 3.2.1 k-Shingles
        2. 3.2.2 Choosing the Shingle Size
        3. 3.2.3 Hashing Shingles
        4. 3.2.4 Shingles Built from Words
        5. 3.2.5 Exercises for Section 3.2
      3. 3.3 Similarity-Preserving Summaries of Sets
        1. 3.3.1 Matrix Representation of Sets
        2. 3.3.2 Minhashing
        3. 3.3.3 Minhashing and Jaccard Similarity
        4. 3.3.4 Minhash Signatures
        5. 3.3.5 Computing Minhash Signatures
        6. 3.3.6 Exercises for Section 3.3
      4. 3.4 Locality-Sensitive Hashing for Documents
        1. 3.4.1 LSH for Minhash Signatures
        2. 3.4.2 Analysis of the Banding Technique
        3. 3.4.3 Combining the Techniques
        4. 3.4.4 Exercises for Section 3.4
      5. 3.5 Distance Measures
        1. 3.5.1 Definition of a Distance Measure
        2. 3.5.2 Euclidean Distances
        3. 3.5.3 Jaccard Distance
        4. 3.5.4 Cosine Distance
        5. 3.5.5 Edit Distance
        6. 3.5.6 Hamming Distance
        7. 3.5.7 Exercises for Section 3.5
      6. 3.6 The Theory of Locality-Sensitive Functions
        1. 3.6.1 Locality-Sensitive Functions
        2. 3.6.2 Locality-Sensitive Families for Jaccard Distance
        3. 3.6.3 Amplifying a Locality-Sensitive Family
        4. 3.6.4 Exercises for Section 3.6
      7. 3.7 LSH Families for Other Distance Measures
        1. 3.7.1 LSH Families for Hamming Distance
        2. 3.7.2 Random Hyperplanes and the Cosine Distance
        3. 3.7.3 Sketches
        4. 3.7.4 LSH Families for Euclidean Distance
        5. 3.7.5 More LSH Families for Euclidean Spaces
        6. 3.7.6 Exercises for Section 3.7
      8. 3.8 Applications of Locality-Sensitive Hashing
        1. 3.8.1 Entity Resolution
        2. 3.8.2 An Entity-Resolution Example
        3. 3.8.3 Validating Record Matches
        4. 3.8.4 Matching Fingerprints
        5. 3.8.5 A LSH Family for Fingerprint Matching
        6. 3.8.6 Similar News Articles
        7. 3.8.7 Exercises for Section 3.8
      9. 3.9 Methods for High Degrees of Similarity
        1. 3.9.1 Finding Identical Items
        2. 3.9.2 Representing Sets as Strings
        3. 3.9.3 Length-Based Filtering
        4. 3.9.4 Prefix Indexing
        5. 3.9.5 Using Position Information
        6. 3.9.6 Using Position and Length in Indexes
        7. 3.9.7 Exercises for Section 3.9
      10. 3.10 Summary of Chapter 3
      11. 3.11 References for Chapter 3
    6. 4 Mining Data Streams
      1. 4.1 The Stream Data Model
        1. 4.1.1 A Data-Stream-Management System
        2. 4.1.2 Examples of Stream Sources
        3. 4.1.3 Stream Queries
        4. 4.1.4 Issues in Stream Processing
      2. 4.2 Sampling Data in a Stream
        1. 4.2.1 A Motivating Example
        2. 4.2.2 Obtaining a Representative Sample
        3. 4.2.3 The General Sampling Problem
        4. 4.2.4 Varying the Sample Size
        5. 4.2.5 Exercises for Section 4.2
      3. 4.3 Filtering Streams
        1. 4.3.1 A Motivating Example
        2. 4.3.2 The Bloom Filter
        3. 4.3.3 Analysis of Bloom Filtering
        4. 4.3.4 Exercises for Section 4.3
      4. 4.4 Counting Distinct Elements in a Stream
        1. 4.4.1 The Count-Distinct Problem
        2. 4.4.2 The Flajolet-Martin Algorithm
        3. 4.4.3 Combining Estimates
        4. 4.4.4 Space Requirements
        5. 4.4.5 Exercises for Section 4.4
      5. 4.5 Estimating Moments
        1. 4.5.1 Definition of Moments
        2. 4.5.2 The Alon-Matias-Szegedy Algorithm for Second Moments
        3. 4.5.3 Why the Alon-Matias-Szegedy Algorithm Works
        4. 4.5.4 Higher-Order Moments
        5. 4.5.5 Dealing With Infinite Streams
        6. 4.5.6 Exercises for Section 4.5
      6. 4.6 Counting Ones in a Window
        1. 4.6.1 The Cost of Exact Counts
        2. 4.6.2 The Datar-Gionis-Indyk-Motwani Algorithm
        3. 4.6.3 Storage Requirements for the DGIM Algorithm
        4. 4.6.4 Query Answering in the DGIM Algorithm
        5. 4.6.5 Maintaining the DGIM Conditions
        6. 4.6.6 Reducing the Error
        7. 4.6.7 Extensions to the Counting of Ones
        8. 4.6.8 Exercises for Section 4.6
      7. 4.7 Decaying Windows
      8. 4.8 Summary of Chapter 4
      9. 4.9 References for Chapter 4
    7. 5 Link Analysis
      1. 5.1 PageRank
        1. 5.1.1 Early Search Engines and Term Spam
        2. 5.1.2 Definition of PageRank
        3. 5.1.3 Structure of the Web
        4. 5.1.4 Avoiding Dead Ends
        5. 5.1.5 Spider Traps and Taxation
        6. 5.1.6 Using PageRank in a Search Engine
        7. 5.1.7 Exercises for Section 5.1
      2. 5.2 Efficient Computation of PageRank
        1. 5.2.1 Representing Transition Matrices
        2. 5.2.2 PageRank Iteration Using MapReduce
        3. 5.2.3 Use of Combiners to Consolidate the Result Vector
        4. 5.2.4 Representing Blocks of the Transition Matrix
        5. 5.2.5 Other Efficient Approaches to PageRank Iteration
        6. 5.2.6 Exercises for Section 5.2
      3. 5.3 Topic-Sensitive PageRank
        1. 5.3.1 Motivation for Topic-Sensitive Page Rank
        2. 5.3.2 Biased Random Walks
        3. 5.3.3 Using Topic-Sensitive PageRank
        4. 5.3.4 Inferring Topics from Words
        5. 5.3.5 Exercises for Section 5.3
      4. 5.4 Link Spam
        1. 5.4.1 Architecture of a Spam Farm
        2. 5.4.2 Analysis of a Spam Farm
        3. 5.4.3 Combating Link Spam
        4. 5.4.4 TrustRank
        5. 5.4.5 Spam Mass
        6. 5.4.6 Exercises for Section 5.4
      5. 5.5 Hubs and Authorities
        1. 5.5.1 The Intuition Behind HITS
        2. 5.5.2 Formalizing Hubbiness and Authority
        3. 5.5.3 Exercises for Section 5.5
      6. 5.6 Summary of Chapter 5
      7. 5.7 References for Chapter 5
    8. 6 Frequent Itemsets
      1. 6.1 The Market-Basket Model
        1. 6.1.1 Definition of Frequent Itemsets
        2. 6.1.2 Applications of Frequent Itemsets
        3. 6.1.3 Association Rules
        4. 6.1.4 Finding Association Rules with High Confidence
        5. 6.1.5 Exercises for Section 6.1
      2. 6.2 Market Baskets and the A-Priori Algorithm
        1. 6.2.1 Representation of Market-Basket Data
        2. 6.2.2 Use of Main Memory for Itemset Counting
        3. 6.2.3 Monotonicity of Itemsets
        4. 6.2.4 Tyranny of Counting Pairs
        5. 6.2.5 The A-Priori Algorithm
        6. 6.2.6 A-Priori for All Frequent Itemsets
        7. 6.2.7 Exercises for Section 6.2
      3. 6.3 Handling Larger Datasets in Main Memory
        1. 6.3.1 The Algorithm of Park, Chen, and Yu
        2. 6.3.2 The Multistage Algorithm
        3. 6.3.3 The Multihash Algorithm
        4. 6.3.4 Exercises for Section 6.3
      4. 6.4 Limited-Pass Algorithms
        1. 6.4.1 The Simple, Randomized Algorithm
        2. 6.4.2 Avoiding Errors in Sampling Algorithms
        3. 6.4.3 The Algorithm of Savasere, Omiecinski, and Navathe
        4. 6.4.4 The SON Algorithm and MapReduce
        5. 6.4.5 Toivonen’s Algorithm
        6. 6.4.6 Why Toivonen’s Algorithm Works
        7. 6.4.7 Exercises for Section 6.4
      5. 6.5 Counting Frequent Items in a Stream
        1. 6.5.1 Sampling Methods for Streams
        2. 6.5.2 Frequent Itemsets in Decaying Windows
        3. 6.5.3 Hybrid Methods
        4. 6.5.4 Exercises for Section 6.5
      6. 6.6 Summary of Chapter 6
      7. 6.7 References for Chapter 6
    9. 7 Clustering
      1. 7.1 Introduction to Clustering Techniques
        1. 7.1.1 Points, Spaces, and Distances
        2. 7.1.2 Clustering Strategies3
        3. 7.1.3 The Curse of Dimensionality
        4. 7.1.4 Exercises for Section 7.1
      2. 7.2 Hierarchical Clustering
        1. 7.2.1 Hierarchical Clustering in a Euclidean Space
        2. 7.2.2 Efficiency of Hierarchical Clustering
        3. 7.2.3 Alternative Rules for Controlling Hierarchical Clustering
        4. 7.2.4 Hierarchical Clustering in Non-Euclidean Spaces
        5. 7.2.5 Exercises for Section 7.2
      3. 7.3 K-means Algorithms
        1. 7.3.1 K-Means Basics
        2. 7.3.2 Initializing Clusters for K-Means
        3. 7.3.3 Picking the Right Value of k
        4. 7.3.4 The Algorithm of Bradley, Fayyad, and Reina
        5. 7.3.5 Processing Data in the BFR Algorithm
        6. 7.3.6 Exercises for Section 7.3
      4. 7.4 The CURE Algorithm
        1. 7.4.1 Initialization in CURE
        2. 7.4.2 Completion of the CURE Algorithm
        3. 7.4.3 Exercises for Section 7.4
      5. 7.5 Clustering in Non-Euclidean Spaces
        1. 7.5.1 Representing Clusters in the GRGPF Algorithm
        2. 7.5.2 Initializing the Cluster Tree
        3. 7.5.3 Adding Points in the GRGPF Algorithm
        4. 7.5.4 Splitting and Merging Clusters
        5. 7.5.5 Exercises for Section 7.5
      6. 7.6 Clustering for Streams and Parallelism
        1. 7.6.1 The Stream-Computing Model
        2. 7.6.2 A Stream-Clustering Algorithm
        3. 7.6.3 Initializing Buckets
        4. 7.6.4 Merging Buckets
        5. 7.6.5 Answering Queries
        6. 7.6.6 Clustering in a Parallel Environment
        7. 7.6.7 Exercises for Section 7.6
      7. 7.7 Summary of Chapter 7
      8. 7.8 References for Chapter 7
    10. 8 Advertising on the Web
      1. 8.1 Issues in On-Line Advertising
        1. 8.1.1 Advertising Opportunities
        2. 8.1.2 Direct Placement of Ads
        3. 8.1.3 Issues for Display Ads
      2. 8.2 On-Line Algorithms
        1. 8.2.1 On-Line and Off-Line Algorithms
        2. 8.2.2 Greedy Algorithms
        3. 8.2.3 The Competitive Ratio
        4. 8.2.4 Exercises for Section 8.2
      3. 8.3 The Matching Problem
        1. 8.3.1 Matches and Perfect Matches
        2. 8.3.2 The Greedy Algorithm for Maximal Matching
        3. 8.3.3 Competitive Ratio for Greedy Matching
        4. 8.3.4 Exercises for Section 8.3
      4. 8.4 The Adwords Problem
        1. 8.4.1 History of Search Advertising
        2. 8.4.2 Definition of the Adwords Problem
        3. 8.4.3 The Greedy Approach to the Adwords Problem
        4. 8.4.4 The Balance Algorithm
        5. 8.4.5 A Lower Bound on Competitive Ratio for Balance
        6. 8.4.6 The Balance Algorithm with Many Bidders
        7. 8.4.7 The Generalized Balance Algorithm
        8. 8.4.8 Final Observations About the Adwords Problem
        9. 8.4.9 Exercises for Section 8.4
      5. 8.5 Adwords Implementation
        1. 8.5.1 Matching Bids and Search Queries
        2. 8.5.2 More Complex Matching Problems
        3. 8.5.3 A Matching Algorithm for Documents and Bids
      6. 8.6 Summary of Chapter 8
      7. 8.7 References for Chapter 8
    11. 9 Recommendation Systems
      1. 9.1 A Model for Recommendation Systems
        1. 9.1.1 The Utility Matrix
        2. 9.1.2 The Long Tail
        3. 9.1.3 Applications of Recommendation Systems
        4. 9.1.4 Populating the Utility Matrix
      2. 9.2 Content-Based Recommendations
        1. 9.2.1 Item Profiles
        2. 9.2.2 Discovering Features of Documents
        3. 9.2.3 Obtaining Item Features From Tags
        4. 9.2.4 Representing Item Profiles
        5. 9.2.5 User Profiles
        6. 9.2.6 Recommending Items to Users Based on Content
        7. 9.2.7 Classification Algorithms
        8. 9.2.8 Exercises for Section 9.2
      3. 9.3 Collaborative Filtering
        1. 9.3.1 Measuring Similarity
        2. 9.3.2 The Duality of Similarity
        3. 9.3.3 Clustering Users and Items
        4. 9.3.4 Exercises for Section 9.3
      4. 9.4 Dimensionality Reduction
        1. 9.4.1 UV-Decomposition
        2. 9.4.2 Root-Mean-Square Error
        3. 9.4.3 Incremental Computation of a UV-Decomposition
        4. 9.4.4 Optimizing an Arbitrary Element
        5. 9.4.5 Building a Complete UV-Decomposition Algorithm
        6. 9.4.6 Exercises for Section 9.4
      5. 9.5 The NetFlix Challenge
      6. 9.6 Summary of Chapter 9
      7. 9.7 References for Chapter 9
    12. 10 Mining Social-Network Graphs
      1. 10.1 Social Networks as Graphs
        1. 10.1.1 What is a Social Network?
        2. 10.1.2 Social Networks as Graphs
        3. 10.1.3 Varieties of Social Networks
        4. 10.1.4 Graphs With Several Node Types
        5. 10.1.5 Exercises for Section 10.1
      2. 10.2 Clustering of Social-Network Graphs
        1. 10.2.1 Distance Measures for Social-Network Graphs
        2. 10.2.2 Applying Standard Clustering Methods
        3. 10.2.3 Betweenness
        4. 10.2.4 The Girvan-Newman Algorithm
        5. 10.2.5 Using Betweenness to Find Communities
        6. 10.2.6 Exercises for Section 10.2
      3. 10.3 Direct Discovery of Communities
        1. 10.3.1 Finding Cliques
        2. 10.3.2 Complete Bipartite Graphs
        3. 10.3.3 Finding Complete Bipartite Subgraphs
        4. 10.3.4 Why Complete Bipartite Graphs Must Exist
        5. 10.3.5 Exercises for Section 10.3
      4. 10.4 Partitioning of Graphs
        1. 10.4.1 What Makes a Good Partition?
        2. 10.4.2 Normalized Cuts
        3. 10.4.3 Some Matrices That Describe Graphs
        4. 10.4.4 Eigenvalues of the Laplacian Matrix
        5. 10.4.5 Alternative Partitioning Methods
        6. 10.4.6 Exercises for Section 10.4
      5. 10.5 Finding Overlapping Communities
        1. 10.5.1 The Nature of Communities
        2. 10.5.2 Maximum-Likelihood Estimation
        3. 10.5.3 The Affiliation-Graph Model
        4. 10.5.4 Avoiding the Use of Discrete Membership Changes
        5. 10.5.5 Exercises for Section 10.5
      6. 10.6 Simrank
        1. 10.6.1 Random Walkers on a Social Graph
        2. 10.6.2 Random Walks with Restart
        3. 10.6.3 Exercises for Section 10.6
      7. 10.7 Counting Triangles
        1. 10.7.1 Why Count Triangles?
        2. 10.7.2 An Algorithm for Finding Triangles
        3. 10.7.3 Optimality of the Triangle-Finding Algorithm
        4. 10.7.4 Finding Triangles Using MapReduce
        5. 10.7.5 Using Fewer Reduce Tasks
        6. 10.7.6 Exercises for Section 10.7
      8. 10.8 Neighborhood Properties of Graphs
        1. 10.8.1 Directed Graphs and Neighborhoods
        2. 10.8.2 The Diameter of a Graph
        3. 10.8.3 Transitive Closure and Reachability
        4. 10.8.4 Transitive Closure Via MapReduce
        5. 10.8.5 Smart Transitive Closure
        6. 10.8.6 Transitive Closure by Graph Reduction
        7. 10.8.7 Approximating the Sizes of Neighborhoods
        8. 10.8.8 Exercises for Section 10.8
      9. 10.9 Summary of Chapter 10
      10. 10.10 References for Chapter 10
    13. 11 Dimensionality Reduction
      1. 11.1 Eigenvalues and Eigenvectors
        1. 11.1.1 Definitions
        2. 11.1.2 Computing Eigenvalues and Eigenvectors
        3. 11.1.3 Finding Eigenpairs by Power Iteration
        4. 11.1.4 The Matrix of Eigenvectors
        5. 11.1.5 Exercises for Section 11.1
      2. 11.2 Principal-Component Analysis
        1. 11.2.1 An Illustrative Example
        2. 11.2.2 Using Eigenvectors for Dimensionality Reduction
        3. 11.2.3 The Matrix of Distances
        4. 11.2.4 Exercises for Section 11.2
      3. 11.3 Singular-Value Decomposition
        1. 11.3.1 Definition of SVD
        2. 11.3.2 Interpretation of SVD
        3. 11.3.3 Dimensionality Reduction Using SVD
        4. 11.3.4 Why Zeroing Low Singular Values Works
        5. 11.3.5 Querying Using Concepts
        6. 11.3.6 Computing the SVD of a Matrix
        7. 11.3.7 Exercises for Section 11.3
      4. 11.4 CUR Decomposition
        1. 11.4.1 Definition of CUR
        2. 11.4.2 Choosing Rows and Columns Properly
        3. 11.4.3 Constructing the Middle Matrix
        4. 11.4.4 The Complete CUR Decomposition
        5. 11.4.5 Eliminating Duplicate Rows and Columns
        6. 11.4.6 Exercises for Section 11.4
      5. 11.5 Summary of Chapter 11
      6. 11.6 References for Chapter 11
    14. 12 Large-Scale Machine Learning
      1. 12.1 The Machine-Learning Model 
        1. 12.1.1 Training Sets
        2. 12.1.2 Some Illustrative Examples
        3. 12.1.3 Approaches to Machine Learning
        4. 12.1.4 Machine-Learning Architecture
        5. 12.1.5 Exercises for Section 12.1
      2. 12.2 Perceptrons
        1. 12.2.1 Training a Perceptron with Zero Threshold
        2. 12.2.2 Convergence of Perceptrons
        3. 12.2.3 The Winnow Algorithm
        4. 12.2.4 Allowing the Threshold to Vary
        5. 12.2.5 Multiclass Perceptrons
        6. 12.2.6 Transforming the Training Set
        7. 12.2.7 Problems With Perceptrons
        8. 12.2.8 Parallel Implementation of Perceptrons
        9. 12.2.9 Exercises for Section 12.2
      3. 12.3 Support-Vector Machines
        1. 12.3.1 The Mechanics of an SVM
        2. 12.3.2 Normalizing the Hyperplane
        3. 12.3.3 Finding Optimal Approximate Separators
        4. 12.3.4 SVM Solutions by Gradient Descent
        5. 12.3.5 Stochastic Gradient Descent
        6. 12.3.6 Parallel Implementation of SVM
        7. 12.3.7 Exercises for Section 12.3
      4. 12.4 Learning from Nearest Neighbors
        1. 12.4.1 The Framework for Nearest-Neighbor Calculations
        2. 12.4.2 Learning with One Nearest Neighbor
        3. 12.4.3 Learning One-Dimensional Functions
        4. 12.4.4 Kernel Regression
        5. 12.4.5 Dealing with High-Dimensional Euclidean Data
        6. 12.4.6 Dealing with Non-Euclidean Distances
        7. 12.4.7 Exercises for Section 12.4
      5. 12.5 Comparison of Learning Methods
      6. 12.6 Summary of Chapter 12
      7. 12.7 References for Chapter 12

Story

Slides

Spotfire Dashboard

Research Notes

CS246: Mining Massive Datasets Slides

Chapter 1 Introduction

Slides

Slide 1 Mining of Massive Datasets: Course Introduction

JureLeskovecChapter01Slide1.PNG

Slide 2 What is Data Mining? Knowledge discovery from data

JureLeskovecChapter01Slide2.PNG

Slide 3 Some Massive Data Statistics

JureLeskovecChapter01Slide3.PNG

Slide 4 Data contains value and knowledge

JureLeskovecChapter01Slide4.PNG

Slide 5 Data Mining

JureLeskovecChapter01Slide5.PNG

Slide 6 Good news: Demand for Data Mining

JureLeskovecChapter01Slide6.PNG

Slide 7 What is Data Mining?

JureLeskovecChapter01Slide7.PNG

Slide 8 Data Mining Tasks

JureLeskovecChapter01Slide8.PNG

Slide 9 Meaningfulness of Analytic Answers 1

JureLeskovecChapter01Slide9.PNG

Slide 10 Meaningfulness of Analytic Answers 2

JureLeskovecChapter01Slide10.PNG

Slide 11 What matters when dealing with data?

JureLeskovecChapter01Slide11.PNG

Slide 12 Data Mining: Cultures

JureLeskovecChapter01Slide12.PNG

Slide 13 This Class: CS246

JureLeskovecChapter01Slide13.PNG

Slide 14 What will we learn? 1

JureLeskovecChapter01Slide14.PNG

Slide 15 What will we learn? 2

JureLeskovecChapter01Slide15.PNG

Slide 16 How It All Fits Together

JureLeskovecChapter01Slide16.PNG

Slide 17 How do you want that data?

JureLeskovecChapter01Slide17.PNG

Slide 18 About the Course

JureLeskovecChapter01Slide18.PNG

Slide 19 2014 CS246 Course Staff

JureLeskovecChapter01Slide19.PNG

Slide 20 Course Logistics

JureLeskovecChapter01Slide20.PNG

Slide 21 Logistics: Communication

JureLeskovecChapter01Slide21.PNG

Slide 22 Work for the Course 1

JureLeskovecChapter01Slide22.PNG

Slide 23 Work for the Course 2

JureLeskovecChapter01Slide23.PNG

Slide 24 Course Calender

JureLeskovecChapter01Slide24.PNG

Slide 25 Prerequisites

JureLeskovecChapter01Slide25.PNG

Slide 26 Recitation Sessions

JureLeskovecChapter01Slide26.PNG

Slide 27 What's after the class

JureLeskovecChapter01Slide27.PNG

Slide 28 3 To-do items

JureLeskovecChapter01Slide28.PNG

Hadoop Tutorial

Source: http://web.stanford.edu/class/cs246/...s/tutorial.pdf (PDF)

General Instructions

The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you acquainted with the code and homework submission system. Completing the tutorial is optional but by handing in the results in time students will earn 5 points. This tutorial is to be completed individually.

Here you will learn how to write, compile, debug and execute a simple Hadoop program. First part of the assignment serves as a tutorial and the second part asks you to write your own Hadoop program.

Section 1 describes the virtual machine environment. Instead of the virtual machine, you are welcome to setup your own pseudo-distributed or fully distributed cluster if you pre- fer. Any version of Hadoop that is at least 1.0 will suce. (For an easy way to set up a cluster, try Cloudera Manager:http://archive.cloudera.com/cm5/installer/latest/ cloudera-manager-installer.bin.) If you choose to setup your own cluster, you are re- sponsible for making sure the cluster is working properly. The TAs will be unable to help you debug conguration issues in your own cluster.

Section 2 explains how to use the Eclipse environment in the virtual machine, including how to create a project, how to run jobs, and how to debug jobs. Section 2.5 gives an end-to-end example of creating a project, adding code, building, running, and debugging it.

Section 3 is the actual homework assignment. There are no deliverable for sections 1 and 2. In section 3, you are asked to write and submit your own MapReduce job

This assignment requires you to upload the code and hand-in the output for Section 3.

All students should submit the output via Scoryst and upload the code

Scoryst: create account at https://scoryst.com/enroll/suyVLAjbns

Upload the code: http://snap.stanford.edu/submit/

1 Setting up a virtual machine

  • Download and install VirtualBox on your machine: http://virtualbox.org/wiki/Downloads
  • Download the Cloudera Quickstart VM at http://www.cloudera.com/content/clou...cdh-5-3-x.html
  • Uncompress the VM archive. It is compressed with 7-zip. If needed you can download a tool to uncompress the archive at http://www.7-zip.org/.
  • Start VirtualBox and click Import Appliance in the File dropdown menu. Click the folder icon beside the location eld. Browse to the uncompressed archive folder, select the .ovf le, and click the Open button. Click the Continue button. Click the Import button.
  • Your virtual machine should now appear in the left column. Select it and click on Start to launch it.
  • To verify that the VM is running and you can access it, open a browser to the URL: http://localhost:8088. You should see the resource manager UI. The VM uses port forwarding for the common Hadoop ports, so when the VM is running, those ports on localhost will redirect to the VM.
  • Optional: Open the Virtual Box preferences (File ! Preferences ! Network) and select the Adapter 2 tab. Click the Enable Network Adapter checkbox. Select Host- only Adapter. If the list of networks is empty, add a new network. Click OK. If you do this step, you will be able to connect to the running virtual machine via SSH from the host OS at 192.168.56.101. The username and password are 'cloudera'.

The virtual machine includes the following software

  • CentOS 6.4
  • JDK 7 (1.7.0 67)
  • Hadoop 2.5.0
  • Eclipse 4.2.6 (Juno)

The virtual machine runs best with 4096MB of RAM, but has been tested to function with 1024MB. Note that at 1024MB, while it did technically function, it was very slow to start up.

2 Running Hadoop jobs

Generally Hadoop can be run in three modes.

1. Standalone (or local) mode: There are no daemons used in this mode. Hadoop uses the local le system as an substitute for HDFS le system. The jobs will run as if there is 1 mapper and 1 reducer.

2. Pseudo-distributed mode: All the daemons run on a single machine and this setting mimics the behavior of a cluster. All the daemons run on your machine locally using the HDFS protocol. There can be multiple mappers and reducers.

3. Fully-distributed mode: This is how Hadoop runs on a real cluster.

In this homework we will show you how to run Hadoop jobs in Standalone mode (very useful for developing and debugging) and also in Pseudo-distributed mode (to mimic the behavior of a cluster environment).

2.1 Creating a Hadoop project in Eclipse

(There is a plugin for Eclipse that makes it simple to create a new Hadoop project and execute Hadoop jobs, but the plugin is only well maintained for Hadoop 1.0.4, which is a rather old version of Hadoop. There is a project at https://github.com/winghc/hadoop2x-eclipse-plugin that is working to update the plugin for Hadoop 2.0. You can try it out if you like, but your mileage may vary.)

To create a project:

1. Open Eclipse. If you just launched the VM, you may have to close the Firefox window to find the Eclipse icon on the desktop.

2. Right-click on the training node in the Package Explorer and select Copy. See Figure 1.

Figure 1: Create a Hadoop Project

Figure1CreateaHadoopProject.png

3. Right-click on the training node in the Package Explorer and select Paste. See Figure 2.

Figure 2: Create a Hadoop Project

Figure2CreateaHadoopProject.png

4. In the pop-up dialog, enter the new project name in the Project Name eld and click OK. See Figure 3.

Figure 3: Create a Hadoop Project

Figure3CreateaHadoopProject.png

5. Modify or replace the stub classes found in the src directory as needed.

2.2 Running Hadoop jobs in standalone mode

Once you've created your project and written the source code, to run the project in stand- alone mode, do the following:

1. Right-click on the project and select Run As -> RunConfigurations. See Figure 4.

Figure 4: Run a Hadoop Project

Figure4RunaHadoopProject.png

2. In the pop-up dialog, select the Java Application node and click the New launch configuration button in the upper left corner. See Figure 5.

Figure 5: Run a Hadoop Project

Figure5RunaHadoopProject.png

3. Enter a name in the Name eld and the name of the main class in the Main class eld. See Figure 6.

Figure 6: Run a Hadoop Project

Figure6RunaHadoopProject.png

4. Switch to the Arguments tab and input the required arguments. Click Apply. See Figure 7. To run the job immediately, click on the Run button. Otherwise click Close and complete the following step.

Figure 7: Run a Hadoop Project

Figure7RunaHadoopProject.png

5. Right-click on the project and select Run As ! Java Application. See Figure 8.

Figure 8: Run a Hadoop Project

Figure8RunaHadoopProject.png

6. In the pop-up dialog select the main class from the selection list and click OK. See Figure 9.

Figure 9: Run a Hadoop Project

Figure9RunaHadoopProject.png

After you have setup the run configuration the first time, you can skip steps 1 and 2 above in subsequent runs, unless you need to change the arguments. You can also create more than one launch configuration if you'd like, such as one for each set of common arguments.

2.3 Running Hadoop in pseudo-distributed mode

Once you've created your project and written the source code, to run the project in pseudo- distributed mode, do the following:

1. Right-click on the project and select Export. See Figure 10.

Figure 10: Run a Hadoop Project

Figure10RunaHadoopProject.png

2. In the pop-up dialog, expand the Java node and select JAR file. See Figure 11. Click Next >

Figure 11: Run a Hadoop Project

Figure11RunaHadoopProject.png

3. Enter a path in the JAR file field and click Finish. See Figure 12.

Figure 12: Run a Hadoop Project

Figure12RunaHadoopProject.png

4. Open a terminal and run the following command:

hadoop jar path/to/file.jar input path output path

After modifications to the source files, repeat all of the above steps to run job again.

2.4 Debugging Hadoop jobs

To debug an issue with a job, the easiest approach is to run the job in stand-alone mode and use a debugger. To debug your job, do the following steps:

1. Right-click on the project and select Debug As ! Java Application. See Figure 13.

Figure 13: Debug a Hadoop project

Figure13DebugaHadoopProject.png

2. In the pop-up dialog select the main class from the selection list and click OK. See Figure 14.

Figure 14: Run a Hadoop Project

Figure14RunaHadoopProject.png

You can use the Eclipse debugging features to debug your job execution. See the additional Eclipse tutorials at the end of section 2.6 for help using the Eclipse debugger.

When running your job in pseudo-distributed mode, the output from the job is logged in the task tracker's log les, which can be accessed most easily by pointing a web browser to port 8088 of the server, which will the localhost. From the job tracker web page, you can drill down into the failing job, the failing task, the failed attempt, and finally the log files. Note that the logs for stdout and stderr are separated, which can be useful when trying to isolate specific debugging print statements.

2.5 Example project

In this section you will create a new Eclipse Hadoop project, compile, and execute it. The program will count the frequency of all the words in a given large text file. In your virtual machine, Hadoop, Java environment and Eclipse have already been pre-installed.

Open Eclipse. If you just launched the VM, you may have to close the Firefox window to find the Eclipse icon on the desktop.

Right-click on the training node in the Package Explorer and select Copy. See Figure 15.

Figure 15: Create a Hadoop Project

Figure15CreateaHadoopProject.png

Right-click on the training node in the Package Explorer and select Paste. See Figure 16.

Figure 16: Create a Hadoop Project

Figure16CreateaHadoopProject.png

In the pop-up dialog, enter the new project name in the Project Name eld and click OK. See Figure 17.

Figure 17: Create a Hadoop Project

Figure17CreateaHadoopProject.png

Create a new package called edu.stanford.cs246.wordcount by right-clicking on the src node and selecting New ! Package. See Figure 18.

Figure 18: Create a Hadoop Project

Figure18CreateaHadoopProject.png

Enter edu.stanford.cs246.wordcount in the Name eld and click Finish. See Figure 19.

Figure 19: Create a Hadoop Project

Figure19CreateaHadoopProject.png

Create a new class in that package called WordCount by right-clicking on the edu.stanford.cs246.wordcount node and selecting New ! Class. See Figure 20.

Figure 20: Create a Hadoop Project

Figure20CreateaHadoopProject.png

In the pop-up dialog, enter WordCount as the Name. See Figure 21.

Figure 21: Create a Hadoop Project

Figure21CreateaHadoopProject.png

In the Superclass eld, enter Configured and click the Browse button. From the popup window select Configured - org:apache:hadoop:conf and click the OK button. See Figure 22.

Figure 22: Create a java file

Figure22CreateaJavaFile.png

In the Interfaces section, click the Add button. From the pop-up window select Tool - org:apache:hadoop:util and click the OK button. See Figure 23.

Figure 23: Create a java file

Figure23CreateaJavaFile.png

Check the boxes for public static void main(String args[]) and Inherited abstract methods and click the Finish button. See Figure 24.

Figure 24: Create WordCount.java

Figure24CreateaWordCount.java.png

You will now have a rough skeleton of a Java le as in Figure 25. You can now add code to this class to implement your Hadoop job.

Figure 25: Create WordCount.java

Figure25CreateaWordCount.java.png

Rather than implement a job from scratch, copy the contents from http://snap.stanford.edu/class/cs246...WordCount.java and paste it into the WordCount.java file. See Figure 26. The code in WordCount.java calculates the frequency of each word in a given dataset.

Figure 26: Create WordCount.java

Figure26CreateaWordCount.java.png

Download the Complete Works of William Shakespeare from Project Gutenberg at http://www.gutenberg.org/cache/epub/100/pg100.txt. You can do this simply with cURL, but you also have to be aware of the byte order mark (BOM). You can download the file and remove the BOM in one line by opening a terminal, changing to the ~/workspace/WordCount directory, and running the following command:

curl http://www.gutenberg.org/cache/epub/100/pg100.txt | perl -pe 's/^nxEFnxBBnxBF//' > pg100.txt

If you copy the above command beware the quotes as the copy/paste will likely mis- translate them.

Right-click on the project and select Run As -> RunConfigurations. See Figure 27.

Figure 27: Run WordCount.java

Figure27CreateaWordCount.java.png

In the pop-up dialog, select the Java Application node and click the New launch configuration button in the upper left corner. See Figure 28.

Figure 28: Run WordCount.java

Figure28RunaWordCount.java.png

Enter a name in the Name field and WordCount in the Main class field. See Figure 29.

Figure 29: Run WordCount.java

Figure29RunaWordCount.java.png

Switch to the Arguments tab and put pg100.txt output in the Program arguments eld. See Figure 30. Click Apply and Close.

Figure 30: Run WordCount.java

Figure30RunaWordCount.java.png

Right-click on the project and select Run As ! Java Application. See Figure 31.

Figure 31: Run WordCount.java

Figure31RunaWordCount.java.png

In the pop-up dialog select WordCount - edu.stanford.cs246.wordcount from the selec- tion list and click OK. See Figure 32.

Figure 32: Export a hadoop project

Figure32RunaWordCount.java.png

You will see the command output in the console window, and if the job succeeds, you'll nd the results in the ~/workspace/WordCount/output directory. If the job fails complaining that it cannot find the input file, make sure that the pg100.txt le is located in the ~/workspace/WordCount directory.

Right-click on the project and select Export. See Figure 33.

Figure 33: Run WordCount.java

Figure33RunaWordCount.java.png

In the pop-up dialog, expand the Java node and select JAR file. See Figure 34. Click Next >

Figure 34: Export a Hadoop project

Figure34ExportaHadoopproject.png

Enter /home/cloudera/wordcount.jar in the JAR file field and click Finish. See Figure 35.

Figure 35: Export a Hadoop project

Figure35ExportaHadoopproject.png

If you see an error dialog warning that the project compiled with warnings, you can simply click OK.

Open a terminal and run the following commands:

hadoop fs -put workspace /WordCount/pg100.txt

hadoop jar wordcount.jar edu.stanford.cs246.wordcount.WordCount pg100.txt output

Run the command: hadoop fs -ls output

You should see an output file for each reducer. Since there was only one reducer for this job, you should only see one part-* file. Note that sometimes the files will be called part-NNNNN, and sometimes they'll be called part-r-NNNNN. See Figure 36.

Figure 36: Run WordCount job

Figure36RunaWordCountjob.png

Run the command:

bin/hadoop fs -cat output/partn* | head

You should see the same output as when you ran the job locally, as shown in Figure 37

Figure 37: Run WordCount job

Figure37RunaWordCountjob.png

To view the job's logs, open the browser in the VM and point it to http://localhost:8088 as in Figure 38

Figure 38: Run WordCount job

Figure38RunaWordCountjob.png

Click on the link for the completed job. See Figure 39.

Figure 39: View WordCount job logs

Figure39RunaWordCountjob.png

Click the link for the map tasks. See Figure 40.

Figure 40: View WordCount job logs

Figure40ViewWordCountjoblogs.png

Click the link for the rst attempt. See Figure 41.

Figure 41: View WordCount job logs

Figure41ViewWordCountjoblogs.png

Click the link for the full logs. See Figure 42.

Figure 42: View WordCount job logs

Figure42ViewWordCountjoblogs.png

2.6 Using your local machine for development

If you'd rather use your own development environment instead of working in the IDE, follow these steps:

1. Make sure that you have an entry for localhost.localdomain in your /etc/hosts le, e.g. 127.0.0.1 localhost localhost.localdomain

2. Install a copy of Hadoop locally. The easiest way to do that is to simply download the archive from http://archive.cloudera.com/cdh5/cdh...-latest.tar.gz and unpack it.

3. In the unpacked archive, you'll nd a etc/hadoop directory. In that directory, open the core-site.xml le and modify it as follows:

Section2.6Code1.png

4. Next, open the yarn-site.xml le in the same directory and modify it as follows:

Section2.6Code2.png

You can now run the Hadoop binaries located in the bin directory in the archive, and they will connect to the cluster running in your virtual machine.

Further Hadoop tutorials

Yahoo! Hadoop Tutorial: http://developer.yahoo.com/hadoop/tutorial/

Cloudera Hadoop Tutorial: http://www.cloudera.com/content/clou...-Tutorial.html

How to Debug MapReduce Programs: http://wiki.apache.org/hadoop/HowToD...ReducePrograms

Further Eclipse tutorials

Genera Eclipse tutorial: http://www.vogella.com/articles/Eclipse/article.html.

Tutorial on how to use the Eclipse debugger: http://www.vogella.com/articles/Ecli...g/article.html.

3 Task: Write your own Hadoop Job

Now you will write your rst MapReduce job to accomplish the following task:

  • Write a Hadoop MapReduce program which outputs the number of words that start with each letter. This means that for every letter we want to count the total number of words that start with that letter. In your implementation ignore the letter case, i.e., consider all words as lower case. You can ignore all non-alphabetic characters.
  • Run your program over the same input data as above.

What to hand-in: Hand-in the printout of the output le and upload the source code.

CS246: Mining Massive Datasets Winter 2015

Source: http://web.stanford.edu/class/cs246/intro_handout.pdf (PDF)

Course Information

Instructor: Jure Leskovec
Oce Hours: Wednesdays 9-10am, Gates 418
Lectures: 9:30AM - 10:45AM Tuesday and Thursday in NVidia, Huang Engineering Center
Course website: http://cs246.stanford.edu

Contact:
E-mail us at cs246-win1415-staff@lists.stanford.edu
Use Piazza to post questions: http://piazza.com/stanford/winter2015/cs246
SCPD students can attend oce hours remotely via a Google Hangout; the link will be posted on Piazza just before the oce hours start.
TAs and office hours: See the course website for times and locations.

Topics

  • MapReduce and Hadoop
  • Frequent itemsets and Association rules
  • Near Neighbor Search in High Dimensions
  • Locality Sensitive Hashing (LSH)
  • Dimensionality reduction: SVD and CUR
  • Recommender systems
  • Clustering
  • Analysis of massive graphs
  • Link Analysis: PageRank, HITS
  • Web spam and TrustRank
  • Proximity search on graphs
  • Large-scale supervised machine learning
  • Mining data streams
  • Learning through experimentation
  • Web advertising
  • Optimizing submodular functions

Assignments and grading

  • Four problem sets requiring coding and theory (40%)
  • Final exam (40%)
  • Gradiance quizzes (20%)
  • Piazza and course participation (up to 1% extra credit)

Homework policy

Questions We try very hard to make questions unambiguous, but some ambiguities may remain. Ask (i.e., post a question on Piazza) if confused or state your assumptions explicitly. Reasonable assumptions will be accepted in case of ambiguous questions.

Honor code We take honor code extremely seriously (http://stanford.io/1F3TWNO). The standard penalty includes a one-quarter suspension from the University and 40 hours of community service.We strongly encourage students to form study groups. Students may discuss and work on homework problems in groups. However, each student must write down the solutions and the code independently.In addition, each student should write down the set of people whom s/he interacted.

Late assignments Each student will have a total of 2 late periods to use for homeworks. One late period expires at 5pm. (If the assignment is due on Thursday 5pm then the late period expires next Tuesday 5pm.) No assignment will be accepted more than one late period after its due date.

Assignment submission All students (SCPD and non-SCPD) submit their homeworks via Scoryst (http://www.scoryst.com). Students can typeset or scan their homeworks. Students also need to upload their code at http://snap.stanford.edu/submit. Put all the code for a single question into a single le and upload it. Refer to the course FAQ for more info.

Regrade requests We take great care to ensure that grading is fair and consistent. Since we will always use the same grading procedure, any grades you receive are unlikely to change signi cantly. However, if you feel that your work deserves a regrade, submit a written request via Scoryst within a week of receiving your grade. However, note that we reserve the right to regrade the entire assignment. Moreover, if the regrade request is unjusti ed and thus not honored, then every future unsuccessful regrade request will be penalized 5 points.

Gradiance Quizzes are posted on Friday afternoon and due exactly a week later (hard deadline Friday 11:59pm Paci c time). Once the deadline has passed students will not be able to submit the quiz.

Prerequisites

Students are expected to have the following background (recitation sessions will refresh these topics):

  • The ability to write very non-trivial computer programs (at a minimum, at the level of CS107). Good knowledge of Java will be extremely helpful since most assignments will require the use of Hadoop/Java.
  • Familiarity with basic probability theory is essential (at a minimum, at the level of CS109 or Stat116).
  • Familiarity with writing rigorous proofs (at a minimum at the level of CS 103).
  • Familiarity with basic linear algebra (e.g., any of Math 51, Math 103, Math 113, CS 205, or EE 263).
  • Familiarity with algorithmic analysis (e.g., CS 161).

Materials

Notes and reading assignments will be posted on the course web site. Reading for the class will be from: Mining Massive Datasets by J. Leskovec, A. Rajaraman, J. Ullman (PDFs at http://mmds.org).

Important dates

Assignment Out Date Due Date (all 5pm)
Hadoop tutorial now Jan 13
Assignment 1 Jan 8 Jan 22
Assignment 2 Jan 22 Feb 5
Assignment 3 Feb 5 Feb 19
Assignment 4 Feb 19 Mar 5

 

Alternate final   Mar 16, 7-10pm
Final exam   Mar 20, 12:15-3:15pm

 

We will also hold two review sessions in the rst two weeks of the course (sessions will be video recorded):

  • Review of basic probability. Friday, January 9, at 4:15-5:30pm in Gates B01.
  • Review of basic linear algebra. Friday, January 16, at 4:15-5:30pm in Gates B01.

Next steps for students

Mining of Massive Datasets

Source: http://www.mmds.org/#courses

 
 
Big-data is transforming the world. Here you will learn data mining and machine learning techniques to process large datasets and extract valuable knowledge from them.

The book

The book is based on Stanford Computer Science course CS246: Mining Massive Datasets (and CS345A: Data Mining).

The book, like the course, is designed at the undergraduate computer science level with no formal prerequisites. To support deeper explorations, most of the chapters are supplemented with further reading references.

The Mining of Massive Datasets book has been published by Cambridge University Press. You can get 20% discount here.

By agreement with the publisher, you can download the book for free from this page. Cambridge University Press does, however, retain copyright on the work, and we expect that you will obtain their permission and acknowledge our authorship if you republish parts or all of it.

We welcome your feedback on the manuscript.

The MOOC (Massive Open Online Course)

We are launching an online course based on the Mining Massive Datases book:

Mining Massive Datasets MOOC

The course starts September 29 2014 and will run for 9 weeks with 7 weeks of lectures. Additional information and registration.

The 2nd edition of the book (v2.1)

The following is the second edition of the book. There are three new chapters, on mining large graphs, dimensionality reduction, and machine learning. There is also a revised Chapter 2 that treats map-reduce programming in a manner closer to how it is used in practice.

Together with each chapter there is aslo a set of lecture slides that we use for teaching Stanford CS246: Mining Massive Datasets course. Note that the slides do not necessarily cover all the material convered in the corresponding chapters.

 

Chapter Title Book Slides
  Preface and Table of Contents PDF  
Chapter 1 Data Mining PDF   PDF PPT
Chapter 2 Map-Reduce and the New Software Stack PDF   PDF PPT
Chapter 3 Finding Similar Items PDF   PDF PPT
Chapter 4 Mining Data Streams PDF Part 1:
Part 2:
PDF
PDF
PPT
PPT
Chapter 5 Link Analysis PDF Part 1:
Part 2:
PDF
PDF
PPT
PPT
Chapter 6 Frequent Itemsets PDF   PDF PPT
Chapter 7 Clustering PDF   PDF PPT
Chapter 8 Advertising on the Web PDF   PDF PPT
Chapter 9 Recommendation Systems PDF Part 1:
Part 2:
PDF
PDF
PPT
PPT
Chapter 10 Mining Social-Network Graphs PDF Part 1:
Part 2:
PDF
PDF
PPT
PPT
Chapter 11 Dimensionality Reduction PDF   PDF PPT
Chapter 12 Large-Scale Machine Learning PDF Part 1:
Part 2:
PDF
PDF
PPT
PPT
  Index PDF  
  Errata HTML  

 

Download the latest version of the book as a single big PDF file (511 pages, 3 MB).

Download the full version of the book with a hyper-linked table of contents that make it easy to jump around: PDF file (513 pages, 3.69 MB).

The Errata for the second edition of the book: HTML.

Note to the users of provided slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org/.

Comments and corrections are most welcome. Please let us know if you are using these materials in your course and we will list and link to your course.

Stanford big data courses

CS246

CS246: Mining Massive Datasets is graduate level course that discusses data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis is on Map Reduce as a tool for creating parallel algorithms that can process very large amounts of data.

CS341

CS341 Project in Mining Massive Data Sets is an advanced project based course. Students work on data mining and machine learning algorithms for analyzing very large amounts of data. Both interesting big datasets as well as computational infrastructure (large MapReduce cluster) are provided by course staff. Generally, students first take CS246 followed by CS341.

CS341 is generously supported by Amazon by giving us access to their EC2 platform.

CS224W

CS224W: Social and Information Networks is graduate level course that covers recent research on the structure and analysis of such large social and information networks and on models and algorithms that abstract their basic properties. Class explores how to practically analyze large scale network data and how to reason about it through models for network structure and evolution.

You can take Stanford courses!

If you are not a Stanford student, you can still take CS246 as well as CS224W or earn a Stanford Mining Massive Datasets graduate certificate by completing a sequence of four Stanford Computer Science courses. A graduate certificate is a great way to keep the skills and knowledge in your field current. More information is available at the Stanford Center for Professional Development (SCPD).

Supporting materials

If you are an instructor interested in using the Gradiance Automated Homework System with this book, start by creating an account for yourself here. Then, email your chosen login and the request to become an instructor for the MMDS book to support@gradiance.com. You will then be able to create a class using these materials. Manuals explaining the use of the system are available here.

Students who want to use the Gradiance Automated Homework System for self-study can register here. Then, use the class token1EDD8A1D to join the "omnibus class" for the MMDS book. See The Student Guide for more information.

Previous versions of the book

Version 1.0

The following materials are equivalent to the published book, with errata corrected to July 4, 2012.

Chapter Title Book
  Preface and Table of Contents PDF
Chapter 1 Data Mining PDF
Chapter 2 Large-Scale File Systems and Map-Reduce PDF
Chapter 3 Finding Similar Items PDF
Chapter 4 Mining Data Streams PDF
Chapter 5 Link Analysis PDF
Chapter 6 Frequent Itemsets PDF
Chapter 7 Clustering PDF
Chapter 8 Advertising on the Web PDF
Chapter 9 Recommendation Systems PDF
  Index PDF
  Errata HTML

Download the book as published here (340 pages, 2 MB).

Mining of Massive Datasets

Source: http://infolab.stanford.edu/~ullman/mmds/bookL.pdf (PDF)

Jure Leskovec
Stanford Univ.

Anand Rajaraman

Milliway Labs
Jeffrey D. Ullman

Stanford Univ.

Copyright c 2010, 2011, 2012, 2013, 2014 Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman

Preface

This book evolved from material developed over several years by Anand Rajaraman and Jeff Ullman for a one-quarter course at Stanford. The course CS345A, titled “Web Mining,” was designed as an advanced graduate course, although it has become accessible and interesting to advanced undergraduates. When Jure Leskovec joined the Stanford faculty, we reorganized the material considerably. He introduced a new course CS224W on network analysis and added material to CS345A, which was renumbered CS246. The three authors also introduced a large-scale data-mining project course, CS341. The book now contains material taught in all three courses.

What the Book Is About

At the highest level of description, this book is about data mining. However, it focuses on data mining of very large amounts of data, that is, data so large it does not fit in main memory. Because of the emphasis on size, many of our examples are about the Web or data derived from the Web. Further, the book takes an algorithmic point of view: data mining is about applying algorithms
to data, rather than using data to “train” a machine-learning engine of some sort. The principal topics covered are:

1. Distributed file systems and map-reduce as a tool for creating parallel algorithms that succeed on very large amounts of data.

2. Similarity search, including the key techniques of minhashing and localitysensitive hashing.

3. Data-stream processing and specialized algorithms for dealing with data that arrives so fast it must be processed immediately or lost.

4. The technology of search engines, including Google’s PageRank, link-spam detection, and the hubs-and-authorities approach.

5. Frequent-itemset mining, including association rules, market-baskets, the A-Priori Algorithm and its improvements.

6. Algorithms for clustering very large, high-dimensional datasets.

7. Two key problems for Web applications: managing advertising and recommendation systems.

8. Algorithms for analyzing and mining the structure of very large graphs, especially social-network graphs.

9. Techniques for obtaining the important properties of a large dataset by dimensionality reduction, including singular-value decomposition and latent semantic indexing.

10. Machine-learning algorithms that can be applied to very large data, such as perceptrons, support-vector machines, and gradient descent.

Prerequisites

To appreciate fully the material in this book, we recommend the following prerequisites:

1. An introduction to database systems, covering SQL and related programming systems.

2. A sophomore-level course in data structures, algorithms, and discrete math.

3. A sophomore-level course in software systems, software engineering, and programming languages.

Exercises

The book contains extensive exercises, with some for almost every section. We indicate harder exercises or parts of exercises with an exclamation point. The hardest exercises have a double exclamation point.

Support on the Web

Go to http://www.mmds.org for slides, homework assignments, project requirements, and exams from courses related to this book.

Gradiance Automated Homework

There are automated exercises based on this book, using the Gradiance rootquestion technology, available at www.gradiance.com/services. Students may enter a public class by creating an account at that site and entering the class with code 1EDD8A1D. Instructors may use the site by making an account there and then emailing support at gradiance dot com with their login name, the name of their school, and a request to use the MMDS materials.

Acknowledgements

Cover art is by Scott Ullman.

We would like to thank Foto Afrati, Arun Marathe, and Rok Sosic for critical readings of a draft of this manuscript.

Errors were also reported by Rajiv Abraham, Apoorv Agarwal, Aris Anagnostopoulos, Atilla Soner Balkir, Arnaud Belletoile, Robin Bennett, Susan Biancani, Amitabh Chaudhary, Leland Chen, Anastasios Gounaris, Shrey Gupta, Waleed Hameid, Saman Haratizadeh, Lachlan Kang, Ed Knorr, Haewoon Kwak, Ellis Lau, Greg Lee, Ethan Lozano, Yunan Luo, Michael Mahoney, Justin Meyer, Bryant Moscon, Brad Penoff, Philips Kokoh Prasetyo, Qi Ge, Rich Seiter, Hitesh Shetty, Angad Singh, Sandeep Sripada, Dennis Sidharta, Krzysztof Stencel, Mark Storus, Roshan Sumbaly, Zack Taylor, Tim Triche Jr., Wang Bin, Weng Zhen-Bin, Robert West, Oscar Wu, Xie Ke, Nicolas Zhao, and Zhou Jingbo, The remaining errors are ours, of course.

J. L.
A. R.
J. D. U.

Palo Alto, CA

March, 2014

Contents

1 Data Mining

In this intoductory chapter we begin with the essence of data mining and a discussion of how data mining is treated by the various disciplines that contribute to this field. We cover “Bonferroni’s Principle,” which is really a warning about overusing the ability to mine data. This chapter is also the place where we summarize a few useful ideas that are not data mining but are useful in understanding some important data-mining concepts. These include the TF.IDF measure of word importance, behavior of hash functions and indexes, and identities involving e, the base of natural logarithms. Finally, we give an outline of the topics covered in the balance of the book.

1.1 What is Data Mining?

The most commonly accepted definition of “data mining” is the discovery of “models” for data. A “model,” however, can be one of several things. We mention below the most important directions in modeling.

1.1.1 Statistical Modeling

Statisticians were the first to use the term “data mining.” Originally, “data mining” or “data dredging” was a derogatory term referring to attempts to extract information that was not supported by the data. Section 1.2 illustrates the sort of errors one can make by trying to extract what really isn’t in the data. Today, “data mining” has taken on a positive meaning. Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution from which the visible data is drawn.

Example 1.1

Suppose our data is a set of numbers. This data is much simpler than data that would be data-mined, but it will serve as an example. A statistician might decide that the data comes from a Gaussian distribution and use a formula to compute the most likely parameters of this Gaussian. The mean and standard deviation of this Gaussian distribution completely characterize the distribution and would become the model of the data.

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning. There is no question that some data mining appropriately uses algorithms from machine learning. Machine-learning practitioners use the data as a training set, to train an algorithm of one of the many types used by machine-learning practitioners, such as Bayes nets, support-vector machines, decision trees, hidden Markov models, and many others.

There are situations where using data in this way makes sense. The typical case where machine learning is a good approach is when we have little idea of what we are looking for in the data. For example, it is rather unclear what it is about movies that makes certain movie-goers like or dislike it. Thus, in answering the “Netflix challenge” to devise an algorithm that predicts the ratings of movies by users, based on a sample of their responses, machinelearning algorithms have proved quite successful. We shall discuss a simple form of this type of algorithm in Section 9.4.

On the other hand, machine learning has not proved successful in situations where we can describe the goals of the mining more directly. An interesting case in point is the attempt by WhizBang! Labs 1 to use machine learning to locate people’s resumes on theWeb. It was not able to do better than algorithms designed by hand to look for some of the obvious words and phrases that appear in the typical resume. Since everyone who has looked at or written a resume has a pretty good idea of what resumes contain, there was no mystery about what makes a Web page a resume. Thus, there was no advantage to machine-learning over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic problem. In this case, the model of the data is simply the answer to a complex query about it. For instance, given the set of numbers of Example 1.1, we might compute their average and standard deviation. Note that these values might not be the parameters of the Gaussian that best fits the data, although they will almost certainly be very close if the size of the data is large.

There are many different approaches to modeling data. We have already mentioned the possibility of constructing a statistical process whereby the data could have been generated. Most other approaches to modeling can be described as either

1. Summarizing the data succinctly and approximately, or

2. Extracting the most prominent features of the data and ignoring the rest.

We shall explore these two approaches in the following sections.

1.1.4 Summarization

One of the most interesting forms of summarization is the PageRank idea, which made Google successful and which we shall cover in Chapter 5. In this form of Web mining, the entire complex structure of the Web is summarized by a single number for each page. This number, the “PageRank” of the page, is (oversimplifying somewhat) the probability that a random walker on the graph would be at that page at any given time. The remarkable property this ranking has is that it reflects very well the “importance” of the page – the degree to which typical searchers would like that page returned as an answer to their search query.

Another important form of summary – clustering – will be covered in Chapter 7. Here, data is viewed as points in a multidimensional space. Points that are “close” in this space are assigned to the same cluster. The clusters themselves are summarized, perhaps by giving the centroid of the cluster and the average distance from the centroid of points in the cluster. These cluster summaries become the summary of the entire data set.

Example 1.2

A famous instance of clustering to solve a problem took place long ago in London, and it was done entirely without computers. 2 The physician John Snow, dealing with a Cholera outbreak plotted the cases on a map of the city. A small illustration suggesting the process is shown in Fig. 1.1.

Figure 1.1: Plotting cholera cases on a map of London

The cases clustered around some of the intersections of roads. These intersections were the locations of wells that had become contaminated; people who lived nearest these wells got sick, while people who lived nearer to wells that had not been contaminated did not get sick. Without the ability to cluster the data, the cause of Cholera would not have been discovered.

1.1.5 Feature Extraction

The typical feature-based model looks for the most extreme examples of a phenomenon and represents the data by these examples. If you are familiar with Bayes nets, a branch of machine learning and a topic we do not cover in this book, you know how a complex relationship between objects is represented by finding the strongest statistical dependencies among these objects and using only those in representing all statistical connections. Some of the important kinds of feature extraction from large-scale data that we shall study are:

1. Frequent Itemsets. This model makes sense for data that consists of “baskets” of small sets of items, as in the market-basket problem that we shall discuss in Chapter 6. We look for small sets of items that appear together in many baskets, and these “frequent itemsets” are the characterization of the data that we seek. The original application of this sort of mining was true market baskets: the sets of items, such as hamburger and ketchup, that people tend to buy together when checking out at the cash register of a store or super market.

2. Similar Items. Often, your data looks like a collection of sets, and the objective is to find pairs of sets that have a relatively large fraction of their elements in common. An example is treating customers at an online store like Amazon as the set of items they have bought. In order for Amazon to recommend something else they might like, Amazon can look for “similar” customers and recommend something many of these customers have bought. This process is called “collaborative filtering.” If customers were single-minded, that is, they bought only one kind of thing, then clustering customers might work. However, since customers tend to have interests in many different things, it is more useful to find, for each customer, a small number of other customers who are similar in their tastes, and represent the data by these connections. We discuss similarity in Chapter 3.

1.2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events hidden within massive amounts of data. This section is a discussion of the problem, including “Bonferroni’s Principle,” a warning against overzealous use of data mining.

1.2.1 Total Information Awareness

In 2002, the Bush administration put forward a plan to mine all the data it could find, including credit-card receipts, hotel records, travel data, and many other kinds of information in order to track terrorist activity. This idea naturally caused great concern among privacy advocates, and the project, called TIA, or Total Information Awareness, was eventually killed by Congress, although it is unclear whether the project in fact exists under another name. It is not the purpose of this book to discuss the difficult issue of the privacy-security tradeoff. However, the prospect of TIA or a system like it does raise technical questions about its feasibility and the realism of its assumptions.

The concern raised by many is that if you look at so much data, and you try to find within it activities that look like terrorist behavior, are you not going to find many innocent activities – or even illicit activities that are not terrorism – that will result in visits from the police and maybe worse than just a visit? The answer is that it all depends on how narrowly you define the activities that you look for. Statisticians have seen this problem in many guises and have a theory, which we introduce in the next section.

1.2.2 Bonferroni’s Principle

Suppose you have a certain amount of data, and you look for events of a certain type within that data. You can expect events of this type to occur, even if the data is completely random, and the number of occurrences of these events will grow as the size of the data grows. These occurrences are “bogus,” in the sense that they have no cause other than that random data will always have some number of unusual features that look significant but aren’t. A theorem of statistics, known as the Bonferroni correction gives a statistically sound way to avoid most of these bogus positive responses to a search through the data. Without going into the statistical details, we offer an informal version, Bon- ferroni’s principle, that helps us avoid treating random occurrences as if they were real. Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. If this number is significantly larger than the number of real instances you hope to find, then you must expect almost anything you find to be bogus, i.e., a statistical artifact rather than evidence of what you are looking for. This observation is the informal statement of Bonferroni’s principle.

In a situation like searching for terrorists, where we expect that there are few terrorists operating at any one time, Bonferroni’s principle says that we may only detect terrorists by looking for events that are so rare that they are unlikely to occur in random data. We shall give an extended example in the next section.

1.2.3 An Example of Bonferroni’s Principle

Suppose there are believed to be some “evil-doers” out there, and we want to detect them. Suppose further that we have reason to believe that periodically, evil-doers gather at a hotel to plot their evil. Let us make the following assumptions about the size of the problem:

1. There are one billion people who might be evil-doers.

2. Everyone goes to a hotel one day in 100.

3. A hotel holds 100 people. Hence, there are 100,000 hotels – enough to hold the 1% of a billion people who visit a hotel on any given day.

4. We shall examine hotel records for 1000 days.

To find evil-doers in this data, we shall look for people who, on two different days, were both at the same hotel. Suppose, however, that there really are no evil-doers. That is, everyone behaves at random, deciding with probability 0.01 to visit a hotel on any given day, and if so, choosing one of the 105 hotels at random. Would we find any pairs of people who appear to be evil-doers?

We can do a simple approximate calculation as follows. The probability of any two people both deciding to visit a hotel on any given day is .0001. The chance that they will visit the same hotel is this probability divided by 105, the number of hotels. Thus, the chance that they will visit the same hotel on one given day is 10−9. The chance that they will visit the same hotel on two different given days is the square of this number, 10−18. Note that the hotels can be different on the two days.

Now, we must consider how many events will indicate evil-doing. An “event” in this sense is a pair of people and a pair of days, such that the two people were at the same hotel on each of the two days. To simplify the arithmetic, note that for large n, (n/2) is about n2/2. We shall use this approximation in what follows. Thus, the number of pairs of people is (109/2) = 5 × 1017. The number of pairs of days is 1000/2 = 5 × 105. The expected number of events that look like evil-doing is the product of the number of pairs of people, the number of pairs of days, and the probability that any one pair of people and pair of days is an instance of the behavior we are looking for. That number is

5 × 1017 × 5 × 105 × 10−18 = 250, 000

That is, there will be a quarter of a million pairs of people who look like evildoers, even though they are not.

Now, suppose there really are 10 pairs of evil-doers out there. The police will need to investigate a quarter of a million other pairs in order to find the real evil-doers. In addition to the intrusion on the lives of half a million innocent people, the work involved is sufficiently great that this approach to finding evil-doers is probably not feasible.

1.2.4 Exercises for Section 1.2
Exercise 1.2.1

Using the information from Section 1.2.3, what would be the number of suspected pairs if the following changes were made to the data (and all other numbers remained as they were in that section)?

(a) The number of days of observation was raised to 2000.

(b) The number of people observed was raised to 2 billion (and there were therefore 200,000 hotels).

(c) We only reported a pair as suspect if they were at the same hotel at the same time on three different days.

Exercise 1.2.2

Suppose we have information about the supermarket purchases of 100 million people. Each person goes to the supermarket 100 times in a year and buys 10 of the 1000 items that the supermarket sells. We believe that a pair of terrorists will buy exactly the same set of 10 items (perhaps the ingredients for a bomb?) at some time during the year. If we search for pairs of people who have bought the same set of items, would we expect that any such people found were truly terrorists? 3

1.3 Things Useful to Know

In this section, we offer brief introductions to subjects that you may or may not have seen in your study of other courses. Each will be useful in the study of data mining. They include:

1. The TF.IDF measure of word importance.

2. Hash functions and their use.

3. Secondary storage (disk) and its effect on running time of algorithms.

4. The base e of natural logarithms and identities involving that constant.

5. Power laws.

1.3.1 Importance of Words in Documents

In several applications of data mining, we shall be faced with the problem of categorizing documents (sequences of words) by their topic. Typically, topics are identified by finding the special words that characterize documents about that topic. For instance, articles about baseball would tend to have many occurrences of words like “ball,” “bat,” “pitch,”, “run,” and so on. Once we have classified documents to determine they are about baseball, it is not hard to notice that words such as these appear unusually frequently. However, until we have made the classification, it is not possible to identify these words as characteristic.

Thus, classification often starts by looking at documents, and finding the significant words in those documents. Our first guess might be that the words appearing most frequently in a document are the most significant. However, that intuition is exactly opposite of the truth. The most frequent words will most surely be the common words such as “the” or “and,” which help build ideas but do not carry any significance themselves. In fact, the several hundred most common words in English (called stop words) are often removed from documents before any attempt to classify them.

In fact, the indicators of the topic are relatively rare words. However, not all rare words are equally useful as indicators. There are certain words, for example “notwithstanding” or “albeit,” that appear rarely in a collection of documents, yet do not tell us anything useful. On the other hand, a word like “chukker” is probably equally rare, but tips us off that the document is about the sport of polo. The difference between rare words that tell us something and those that do not has to do with the concentration of the useful words in just a few documents. That is, the presence of a word like “albeit” in a document does not make it terribly more likely that it will appear multiple times. However, if an article mentions “chukker” once, it is likely to tell us what happened in the “first chukker,” then the “second chukker,” and so on. That is, the word is likely to be repeated if it appears at all.

The formal measure of how concentrated into relatively few documents are the occurrences of a given word is called TF.IDF (Term Frequency times Inverse Document Frequency). It is normally computed as follows. Suppose we have a collection of N documents. Define fij to be the frequency (number of occurrences) of term (word) i in document j. Then, define the term frequency TFij to be:

TF ij = fij / maxk fkj

That is, the term frequency of term i in document j is f ij normalized by dividing it by the maximum number of occurrences of any term (perhaps excluding stop words) in the same document. Thus, the most frequent term in document j gets a TF of 1, and other terms get fractions as their term frequency for this document.

The IDF for a term is defined as follows. Suppose term i appears in ni of the N documents in the collection. Then IDF i = log2(N/ni). The TF.IDF score for term i in document j is then defined to be TFij × IDFi. The terms with the highest TF.IDF score are often the terms that best characterize the topic of the document.

Example 1.3

Suppose our repository consists of 2 20 = 1,048,576 documents. Suppose word w appears in 2 10 = 1024 of these documents. Then IDFw = log2(220/210) = log 2(210) = 10. Consider a document j in which w appears 20 times, and that is the maximum number of times in which any word appears (perhaps after eliminating stop words). Then TFwj = 1, and the TF.IDF score for w in document j is 10.

Suppose that in document k, word w appears once, while the maximum number of occurrences of any word in this document is 20. Then TFwk = 1/20, and the TF.IDF score for w in document k is 1/2

1.3.2 Hash Functions

The reader has probably heard of hash tables, and perhaps used them in Java classes or similar packages. The hash functions that make hash tables feasible are also essential components in a number of data-mining algorithms, where the hash table takes an unfamiliar form. We shall review the basics here.

First, a hash function h takes a hash-key value as an argument and produces a bucket number as a result. The bucket number is an integer, normally in the range 0 to B − 1, where B is the number of buckets. Hash-keys can be of any type. There is an intuitive property of hash functions that they “randomize” hash-keys. To be precise, if hash-keys are drawn randomly from a reasonable population of possible hash-keys, then h will send approximately equal numbers of hash-keys to each of the B buckets. It would be impossible to do so if, for example, the population of possible hash-keys were smaller than B. Such a population would not be “reasonable.” However, there can be more subtle reasons why a hash function fails to achieve an approximately uniform distribution into buckets.

Example 1.4

Suppose hash-keys are positive integers. A common and simple hash function is to pick h(x) = x mod B, that is, the remainder when x is divided by B. That choice works fine if our population of hash-keys is all positive integers. 1/Bth of the integers will be assigned to each of the buckets. However, suppose our population is the even integers, and B = 10. Then only buckets 0, 2, 4, 6, and 8 can be the value of h(x), and the hash function is distinctly nonrandom in its behavior. On the other hand, if we picked B = 11, then we would find that 1/11th of the even integers get sent to each of the 11 buckets, so the hash function would work very well.

The generalization of Example 1.4 is that when hash-keys are integers, chosing B so it has any common factor with all (or even most of) the possible hashkeys will result in nonrandom distribution into buckets. Thus, it is normally preferred that we choose B to be a prime. That choice reduces the chance of nonrandom behavior, although we still have to consider the possibility that all hash-keys have B as a factor. Of course there are many other types of hash functions not based on modular arithmetic. We shall not try to summarize the options here, but some sources of information will be mentioned in the bibliographic notes.

What if hash-keys are not integers? In a sense, all data types have values that are composed of bits, and sequences of bits can always be interpreted as integers. However, there are some simple rules that enable us to convert common types to integers. For example, if hash-keys are strings, convert each character to its ASCII or Unicode equivalent, which can be interpreted as a small integer. Sum the integers before dividing by B. As long as B is smaller than the typical sum of character codes for the population of strings, the distribution into buckets will be relatively uniform. If B is larger, then we can partition the characters of a string into groups of several characters each. Treat the concatenation of the codes for the characters of a group as a single integer. Sum the integers associated with all the groups of a string, and divide by B as before. For instance, if B is around a billion, or 230, then grouping characters four at a time will give us 32-bit integers. The sum of several of these will distribute fairly evenly into a billion buckets.

For more complex data types, we can extend the idea used for converting strings to integers, recursively

For a type that is a record, each of whose components has its own type, recursively convert the value of each component to an integer, using the algorithm appropriate for the type of that component. Sum the integers for the components, and convert the integer sum to buckets by dividing by B.

For a type that is an array, set, or bag of elements of some one type, convert the values of the elements’ type to integers, sum the integers, and divide by B.

1.3.3 Indexes

An index is a data structure that makes it efficient to retrieve objects given the value of one or more elements of those objects. The most common situation is one where the objects are records, and the index is on one of the fields of that record. Given a value v for that field, the index lets us retrieve all the records with value v in that field. For example, we could have a file of (name, address, phone) triples, and an index on the phone field. Given a phone number, the index allows us to find quickly the record or records with that phone number.

There are many ways to implement indexes, and we shall not attempt to survey the matter here. The bibliographic notes give suggestions for further reading. However, a hash table is one simple way to build an index. The field or fields on which the index is based form the hash-key for a hash function. Records have the hash function applied to value of the hash-key, and the record itself is placed in the bucket whose number is determined by the hash function. The bucket could be a list of records in main-memory, or a disk block, for example.

Then, given a hash-key value, we can hash it, find the bucket, and need to search only that bucket to find the records with that value for the hash-key. If we choose the number of buckets B to be comparable to the number of records in the file, then there will be relatively few records in any bucket, and the search of a bucket takes little time.

Figure 1.2: A hash table used as an index

phone numbers are hashed to buckets, and the entire record is placed in the bucket whose number is the hash value of the phone

Example 1.5

Figure 1.2 suggests what a main-memory index of records with name, address, and phone fields might look like. Here, the index is on the phone field, and buckets are linked lists. We show the phone 800-555-1212 hashed to bucket number 17. There is an array of bucket headers, whose ith element is the head of a linked list for the bucket numbered i. We show expanded one of the elements of the linked list. It contains a record with name, address, and phone fields. This record is in fact one with the phone number 800-555-1212. Other records in that bucket may or may not have this phone number. We only know that whatever phone number they have is a phone that hashes to 17.

1.3.4 Secondary Storage

It is important, when dealing with large-scale data, that we have a good understanding of the difference in time taken to perform computations when the data is initially on disk, as opposed to the time needed if the data is initially in main memory. The physical characteristics of disks is another subject on which we could say much, but shall say only a little and leave the interested reader to follow the bibliographic notes.

Disks are organized into blocks, which are the minimum units that the operating system uses to move data between main memory and disk. For example, theWindows operating system uses blocks of 64K bytes (i.e., 216 = 65,536 bytes to be exact). It takes approximately ten milliseconds to access (move the disk head to the track of the block and wait for the block to rotate under the head) and read a disk block. That delay is at least five orders of magnitude (a factor of 105) slower than the time taken to read a word from main memory, so if all we want to do is access a few bytes, there is an overwhelming benefit to having data in main memory. In fact, if we want to do something simple to every byte of a disk block, e.g., treat the block as a bucket of a hash table and search for a particular value of the hash-key among all the records in that bucket, then the time taken to move the block from disk to main memory will be far larger than the time taken to do the computation.

By organizing our data so that related data is on a single cylinder (the collection of blocks reachable at a fixed radius from the center of the disk, and therefore accessible without moving the disk head), we can read all the blocks on the cylinder into main memory in considerably less than 10 milliseconds per block. You can assume that a disk cannot transfer data to main memory at more than a hundred million bytes per second, no matter how that data is organized. That is not a problem when your dataset is a megabyte. But a dataset of a hundred gigabytes or a terabyte presents problems just accessing it, let alone doing anything useful with it.

1.3.5 The Base of Natural Logarithms

The constant e = 2.7182818 · · · has a number of useful special properties. In particular, e is the limit of (1 + 1/x)x as x goes to infinity. The values of this expression for x = 1, 2, 3, 4 are approximately 2, 2.25, 2.37, 2.44, so you should find it easy to believe that the limit of this series is around 2.72.

Some algebra lets us obtain approximations to many seemingly complex expressions. Consider (1+a)b, where a is small. We can rewrite the expression as (1+a)(1/a)(ab). Then substitute a = 1/x and 1/a = x, so we have (1+ 1/x )x(ab), which is

((1 + 1/x)x)ab

Since a is assumed small, x is large, so the subexpression (1+ 1/x )x will be close to the limiting value of e. We can thus approximate (1 + a)b as eab.

Similar identities hold when a is negative. That is, the limit as x goes to infinity of (1 − 1 x )x is 1/e. It follows that the approximation (1 + a)b = eab holds even when a is a small negative number. Put another way, (1 − a)b is approximately e−ab when a is small and b is large.

Some other useful approximations follow from the Taylor expansion of ex. That is, ex = P∞ i=0 xi/i!, or ex = 1 + x + x2/2 + x3/6 + x4/24 + · · · . When x is large, the above series converges slowly, although it does converge because n! grows faster than xn for any constant x. However, when x is small, either positive or negative, the series converges rapidly, and only a few terms are necessary to get a good approximation.

Example 1.6

Let x = 1/2. Then

e1/2 = 1 + 1/2 + 1/8 + 1/48 + 1/384 + · · ·

or approximately e1/2 = 1.64844

Let x = −1. Then

e−1 = 1 − 1 + 1/2 − 1/6 + 1/24 − 1/120 + 1/720 − 1/5040 + · · ·

or approximately e−1 = 0.36786

1.3.6 Power Laws

There are many phenomena that relate two variables by a power law, that is, a linear relationship between the logarithms of the variables. Figure 1.3 suggests such a relationship. If x is the horizontal axis and y is the vertical axis, then the relationship is log10 y = 6 − 2 log10 x.

Figure 1.3: A power law with a slope of −2

Example 1.7 : We might examine book sales at Amazon.com, and let x represent the rank of books by sales. Then y is the number of sales of the xth best-selling book over some period. The implication of the graph of Fig. 1.3 would be that the best-selling book sold 1,000,000 copies, the 10th best-selling book sold 10,000 copies, the 100th best-selling book sold 100 copies, and so on for all ranks between these numbers and beyond. The implication that above rank 1000 the sales are a fraction of a book is too extreme, and we would in fact expect the line to flatten out for ranks much higher than 1000.

The Matthew Effect

Often, the existence of power laws with values of the exponent higher than 1 are explained by the Matthew effect. In the biblical Book of Matthew, there is a verse about “the rich get richer.” Many phenomena exhibit this behavior, where getting a high value of some property causes that very property to increase. For example, if a Web page has many links in, then people are more likely to find the page and may choose to link to it from one of their pages as well. As another example, if a book is selling well on Amazon, then it is likely to be advertised when customers go to the Amazon site. Some of these people will choose to buy the book as well, thus increasing the sales of this book.

The general form of a power law relating x and y is log y = b+a log x. If we raise the base of the logarithm (which doesn’t actually matter), say e, to the values on both sides of this equation, we get y = ebea log x = ebxa. Since eb is just “some constant,” let us replace it by constant c. Thus, a power law can be written as y = cxa for some constants a and c.

Example 1.8

In Fig. 1.3 we see that when x = 1, y = 106, and when x = 1000, y = 1. Making the first substitution, we see 106 = c. The second substitution gives us 1 = c(1000)a. Since we now know c = 106, the second equation gives us 1 = 106(1000)a, from which we see a = −2. That is, the law expressed by Fig. 1.3 is y = 106x−2, or y = 106/x2.

We shall meet in this book many ways that power laws govern phenomena. Here are some examples:

1. Node Degrees in the Web Graph: Order all pages by the number of inlinks to that page. Let x be the position of a page in this ordering, and let y be the number of in-links to the xth page. Then y as a function of x looks very much like Fig. 1.3. The exponent a is slightly larger than the −2 shown there; it has been found closer to 2.1.

2. Sales of Products: Order products, say books at Amazon.com, by their sales over the past year. Let y be the number of sales of the xth most popular book. Again, the function y(x) will look something like Fig. 1.3. we shall discuss the consequences of this distribution of sales in Section 9.1.2, where we take up the matter of the “long tail.”

3. Sizes of Web Sites: Count the number of pages at Web sites, and order sites by the number of their pages. Let y be the number of pages at the xth site. Again, the function y(x) follows a power law.

4. Zipf ’s Law: This power law originally referred to the frequency of words in a collection of documents. If you order words by frequency, and let y be the number of times the xth word in the order appears, then you get a power law, although with a much shallower slope than that of Fig. 1.3. Zipf’s observation was that y = cx−1/2. Interestingly, a number of other kinds of data follow this particular power law. For example, if we order states in the US by population and let y be the population of the xth most populous state, then x and y obey Zipf’s law approximately.

1.3.7 Exercises for Section 1.3
Exercise 1.3.1

Suppose there is a repository of ten million documents. What (to the nearest integer) is the IDF for a word that appears in (a) 40 documents (b) 10,000 documents?

Exercise 1.3.2

Suppose there is a repository of ten million documents, and word w appears in 320 of them. In a particular document d, the maximum number of occurrences of a word is 15. Approximately what is the TF.IDF score for w if that word appears (a) once (b) five times?

Exercise 1.3.3

Suppose hash-keys are drawn from the population of all nonnegative integers that are multiples of some constant c, and hash function h(x) is x mod 15. For what values of c will h be a suitable hash function, i.e., a large random choice of hash-keys will be divided roughly equally into buckets?

Exercise 1.3.4

In terms of e, give approximations to

(a) (1.01)500 (b) (1.05)1000 (c) (0.9)40

Exercise 1.3.5

Use the Taylor expansion of ex to compute, to three decimal places: (a) e1/10 (b) e−1/10 (c) e2.

1.4 Outline of the Book

This section gives brief summaries of the remaining chapters of the book.

Chapter 2 is not about data mining per se. Rather, it introduces us to the MapReduce methodology for exploiting parallelism in computing clouds (racks of interconnected processors). There is reason to believe that cloud computing, and MapReduce in particular, will become the normal way to compute when analysis of very large amounts of data is involved. A pervasive issue in later chapters will be the exploitation of the MapReduce methodology to implement the algorithms we cover.

Chapter 3 is about finding similar items. Our starting point is that items can be represented by sets of elements, and similar sets are those that have a large fraction of their elements in common. The key techniques of minhashing and locality-sensitive hashing are explained. These techniques have numerous applications and often give surprisingly efficient solutions to problems that appear impossible for massive data sets.

In Chapter 4, we consider data in the form of a stream. The difference between a stream and a database is that the data in a stream is lost if you do not do something about it immediately. Important examples of streams are the streams of search queries at a search engine or clicks at a popular Web site. In this chapter, we see several of the surprising applications of hashing that make management of stream data feasible.

Chapter 5 is devoted to a single application: the computation of PageRank. This computation is the idea that made Google stand out from other search engines, and it is still an essential part of how search engines know what pages the user is likely to want to see. Extensions of PageRank are also essential in the fight against spam (euphemistically called “search engine optimization”), and we shall examine the latest extensions of the idea for the purpose of combating spam.

Then, Chapter 6 introduces the market-basket model of data, and its canonical problems of association rules and finding frequent itemsets. In the marketbasket model, data consists of a large collection of baskets, each of which contains a small set of items. We give a sequence of algorithms capable of finding all frequent pairs of items, that is pairs of items that appear together in many baskets. Another sequence of algorithms are useful for finding most of the frequent itemsets larger than pairs, with high efficiency.

Chapter 7 examines the problem of clustering. We assume a set of items with a distance measure defining how close or far one item is from another. The goal is to examine a large amount of data and partition it into subsets (clusters), each cluster consisting of items that are all close to one another, yet far from items in the other clusters.

Chapter 8 is devoted to on-line advertising and the computational problems it engenders. We introduce the notion of an on-line algorithm – one where a good response must be given immediately, rather than waiting until we have seen the entire dataset. The idea of competitive ratio is another important concept covered in this chapter; it is the ratio of the guaranteed performance of an on-line algorithm compared with the performance of the optimal algorithm that is allowed to see all the data before making any decisions. These ideas are used to give good algorithms that match bids by advertisers for the right to display their ad in response to a query against the search queries arriving at a search engine.

Chapter 9 is devoted to recommendation systems. Many Web applications involve advising users on what they might like. The Netflix challenge is one example, where it is desired to predict what movies a user would like, or Amazon’s problem of pitching a product to a customer based on information about what they might be interested in buying. There are two basic approaches to recommendation. We can characterize items by features, e.g., the stars of a movie, and recommend items with the same features as those the user is known to like. Or, we can look at other users with preferences similar to that of the user in question, and see what they liked (a technique known as collaborative filtering).

In Chapter 10, we study social networks and algorithms for their analysis. The canonical example of a social network is the graph of Facebook friends, where the nodes are people, and edges connect two people if they are friends. Directed graphs, such as followers on Twitter, can also be viewed as social networks. A common example of a problem to be addressed is identifying “communities,” that is, small sets of nodes with an unusually large number of edges among them. Other questions about social networks are general questions about graphs, such as computing the transitive closure or diameter of a graph, but are made more difficult by the size of typical networks.

Chapter 11 looks at dimensionality reduction. We are given a very large matrix, typically sparse. Think of the matrix as representing a relationship between two kinds of entities, e.g., ratings of movies by viewers. Intuitively, there are a small number of concepts, many fewer concepts than there are movies or viewers, that explain why certain viewers like certain movies. We offer several algorithms that simplify matrices by decomposing them into a product of matrices that are much smaller in one of the two dimensions. One matrix relates entities of one kind to the small number of concepts and another relates the concepts to the other kind of entity. If done correctly, the product of the smaller matrices will be very close to the original matrix.

Finally, Chapter 12 discusses algorithms for machine learning from very large datasets. Techniques covered include perceptrons, support-vector machines, finding models by gradient descent, nearest-neighbor models, and decision trees.

1.5 Summary of Chapter 1

  • Data Mining: This term refers to the process of extracting useful models of data. Sometimes, a model can be a summary of the data, or it can be the set of most extreme features of the data.
  • Bonferroni’s Principle: If we are willing to view as an interesting feature of data something of which many instances can be expected to exist in random data, then we cannot rely on such features being significant. This observation limits our ability to mine data for features that are not sufficiently rare in practice.
  • TF.IDF: The measure called TF.IDF lets us identify words in a collection of documents that are useful for determining the topic of each document. A word has high TF.IDF score in a document if it appears in relatively few documents, but appears in this one, and when it appears in a document it tends to appear many times.
  • Hash Functions: A hash function maps hash-keys of some data type to integer bucket numbers. A good hash function distributes the possible hash-key values approximately evenly among buckets. Any data type can be the domain of a hash function.
  • Indexes: An index is a data structure that allows us to store and retrieve data records efficiently, given the value in one or more of the fields of the record. Hashing is one way to build an index.
  • Storage on Disk: When data must be stored on disk (secondary memory), it takes very much more time to access a desired data item than if the same data were stored in main memory. When data is large, it is important that algorithms strive to keep needed data in main memory.
  • Power Laws: Many phenomena obey a law that can be expressed as y = cxa for some power a, often around −2. Such phenomena include the sales of the xth most popular book, or the number of in-links to the xth most popular page.

1.6 References for Chapter 1

[7] is a clear introduction to the basics of data mining. [2] covers data mining principally from the point of view of machine learning and statistics.

For construction of hash functions and hash tables, see [4]. Details of the TF.IDF measure and other matters regarding document processing can be found in [5]. See [3] for more on managing indexes, hash tables, and data on disk.

Power laws pertaining to the Web were explored by [1]. The Matthew effect was first observed in [6].

1. A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Weiner, “Graph structure in the web,” Com- puter Networks 33:1–6, pp. 309–320, 2000.

2. M.M. Gaber, Scientific Data Mining and Knowledge Discovery — Prin- ciples and Foundations, Springer, New York, 2010.

3. H. Garcia-Molina, J.D. Ullman, and J. Widom, Database Systems: The Complete Book Second Edition, Prentice-Hall, Upper Saddle River, NJ, 2009.

4. D.E. Knuth, The Art of Computer Programming Vol. 3 (Sorting and Searching), Second Edition, Addison-Wesley, Upper Saddle River, NJ, 1998.

5. C.P. Manning, P. Raghavan, and H. Sch¨utze, Introduction to Information Retrieval, Cambridge Univ. Press, 2008.

6. R.K. Merton, “The Matthew effect in science,” Science 159:3810, pp. 56– 63, Jan. 5, 1968.

7. P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison-Wesley, Upper Saddle River, NJ, 2005.

1.7 Footnotes for Chapter 1

1

This startup attempted to use machine learning to mine large-scale data, and hired many of the top machine-learning people to do so. Unfortunately, it was not able to survive.

3

That is, assume our hypothesis that terrorists will surely buy a set of 10 items in common at some time during the year. We don’t want to address the matter of whether or not terrorists would necessarily do so.

2 MapReduce and the New Software Stack

Modern data-mining applications, often called “big-data” analysis, require us to manage immense amounts of data quickly. In many of these applications, the data is extremely regular, and there is ample opportunity to exploit parallelism. Important examples are:

1. The ranking of Web pages by importance, which involves an iterated matrix-vector multiplication where the dimension is many billions.

2. Searches in “friends” networks at social-networking sites, which involve graphs with hundreds of millions of nodes and many billions of edges.

To deal with applications such as these, a new software stack has evolved. These programming systems are designed to get their parallelism not from a “supercomputer,” but from “computing clusters” – large collections of commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches. The software stack begins with a new form of file system, called a “distributed file system,” which features much larger units than the disk blocks in a conventional operating system. Distributed file systems also provide replication of data or redundancy to protect against the frequent media failures that occur when data is distributed over thousands of low-cost compute nodes.

On top of these file systems, many different higher-level programming systems have been developed. Central to the new software stack is a programming system called MapReduce. Implementations of MapReduce enable many of the most common calculations on large-scale data to be performed on computing clusters efficiently and in a way that is tolerant of hardware failures during the computation.

MapReduce systems are evolving and extending rapidly. Today, it is common for MapReduce programs to be created from still higher-level programming systems, often an implementation of SQL. Further, MapReduce turns out to be a useful, but simple, case of more general and powerful ideas. We include in this chapter a discussion of generalizations of MapReduce, first to systems that support acyclic workflows and then to systems that implement recursive algorithms.

Our last topic for this chapter is the design of good MapReduce algorithms, a subject that often differs significantly from the matter of designing good parallel algorithms to be run on a supercomputer. When designing MapReduce algorithms, we often find that the greatest cost is in the communication. We thus investigate communication cost and what it tells us about the most efficient MapReduce algorithms. For several common applications of MapReduce we are able to give families of algorithms that optimally trade the communication cost against the degree of parallelism.

2.1 Distributed File Systems

2.1.1 Physical Organization of Compute Nodes
2.1.2 Large-Scale File-System Organization

2.2 MapReduce

2.2.1 The Map Tasks
2.2.2 Grouping by Key
2.2.3 The Reduce Tasks
2.2.4 Combiners
2.2.5 Details of MapReduce Execution
2.2.6 Coping With Node Failures
2.2.7 Exercises for Section 2.2

2.3 Algorithms Using MapReduce

2.3.1 Matrix-Vector Multiplication by MapReduce
2.3.2 If the Vector v Cannot Fit in Main Memory
2.3.3 Relational-Algebra Operations
2.3.4 Computing Selections by MapReduce
2.3.5 Computing Projections by MapReduce
2.3.6 Union, Intersection, and Difference by MapReduce
2.3.7 Computing Natural Join by MapReduce
2.3.8 Grouping and Aggregation by MapReduce
2.3.9 Matrix Multiplication
2.3.10 Matrix Multiplication with One MapReduce Step
2.3.11 Exercises for Section 2.3

2.4 Extensions to MapReduce

2.4.1 Workflow Systems
2.4.2 Recursive Extensions to MapReduce
2.4.3 Pregel
2.4.4 Exercises for Section 2.4

2.5 The Communication Cost Model

2.5.1 Communication-Cost for Task Networks
2.5.2 Wall-Clock Time
2.5.3 Multiway Joins
2.5.4 Exercises for Section 2.5

2.6 Complexity Theory for MapReduce

2.6.1 Reducer Size and Replication Rate
2.6.2 An Example: Similarity Joins
2.6.3 A Graph Model for MapReduce Problems
2.6.4 Mapping Schemas
2.6.5 When Not All Inputs Are Present
2.6.6 Lower Bounds on Replication Rate
2.6.7 Case Study: Matrix Multiplication
2.6.8 Exercises for Section 2.6

2.7 Summary of Chapter 2

  • Cluster Computing: A common architecture for very large-scale applications is a cluster of compute nodes (processor chip, main memory, and disk). Compute nodes are mounted in racks, and the nodes on a rack are connected, typically by gigabit Ethernet. Racks are also connected by a high-speed network or switch.
  • Distributed File Systems: An architecture for very large-scale file systems has developed recently. Files are composed of chunks of about 64 megabytes, and each chunk is replicated several times, on different compute nodes or racks.
  • MapReduce: This programming system allows one to exploit parallelism inherent in cluster computing, and manages the hardware failures that can occur during a long computation on many nodes. Many Map tasks and many Reduce tasks are managed by a Master process. Tasks on a failed compute node are rerun by the Master.
  • The Map Function: This function is written by the user. It takes a collection of input objects and turns each into zero or more key-value pairs. Keys are not necessarily unique.
  • The Reduce Function: A MapReduce programming system sorts all the key-value pairs produced by all the Map tasks, forms all the values associated with a given key into a list and distributes key-list pairs to Reduce tasks. Each Reduce task combines the elements on each list, by applying the function written by the user. The results produced by all the Reduce tasks form the output of the MapReduce process.
  • Reducers: It is often convenient to refer to the application of the Reduce function to a single key and its associated value list as a “reducer.”
  • Hadoop: This programming system is an open-source implementation of a distributed file system (HDFS, the Hadoop Distributed File System) and MapReduce (Hadoop itself). It is available through the Apache Foundation.
  • Managing Compute-Node Failures: MapReduce systems support restart of tasks that fail because their compute node, or the rack containing that node, fail. Because Map and Reduce tasks deliver their output only after they finish, it is possible to restart a failed task without concern for possible repetition of the effects of that task. It is necessary to restart the entire job only if the node at which the Master executes fails.
  • Applications of MapReduce: While not all parallel algorithms are suitable for implementation in the MapReduce framework, there are simple implementations of matrix-vector and matrix-matrix multiplication. Also, the principal operators of relational algebra are easily implemented in MapReduce.
  • Workflow Systems: MapReduce has been generalized to systems that support any acyclic collection of functions, each of which can be instantiated by any number of tasks, each responsible for executing that function on a portion of the data.
  • Recursive Workflows: When implementing a recursive collection of functions, it is not always possible to preserve the ability to restart any failed task, because recursive tasks may have produced output that was consumed by another task before the failure. A number of schemes for checkpointing parts of the computation to allow restart of single tasks, or restart all tasks from a recent point, have been proposed.
  • Communication-Cost : Many applications of MapReduce or similar systems do very simple things for each task. Then, the dominant cost is usually the cost of transporting data from where it is created to where it is used. In these cases, efficiency of a MapReduce algorithm can be estimated by calculating the sum of the sizes of the inputs to all the tasks.
  • Multiway Joins: It is sometimes more efficient to replicate tuples of the relations involved in a join and have the join of three or more relations computed as a single MapReduce job. The technique of Lagrangean multipliers can be used to optimize the degree of replication for each of the participating relations.
  • Star Joins: Analytic queries often involve a very large fact table joined with smaller dimension tables. These joins can always be done efficiently by the multiway-join technique. An alternative is to distribute the fact table and replicate the dimension tables permanently, using the same strategy as would be used if we were taking the multiway join of the fact table and every dimension table.
  • Replication Rate and Reducer Size: It is often convenient to measure communication by the replication rate, which is the communication per input. Also, the reducer size is the maximum number of inputs associated with any reducer. For many problems, it is possible to derive a lower bound on replication rate as a function of the reducer size.
  • Representing Problems as Graphs: It is possible to represent many problems that are amenable to MapReduce computation by a graph in which nodes represent inputs and outputs. An output is connected to all the inputs that are needed to compute that output.
  • Mapping Schemas: Given the graph of a problem, and given a reducer size, a mapping schema is an assignment of the inputs to one or more reducers so that no reducer is assigned more inputs than the reducer size permits, and yet for every output there is some reducer that gets all the inputs needed to compute that output. The requirement that there be a mapping schema for any MapReduce algorithm is a good expression of what makes MapReduce algorithms different from general parallel computations.
  • Matrix Multiplication by MapReduce: There is a family of one-pass Map- Reduce algorithms that performs multiplication of n × n matrices with the minimum possible replication rate r = 2n2/q, where q is the reducer size. On the other hand, a two-pass MapReduce algorithm for the same problem with the same reducer size can use up to a factor of n less communication.

2.8 References for Chapter 2

3 Finding Similar Items

A fundamental data-mining problem is to examine data for “similar” items. We shall take up applications in Section 3.1, but an example would be looking at a collection of Web pages and finding near-duplicate pages. These pages could be plagiarisms, for example, or they could be mirrors that have almost the same content but differ in information about the host and about other mirrors.

We begin by phrasing the problem of similarity as one of finding sets with a relatively large intersection. We show how the problem of finding textually similar documents can be turned into such a set problem by the technique known as “shingling.” Then, we introduce a technique called “minhashing,” which compresses large sets in such a way that we can still deduce the similarity of the underlying sets from their compressed versions. Other techniques that work when the required degree of similarity is very high are covered in Section 3.9.

Another important problem that arises when we search for similar items of any kind is that there may be far too many pairs of items to test each pair for their degree of similarity, even if computing the similarity of any one pair can be made very easy. That concern motivates a technique called “locality-sensitive hashing,” for focusing our search on pairs that are most likely to be similar.

Finally, we explore notions of “similarity” that are not expressible as intersection of sets. This study leads us to consider the theory of distance measures in arbitrary spaces. It also motivates a general framework for locality-sensitive hashing that applies for other definitions of “similarity.”

3.1 Applications of Near-Neighbor Search

3.1.1 Jaccard Similarity of Sets
3.1.2 Similarity of Documents
3.1.3 Collaborative Filtering as a Similar-Sets Problem
3.1.4 Exercises for Section 3.1

3.2 Shingling of Documents

3.2.1 k-Shingles
3.2.2 Choosing the Shingle Size
3.2.3 Hashing Shingles
3.2.4 Shingles Built from Words
3.2.5 Exercises for Section 3.2

3.3 Similarity-Preserving Summaries of Sets

3.3.1 Matrix Representation of Sets
3.3.2 Minhashing
3.3.3 Minhashing and Jaccard Similarity
3.3.4 Minhash Signatures
3.3.5 Computing Minhash Signatures
3.3.6 Exercises for Section 3.3

3.4 Locality-Sensitive Hashing for Documents

3.4.1 LSH for Minhash Signatures
3.4.2 Analysis of the Banding Technique
3.4.3 Combining the Techniques
3.4.4 Exercises for Section 3.4

3.5 Distance Measures

3.5.1 Definition of a Distance Measure
3.5.2 Euclidean Distances
3.5.3 Jaccard Distance
3.5.4 Cosine Distance
3.5.5 Edit Distance
3.5.6 Hamming Distance
3.5.7 Exercises for Section 3.5

3.6 The Theory of Locality-Sensitive Functions

3.6.1 Locality-Sensitive Functions
3.6.2 Locality-Sensitive Families for Jaccard Distance
3.6.3 Amplifying a Locality-Sensitive Family
3.6.4 Exercises for Section 3.6

3.7 LSH Families for Other Distance Measures

3.7.1 LSH Families for Hamming Distance
3.7.2 Random Hyperplanes and the Cosine Distance
3.7.3 Sketches
3.7.4 LSH Families for Euclidean Distance
3.7.5 More LSH Families for Euclidean Spaces
3.7.6 Exercises for Section 3.7

3.8 Applications of Locality-Sensitive Hashing

3.8.1 Entity Resolution
3.8.2 An Entity-Resolution Example
3.8.3 Validating Record Matches
3.8.4 Matching Fingerprints
3.8.5 A LSH Family for Fingerprint Matching
3.8.6 Similar News Articles
3.8.7 Exercises for Section 3.8

3.9 Methods for High Degrees of Similarity

3.9.1 Finding Identical Items
3.9.2 Representing Sets as Strings
3.9.3 Length-Based Filtering
3.9.4 Prefix Indexing
3.9.5 Using Position Information
3.9.6 Using Position and Length in Indexes
3.9.7 Exercises for Section 3.9

3.10 Summary of Chapter 3

  • Jaccard Similarity: The Jaccard similarity of sets is the ratio of the size of the intersection of the sets to the size of the union. This measure of similarity is suitable for many applications, including textual similarity of documents and similarity of buying habits of customers.
  • Shingling: A k-shingle is any k characters that appear consecutively in a document. If we represent a document by its set of k-shingles, then the Jaccard similarity of the shingle sets measures the textual similarity of documents. Sometimes, it is useful to hash shingles to bit strings of shorter length, and use sets of hash values to represent documents.
  • Minhashing: A minhash function on sets is based on a permutation of the universal set. Given any such permutation, the minhash value for a set is that element of the set that appears first in the permuted order.
  • Minhash Signatures: We may represent sets by picking some list of permutations and computing for each set its minhash signature, which is the sequence of minhash values obtained by applying each permutation on the list to that set. Given two sets, the expected fraction of the permutations that will yield the same minhash value is exactly the Jaccard similarity of the sets.
  • Efficient Minhashing: Since it is not really possible to generate random permutations, it is normal to simulate a permutation by picking a random hash function and taking the minhash value for a set to be the least hash value of any of the set’s members.
  • Locality-Sensitive Hashing for Signatures: This technique allows us to avoid computing the similarity of every pair of sets or their minhash signatures. If we are given signatures for the sets, we may divide them into bands, and only measure the similarity of a pair of sets if they are identical in at least one band. By choosing the size of bands appropriately, we can eliminate from consideration most of the pairs that do not meet our threshold of similarity.
  • Distance Measures: A distance measure is a function on pairs of points in a space that satisfy certain axioms. The distance between two points is 0 if the points are the same, but greater than 0 if the points are different. The distance is symmetric; it does not matter in which order we consider the two points. A distance measure must satisfy the triangle inequality: the distance between two points is never more than the sum of the distances between those points and some third point.
  • Euclidean Distance: The most common notion of distance is the Euclidean distance in an n-dimensional space. This distance, sometimes called the L2-norm, is the square root of the sum of the squares of the differences between the points in each dimension. Another distance suitable for Euclidean spaces, called Manhattan distance or the L1-norm is the sum of the magnitudes of the differences between the points in each dimension.
  • Jaccard Distance: One minus the Jaccard similarity is a distance measure, called the Jaccard distance.
  • Cosine Distance: The angle between vectors in a vector space is the cosine distance measure. We can compute the cosine of that angle by taking the dot product of the vectors and dividing by the lengths of the vectors.
  • Edit Distance: This distance measure applies to a space of strings, and is the number of insertions and/or deletions needed to convert one string into the other. The edit distance can also be computed as the sum of the lengths of the strings minus twice the length of the longest common subsequence of the strings.
  • Hamming Distance: This distance measure applies to a space of vectors. The Hamming distance between two vectors is the number of positions in which the vectors differ.
  • Generalized Locality-Sensitive Hashing: We may start with any collection of functions, such as the minhash functions, that can render a decision as to whether or not a pair of items should be candidates for similarity checking. The only constraint on these functions is that they provide a lower bound on the probability of saying “yes” if the distance (according to some distance measure) is below a given limit, and an upper bound on the probability of saying “yes” if the distance is above another given limit. We can then increase the probability of saying “yes” for nearby items and at the same time decrease the probability of saying “yes” for distant items to as great an extent as we wish, by applying an AND construction and an OR construction.
  • Random Hyperplanes and LSH for Cosine Distance: We can get a set of basis functions to start a generalized LSH for the cosine distance measure by identifying each function with a list of randomly chosen vectors. We apply a function to a given vector v by taking the dot product of v with each vector on the list. The result is a sketch consisting of the signs (+1 or −1) of the dot products. The fraction of positions in which the sketches of two vectors agree, multiplied by 180, is an estimate of the angle between the two vectors.
  • LSH For Euclidean Distance: A set of basis functions to start LSH for Euclidean distance can be obtained by choosing random lines and projecting points onto those lines. Each line is broken into fixed-length intervals, and the function answers “yes” to a pair of points that fall into the same interval.
  • High-Similarity Detection by String Comparison: An alternative approach to finding similar items, when the threshold of Jaccard similarity is close to 1, avoids using minhashing and LSH. Rather, the universal set is ordered, and sets are represented by strings, consisting their elements in order. The simplest way to avoid comparing all pairs of sets or their strings is to note that highly similar sets will have strings of approximately the same length. If we sort the strings, we can compare each string with only a small number of the immediately following strings.
  • Character Indexes: If we represent sets by strings, and the similarity threshold is close to 1, we can index all strings by their first few characters. The prefix whose characters must be indexed is approximately the length of the string times the maximum Jaccard distance (1 minus the minimum Jaccard similarity).
  • Position Indexes: We can index strings not only on the characters in their prefixes, but on the position of that character within the prefix. We reduce the number of pairs of strings that must be compared, because if two strings share a character that is not in the first position in both strings, then we know that either there are some preceding characters that are in the union but not the intersection, or there is an earlier symbol that appears in both strings.
  • Suffix Indexes: We can also index strings based not only on the characters in their prefixes and the positions of those characters, but on the length of the character’s suffix – the number of positions that follow it in the string. This structure further reduces the number of pairs that must be compared, because a common symbol with different suffix lengths implies additional characters that must be in the union but not in the intersection.

3.11 References for Chapter 3

4 Mining Data Streams

Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make another assumption: data arrives in a stream or streams, and if it is not processed immediately or stored, then it is lost forever. Moreover, we shall assume that the data arrives so rapidly that it is not feasible to store it all in active storage (i.e., in a conventional database), and then interact with it at the time of our choosing.

The algorithms for processing streams each involve summarization of the stream in some way. We shall start by considering how to make a useful sample of a stream and how to filter a stream to eliminate most of the “undesirable” elements. We then show how to estimate the number of different elements in a stream using much less storage than would be required if we listed all the elements we have seen.

Another approach to summarizing a stream is to look at only a fixed-length “window” consisting of the last n elements for some (typically large) n. We then query the window as if it were a relation in a database. If there are many streams and/or n is large, we may not be able to store the entire window for every stream, so we need to summarize even the windows. We address the fundamental problem of maintaining an approximate count on the number of 1’s in the window of a bit stream, while using much less space than would be needed to store the entire window itself. This technique generalizes to approximating various kinds of sums.

4.1 The Stream Data Model

4.1.1 A Data-Stream-Management System
4.1.2 Examples of Stream Sources
4.1.3 Stream Queries
4.1.4 Issues in Stream Processing

4.2 Sampling Data in a Stream

4.2.1 A Motivating Example
4.2.2 Obtaining a Representative Sample
4.2.3 The General Sampling Problem
4.2.4 Varying the Sample Size
4.2.5 Exercises for Section 4.2

4.3 Filtering Streams

4.3.1 A Motivating Example
4.3.2 The Bloom Filter
4.3.3 Analysis of Bloom Filtering
4.3.4 Exercises for Section 4.3

4.4 Counting Distinct Elements in a Stream

4.4.1 The Count-Distinct Problem
4.4.2 The Flajolet-Martin Algorithm
4.4.3 Combining Estimates
4.4.4 Space Requirements
4.4.5 Exercises for Section 4.4

4.5 Estimating Moments

4.5.1 Definition of Moments
4.5.2 The Alon-Matias-Szegedy Algorithm for Second Moments
4.5.3 Why the Alon-Matias-Szegedy Algorithm Works
4.5.4 Higher-Order Moments
4.5.5 Dealing With Infinite Streams
4.5.6 Exercises for Section 4.5

4.6 Counting Ones in a Window

4.6.1 The Cost of Exact Counts
4.6.2 The Datar-Gionis-Indyk-Motwani Algorithm
4.6.3 Storage Requirements for the DGIM Algorithm
4.6.4 Query Answering in the DGIM Algorithm
4.6.5 Maintaining the DGIM Conditions
4.6.6 Reducing the Error
4.6.7 Extensions to the Counting of Ones
4.6.8 Exercises for Section 4.6

4.7 Decaying Windows

4.7.1 The Problem of Most-Common Elements

4.7.2 Definition of the Decaying Window

4.7.3 Finding the Most Popular Elements

4.8 Summary of Chapter 4

  • The Stream Data Model : This model assumes data arrives at a processing engine at a rate that makes it infeasible to store everything in active storage. One strategy to dealing with streams is to maintain summaries of the streams, sufficient to answer the expected queries about the data. A second approach is to maintain a sliding window of the most recently arrived data.
  • Sampling of Streams: To create a sample of a stream that is usable for a class of queries, we identify a set of key attributes for the stream. By hashing the key of any arriving stream element, we can use the hash value to decide consistently whether all or none of the elements with that key will become part of the sample.
  • Bloom Filters: This technique allows us to filter streams so elements that belong to a particular set are allowed through, while most nonmembers are deleted. We use a large bit array, and several hash functions. Members of the selected set are hashed to buckets, which are bits in the array, and those bits are set to 1. To test a stream element for membership, we hash the element to a set of bits using each of the hash functions, and only accept the element if all these bits are 1.
  • Counting Distinct Elements: To estimate the number of different elements appearing in a stream, we can hash elements to integers, interpreted as binary numbers. 2 raised to the power that is the longest sequence of 0’s seen in the hash value of any stream element is an estimate of the number of different elements. By using many hash functions and combining these estimates, first by taking averages within groups, and then taking the median of the averages, we get a reliable estimate.
  • Moments of Streams: The kth moment of a stream is the sum of the kth powers of the counts of each element that appears at least once in the stream. The 0th moment is the number of distinct elements, and the 1st moment is the length of the stream.
  • Estimating Second Moments: A good estimate for the second moment, or surprise number, is obtained by choosing a random position in the stream, taking twice the number of times this element appears in the stream from that position onward, subtracting 1, and multiplying by the length of the stream. Many random variables of this type can be combined like the estimates for counting the number of distinct elements, to produce a reliable estimate of the second moment.
  • Estimating Higher Moments: The technique for second moments works for kth moments as well, as long as we replace the formula 2x−1 (where x is the number of times the element appears at or after the selected position) by xk − (x − 1)k.
  • Estimating the Number of 1’s in a Window: We can estimate the number of 1’s in a window of 0’s and 1’s by grouping the 1’s into buckets. Each bucket has a number of 1’s that is a power of 2; there are one or two buckets of each size, and sizes never decrease as we go back in time. If we record only the position and size of the buckets, we can represent the contents of a window of size N with O(log2 N) space.
  • Answering Queries About Numbers of 1’s: If we want to know the approximate numbers of 1’s in the most recent k elements of a binary stream, we find the earliest bucket B that is at least partially within the last k positions of the window and estimate the number of 1’s to be the sum of the sizes of each of the more recent buckets plus half the size of B. This estimate can never be off by more that 50% of the true count of 1’s.
  • Closer Approximations to the Number of 1’s: By changing the rule for how many buckets of a given size can exist in the representation of a binary window, so that either r or r −1 of a given size may exist, we can assure that the approximation to the true number of 1’s is never off by more than 1/r.
  • Exponentially Decaying Windows: Rather than fixing a window size, we can imagine that the window consists of all the elements that ever arrived in the stream, but with the element that arrived t time units ago weighted by e−ct for some time-constant c. Doing so allows us to maintain certain summaries of an exponentially decaying window easily. For instance, the weighted sum of elements can be recomputed, when a new element arrives, by multiplying the old sum by 1 − c and then adding the new element.
  • Maintaining Frequent Elements in an Exponentially Decaying Window: We can imagine that each item is represented by a binary stream, where 0 means the item was not the element arriving at a given time, and 1 means that it was. We can find the elements whose sum of their binary stream is at least 1/2. When a new element arrives, multiply all recorded sums by 1 minus the time constant, add 1 to the count of the item that just arrived, and delete from the record any item whose sum has fallen below 1/2.

4.9 References for Chapter 4

5 Link Analysis

One of the biggest changes in our lives in the decade following the turn of the century was the availability of efficient and accurate Web search, through search engines such as Google. While Google was not the first search engine, it was the first able to defeat the spammers who had made search almost useless. Moreover, the innovation provided by Google was a nontrivial technological advance, called “PageRank.” We shall begin the chapter by explaining what PageRank is and how it is computed efficiently.

Yet the war between those who want to make the Web useful and those who would exploit it for their own purposes is never over. When PageRank was established as an essential technique for a search engine, spammers invented ways to manipulate the PageRank of a Web page, often called link spam.1 That development led to the response of TrustRank and other techniques for preventing spammers from attacking PageRank. We shall discuss TrustRank and other approaches to detecting link spam.

Finally, this chapter also covers some variations on PageRank. These techniques include topic-sensitive PageRank (which can also be adapted for combating link spam) and the HITS, or “hubs and authorities” approach to evaluating pages on the Web.;

5.1 PageRank

5.1.1 Early Search Engines and Term Spam
5.1.2 Definition of PageRank
5.1.3 Structure of the Web
5.1.4 Avoiding Dead Ends
5.1.5 Spider Traps and Taxation
5.1.6 Using PageRank in a Search Engine
5.1.7 Exercises for Section 5.1

5.2 Efficient Computation of PageRank

5.2.1 Representing Transition Matrices
5.2.2 PageRank Iteration Using MapReduce
5.2.3 Use of Combiners to Consolidate the Result Vector
5.2.4 Representing Blocks of the Transition Matrix
5.2.5 Other Efficient Approaches to PageRank Iteration
5.2.6 Exercises for Section 5.2

5.3 Topic-Sensitive PageRank

5.3.1 Motivation for Topic-Sensitive Page Rank
5.3.2 Biased Random Walks
5.3.3 Using Topic-Sensitive PageRank
5.3.4 Inferring Topics from Words
5.3.5 Exercises for Section 5.3

5.4 Link Spam

5.4.1 Architecture of a Spam Farm
5.4.2 Analysis of a Spam Farm
5.4.3 Combating Link Spam
5.4.4 TrustRank
5.4.5 Spam Mass
5.4.6 Exercises for Section 5.4

5.5 Hubs and Authorities

5.5.1 The Intuition Behind HITS
5.5.2 Formalizing Hubbiness and Authority
5.5.3 Exercises for Section 5.5

5.6 Summary of Chapter 5

  • Term Spam: Early search engines were unable to deliver relevant results because they were vulnerable to term spam – the introduction into Web pages of words that misrepresented what the page was about.
  • The Google Solution to Term Spam: Google was able to counteract term spam by two techniques. First was the PageRank algorithm for determining the relative importance of pages on the Web. The second was a strategy of believing what other pages said about a given page, in or near their links to that page, rather than believing only what the page said about itself.
  • PageRank: PageRank is an algorithm that assigns a real number, called its PageRank, to each page on the Web. The PageRank of a page is a measure of how important the page is, or how likely it is to be a good response to a search query. In its simplest form, PageRank is a solution to the recursive equation “a page is important if important pages link to it.”
  • Transition Matrix of the Web: We represent links in the Web by a matrix whose ith row and ith column represent the ith page of the Web. If there are one or more links from page j to page i, then the entry in row i and column j is 1/k, where k is the number of pages to which page j links. Other entries of the transition matrix are 0.
  • Computing PageRank on Strongly Connected Web Graphs: For strongly connected Web graphs (those where any node can reach any other node), PageRank is the principal eigenvector of the transition matrix. We can compute PageRank by starting with any nonzero vector and repeatedly multiplying the current vector by the transition matrix, to get a better estimate.7 After about 50 iterations, the estimate will be very close to the limit, which is the true PageRank.
  • The Random Surfer Model : Calculation of PageRank can be thought of as simulating the behavior of many random surfers, who each start at a random page and at any step move, at random, to one of the pages to which their current page links. The limiting probability of a surfer being at a given page is the PageRank of that page. The intuition is that people tend to create links to the pages they think are useful, so random surfers will tend to be at a useful page.
  • Dead Ends: A dead end is a Web page with no links out. The presence of dead ends will cause the PageRank of some or all of the pages to go to 0 in the iterative computation, including pages that are not dead ends. We can eliminate all dead ends before undertaking a PageRank calculation by recursively dropping nodes with no arcs out. Note that dropping one node can cause another, which linked only to it, to become a dead end, so the process must be recursive.
  • Spider Traps: A spider trap is a set of nodes that, while they may link to each other, have no links out to other nodes. In an iterative calculation of PageRank, the presence of spider traps cause all the PageRank to be captured within that set of nodes.
  • Taxation Schemes: To counter the effect of spider traps (and of dead ends, if we do not eliminate them), PageRank is normally computed in a way that modifies the simple iterative multiplication by the transition matrix. A parameter β is chosen, typically around 0.85. Given an estimate of the PageRank, the next estimate is computed by multiplying the estimate by β times the transition matrix, and then adding (1−β)/n to the estimate for each page, where n is the total number of pages.
  • Taxation and Random Surfers: The calculation of PageRank using taxation parameter β can be thought of as giving each random surfer a probability 1 − β of leaving the Web, and introducing an equivalent number of surfers randomly throughout the Web.
  • Efficient Representation of Transition Matrices: Since a transition matrix is very sparse (almost all entries are 0), it saves both time and space to represent it by listing its nonzero entries. However, in addition to being sparse, the nonzero entries have a special property: they are all the same in any given column; the value of each nonzero entry is the inverse of the number of nonzero entries in that column. Thus, the preferred representation is column-by-column, where the representation of a column is the number of nonzero entries, followed by a list of the rows where those entries occur.
  • Very Large-Scale Matrix–Vector Multiplication: For Web-sized graphs, it may not be feasible to store the entire PageRank estimate vector in the main memory of one machine. Thus, we can break the vector into k segments and break the transition matrix into k2 squares, called blocks, assigning each square to one machine. The vector segments are each sent to k machines, so there is a small additional cost in replicating the vector.
  • Representing Blocks of a Transition Matrix : When we divide a transition matrix into square blocks, the columns are divided into k segments. To represent a segment of a column, nothing is needed if there are no nonzero entries in that segment. However, if there are one or more nonzero entries, then we need to represent the segment of the column by the total number of nonzero entries in the column (so we can tell what value the nonzero entries have) followed by a list of the rows with nonzero entries.
  • Topic-Sensitive PageRank: If we know the queryer is interested in a certain topic, then it makes sense to bias the PageRank in favor of pages on that topic. To compute this form of PageRank, we identify a set of pages known to be on that topic, and we use it as a “teleport set.” The PageRank calculation is modified so that only the pages in the teleport set are given a share of the tax, rather than distributing the tax among all pages on the Web.
  • Creating Teleport Sets: For topic-sensitive PageRank to work, we need to identify pages that are very likely to be about a given topic. One approach is to start with the pages that the open directory (DMOZ) identifies with that topic. Another is to identify words known to be associated with the topic, and select for the teleport set those pages that have an unusually high number of occurrences of such words.
  • Link Spam: To fool the PageRank algorithm, unscrupulous actors have created spam farms. These are collections of pages whose purpose is to concentrate high PageRank on a particular target page.
  • Structure of a Spam Farm: Typically, a spam farm consists of a target page and very many supporting pages. The target page links to all the supporting pages, and the supporting pages link only to the target page. In addition, it is essential that some links from outside the spam farm be created. For example, the spammer might introduce links to their target page by writing comments in other people’s blogs or discussion groups.
  • TrustRank: One way to ameliorate the effect of link spam is to compute a topic-sensitive PageRank called TrustRank, where the teleport set is a collection of trusted pages. For example, the home pages of universities could serve as the trusted set. This technique avoids sharing the tax in the PageRank calculation with the large numbers of supporting pages in spam farms and thus preferentially reduces their PageRank.
  • Spam Mass: To identify spam farms, we can compute both the conventional PageRank and the TrustRank for all pages. Those pages that have much lower TrustRank than PageRank are likely to be part of a spam farm.
  • Hubs and Authorities: While PageRank gives a one-dimensional view of the importance of pages, an algorithm called HITS tries to measure two different aspects of importance. Authorities are those pages that contain valuable information. Hubs are pages that, while they do not themselves contain the information, link to places where the information can be found.
  • Recursive Formulation of the HITS Algorithm: Calculation of the hubs and authorities scores for pages depends on solving the recursive equations: “a hub links to many authorities, and an authority is linked to by many hubs.” The solution to these equations is essentially an iterated matrix–vector multiplication, just like PageRank’s. However, the existence of dead ends or spider traps does not affect the solution to the HITS equations in the way they do for PageRank, so no taxation scheme is necessary.

5.7 References for Chapter 5

6 Frequent Itemsets

​We turn in this chapter to one of the major families of techniques for characterizing data: the discovery of frequent itemsets. This problem is often viewed as the discovery of “association rules,” although the latter is a more complex characterization of data, whose discovery depends fundamentally on the discovery of frequent itemsets.

To begin, we introduce the “market-basket” model of data, which is essentially a many-many relationship between two kinds of elements, called “items” and “baskets,” but with some assumptions about the shape of the data. The frequent-itemsets problem is that of finding sets of items that appear in (are related to) many of the same baskets.

The problem of finding frequent itemsets differs from the similarity search discussed in Chapter 3. Here we are interested in the absolute number of baskets that contain a particular set of items. In Chapter 3 we wanted items that have a large fraction of their baskets in common, even if the absolute number of baskets is small.

The difference leads to a new class of algorithms for finding frequent itemsets. We begin with the A-Priori Algorithm, which works by eliminating most large sets as candidates by looking first at smaller sets and recognizing that a large set cannot be frequent unless all its subsets are. We then consider various improvements to the basic A-Priori idea, concentrating on very large data sets that stress the available main memory.

Next, we consider approximate algorithms that work faster but are not guaranteed to find all frequent itemsets. Also in this class of algorithms are those that exploit parallelism, including the parallelism we can obtain through a MapReduce formulation. Finally, we discuss briefly how to find frequent itemsets in a data stream.

6.1 The Market-Basket Model

6.1.1 Definition of Frequent Itemsets
6.1.2 Applications of Frequent Itemsets
6.1.3 Association Rules
6.1.4 Finding Association Rules with High Confidence
6.1.5 Exercises for Section 6.1

6.2 Market Baskets and the A-Priori Algorithm

6.2.1 Representation of Market-Basket Data
6.2.2 Use of Main Memory for Itemset Counting
6.2.3 Monotonicity of Itemsets
6.2.4 Tyranny of Counting Pairs
6.2.5 The A-Priori Algorithm
6.2.6 A-Priori for All Frequent Itemsets
6.2.7 Exercises for Section 6.2

6.3 Handling Larger Datasets in Main Memory

6.3.1 The Algorithm of Park, Chen, and Yu
6.3.2 The Multistage Algorithm
6.3.3 The Multihash Algorithm
6.3.4 Exercises for Section 6.3

6.4 Limited-Pass Algorithms

6.4.1 The Simple, Randomized Algorithm
6.4.2 Avoiding Errors in Sampling Algorithms
6.4.3 The Algorithm of Savasere, Omiecinski, and Navathe
6.4.4 The SON Algorithm and MapReduce
6.4.5 Toivonen’s Algorithm
6.4.6 Why Toivonen’s Algorithm Works
6.4.7 Exercises for Section 6.4

6.5 Counting Frequent Items in a Stream

6.5.1 Sampling Methods for Streams
6.5.2 Frequent Itemsets in Decaying Windows
6.5.3 Hybrid Methods
6.5.4 Exercises for Section 6.5

6.6 Summary of Chapter 6

  • Market-Basket Data: This model of data assumes there are two kinds of entities: items and baskets. There is a many–many relationship between items and baskets. Typically, baskets are related to small sets of items, while items may be related to many baskets.
  • Frequent Itemsets: The support for a set of items is the number of baskets containing all those items. Itemsets with support that is at least some threshold are called frequent itemsets.
  • Association Rules: These are implications that if a basket contains a certain set of items I, then it is likely to contain another particular item j as well. The probability that j is also in a basket containing I is called the confidence of the rule. The interest of the rule is the amount by which the confidence deviates from the fraction of all baskets that contain j.
  • The Pair-Counting Bottleneck: To find frequent itemsets, we need to examine all baskets and count the number of occurrences of sets of a certain size. For typical data, with a goal of producing a small number of itemsets that are the most frequent of all, the part that often takes the most main memory is the counting of pairs of items. Thus, methods for finding frequent itemsets typically concentrate on how to minimize the main memory needed to count pairs.
  • Triangular Matrices: While one could use a two-dimensional array to count pairs, doing so wastes half the space, because there is no need to count pair {i, j} in both the i-j and j-i array elements. By arranging the pairs (i, j) for which i < j in lexicographic order, we can store only the needed counts in a one-dimensional array with no wasted space, and yet be able to access the count for any pair efficiently.
  • Storage of Pair Counts as Triples: If fewer than 1/3 of the possible pairs actually occur in baskets, then it is more space-efficient to store counts of pairs as triples (i, j, c), where c is the count of the pair {i, j}, and i < j. An index structure such as a hash table allows us to find the triple for (i, j) efficiently.
  • Monotonicity of Frequent Itemsets: An important property of itemsets is that if a set of items is frequent, then so are all its subsets. We exploit this property to eliminate the need to count certain itemsets by using its contrapositive: if an itemset is not frequent, then neither are its supersets.
  • The A-Priori Algorithm for Pairs: We can find all frequent pairs by making two passes over the baskets. On the first pass, we count the items themselves, and then determine which items are frequent. On the second pass, we count only the pairs of items both of which are found frequent on the first pass. Monotonicity justifies our ignoring other pairs.
  • Finding Larger Frequent Itemsets: A-Priori and many other algorithms allow us to find frequent itemsets larger than pairs, if we make one pass over the baskets for each size itemset, up to some limit. To find the frequent itemsets of size k, monotonicity lets us restrict our attention to only those itemsets such that all their subsets of size k − 1 have already been found frequent.
  • The PCY Algorithm: This algorithm improves on A-Priori by creating a hash table on the first pass, using all main-memory space that is not needed to count the items. Pairs of items are hashed, and the hash-table buckets are used as integer counts of the number of times a pair has hashed to that bucket. Then, on the second pass, we only have to count pairs of frequent items that hashed to a frequent bucket (one whose count is at least the support threshold) on the first pass.
  • The Multistage Algorithm: We can insert additional passes between the first and second pass of the PCY Algorithm to hash pairs to other, independent hash tables. At each intermediate pass, we only have to hash pairs of frequent items that have hashed to frequent buckets on all previous passes.
  • The Multihash Algorithm: We can modify the first pass of the PCY Algorithm to divide available main memory into several hash tables. On the second pass, we only have to count a pair of frequent items if they hashed to frequent buckets in all hash tables.
  • Randomized Algorithms: Instead of making passes through all the data, we may choose a random sample of the baskets, small enough that it is possible to store both the sample and the needed counts of itemsets in main memory. The support threshold must be scaled down in proportion. We can then find the frequent itemsets for the sample, and hope that it is a good representation of the data as whole. While this method uses at most one pass through the whole dataset, it is subject to false positives (itemsets that are frequent in the sample but not the whole) and false negatives (itemsets that are frequent in the whole but not the sample).
  • The SON Algorithm: An improvement on the simple randomized algorithm is to divide the entire file of baskets into segments small enough that all frequent itemsets for the segment can be found in main memory. Candidate itemsets are those found frequent for at least one segment. A second pass allows us to count all the candidates and find the exact collection of frequent itemsets. This algorithm is especially appropriate in a MapReduce setting.
  • Toivonen’s Algorithm: This algorithm starts by finding frequent itemsets in a sample, but with the threshold lowered so there is little chance of missing an itemset that is frequent in the whole. Next, we examine the entire file of baskets, counting not only the itemsets that are frequent in the sample, but also, the negative border – itemsets that have not been found frequent, but all their immediate subsets are. If no member of the negative border is found frequent in the whole, then the answer is exact. But if a member of the negative border is found frequent, then the whole process has to repeat with another sample.
  • Frequent Itemsets in Streams: If we use a decaying window with constant c, then we can start counting an item whenever we see it in a basket. We start counting an itemset if we see it contained within the current basket, and all its immediate proper subsets already are being counted. As the window is decaying, we multiply all counts by 1 − c and eliminate those that are less than 1/2.

6.7 References for Chapter 6

7 Clustering

Clustering is the process of examining a collection of “points,” and grouping the points into “clusters” according to some distance measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another. A suggestion of what clusters might look like was seen in Fig. 1.1. However, there the intent was that there were three clusters around three different road intersections, but two of the clusters blended into one another because they were not sufficiently separated.

Our goal in this chapter is to offer methods for discovering clusters in data. We are particularly interested in situations where the data is very large, and/or where the space either is high-dimensional, or the space is not Euclidean at all. We shall therefore discuss several algorithms that assume the data does not fit in main memory. However, we begin with the basics: the two general approaches to clustering and the methods for dealing with clusters in a non- Euclidean space.

7.1 Introduction to Clustering Techniques

7.1.1 Points, Spaces, and Distances
7.1.2 Clustering Strategies3
7.1.3 The Curse of Dimensionality
7.1.4 Exercises for Section 7.1

7.2 Hierarchical Clustering

7.2.1 Hierarchical Clustering in a Euclidean Space
7.2.2 Efficiency of Hierarchical Clustering
7.2.3 Alternative Rules for Controlling Hierarchical Clustering
7.2.4 Hierarchical Clustering in Non-Euclidean Spaces
7.2.5 Exercises for Section 7.2

7.3 K-means Algorithms

7.3.1 K-Means Basics
7.3.2 Initializing Clusters for K-Means
7.3.3 Picking the Right Value of k
7.3.4 The Algorithm of Bradley, Fayyad, and Reina
7.3.5 Processing Data in the BFR Algorithm
7.3.6 Exercises for Section 7.3

7.4 The CURE Algorithm

7.4.1 Initialization in CURE
7.4.2 Completion of the CURE Algorithm
7.4.3 Exercises for Section 7.4

7.5 Clustering in Non-Euclidean Spaces

7.5.1 Representing Clusters in the GRGPF Algorithm
7.5.2 Initializing the Cluster Tree
7.5.3 Adding Points in the GRGPF Algorithm
7.5.4 Splitting and Merging Clusters
7.5.5 Exercises for Section 7.5

7.6 Clustering for Streams and Parallelism

7.6.1 The Stream-Computing Model
7.6.2 A Stream-Clustering Algorithm
7.6.3 Initializing Buckets
7.6.4 Merging Buckets
7.6.5 Answering Queries
7.6.6 Clustering in a Parallel Environment
7.6.7 Exercises for Section 7.6

7.7 Summary of Chapter 7

  • Clustering: Clusters are often a useful summary of data that is in the form of points in some space. To cluster points, we need a distance measure on that space. Ideally, points in the same cluster have small distances between them, while points in different clusters have large distances between them.
  • Clustering Algorithms: Clustering algorithms generally have one of two forms. Hierarchical clustering algorithms begin with all points in a cluster of their own, and nearby clusters are merged iteratively. Point-assignment clustering algorithms consider points in turn and assign them to the cluster in which they best fit.
  • The Curse of Dimensionality: Points in high-dimensional Euclidean spaces, as well as points in non-Euclidean spaces often behave unintuitively. Two unexpected properties of these spaces are that random points are almost always at about the same distance, and random vectors are almost always orthogonal.
  • Centroids and Clustroids: In a Euclidean space, the members of a cluster can be averaged, and this average is called the centroid. In non-Euclidean spaces, there is no guarantee that points have an “average,” so we are forced to use one of the members of the cluster as a representative or typical element of the cluster. That representative is called the clustroid.
  • Choosing the Clustroid: There are many ways we can define a typical point of a cluster in a non-Euclidean space. For example, we could choose the point with the smallest sum of distances to the other points, the smallest sum of the squares of those distances, or the smallest maximum distance to any other point in the cluster.
  • Radius and Diameter : Whether or not the space is Euclidean, we can define the radius of a cluster to be the maximum distance from the centroid or clustroid to any point in that cluster. We can define the diameter of the cluster to be the maximum distance between any two points in the cluster. Alternative definitions, especially of the radius, are also known, for example, average distance from the centroid to the other points.
  • Hierarchical Clustering: This family of algorithms has many variations, which differ primarily in two areas. First, we may chose in various ways which two clusters to merge next. Second, we may decide when to stop the merge process in various ways.
  • Picking Clusters to Merge: One strategy for deciding on the best pair of clusters to merge in a hierarchical clustering is to pick the clusters with the closest centroids or clustroids. Another approach is to pick the pair of clusters with the closest points, one from each cluster. A third approach is to use the average distance between points from the two clusters.
  • Stopping the Merger Process: A hierarchical clustering can proceed until there are a fixed number of clusters left. Alternatively, we could merge until it is impossible to find a pair of clusters whose merger is sufficiently compact, e.g., the merged cluster has a radius or diameter below some threshold. Another approach involves merging as long as the resulting cluster has a sufficiently high “density,” which can be defined in various ways, but is the number of points divided by some measure of the size of the cluster, e.g., the radius.
  • K-Means Algorithms: This family of algorithms is of the point-assignment type and assumes a Euclidean space. It is assumed that there are exactly k clusters for some known k. After picking k initial cluster centroids, the points are considered one at a time and assigned to the closest centroid. The centroid of a cluster can migrate during point assignment, and an optional last step is to reassign all the points, while holding the centroids fixed at their final values obtained during the first pass.
  • Initializing K-Means Algorithms: One way to find k initial centroids is to pick a random point, and then choose k − 1 additional points, each as far away as possible from the previously chosen points. An alternative is to start with a small sample of points and use a hierarchical clustering to merge them into k clusters.
  • Picking K in a K-Means Algorithm: If the number of clusters is unknown, we can use a binary-search technique, trying a k-means clustering with different values of k. We search for the largest value of k for which a decrease below k clusters results in a radically higher average diameter of the clusters. This search can be carried out in a number of clustering operations that is logarithmic in the true value of k.
  • The BFR Algorithm: This algorithm is a version of k-means designed to handle data that is too large to fit in main memory. It assumes clusters are normally distributed about the axes.
  • Representing Clusters in BFR: Points are read from disk one chunk at a time. Clusters are represented in main memory by the count of the number of points, the vector sum of all the points, and the vector formed by summing the squares of the components of the points in each dimension. Other collection of points, too far from a cluster centroid to be included in a cluster, are represented as “miniclusters” in the same way as the k clusters, while still other points, which are not near any other point will be represented as themselves and called “retained” points.
  • Processing Points in BFR: Most of the points in a main-memory load will be assigned to a nearby cluster and the parameters for that cluster will be adjusted to account for the new points. Unassigned points can be formed into new miniclusters, and these miniclusters can be merged with previously discovered miniclusters or retained points. After the last memory load, the miniclusters and retained points can be merged to their nearest cluster or kept as outliers.
  • The CURE Algorithm: This algorithm is of the point-assignment type. It is designed for a Euclidean space, but clusters can have any shape. It handles data that is too large to fit in main memory.
  • Representing Clusters in CURE: The algorithm begins by clustering a small sample of points. It then selects representative points for each cluster, by picking points in the cluster that are as far away from each other as possible. The goal is to find representative points on the fringes of the cluster. However, the representative points are then moved a fraction of the way toward the centroid of the cluster, so they lie somewhat in the interior of the cluster.
  • Processing Points in CURE: After creating representative points for each cluster, the entire set of points can be read from disk and assigned to a cluster. We assign a given point to the cluster of the representative point that is closest to the given point.
  • The GRGPF Algorithm: This algorithm is of the point-assignment type. It handles data that is too big to fit in main memory, and it does not assume a Euclidean space.
  • Representing Clusters in GRGPF: A cluster is represented by the count of points in the cluster, the clustroid, a set of points nearest the clustroid and a set of points furthest from the clustroid. The nearby points allow us to change the clustroid if the cluster evolves, and the distant points allow for merging clusters efficiently in appropriate circumstances. For each of these points, we also record the rowsum, that is the square root of the sum of the squares of the distances from that point to all the other points of the cluster.
  • Tree Organization of Clusters in GRGPF: Cluster representations are organized into a tree structure like a B-tree, where nodes of the tree are typically disk blocks and contain information about many clusters. The leaves hold the representation of as many clusters as possible, while interior nodes hold a sample of the clustroids of the clusters at their descendant leaves. We organize the tree so that the clusters whose representatives are in any subtree are as close as possible.
  • Processing Points in GRGPF: After initializing clusters from a sample of points, we insert each point into the cluster with the nearest clustroid. Because of the tree structure, we can start at the root and choose to visit the child with the sample clustroid nearest to the given point. Following this rule down one path in the tree leads us to a leaf, where we insert the point into the cluster with the nearest clustroid on that leaf.
  • Clustering Streams: A generalization of the DGIM Algorithm (for counting 1’s in the sliding window of a stream) can be used to cluster points that are part of a slowly evolving stream. The BDMO Algorithm uses buckets similar to those of DGIM, with allowable bucket sizes forming a sequence where each size is twice the previous size.
  • Representation of Buckets in BDMO: The size of a bucket is the number of points it represents. The bucket itself holds only a representation of the clusters of these points, not the points themselves. A cluster representation includes a count of the number of points, the centroid or clustroid, and other information that is needed for merging clusters according to some selected strategy.
  • Merging Buckets in BDMO: When buckets must be merged, we find the best matching of clusters, one from each of the buckets, and merge them in pairs. If the stream evolves slowly, then we expect consecutive buckets to have almost the same cluster centroids, so this matching makes sense.
  • Answering Queries in BDMO: A query is a length of a suffix of the sliding window. We take all the clusters in all the buckets that are at least partially within that suffix and merge them using some strategy. The resulting clusters are the answer to the query.
  • Clustering Using MapReduce: We can divide the data into chunks and cluster each chunk in parallel, using a Map task. The clusters from each Map task can be further clustered in a single Reduce task.

7.8 References for Chapter 7

8 Advertising on the Web

One of the big surprises of the 21st century has been the ability of all sorts of interesting Web applications to support themselves through advertising, rather than subscription. While radio and television have managed to use advertising as their primary revenue source, most media – newspapers and magazines, for example – have had to use a hybrid approach, combining revenue from advertising and subscriptions.

By far the most lucrative venue for on-line advertising has been search, and much of the effectiveness of search advertising comes from the “adwords” model of matching search queries to advertisements. We shall therefore devote much of this chapter to algorithms for optimizing the way this assignment is done. The algorithms used are of an unusual type; they are greedy and they are “online” in a particular technical sense to be discussed. We shall therefore digress to discuss these two algorithmic issues – greediness and on-line algorithms – in general, before tackling the adwords problem.

A second interesting on-line advertising problem involves selecting items to advertise at an on-line store. This problem involves “collaborative filtering,” where we try to find customers with similar behavior in order to suggest they buy things that similar customers have bought. This subject will be treated in Section 9.3.

8.1 Issues in On-Line Advertising

8.1.1 Advertising Opportunities
8.1.2 Direct Placement of Ads
8.1.3 Issues for Display Ads

8.2 On-Line Algorithms

8.2.1 On-Line and Off-Line Algorithms
8.2.2 Greedy Algorithms
8.2.3 The Competitive Ratio
8.2.4 Exercises for Section 8.2

8.3 The Matching Problem

8.3.1 Matches and Perfect Matches
8.3.2 The Greedy Algorithm for Maximal Matching
8.3.3 Competitive Ratio for Greedy Matching
8.3.4 Exercises for Section 8.3

8.4 The Adwords Problem

8.4.1 History of Search Advertising
8.4.2 Definition of the Adwords Problem
8.4.3 The Greedy Approach to the Adwords Problem
8.4.4 The Balance Algorithm
8.4.5 A Lower Bound on Competitive Ratio for Balance
8.4.6 The Balance Algorithm with Many Bidders
8.4.7 The Generalized Balance Algorithm
8.4.8 Final Observations About the Adwords Problem
8.4.9 Exercises for Section 8.4

8.5 Adwords Implementation

8.5.1 Matching Bids and Search Queries
8.5.2 More Complex Matching Problems
8.5.3 A Matching Algorithm for Documents and Bids

8.6 Summary of Chapter 8

  • Targeted Advertising: The big advantage that Web-based advertising has over advertising in conventional media such as newspapers is that Web advertising can be selected according to the interests of each individual user. This advantage has enabled many Web services to be supported entirely by advertising revenue.
  • On- and Off-Line Algorithms: Conventional algorithms that are allowed to see all their data before producing an answer are called off-line. An on line algorithm is required to make a response to each element in a stream immediately, with knowledge of only the past, not the future elements in the stream.
  • Greedy Algorithms: Many on-line algorithms are greedy, in the sense that they select their action at every step by minimizing some objective function.
  • Competitive Ratio: We can measure the quality of an on-line algorithm by minimizing, over all possible inputs, the value of the result of the online algorithm compared with the value of the result of the best possible off-line algorithm.
  • Bipartite Matching: This problem involves two sets of nodes and a set of edges between members of the two sets. The goal is to find a maximal matching – as large a set of edges as possible that includes no node more than once.
  • On-Line Solution to the Matching Problem: One greedy algorithm for finding a match in a bipartite graph (or any graph, for that matter) is to order the edges in some way, and for each edge in turn, add it to the match if neither of its ends are yet part of an edge previously selected for the match. This algorithm can be proved to have a competitive ratio of 1/2; that is, it never fails to match at least half as many nodes as the best off-line algorithm matches.
  • Search Ad Management : A search engine receives bids from advertisers on certain search queries. Some ads are displayed with each search query, and the search engine is paid the amount of the bid only if the queryer clicks on the ad. Each advertiser can give a budget, the total amount they are willing to pay for clicks in a month.
  • The Adwords Problem: The data for the adwords problem is a set of bids by advertisers on certain search queries, together with a total budget for each advertiser and information about the historical click-through rate for each ad for each query. Another part of the data is the stream of search queries received by the search engine. The objective is to select on-line a fixed-size set of ads in response to each query that will maximize the revenue to the search engine.
  • Simplified Adwords Problem: To see some of the nuances of ad selection, we considered a simplified version in which all bids are either 0 or 1, only one ad is shown with each query, and all advertisers have the same budget. Under this model the obvious greedy algorithm of giving the ad placement to anyone who has bid on the query and has budget remaining can be shown to have a competitive ratio of 1/2.
  • The Balance Algorithm: This algorithm improves on the simple greedy algorithm. A query’s ad is given to the advertiser who has bid on the query and has the largest remaining budget. Ties can be broken arbitrarily
  • Competitive Ratio of the Balance Algorithm: For the simplified adwords model, the competitive ratio of the Balance Algorithm is 3/4 for the case of two advertisers and 1−1/e, or about 63% for any number of advertisers.
  • The Balance Algorithm for the Generalized Adwords Problem: When bidders can make differing bids, have different budgets, and have different click-through rates for different queries, the Balance Algorithm awards an ad to the advertiser with the highest value of the function = x(1−e−f ). Here, x is the product of the bid and the click-through rate for that advertiser and query, and f is the fraction of the advertiser’s budget that remains unspent.
  • Implementing an Adwords Algorithm: The simplest version of the implementation serves in situations where the bids are on exactly the set of words in the search query. We can represent a query by the list of its words, in sorted order. Bids are stored in a hash table or similar structure, with a hash key equal to the sorted list of words. A search query can then be matched against bids by a straightforward lookup in the table.
  • Matching Word Sets Against Documents: A harder version of the adwords- implementation problem allows bids, which are still small sets of words as in a search query, to be matched against larger documents, such as emails or tweets. A bid set matches the document if all the words appear in the document, in any order and not necessarily adjacent.
  • Hash Storage of Word Sets: A useful data structure stores the words of each bid set in the order rarest-first. Documents have their words sorted in the same order. Word sets are stored in a hash table with the first word, in the rarest-first order, as the key.
  • Processing Documents for Bid Matches: We process the words of the document rarest-first. Word sets whose first word is the current word are copied to a temporary hash table, with the second word as the key. Sets already in the temporary hash table are examined to see if the word that is their key matches the current word, and, if so, they are rehashed using their next word. Sets whose last word is matched are copied to the output.

8.7 References for Chapter 8

9 Recommendation Systems

There is an extensive class of Web applications that involve predicting user responses to options. Such a facility is called a recommendation system. We shall begin this chapter with a survey of the most important examples of these systems. However, to bring the problem into focus, two good examples of recommendation systems are:

1. Offering news articles to on-line newspaper readers, based on a prediction of reader interests.

2. Offering customers of an on-line retailer suggestions about what they might like to buy, based on their past history of purchases and/or product searches.

Recommendation systems use a number of different technologies. We can classify these systems into two broad groups.

  • Content-based systems examine properties of the items recommended. For instance, if a Netflix user has watched many cowboy movies, then recommend a movie classified in the database as having the “cowboy” genre.
  • Collaborative filtering systems recommend items based on similarity measures between users and/or items. The items recommended to a user are those preferred by similar users. This sort of recommendation system can use the groundwork laid in Chapter 3 on similarity search and Chapter 7 on clustering. However, these technologies by themselves are not sufficient, and there are some new algorithms that have proven effective for recommendation systems.

9.1 A Model for Recommendation Systems

9.1.1 The Utility Matrix
9.1.2 The Long Tail
9.1.3 Applications of Recommendation Systems
9.1.4 Populating the Utility Matrix

9.2 Content-Based Recommendations

9.2.1 Item Profiles
9.2.2 Discovering Features of Documents
9.2.3 Obtaining Item Features From Tags
9.2.4 Representing Item Profiles
9.2.5 User Profiles
9.2.6 Recommending Items to Users Based on Content
9.2.7 Classification Algorithms
9.2.8 Exercises for Section 9.2

9.3 Collaborative Filtering

9.3.1 Measuring Similarity
9.3.2 The Duality of Similarity
9.3.3 Clustering Users and Items
9.3.4 Exercises for Section 9.3

9.4 Dimensionality Reduction

9.4.1 UV-Decomposition
9.4.2 Root-Mean-Square Error
9.4.3 Incremental Computation of a UV-Decomposition
9.4.4 Optimizing an Arbitrary Element
9.4.5 Building a Complete UV-Decomposition Algorithm
9.4.6 Exercises for Section 9.4

9.5 The NetFlix Challenge

9.6 Summary of Chapter 9

  • Utility Matrices: Recommendation systems deal with users and items. A utility matrix offers known information about the degree to which a user likes an item. Normally, most entries are unknown, and the essential problem of recommending items to users is predicting the values of the unknown entries based on the values of the known entries.
  • Two Classes of Recommendation Systems: These systems attempt to predict a user’s response to an item by discovering similar items and the response of the user to those. One class of recommendation system is content-based; it measures similarity by looking for common features of the items. A second class of recommendation system uses collaborative filtering; these measure similarity of users by their item preferences and/or measure similarity of items by the users who like them.
  • Item Profiles: These consist of features of items. Different kinds of items have different features on which content-based similarity can be based. Features of documents are typically important or unusual words. Products have attributes such as screen size for a television. Media such as movies have a genre and details such as actor or performer. Tags can also be used as features if they can be acquired from interested users.
  • User Profiles: A content-based collaborative filtering system can construct profiles for users by measuring the frequency with which features appear in the items the user likes. We can then estimate the degree to which a user will like an item by the closeness of the item’s profile to the user’s profile.
  • Classification of Items: An alternative to constructing a user profile is to build a classifier for each user, e.g., a decision tree. The row of the utility matrix for that user becomes the training data, and the classifier must predict the response of the user to all items, whether or not the row had an entry for that item.
  • Similarity of Rows and Columns of the Utility Matrix : Collaborative filtering algorithms must measure the similarity of rows and/or columns of the utility matrix. Jaccard distance is appropriate when the matrix consists only of 1’s and blanks (for “not rated”). Cosine distance works for more general values in the utility matrix. It is often useful to normalize the utility matrix by subtracting the average value (either by row, by column, or both) before measuring the cosine distance.
  • Clustering Users and Items: Since the utility matrix tends to be mostly blanks, distance measures such as Jaccard or cosine often have too little data with which to compare two rows or two columns. A preliminary step or steps, in which similarity is used to cluster users and/or items into small groups with strong similarity, can help provide more common components with which to compare rows or columns.
  • UV-Decomposition: One way of predicting the blank values in a utility matrix is to find two long, thin matrices U and V , whose product is an approximation to the given utility matrix. Since the matrix product UV gives values for all user-item pairs, that value can be used to predict the value of a blank in the utility matrix. The intuitive reason this method makes sense is that often there are a relatively small number of issues (that number is the “thin” dimension of U and V ) that determine whether or not a user likes an item.
  • Root-Mean-Square Error : A good measure of how close the product UV is to the given utility matrix is the RMSE (root-mean-square error). The RMSE is computed by averaging the square of the differences between UV and the utility matrix, in those elements where the utility matrix is nonblank. The square root of this average is the RMSE.
  • Computing U and V : One way of finding a good choice for U and V in a UV-decomposition is to start with arbitrary matrices U and V . Repeatedly adjust one of the elements of U or V to minimize the RMSE between the product UV and the given utility matrix. The process converges to a local optimum, although to have a good chance of obtaining a global optimum we must either repeat the process from many starting matrices, or search from the starting point in many different ways.
  • The NetFlix Challenge: An important driver of research into recommendation systems was the NetFlix challenge. A prize of $1,000,000 was offered for a contestant who could produce an algorithm that was 10% better than NetFlix’s own algorithm at predicting movie ratings by users. The prize was awarded in Sept., 2009.

9.7 References for Chapter 9

10 Mining Social-Network Graphs

There is much information to be gained by analyzing the large-scale data that is derived from social networks. The best-known example of a social network is the “friends” relation found on sites like Facebook. However, as we shall see there are many other sources of data that connect people or other entities.

In this chapter, we shall study techniques for analyzing such networks. An important question about a social network is how to identify “communities,” that is, subsets of the nodes (people or other entities that form the network) with unusually strong connections. Some of the techniques used to identify communities are similar to the clustering algorithms we discussed in Chapter 7. However, communities almost never partition the set of nodes in a network. Rather, communities usually overlap. For example, you may belong to several communities of friends or classmates. The people from one community tend to know each other, but people from two different communities rarely know each other. You would not want to be assigned to only one of the communities, nor would it make sense to cluster all the people from all your communities into one cluster.

Also in this chapter we explore efficient algorithms for discovering other properties of graphs. We look at “simrank,” a way to discover similarities among nodes of a graph. We explore triangle counting as a way to measure the connectedness of a community. We give efficient algorithms for exact and approximate measurement of the neighborhood sizes of nodes in a graph. Finally, we look at efficient algorithms for computing the transitive closure.

10.1 Social Networks as Graphs

10.1.1 What is a Social Network?
10.1.2 Social Networks as Graphs
10.1.3 Varieties of Social Networks
10.1.4 Graphs With Several Node Types
10.1.5 Exercises for Section 10.1

10.2 Clustering of Social-Network Graphs

10.2.1 Distance Measures for Social-Network Graphs
10.2.2 Applying Standard Clustering Methods
10.2.3 Betweenness
10.2.4 The Girvan-Newman Algorithm
10.2.5 Using Betweenness to Find Communities
10.2.6 Exercises for Section 10.2

10.3 Direct Discovery of Communities

10.3.1 Finding Cliques
10.3.2 Complete Bipartite Graphs
10.3.3 Finding Complete Bipartite Subgraphs
10.3.4 Why Complete Bipartite Graphs Must Exist
10.3.5 Exercises for Section 10.3

10.4 Partitioning of Graphs

10.4.1 What Makes a Good Partition?
10.4.2 Normalized Cuts
10.4.3 Some Matrices That Describe Graphs
10.4.4 Eigenvalues of the Laplacian Matrix
10.4.5 Alternative Partitioning Methods
10.4.6 Exercises for Section 10.4

10.5 Finding Overlapping Communities

10.5.1 The Nature of Communities
10.5.2 Maximum-Likelihood Estimation
10.5.3 The Affiliation-Graph Model
10.5.4 Avoiding the Use of Discrete Membership Changes
10.5.5 Exercises for Section 10.5

10.6 Simrank

10.6.1 Random Walkers on a Social Graph
10.6.2 Random Walks with Restart
10.6.3 Exercises for Section 10.6

10.7 Counting Triangles

10.7.1 Why Count Triangles?
10.7.2 An Algorithm for Finding Triangles
10.7.3 Optimality of the Triangle-Finding Algorithm
10.7.4 Finding Triangles Using MapReduce
10.7.5 Using Fewer Reduce Tasks
10.7.6 Exercises for Section 10.7

10.8 Neighborhood Properties of Graphs

10.8.1 Directed Graphs and Neighborhoods
10.8.2 The Diameter of a Graph
10.8.3 Transitive Closure and Reachability
10.8.4 Transitive Closure Via MapReduce
10.8.5 Smart Transitive Closure
10.8.6 Transitive Closure by Graph Reduction
10.8.7 Approximating the Sizes of Neighborhoods
10.8.8 Exercises for Section 10.8

10.9 Summary of Chapter 10

  • Social-Network Graphs: Graphs that represent the connections in a social network are not only large, but they exhibit a form of locality, where small subsets of nodes (communities) have a much higher density of edges than the average density.
  • Communities and Clusters: While communities resemble clusters in some ways, there are also significant differences. Individuals (nodes) normally belong to several communities, and the usual distance measures fail to represent closeness among nodes of a community. As a result, standard algorithms for finding clusters in data do not work well for community finding.
  • Betweenness: One way to separate nodes into communities is to measure the betweenness of edges, which is the sum over all pairs of nodes of the fraction of shortest paths between those nodes that go through the given edge. Communities are formed by deleting the edges whose betweenness is above a given threshold.
  • The Girvan-Newman Algorithm: The Girvan-Newman Algorithm is an efficient technique for computing the betweenness of edges. A breadthfirst search from each node is performed, and a sequence of labeling steps computes the share of paths from the root to each other node that go through each of the edges. The shares for an edge that are computed for each root are summed to get the betweenness.
  • Communities and Complete Bipartite Graphs: A complete bipartite graph has two groups of nodes, all possible edges between pairs of nodes chosen one from each group, and no edges between nodes of the same group. Any sufficiently dense community (a set of nodes with many edges among them) will have a large complete bipartite graph.
  • Finding Complete Bipartite Graphs: We can find complete bipartite graphs by the same techniques we used for finding frequent itemsets. Nodes of the graph can be thought of both as the items and as the baskets. The basket corresponding to a node is the set of adjacent nodes, thought of as items. A complete bipartite graph with node groups of size t and s can be thought of as finding frequent itemsets of size t with support s.
  • Graph Partitioning: One way to find communities is to partition a graph repeatedly into pieces of roughly similar sizes. A cut is a partition of the nodes of the graph into two sets, and its size is the number of edges that have one end in each set. The volume of a set of nodes is the number of edges with at least one end in that set.
  • Normalized Cuts: We can normalize the size of a cut by taking the ratio of the size of the cut and the volume of each of the two sets formed by the cut. Then add these two ratios to get the normalized cut value. Normalized cuts with a low sum are good, in the sense that they tend to divide the nodes into two roughly equal parts, and have a relatively small size.
  • Adjacency Matrices: These matrices describe a graph. The entry in row i and column j is 1 if there is an edge between nodes i and j, and 0 otherwise.
  • Degree Matrices: The degree matrix for a graph has d in the ith diagonal entry if d is the degree of the ith node. Off the diagonal, all entries are 0.
  • Laplacian Matrices: The Laplacian matrix for a graph is its degree matrix minus its adjacency matrix. That is, the entry in row i and column i of the Laplacian matrix is the degree of the ith node of the graph, and the entry in row i and column j, for i 6= j, is −1 if there is an edge between nodes i and j, and 0 otherwise.
  • Spectral Method for Partitioning Graphs: The lowest eigenvalue for any Laplacian matrix is 0, and its corresponding eigenvector consists of all 1’s. The eigenvectors corresponding to small eigenvalues can be used to guide a partition of the graph into two parts of similar size with a small cut value. For one example, putting the nodes with a positive component in the eigenvector with the second-smallest eigenvalue into one set and those with a negative component into the other is usually good.
  • Overlapping Communities: Typically, individuals are members of several communities. In graphs describing social networks, it is normal for the probability that two individuals are friends to rise as the number of communities of which both are members grows.
  • The Affiliation-Graph Model : An appropriate model for membership in communities is to assume that for each community there is a probability that because of this community two members become friends (have an edge in the social network graph). Thus, the probability that two nodes have an edge is 1 minus the product of the probabilities that none of the communities of which both are members cause there to be an edge between them. We then find the assignment of nodes to communities and the values of those probabilities that best describes the observed social graph.
  • Maximum-Likelihood Estimation: An important modeling technique, useful for modeling communities as well as many other things, is to compute, as a function of all choices of parameter values that the model allows, the probability that the observed data would be generated. The values that yield the highest probability are assumed to be correct, and called the maximum-likelihood estimate (MLE).
  • Use of Gradient Descent : If we know membership in communities, we can find the MLE by gradient descent or other methods. However, we cannot find the best membership in communities by gradient descent, because membership is discrete, not continuous.
  • Improved Community Modeling by Strength of Membership: We can formulate the problem of finding the MLE of communities in a social graph by assuming individuals have a strength of membership in each community, possibly 0 if they are not a member. If we define the probability of an edge between two nodes to be a function of their membership strengths in their common communities, we can turn the problem of finding the MLE into a continuous problem and solve it using gradient descent.
  • Simrank: One way to measure the similarity of nodes in a graph with several types of nodes is to start a random walker at one node and allow it to wander, with a fixed probability of restarting at the same node. The distribution of where the walker can be expected to be is a good measure of the similarity of nodes to the starting node. This process must be repeated with each node as the starting node if we are to get all-pairs similarity.
  • Triangles in Social Networks: The number of triangles per node is an important measure of the closeness of a community and often reflects its maturity. We can enumerate or count the triangles in a graph with m edges in O(m3/2) time, but no more efficient algorithm exists in general.
  • Triangle Finding by MapReduce: We can find triangles in a single round of MapReduce by treating it as a three-way join. Each edge must be sent to a number of reducers proportional to the cube root of the total number of reducers, and the total computation time spent at all the reducers is proportional to the time of the serial algorithm for triangle finding.
  • Neighborhoods: The neighborhood of radius d for a node v in a directed or undirected graph is the set of nodes reachable from v along paths of length at most d. The neighborhood profile of a node is the sequence of neighborhood sizes for all distances from 1 upwards. The diameter of a connected graph is the smallest d for which the neighborhood of radius d for any starting node includes the entire graph.
  • Transitive Closure: A node v can reach node u if u is in the neighborhood of v for some radius. The transitive closure of a graph is the set of pairs of nodes (v, u) such that v can reach u.
  • Computing Transitive Closure: Since the transitive closure can have a number of facts equal to the square of the number of nodes of a graph, it is infeasible to compute transitive closure directly for large graphs. One approach is to find strongly connected components of the graph and collapse them each to a single node before computing the transitive closure.
  • Transitive Closure and MapReduce: We can view transitive closure computation as the iterative join of a path relation (pairs of nodes v and u such that u is known to be reachable from v) and the arc relation of the graph. Such an approach requires a number of MapReduce rounds equal to the diameter of the graph
  • Transitive Closure by Recursive Doubling: An approach that uses fewer MapReduce rounds is to join the path relation with itself at each round. At each round, we double the length of paths that are able to contribute to the transitive closure. Thus, the number of needed rounds is only the base-2 logarithm of the diameter of the graph
  • Smart Transitive Closure: While recursive doubling can cause the same path to be considered many times, and thus increases the total computation time (compared with iteratively joining paths with single arcs), a variant called smart transitive closure avoids discovering the same path more than once. The trick is to require that when joining two paths, the first has a length that is a power of 2.
  • Approximating Neighborhood Sizes: By using the Flajolet-Martin technique for approximating the number of distinct elements in a stream, we can find the neighborhood sizes at different radii approximately. We maintain a set of tail lengths for each node. To increase the radius by 1, we examine each edge (u, v) and for each tail length for u we set it equal to the corresponding tail length for v if the latter is larger than the former.

10.10 References for Chapter 10

11 Dimensionality Reduction

There are many sources of data that can be viewed as a large matrix. We saw in Chapter 5 how the Web can be represented as a transition matrix. In Chapter 9, the utility matrix was a point of focus. And in Chapter 10 we examined matrices that represent social networks. In many of these matrix applications, the matrix can be summarized by finding “narrower” matrices that in some sense are close to the original. These narrow matrices have only a small number of rows or a small number of columns, and therefore can be used much more efficiently than can the original large matrix. The process of finding these narrow matrices is called dimensionality reduction.

We saw a preliminary example of dimensionality reduction in Section 9.4. There, we discussed UV-decomposition of a matrix and gave a simple algorithm for finding this decomposition. Recall that a large matrix M was decomposed into two matrices U and V whose product UV was approximately M. The matrix U had a small number of columns whereas V had a small number of rows, so each was significantly smaller than M, and yet together they represented most of the information in M that was useful in predicting ratings of items by individuals.

In this chapter we shall explore the idea of dimensionality reduction in more detail. We begin with a discussion of eigenvalues and their use in “principal component analysis” (PCA). We cover singular-value decomposition, a more powerful version of UV-decomposition. Finally, because we are always interested in the largest data sizes we can handle, we look at another form of decomposition, called CUR-decomposition, which is a variant of singularvalue decomposition that keeps the matrices of the decomposition sparse if the original matrix is sparse.

11.1 Eigenvalues and Eigenvectors

11.1.1 Definitions
11.1.2 Computing Eigenvalues and Eigenvectors
11.1.3 Finding Eigenpairs by Power Iteration
11.1.4 The Matrix of Eigenvectors
11.1.5 Exercises for Section 11.1

11.2 Principal-Component Analysis

11.2.1 An Illustrative Example
11.2.2 Using Eigenvectors for Dimensionality Reduction
11.2.3 The Matrix of Distances
11.2.4 Exercises for Section 11.2

11.3 Singular-Value Decomposition

11.3.1 Definition of SVD
11.3.2 Interpretation of SVD
11.3.3 Dimensionality Reduction Using SVD
11.3.4 Why Zeroing Low Singular Values Works
11.3.5 Querying Using Concepts
11.3.6 Computing the SVD of a Matrix
11.3.7 Exercises for Section 11.3

11.4 CUR Decomposition

11.4.1 Definition of CUR
11.4.2 Choosing Rows and Columns Properly
11.4.3 Constructing the Middle Matrix
11.4.4 The Complete CUR Decomposition
11.4.5 Eliminating Duplicate Rows and Columns
11.4.6 Exercises for Section 11.4

11.5 Summary of Chapter 11

  • Dimensionality Reduction: The goal of dimensionality reduction is to replace a large matrix by two or more other matrices whose sizes are much smaller than the original, but from which the original can be approximately reconstructed, usually by taking their product.
  • Eigenvalues and Eigenvectors: A matrix may have several eigenvectors such that when the matrix multiplies the eigenvector, the result is a constant multiple of the eigenvector. That constant is the eigenvalue associated with this eigenvector. Together the eigenvector and its eigenvalue are called an eigenpair.
  • Finding Eigenpairs by Power Iteration: We can find the principal eigenvector (eigenvector with the largest eigenvalue) by starting with any vector and repeatedly multiplying the current vector by the matrix to get a new vector. When the changes to the vector become small, we can treat the result as a close approximation to the principal eigenvector. By modifying the matrix, we can then use the same iteration to get the second eigenpair (that with the second-smallest eigenvalue), and similarly get each of the eigenpairs in turn, in order of decreasing value of the eigenvalue.
  • Principal-Component Analysis: This technique for dimensionality reduction views data consisting of a collection of points in a multidimensional space as a matrix, with rows corresponding to the points and columns to the dimensions. The product of this matrix and its transpose has eigenpairs, and the principal eigenvector can be viewed as the direction in the space along which the points best line up. The second eigenvector represents the direction in which deviations from the principal eigenvector are the greatest, and so on.
  • Dimensionality Reduction by PCA: By representing the matrix of points by a small number of its eigenvectors, we can approximate the data in a way that minimizes the root-mean-square error for the given number of columns in the representing matrix.
  • Singular-Value Decomposition: The singular-value decomposition of a matrix consists of three matrices, U, , and V . The matrices U and V are column-orthonormal, meaning that as vectors, the columns are orthogonal, and their lengths are 1. The matrix is a diagonal matrix, and the values along its diagonal are called singular values. The product of U, , and the transpose of V equals the original matrix.
  • Concepts: SVD is useful when there are a small number of concepts that connect the rows and columns of the original matrix. For example, if the original matrix represents the ratings given by movie viewers (rows) to movies (columns), the concepts might be the genres of the movies. The matrix U connects rows to concepts, represents the strengths of the concepts, and V connects the concepts to columns.
  • Queries Using the Singular-Value Decomposition: We can use the decomposition to relate new or hypothetical rows of the original matrix to the concepts represented by the decomposition. Multiply a row by the matrix V of the decomposition to get a vector indicating the extent to which that row matches each of the concepts.
  • Using SVD for Dimensionality Reduction: In a complete SVD for a matrix, U and V are typically as large as the original. To use fewer columns for U and V , delete the columns corresponding to the smallest singular values from U, V , and . This choice minimizes the error in reconstructing the original matrix from the modified U, , and V .
  • Decomposing Sparse Matrices: Even in the common case where the given matrix is sparse, the matrices constructed by SVD are dense. The CUR decomposition seeks to decompose a sparse matrix into sparse, smaller matrices whose product approximates the original matrix.
  • CUR Decomposition: This method chooses from a given sparse matrix a set of columns C and a set of rows R, which play the role of U and V T in SVD; the user can pick any number of rows and columns. The choice of rows and columns is made randomly with a distribution that depends on the Frobenius norm, or the square root of the sum of the squares of the elements. Between C and R is a square matrix called U that is constructed by a pseudo-inverse of the intersection of the chosen rows and columns.

11.6 References for Chapter 11

12 Large-Scale Machine Learning

Many algorithms are today classified as “machine learning.” These algorithms share, with the other algorithms studied in this book, the goal of extracting information from data. All algorithms for analysis of data are designed to produce a useful summary of the data, from which decisions are made. Among many examples, the frequent-itemset analysis that we did in Chapter 6 produces information like association rules, which can then be used for planning a sales strategy or for many other purposes.

However, algorithms called “machine learning” not only summarize our data; they are perceived as learning a model or classifier from the data, and thus discover something about data that will be seen in the future. For instance, the clustering algorithms discussed in Chapter 7 produce clusters that not only tell us something about the data being analyzed (the training set), but they allow us to classify future data into one of the clusters that result from the clustering algorithm. Thus, machine-learning enthusiasts often speak of clustering with the neologism “unsupervised learning”; the term unsupervised refers to the fact that the input data does not tell the clustering algorithm what the clusters should be. In supervised machine learning,, which is the subject of this chapter, the available data includes information about the correct way to classify at least some of the data. The data classified already is called the training set.

In this chapter, we do not attempt to cover all the different approaches to machine learning. We concentrate on methods that are suitable for very large data and that have the potential for parallel implementation. We consider the classical “perceptron” approach to learning a data classifier, where a hyperplane that separates two classes is sought. Then, we look at more modern techniques involving support-vector machines. Similar to perceptrons, these methods look for hyperplanes that best divide the classes, so that few, if any, members of the training set lie close to the hyperplane. We end with a discussion of nearestneighbor techniques, where data is classified according to the class(es) of their nearest neighbors in some space.

12.1 The Machine-Learning Model 

12.1.1 Training Sets
12.1.2 Some Illustrative Examples
12.1.3 Approaches to Machine Learning
12.1.4 Machine-Learning Architecture
12.1.5 Exercises for Section 12.1

12.2 Perceptrons

12.2.1 Training a Perceptron with Zero Threshold
12.2.2 Convergence of Perceptrons
12.2.3 The Winnow Algorithm
12.2.4 Allowing the Threshold to Vary
12.2.5 Multiclass Perceptrons
12.2.6 Transforming the Training Set
12.2.7 Problems With Perceptrons
12.2.8 Parallel Implementation of Perceptrons
12.2.9 Exercises for Section 12.2

12.3 Support-Vector Machines

12.3.1 The Mechanics of an SVM
12.3.2 Normalizing the Hyperplane
12.3.3 Finding Optimal Approximate Separators
12.3.4 SVM Solutions by Gradient Descent
12.3.5 Stochastic Gradient Descent
12.3.6 Parallel Implementation of SVM
12.3.7 Exercises for Section 12.3

12.4 Learning from Nearest Neighbors

12.4.1 The Framework for Nearest-Neighbor Calculations
12.4.2 Learning with One Nearest Neighbor
12.4.3 Learning One-Dimensional Functions
12.4.4 Kernel Regression
12.4.5 Dealing with High-Dimensional Euclidean Data
12.4.6 Dealing with Non-Euclidean Distances
12.4.7 Exercises for Section 12.4

12.5 Comparison of Learning Methods

12.6 Summary of Chapter 12

  • Training Sets: A training set consists of a feature vector, each component of which is a feature, and a label indicating the class to which the object represented by the feature vector belongs. Features can be categorical – belonging to an enumerated list of values – or numerical.
  • Test Sets and Overfitting: When training some classifier on a training set, it is useful to remove some of the training set and use the removed data as a test set. After producing a model or classifier without using the test set, we can run the classifier on the test set to see how well it does. If the classifier does not perform as well on the test set as on the training set used, then we have overfit the training set by conforming to peculiarities of the training-set data which is not present in the data as a whole.
  • Batch Versus On-Line Learning: In batch learning, the training set is available at any time and can be used in repeated passes. On-line learning uses a stream of training examples, each of which can be used only once.
  • Perceptrons: This machine-learning method assumes the training set has only two class labels, positive and negative. Perceptrons work when there is a hyperplane that separates the feature vectors of the positive examples from those of the negative examples. We converge to that hyperplane by adjusting our estimate of the hyperplane by a fraction – the learning rate – of the direction that is the average of the currently misclassified points.
  • The Winnow Algorithm: This algorithm is a variant of the perceptron algorithm that requires components of the feature vectors to be 0 or 1. Training examples are examined in a round-robin fashion, and if the current classification of a training example is incorrect, the components of the estimated separator where the feature vector has 1 are adjusted up or down, in the direction that will make it more likely this training example is correctly classified in the next round.
  • Nonlinear Separators: When the training points do not have a linear function that separates two classes, it may still be possible to use a perceptron to classify them. We must find a function we can use to transform the points so that in the transformed space, the separator is a hyperplane.
  • Support-Vector Machines: The SVM improves upon perceptrons by finding a separating hyperplane that not only separates the positive and negative points, but does so in a way that maximizes the margin – the distance perpendicular to the hyperplane to the nearest points. The points that lie exactly at this minimum distance are the support vectors. Alternatively, the SVM can be designed to allow points that are too close to the hyperplane, or even on the wrong side of the hyperplane, but minimize the error due to such misplaced points.
  • Solving the SVM Equations: We can set up a function of the vector that is normal to the hyperplane, the length of the vector (which determines the margin), and the penalty for points on the wrong side of the margins. The regularization parameter determines the relative importance of a wide margin and a small penalty. The equations can be solved by several methods, including gradient descent and quadratic programming
  • Nearest-Neighbor Learning: In this approach to machine learning, the entire training set is used as the model. For each (“query”) point to be classified, we search for its k nearest neighbors in the training set. The classification of the query point is some function of the labels of these k neighbors. The simplest case is when k = 1, in which case we can take the label of the query point to be the label of the nearest neighbor.
  • Regression: A common case of nearest-neighbor learning, called regression, occurs when the there is only one feature vector, and it, as well as the label, are real numbers; i.e., the data defines a real-valued function of one variable. To estimate the label, i.e., the value of the function, for an unlabeled data point, we can perform some computation involving the k nearest neighbors. Examples include averaging the neighbors or taking a weighted average, where the weight of a neighbor is some decreasing function of its distance from the point whose label we are trying to determine.

12.7 References for Chapter 12

Page statistics
2718 view(s) and 79 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments