Table of contents
  1. Story
  2. Spotfire Screen Captures
    1. Cover Page
    2. Chapter 03 Data Set
    3. Chapter 04 Data Set
    4. Chapter 05 Data Set
    5. Chapter 06 Data Set
    6. Chapter 07 Data Set Scoring
    7. Chapter 07 Data Set Training
    8. Chapter 08 Data Set: MyModel​
    9. ​Chapter 09 Data Set Scoring
    10. Chapter 09 Data Set Training
    11. Chapter 10 Data Set Scoring
    12. Chapter 10 Data Set Training
    13. Chapter 11 Data Set Scoring
    14. Chapter 11 Data Set Training
    15. Chapter 11 Exercise Training Data
  3. Spotfire Dashboard
  4. Research Notes
    1. Scientific Data Mining GMU Fall Semester
      1. Syllabus
      2. Reading Assignments
    2. 2014 George Mason and IBM Symposium on Diverse Data Analytics Applications
      1. Preface
      2. Slides
      3. Agenda
  5. Data Mining for the Masses
  6. Data Mining for the Masses Data Sets
  7. Data Mining for the Masses
    1. Cover Page
    2. Dedication
    3. Acknowledgements
    4. Section One: Data Mining Basics
      1. Chapter One: Introduction to Data Mining and CRISP-DM
        1. Introduction
        2. A Note About Tools
        3. The Data Mining Process
          1. Figure 1-1: CRISP-DM Conceptual Model
          2. CRISP-DM Step 1: Business (Organizational) Understanding
          3. CRISP-DM Step 2: Data Understanding
          4. CRISP-DM Step 3: Data Preparation
          5. CRISP-DM Step 4: Modeling
          6. Figure 1-2. Types of Data Mining Models
          7. CRISP-DM Step 5: Evaluation
          8. CRISP-DM Step 6: Deployment
        4. Data Mining and You
      2. Chapter Two: Organizational Understanding and Data Understanding
        1. Context and Perspective
        2. Learning Objectives
        3. Purposes, Intents and Limitations of Data Mining
        4. Database, Data Warehouse, Data Mart, Data Set…?
          1. Figure 2-1: Data arranged in columns and rows
          2. Figure 2-2: A simple database with a relation between two tables
          3. Figure 2-3: A combination of the tables into a single data set
        5. Types of Data
        6. A Note about Privacy and Security
        7. Chapter Summary
        8. Review Questions
        9. Exercises
      3. Chapter Three: Data Preparation
        1. Context and Perspective
        2. Learning Objectives
        3. Collation
        4. Data Scrubbing
        5. Hands on Exercise
        6. Preparing RapidMiner, Importing Data, and Handling Missing Data
        7. Data Reduction
        8. Handling Inconsistent Data
        9. Attribute Reduction
        10. Chapter Summary
        11. Review Questions
        12. Exercise
    5. Section Two: Data Mining Models and Methods
      1. Chapter Four: Correlation
        1. Context and Perspective
        2. Organizational Understanding
        3. Data Understanding
        4. Data Preparation
        5. Modeling
        6. Evaluation
        7. Deployment
        8. Chapter Summary
        9. Review Questions
        10. Exercise
          1. Figure 4-8. A two-dimensional scatterplot with a colored third dimension and a slight jitter
          2. Figure 4-9. A three-dimensional scatterplot with a colored fourth dimension
      2. Chapter Five: Association Rules
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      3. Chapter Six: k-Means Clustering
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      4. Chapter Seven: Discriminant Analysis
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      5. Chapter Eight: Linear Regression
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      6. Chapter Nine: Logistic Regression
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      7. Chapter Ten: Decision Trees
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
          1. Figure 10-12. Tree resulting from a gini_index algorithm
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      8. Chapter Eleven: Neural Networks
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
          1. Figure 11-5. A graphical view of our neural network showing different strength neurons and the four nodes for each of the possible Team_Value categories
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      9. Chapter Twelve: Text Mining
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
    6. Section Three: Special Considerations in Data Mining
      1. Chapter Thirteen: Evaluation and Deployment
        1. How Far We’ve Come
        2. Learning Objectives
        3. Cross-Validation
        4. Chapter Summary: The Value of Experience
        5. Review Questions
        6. Exercise
      2. Chapter Fourteen: Data Mining Ethics
        1. Why Data Mining Ethics?
          1. FIGURE 14-1. This Just In: BEING AN ETHICAL DATA MINER IS IMPORTANT
        2. Ethical Frameworks and Suggestions
        3. Conclusion
    7. Glossary and Index
      1. Antecedent
      2. Archived Data
      3. Association Rules
      4. Attribute
      5. Average
      6. Binomial
      7. Binominal
      8. Business Understanding
      9. Case
      10. Case Sensitive
      11. Classification
      12. Code
      13. Coefficient
      14. Column
      15. Comma Separated Values (CSV)
      16. Conclusion
      17. Confidence (Alpha) Level
      18. Confidence Percent
      19. Consequent
      20. Correlation
      21. CRISP-DM
      22. Cross-validation
      23. Data
      24. Data Analysis
      25. Data Mart
      26. Data Mining
      27. Data Preparation
      28. Data Set
      29. Data Type
      30. Data Understanding
      31. Data Warehouse
      32. Database
      33. Decision Tree
      34. Denormalization
      35. Dependent Variable (Attribute)
      36. Deployment
      37. Descartes' Rule of Change
      38. Design Perspective
      39. Discriminant Analysis
      40. Ethics
      41. Evaluation
      42. False Positive
      43. Field
      44. Frequency Pattern
      45. Fuzzy Logic
      46. Gain Ratio
      47. Gini Index
      48. Heterogeneity
      49. Inconsistent Data
      50. Independent Variable (Attribute)
      51. Jittering
      52. Join
      53. Kant's Categorical Imperative
      54. k-Means Clustering
      55. Label
      56. Laws
      57. Leaf
      58. Linear Regression
      59. Logistic Regression
      60. Markets
      61. Mean
      62. Median
      63. Meta Data
      64. Missing Data
      65. Mode
      66. Model
      67. Name (Attribute)
      68. Neural Network
      69. n-Gram
      70. Node
      71. Normalization
      72. Null
      73. Observation
      74. Online Analytical Processing (OLAP)
      75. Online Transaction Processing (OLTP)
      76. Operational Data
      77. Operator
      78. Organizational Data
      79. Organizational Understanding
      80. Parameters
      81. Port
      82. Prediction
      83. Premise
      84. Privacy
      85. Professional Code of Conduct
      86. Query
      87. Record
      88. Relational Database
      89. Repository
      90. Results Perspective
      91. Role (Attribute)
      92. Row
      93. Sample
      94. Scoring Data
      95. Social Norms
      96. Spline
      97. Standard Deviation
      98. Standard Operating Procedures
      99. Statistical Significance
      100. Stemming
      101. Stopwords
      102. Stream
      103. Structured Query Language (SQL)
      104. Sub-process
      105. Support Percent
      106. Table
      107. Target Attribute
      108. Technology
      109. Text Mining
      110. Token (Tokenize)
      111. Training Data
      112. Tuple
      113. Variable
      114. View
    8. About the Author
  8. NEXT

Data Science for Data Mining

Last modified
Table of contents
  1. Story
  2. Spotfire Screen Captures
    1. Cover Page
    2. Chapter 03 Data Set
    3. Chapter 04 Data Set
    4. Chapter 05 Data Set
    5. Chapter 06 Data Set
    6. Chapter 07 Data Set Scoring
    7. Chapter 07 Data Set Training
    8. Chapter 08 Data Set: MyModel​
    9. ​Chapter 09 Data Set Scoring
    10. Chapter 09 Data Set Training
    11. Chapter 10 Data Set Scoring
    12. Chapter 10 Data Set Training
    13. Chapter 11 Data Set Scoring
    14. Chapter 11 Data Set Training
    15. Chapter 11 Exercise Training Data
  3. Spotfire Dashboard
  4. Research Notes
    1. Scientific Data Mining GMU Fall Semester
      1. Syllabus
      2. Reading Assignments
    2. 2014 George Mason and IBM Symposium on Diverse Data Analytics Applications
      1. Preface
      2. Slides
      3. Agenda
  5. Data Mining for the Masses
  6. Data Mining for the Masses Data Sets
  7. Data Mining for the Masses
    1. Cover Page
    2. Dedication
    3. Acknowledgements
    4. Section One: Data Mining Basics
      1. Chapter One: Introduction to Data Mining and CRISP-DM
        1. Introduction
        2. A Note About Tools
        3. The Data Mining Process
          1. Figure 1-1: CRISP-DM Conceptual Model
          2. CRISP-DM Step 1: Business (Organizational) Understanding
          3. CRISP-DM Step 2: Data Understanding
          4. CRISP-DM Step 3: Data Preparation
          5. CRISP-DM Step 4: Modeling
          6. Figure 1-2. Types of Data Mining Models
          7. CRISP-DM Step 5: Evaluation
          8. CRISP-DM Step 6: Deployment
        4. Data Mining and You
      2. Chapter Two: Organizational Understanding and Data Understanding
        1. Context and Perspective
        2. Learning Objectives
        3. Purposes, Intents and Limitations of Data Mining
        4. Database, Data Warehouse, Data Mart, Data Set…?
          1. Figure 2-1: Data arranged in columns and rows
          2. Figure 2-2: A simple database with a relation between two tables
          3. Figure 2-3: A combination of the tables into a single data set
        5. Types of Data
        6. A Note about Privacy and Security
        7. Chapter Summary
        8. Review Questions
        9. Exercises
      3. Chapter Three: Data Preparation
        1. Context and Perspective
        2. Learning Objectives
        3. Collation
        4. Data Scrubbing
        5. Hands on Exercise
        6. Preparing RapidMiner, Importing Data, and Handling Missing Data
        7. Data Reduction
        8. Handling Inconsistent Data
        9. Attribute Reduction
        10. Chapter Summary
        11. Review Questions
        12. Exercise
    5. Section Two: Data Mining Models and Methods
      1. Chapter Four: Correlation
        1. Context and Perspective
        2. Organizational Understanding
        3. Data Understanding
        4. Data Preparation
        5. Modeling
        6. Evaluation
        7. Deployment
        8. Chapter Summary
        9. Review Questions
        10. Exercise
          1. Figure 4-8. A two-dimensional scatterplot with a colored third dimension and a slight jitter
          2. Figure 4-9. A three-dimensional scatterplot with a colored fourth dimension
      2. Chapter Five: Association Rules
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      3. Chapter Six: k-Means Clustering
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      4. Chapter Seven: Discriminant Analysis
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      5. Chapter Eight: Linear Regression
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      6. Chapter Nine: Logistic Regression
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      7. Chapter Ten: Decision Trees
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
          1. Figure 10-12. Tree resulting from a gini_index algorithm
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      8. Chapter Eleven: Neural Networks
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
          1. Figure 11-5. A graphical view of our neural network showing different strength neurons and the four nodes for each of the possible Team_Value categories
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      9. Chapter Twelve: Text Mining
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
    6. Section Three: Special Considerations in Data Mining
      1. Chapter Thirteen: Evaluation and Deployment
        1. How Far We’ve Come
        2. Learning Objectives
        3. Cross-Validation
        4. Chapter Summary: The Value of Experience
        5. Review Questions
        6. Exercise
      2. Chapter Fourteen: Data Mining Ethics
        1. Why Data Mining Ethics?
          1. FIGURE 14-1. This Just In: BEING AN ETHICAL DATA MINER IS IMPORTANT
        2. Ethical Frameworks and Suggestions
        3. Conclusion
    7. Glossary and Index
      1. Antecedent
      2. Archived Data
      3. Association Rules
      4. Attribute
      5. Average
      6. Binomial
      7. Binominal
      8. Business Understanding
      9. Case
      10. Case Sensitive
      11. Classification
      12. Code
      13. Coefficient
      14. Column
      15. Comma Separated Values (CSV)
      16. Conclusion
      17. Confidence (Alpha) Level
      18. Confidence Percent
      19. Consequent
      20. Correlation
      21. CRISP-DM
      22. Cross-validation
      23. Data
      24. Data Analysis
      25. Data Mart
      26. Data Mining
      27. Data Preparation
      28. Data Set
      29. Data Type
      30. Data Understanding
      31. Data Warehouse
      32. Database
      33. Decision Tree
      34. Denormalization
      35. Dependent Variable (Attribute)
      36. Deployment
      37. Descartes' Rule of Change
      38. Design Perspective
      39. Discriminant Analysis
      40. Ethics
      41. Evaluation
      42. False Positive
      43. Field
      44. Frequency Pattern
      45. Fuzzy Logic
      46. Gain Ratio
      47. Gini Index
      48. Heterogeneity
      49. Inconsistent Data
      50. Independent Variable (Attribute)
      51. Jittering
      52. Join
      53. Kant's Categorical Imperative
      54. k-Means Clustering
      55. Label
      56. Laws
      57. Leaf
      58. Linear Regression
      59. Logistic Regression
      60. Markets
      61. Mean
      62. Median
      63. Meta Data
      64. Missing Data
      65. Mode
      66. Model
      67. Name (Attribute)
      68. Neural Network
      69. n-Gram
      70. Node
      71. Normalization
      72. Null
      73. Observation
      74. Online Analytical Processing (OLAP)
      75. Online Transaction Processing (OLTP)
      76. Operational Data
      77. Operator
      78. Organizational Data
      79. Organizational Understanding
      80. Parameters
      81. Port
      82. Prediction
      83. Premise
      84. Privacy
      85. Professional Code of Conduct
      86. Query
      87. Record
      88. Relational Database
      89. Repository
      90. Results Perspective
      91. Role (Attribute)
      92. Row
      93. Sample
      94. Scoring Data
      95. Social Norms
      96. Spline
      97. Standard Deviation
      98. Standard Operating Procedures
      99. Statistical Significance
      100. Stemming
      101. Stopwords
      102. Stream
      103. Structured Query Language (SQL)
      104. Sub-process
      105. Support Percent
      106. Table
      107. Target Attribute
      108. Technology
      109. Text Mining
      110. Token (Tokenize)
      111. Training Data
      112. Tuple
      113. Variable
      114. View
    8. About the Author
  8. NEXT

  1. Story
  2. Spotfire Screen Captures
    1. Cover Page
    2. Chapter 03 Data Set
    3. Chapter 04 Data Set
    4. Chapter 05 Data Set
    5. Chapter 06 Data Set
    6. Chapter 07 Data Set Scoring
    7. Chapter 07 Data Set Training
    8. Chapter 08 Data Set: MyModel​
    9. ​Chapter 09 Data Set Scoring
    10. Chapter 09 Data Set Training
    11. Chapter 10 Data Set Scoring
    12. Chapter 10 Data Set Training
    13. Chapter 11 Data Set Scoring
    14. Chapter 11 Data Set Training
    15. Chapter 11 Exercise Training Data
  3. Spotfire Dashboard
  4. Research Notes
    1. Scientific Data Mining GMU Fall Semester
      1. Syllabus
      2. Reading Assignments
    2. 2014 George Mason and IBM Symposium on Diverse Data Analytics Applications
      1. Preface
      2. Slides
      3. Agenda
  5. Data Mining for the Masses
  6. Data Mining for the Masses Data Sets
  7. Data Mining for the Masses
    1. Cover Page
    2. Dedication
    3. Acknowledgements
    4. Section One: Data Mining Basics
      1. Chapter One: Introduction to Data Mining and CRISP-DM
        1. Introduction
        2. A Note About Tools
        3. The Data Mining Process
          1. Figure 1-1: CRISP-DM Conceptual Model
          2. CRISP-DM Step 1: Business (Organizational) Understanding
          3. CRISP-DM Step 2: Data Understanding
          4. CRISP-DM Step 3: Data Preparation
          5. CRISP-DM Step 4: Modeling
          6. Figure 1-2. Types of Data Mining Models
          7. CRISP-DM Step 5: Evaluation
          8. CRISP-DM Step 6: Deployment
        4. Data Mining and You
      2. Chapter Two: Organizational Understanding and Data Understanding
        1. Context and Perspective
        2. Learning Objectives
        3. Purposes, Intents and Limitations of Data Mining
        4. Database, Data Warehouse, Data Mart, Data Set…?
          1. Figure 2-1: Data arranged in columns and rows
          2. Figure 2-2: A simple database with a relation between two tables
          3. Figure 2-3: A combination of the tables into a single data set
        5. Types of Data
        6. A Note about Privacy and Security
        7. Chapter Summary
        8. Review Questions
        9. Exercises
      3. Chapter Three: Data Preparation
        1. Context and Perspective
        2. Learning Objectives
        3. Collation
        4. Data Scrubbing
        5. Hands on Exercise
        6. Preparing RapidMiner, Importing Data, and Handling Missing Data
        7. Data Reduction
        8. Handling Inconsistent Data
        9. Attribute Reduction
        10. Chapter Summary
        11. Review Questions
        12. Exercise
    5. Section Two: Data Mining Models and Methods
      1. Chapter Four: Correlation
        1. Context and Perspective
        2. Organizational Understanding
        3. Data Understanding
        4. Data Preparation
        5. Modeling
        6. Evaluation
        7. Deployment
        8. Chapter Summary
        9. Review Questions
        10. Exercise
          1. Figure 4-8. A two-dimensional scatterplot with a colored third dimension and a slight jitter
          2. Figure 4-9. A three-dimensional scatterplot with a colored fourth dimension
      2. Chapter Five: Association Rules
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      3. Chapter Six: k-Means Clustering
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      4. Chapter Seven: Discriminant Analysis
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      5. Chapter Eight: Linear Regression
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      6. Chapter Nine: Logistic Regression
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      7. Chapter Ten: Decision Trees
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
          1. Figure 10-12. Tree resulting from a gini_index algorithm
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      8. Chapter Eleven: Neural Networks
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
          1. Figure 11-5. A graphical view of our neural network showing different strength neurons and the four nodes for each of the possible Team_Value categories
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
      9. Chapter Twelve: Text Mining
        1. Context and Perspective
        2. Learning Objectives
        3. Organizational Understanding
        4. Data Understanding
        5. Data Preparation
        6. Modeling
        7. Evaluation
        8. Deployment
        9. Chapter Summary
        10. Review Questions
        11. Exercise
    6. Section Three: Special Considerations in Data Mining
      1. Chapter Thirteen: Evaluation and Deployment
        1. How Far We’ve Come
        2. Learning Objectives
        3. Cross-Validation
        4. Chapter Summary: The Value of Experience
        5. Review Questions
        6. Exercise
      2. Chapter Fourteen: Data Mining Ethics
        1. Why Data Mining Ethics?
          1. FIGURE 14-1. This Just In: BEING AN ETHICAL DATA MINER IS IMPORTANT
        2. Ethical Frameworks and Suggestions
        3. Conclusion
    7. Glossary and Index
      1. Antecedent
      2. Archived Data
      3. Association Rules
      4. Attribute
      5. Average
      6. Binomial
      7. Binominal
      8. Business Understanding
      9. Case
      10. Case Sensitive
      11. Classification
      12. Code
      13. Coefficient
      14. Column
      15. Comma Separated Values (CSV)
      16. Conclusion
      17. Confidence (Alpha) Level
      18. Confidence Percent
      19. Consequent
      20. Correlation
      21. CRISP-DM
      22. Cross-validation
      23. Data
      24. Data Analysis
      25. Data Mart
      26. Data Mining
      27. Data Preparation
      28. Data Set
      29. Data Type
      30. Data Understanding
      31. Data Warehouse
      32. Database
      33. Decision Tree
      34. Denormalization
      35. Dependent Variable (Attribute)
      36. Deployment
      37. Descartes' Rule of Change
      38. Design Perspective
      39. Discriminant Analysis
      40. Ethics
      41. Evaluation
      42. False Positive
      43. Field
      44. Frequency Pattern
      45. Fuzzy Logic
      46. Gain Ratio
      47. Gini Index
      48. Heterogeneity
      49. Inconsistent Data
      50. Independent Variable (Attribute)
      51. Jittering
      52. Join
      53. Kant's Categorical Imperative
      54. k-Means Clustering
      55. Label
      56. Laws
      57. Leaf
      58. Linear Regression
      59. Logistic Regression
      60. Markets
      61. Mean
      62. Median
      63. Meta Data
      64. Missing Data
      65. Mode
      66. Model
      67. Name (Attribute)
      68. Neural Network
      69. n-Gram
      70. Node
      71. Normalization
      72. Null
      73. Observation
      74. Online Analytical Processing (OLAP)
      75. Online Transaction Processing (OLTP)
      76. Operational Data
      77. Operator
      78. Organizational Data
      79. Organizational Understanding
      80. Parameters
      81. Port
      82. Prediction
      83. Premise
      84. Privacy
      85. Professional Code of Conduct
      86. Query
      87. Record
      88. Relational Database
      89. Repository
      90. Results Perspective
      91. Role (Attribute)
      92. Row
      93. Sample
      94. Scoring Data
      95. Social Norms
      96. Spline
      97. Standard Deviation
      98. Standard Operating Procedures
      99. Statistical Significance
      100. Stemming
      101. Stopwords
      102. Stream
      103. Structured Query Language (SQL)
      104. Sub-process
      105. Support Percent
      106. Table
      107. Target Attribute
      108. Technology
      109. Text Mining
      110. Token (Tokenize)
      111. Training Data
      112. Tuple
      113. Variable
      114. View
    8. About the Author
  8. NEXT

Story

Data Science for Data Mining

I suggested "Data Mining for the Masses" by Matthew North. It uses the CRISP Data Mining Conceptual Model that is used in the Data Science for Business book I did the tutorial for.

GMU Professor Borne uses the title in his talks and the book in his Data Science Class: http://kirkborne.net/cds401fall2014/ My Note: See below

Available at Amazon.com: http://www.amazon.com/Data-Mining-Ma...for+the+masses...­

Book datasets are available: https://sites.google.com/site/dataminingforthemasses/ My Note: See below

Recent book review: http://www.onlineprogrammingbooks.co...mining-masses/ My Note: See below

Free PDF download of the book: http://dl.dropbox.com/u/31779972/Dat...rTheMasses.pdf My Note: See below

I will do a tutorial on this and would welcome anyone else doing and presenting on this as well.The steps I followed are as follows:

  • I merged the 14 CSV files into one Excel Spreadsheet
  • I copied the Book PDF files into MindTouch by first creating the Table of Contents structure and then copying individual sections of the book to support the Exploratory Data Analysis I did with Spotfire.
  • Instead of the book's text mining exercises and four text files in Chapter 12, I text mined the entire publication by building a structured knowledge base in the Excel Spreadsheet.

Question: Can we do RapidMiner with Spotfire?

The Answer is Yes and is shown in the Spotfire Screen Captures below.

Spotfire Screen Captures

Cover Page

Data Source: Spreadsheet

Data Dictionary: See below

DataMiningFortheMasses-Spotfire-Cover Page.png

Chapter 03 Data Set

Data Source: Spreadsheet

Data Dictionary: Hands on Exercise

DataMiningFortheMasses-Spotfire-Chapter 03 Data Set.png

Chapter 04 Data Set

Data Source: Spreadsheet

Data Dictionary: Data Understanding

DataMiningFortheMasses-Spotfire-Chapter 04 Data Set.png

Chapter 05 Data Set

Data Source: Spreadsheet

Data Dictionary: Data Understanding

DataMiningFortheMasses-Spotfire-Chapter 05 Data Set.png

Chapter 06 Data Set

Data Source: Spreadsheet

Data Dictionary: Data Understanding

DataMiningFortheMasses-Spotfire-Chapter 06 Data Set.png

Chapter 07 Data Set Scoring

Data Source: Spreadsheet

Data Dictionary: Data Understanding

DataMiningFortheMasses-Spotfire-Chapter 07 Data Set Scoring.png

Chapter 07 Data Set Training

Data Source: Spreadsheet

Data Dictionary: Data Understanding

DataMiningFortheMasses-Spotfire-Chapter 07 Data Set Training.png

Chapter 08 Data Set: MyModel​

Data Source: Spreadsheet

Data Dictionary: Data Understanding

DataMiningFortheMasses-Spotfire-Chapter 08 Data Set-MyModel.png

​Chapter 09 Data Set Scoring

Data Source: Spreadsheet

Data Dictionary: Data Understanding

DataMiningFortheMasses-Spotfire-Chapter 09 DataSet Scoring.png

Chapter 09 Data Set Training

Data Source: Spreadsheet

Data Dictionary: Data Understanding

DataMiningFortheMasses-Spotfire-Chapter 09 Data Set Training.png

Chapter 10 Data Set Scoring

Data Source: Spreadsheet

Data Dictionary: Data Understanding

DataMiningFortheMasses-Spotfire-Chapter 10 Data Set Scoring.png

Chapter 10 Data Set Training

Data Source: Spreadsheet

Data Dictionary: Data Understanding

DataMiningFortheMasses-Spotfire-Chapter 10 Data Set Training.png

Chapter 11 Data Set Scoring

Data Source: Spreadsheet

Data Dictionary: Data Understanding

DataMiningFortheMasses-Spotfire-Chapter 11 Data Set Scoring.png

Chapter 11 Data Set Training

Data Source: Spreadsheet

Data Dictionary: Data Understanding

DataMiningFortheMasses-Spotfire-Chapter 11 Data Set Training.png

Chapter 11 Exercise Training Data

Data Source: Spreadsheet

Data Dictionary: Exercise

DataMiningFortheMasses-Spotfire-Chapter 11 Exercise Training Data.png

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Error: Embedded data could not be displayed. Use Google Chrome

Research Notes

Scientific Data Mining GMU Fall Semester

Syllabus

Source: http://kirkborne.net/cds401fall2014/

School of Physics, Astronomy, and Computational Sciences 
George Mason University -- College of Science 
Data Mining word cloud     
CDS 401
Scientific Data Mining
Fall Semester 2014
     Data Mining for the Masses Book Cover
  • Course Syllabus Website:   http://kirkborne.net/cds401fall2014/
  • *****IMPORTANT NOTE:   Last Day to Add Classes = September 2, 2014.
  • Course Description (from GMU Catalog):
    • Data mining techniques from statistics, machine learning, and visualization as applied to scientific knowledge discovery. Students will be given a set of case studies and projects to test their understanding of this field and to provide a foundation for future applications in their careers. (3 credits)
  • Detailed Course Overview:
    • This course provides a broad overview of the data mining component of the knowledge discovery process, as applied to scientific research. Scientific databases are growing at near-exponential rates. As the amount of data has grown, so has the difficulty in analyzing these large databases. Data mining is the search for hidden, meaningful patterns in such databases. Identifying these patterns and rules can provide significant competitive advantage to scientific research projects and in other career settings. Data mining is motivated and analyzed as the killer app for large scientific databases. Data mining techniques, algorithms, and applications are covered, as well as the key concepts of machine learning, data types, data preparation, previewing, noise handling, feature selection, normalization, data transformation, similarity measures, and distance metrics. Algorithms and techniques will be analyzed specifically in terms of their application to solving particular problems. Several scientific case studies will be presented from the science research literature. The techniques that are presented will be drawn from well known statistical, machine learning, visualization, and database algorithms, including clustering, decision trees, regression, Bayes theorem, nearest neighbor, neural networks, and genetic algorithms. Topics will include informatics, semantic knowledge mining, and the integration of data mining with large (and often distributed) scientific databases.
  • Course Objectives and Learning Outcomes:  
    1. to develop an understanding of data mining and its scientific applications;
    2. to become familiar with a variety of data mining concepts, techniques, and algorithms;
    3. to become capable in applying these techniques and algorithms to solve scientific problems; and
    4. to provide a foundation and develop the skills for future data-intensive applications in the student's career.
  • Supplemental Syllabus Information:
  • Honor Code:  
    • Instructors may submit Exam Papers, Homework solutions, or any other student assignment to either the TurnItIn.com or the SafeAssign plagiarism-detection services, in compliance with all of the following: GMU policy, Provost approval, and the GMU Honor Code.
    • Plagiarism will not be tolerated.

  • Course Material:   Please log into http://mymason.gmu.edu/ each week to get announcements, lecture slides, assignments, and grades.
  • Lecture Day/Time:   Thursdays 4:30-7:10 PM   (August 28 – December 11) (see https://patriotweb.gmu.edu/)
    • No Class on THANKSGIVING HOLIDAY (Thursday November 27)
  • Lecture Place:   Innovation Hall, Room 317
  • Midterm Exam:   Thursday October 16, 2014 at 4:30pm (in our classroom)
  • Final Exam:   Thursday December 11, 4:30pm-7:15pm (in our classroom)
  • Grading:  
    • 40% = Lab Exercises (the lowest score will be dropped)
    • 25% = Midterm Exam
    • 35% = Final Exam
  • Course Instructor: Dr. Kirk Borne, Professor of Astrophysics and Computational Science
  • Prerequisites:
    • CDS 302 (Scientific Data and Databases).
    • Or else ... Permission of Instructor, depending on whether you are a Junior or Senior majoring in science
  • Reading Assignments:   http://kirkborne.net/cds401fall2014/cds401-reading.htm
  • Required Textbook:  
    1. Matthew North, Data Mining for the Masses. Global Text Project, 2012. ISBN: 9780615684376.
  • Optional Supplemental Reading:  
    1. P.-N. Tan, M. Steinbach, & V. Kumar, Introduction to Data Mining. Addison-Wesley, 2005. ISBN: 9780321321367.
    2. M. Dunham, Data Mining: Introductory and Advanced Topics. Prentice-Hall, 2002. ISBN: 9780130888921.
    3. R. J. Roiger & M. W. Geatz, Data Mining: A Tutorial-Based Primer. Addison-Wesley, 2002. ISBN: 9780201741285.
  • Technology Requirements:  
  • Weekly Schedule: (subject to change)
    • Week 1: Basics and Introduction to Data Mining
    • Week 2: Data Understanding and Data Profiling
    • Week 3: Data Preparation
    • Week 4: Introduction to Unsupervised Learning, and Correlation Mining
    • Week 5: Association Rule Mining
    • Week 6: Clustering
    • Week 7: Linear Regression
    • Week 8: Midterm Exam +plus+ Novelty Mining
    • Week 9: Introduction to Supervised Learning
    • Week 10: Decision Trees
    • Week 11: Neural Networks
    • Week 12: Unstructured Data = Text Mining
    • Week 13: Cross-Validation, and Special Topics: Bayes, Markov Modeling
    • Week 14: Thanksgiving Holiday
    • Week 15: Data Ethics, and More Special Topics
    • *FINAL EXAM*

    Author:   Kirk D. Borne 
    Last Update:   28 August 2014

Reading Assignments

Source: http://kirkborne.net/cds401fall2014/...01-reading.htm

Date Assigned Data Mining for the Masses 
(by Matthew North)
Week 1 (8/28/2014) Chapter 1 (Basics and Introduction)
Week 2 (9/04/2014) Chapter 2 (Data Understanding and Data Profiling)
Week 3 (9/11/2014) Chapter 3 (Data Preparation)
Week 4 (9/18/2014) Chapter 4 (Introduction to Unsupervised Learning, and Correlation Mining)
Week 5 (9/25/2014) Chapter 5 (Association Rule Mining)
Week 6 (10/2/2014) Chapter 6 (Clustering)
Week 7 (10/9/2014) Chapter 8 (Linear Regression)
Week 8 (10/16/2014) MIDTERM EXAM on Thursday Oct.16 
plus short lecture on Novelty Mining (no reading assignment)
Week 9 (10/23/2014) Introduction to Supervised Learning (no reading assignment)
Week 10 (10/30/2014) Chapter 10 (Decision Trees)
Week 11 (11/6/2014) Chapter 11 (Neural Networks)
Week 12 (11/13/2014) Chapter 12 (Unstructured Data = Text Mining)
Week 13 (11/20/2014) Chapter 13 (Cross-Validation, and Special Topics: Bayes, Markov Modeling)
Week 14 (11/27/2014) THANKSGIVING HOLIDAY
Week 15 (12/4/2014) Chapter 14 (Data Ethics, and More Special Topics)

2014 George Mason and IBM Symposium on Diverse Data Analytics Applications

Source: https://www-950.ibm.com/events/wwe/g...S&locale=en_US

Preface

In the last decade, data explosion and robust analytics tools engender “Big Data and Analytics” among the most popular words used in the computer engineering and IT industry. In addition, the cloud, the social, and the mobile environments generate a tremendous amount of personalized, geospatial and temporal data that is extremely valuable to education, business operations, government services, and the intelligence community. According to the Forum for Innovation, 90 percent of the world's data has been produced in the last two years. The operational need and market demand grow stronger and stronger. To create an opportunity to share Big Data and Analytics knowledge and technologies among academic institutions, industry leaders and government customers, George Mason University and IBM will host the conference "Big Data and Analytics 2014" on November 4th at George Mason University. In this conference, we have invited leaders and experts from academia, industry and government to discuss how big data and analytics hold the key to unlocking value. If you are a student, you will learn topics like data mining, statistical models, predictive analytics, and data visualization; if you are a researcher, you can compare notes with analytics experts from George Mason University, IBM and other colleagues; if you are a technology provider, you will get an update on the cutting edge of analytics in industry and research.

Agenda

Johnson Center Cinema
George Mason University
Fairfax, Virginia 
United States

 
   
8:30 AM - 9:00 AM Continental Breakfast and Check-In
9:00 AM - 9:15 AM Opening Remarks: Monty Hayes, George Mason University, ECE Chair
9:15 AM - 9:30 AM IBM Welcome: Frank Stein, Director, Analytics Solution Center, IBM
9:30 AM - 10:00 AM Keynote: How Can Organizations Use Big Data to Achieve Sustainability Objectives?
Dr. Jane Snowdon, Chief Innovation Officer, IBM Federal
10:00 AM - 10:30 AM Connecting the Dots through the Discovery of Evidence, Hypotheses, and Arguments from Masses of Data
Gheorghe Tecuci, George Mason University
10:30 AM - 10:45 AM Break - Light Refreshments Provided
10:45 AM - 11:15 AM Using Advanced Analytics with Big Data in Federal/Healthcare
Rashmi Mathur, IBM
11:15 AM - 11:45 AM Heterogeneous Architectures in 3D for Next Generation Big Data Server Platform
Houman Homayoun, George Mason University 
11:45 AM - 12:15 PM Applications of Advanced Analytics in a Large-Scale Performance Measurement Study
Hua Ni, IBM
12:15 PM - 1:00 PM Lunch w/ Speaker: Predicting Household Demographics with Cable TV Viewing Data
Michael Schader, George Mason University
1:00 PM - 1:30 PM Using Analytics to capture the Voice of the Customer and influence Product development
Dr. Spyros Kontogiorgis, IBM
1:30 PM - 2:00 PM Machine Learning Approaches for Annotating Large Scale Biological Datasets
Huzefa Rangwala, George Mason University
2:00 PM - 2:30 PM How Big Data and Analytics Can Detect and Prevent Fraud
Keith Jones, George Mason University
2:30 PM - 3:00 PM Applications of Big Data Analytics in Air Transportation
Lance Sherry, George Mason University
3:00 PM - 3:45 PM Panel Discussion – Diverse Careers in Data Analytics – Bob Osgood, Moderator
J. P. Auffret, George Mason University 
Jerry Hanweck, George Mason University 
Frank Stein , IBM
Alissa Eichinger
3:45 PM - 4:00 PM Closing: Frank Stein, Director, Analytics Solution Center, IBM 

Data Mining for the Masses

Source: http://www.onlineprogrammingbooks.co...mining-masses/

Posted on September 20th, 2014

Data Mining for the Masses

In Data Mining for the Masses, professor Matt North—a former risk analyst and database developer for eBay.com—uses simple examples, clear explanations and free, powerful, easy-to-use software to teach you the basics of data mining; techniques that can help you answer some of your toughest business questions.

Description
Topics included: Introduction to Data Mining and CRISP-DM 3 • Organizational Understanding and Data Understanding • Data Preparation • Correlation • Association Rules • k-Means Clustering • Discriminant Analysis • Linear Regression • Logistic Regression • Decision Trees • Neural Networks • Text Mining • Evaluation and Deployment • Data Mining Ethics.

Book Details

Author(s): Dr. Matthew North
Publisher: Global Text Project
Published: August
Format(s): PDF
File size: 16.70 MB
Number of pages: 264
Download / View Link(s): PDF | Alternative link

Data Mining for the Masses Data Sets

Source: https://sites.google.com/site/dataminingforthemasses/

My Note: See all Data Sets (Except for 4 Text Files) in Spreadsheet

This web site is designed to serve as a repository for all data sets referred to in Data Mining for the Masses, a textbook by Dr. Matt North.  The book is available on Amazon.com, ISBN: 978-0615684376.  Questions regarding the book, the data sets, or other related matters can be directed to Dr. North.

All data sets are available in the list below.  They are organized according to their corresponding chapters in the book.  All are in comma separated values format in order to ease portability and usability.  To download a data set, simply click on the down arrow to the far right of the file name.

Chapter03DataSet.csv (1k)Matt North, Aug 13, 2012, 12:35 PM ď

Chapter04DataSet.csv (23k)Matt North, Aug 13, 2012, 12:35 PM ď

Chapter05DataSet.csv (117k)Matt North, Aug 13, 2012, 12:35 PM ď

Chapter06DataSet.csv (6k)Matt North, Aug 13, 2012, 12:35 PM ď

Chapter07DataSet_Scoring.csv (39k)Matt North, Aug 13, 2012, 12:35 PM ď

Chapter07DataSet_Training.csv (15k)Matt North, Aug 13, 2012, 12:35 PM ď

Chapter08DataSet.csv (625k)Matt North, Aug 13, 2012, 12:36 PM ď

Chapter09DataSet_Scoring.csv (13k)Matt North, Aug 13, 2012, 12:36 PM ď

Chapter09DataSet_Training.csv (3k)Matt North, Aug 13, 2012, 12:36 PM ď

Chapter10DataSet_Scoring.csv (24k)Matt North, Aug 13, 2012, 12:31 PM ď

Chapter10DataSet_Training.csv (41k)Matt North, Aug 13, 2012, 12:31 PM ď

Chapter11DataSet_Scoring.csv (4k)Matt North, Aug 13, 2012, 12:31 PM ď

Chapter11DataSet_Training.csv (22k)Matt North, Aug 13, 2012, 12:31 PM ď

Chapter11Exercise_TrainingData.csv (28k)Matt North, Aug 13, 2012, 12:31 PM ď

Chapter12_Federalist05_Jay.txt (8k)Matt North, Aug 13, 2012, 12:36 PM ď

Chapter12_Federalist14_Madison.txt (12k)Matt North, Aug 13, 2012, 12:31 PM ď

Chapter12_Federalist17_Hamilton.txt (9k)Matt North, Aug 13, 2012, 12:31 PM ď

Chapter12_Federalist18_Collaboration.txt (13k)Matt North, Aug 13, 2012, 12:31 PM ď

Data Mining for the Masses

Source: http://dl.dropbox.com/u/31779972/Dat...rTheMasses.pdf (PDF)

Dr. Matthew North

Cover Page

A Global Text Project Book
This book is available on Amazon.com.
© 2012 Dr. Matthew A. North This book is licensed under a Creative Commons Attribution 3.0 License
All rights reserved.
ISBN: 0615684378
ISBN-13: 978-0615684376

Dedication

This book is gratefully dedicated to Dr. Charles Hannon, who gave me the chance to become a college professor and then challenged me to learn how to teach data mining to the masses.

Acknowledgements

I would not have had the expertise to write this book if not for the assistance of many colleagues at various institutions. I would like to acknowledge Drs. Thomas Hilton and Jean Pratt, formerly of Utah State University and now of University of Wisconsin—Eau Claire who served as my Master’s degree advisors. I would also like to acknowledge Drs. Terence Ahern and Sebastian Diaz of West Virginia University, who served as doctoral advisors to me.

I express my sincere and heartfelt gratitude for the assistance of Dr. Simon Fischer and the rest of the team at Rapid-I. I thank them for their excellent work on the RapidMiner software product and for their willingness to share their time and expertise with me on my visit to Dortmund.

Finally, I am grateful to the Kenneth M. Mason, Sr. Faculty Research Fund and Washington & Jefferson College, for providing financial support for my work on this text.

Section One: Data Mining Basics

Chapter One: Introduction to Data Mining and CRISP-DM

Introduction

Data mining as a discipline is largely transparent to the world. Most of the time, we never even notice that it’s happening. But whenever we sign up for a grocery store shopping card, place a purchase using a credit card, or surf the Web, we are creating data. These data are stored in large sets on powerful computers owned by the companies we deal with every day. Lying within those data sets are patterns—indicators of our interests, our habits, and our behaviors. Data mining allows people to locate and interpret those patterns, helping them make better informed decisions and better serve their customers. That being said, there are also concerns about the practice of data mining. Privacy watchdog groups in particular are vocal about organizations that amass vast quantities of data, some of which can be very personal in nature.

The intent of this book is to introduce you to concepts and practices common in data mining. It is intended primarily for undergraduate college students and for business professionals who may be interested in using information systems and technologies to solve business problems by mining data, but who likely do not have a formal background or education in computer science. Although data mining is the fusion of applied statistics, logic, artificial intelligence, machine learning and data management systems, you are not required to have a strong background in these fields to use this book. While having taken introductory college-level courses in statistics and databases will be helpful, care has been taken to explain within this book, the necessary concepts and techniques required to successfully learn how to mine data.

Each chapter in this book will explain a data mining concept or technique. You should understand that the book is not designed to be an instruction manual or tutorial for the tools we will use (RapidMiner and OpenOffice Base and Calc). These software packages are capable of many types of data analysis, and this text is not intended to cover all of their capabilities, but rather, to illustrate how these software tools can be used to perform certain kinds of data mining. The book is also not exhaustive; it includes a variety of common data mining techniques, but RapidMiner in particular is capable of many, many data mining tasks that are not covered in the book.

The chapters will all follow a common format. First, chapters will present a scenario referred to as Context and Perspective. This section will help you to gain a real-world idea about a certain kind of problem that data mining can help solve. It is intended to help you think of ways that the data mining technique in that given chapter can be applied to organizational problems you might face. Following Context and Perspective, a set of Learning Objectives is offered. The idea behind this section is that each chapter is designed to teach you something new about data mining. By listing the objectives at the beginning of the chapter, you will have a better idea of what you should expect to learn by reading it. The chapter will follow with several sections addressing the chapter’s topic. In these sections, step-by-step examples will frequently be given to enable you to work alongside an actual data mining task. Finally, after the main concepts of the chapter have been delivered, each chapter will conclude with a Chapter Summary, a set of Review Questions to help reinforce the main points of the chapter, and one or more Exercise to allow you to try your hand at applying what was taught in the chapter.

A Note About Tools

There are many software tools designed to facilitate data mining, however many of these are often expensive and complicated to install, configure and use. Simply put, they’re not a good fit for learning the basics of data mining. This book will use OpenOffice Calc and Base in conjunction with an open source software product called RapidMiner, developed by Rapid-I, GmbH of Dortmund, Germany. Because OpenOffice is widely available and very intuitive, it is a logical place to begin teaching introductory level data mining concepts. However, it lacks some of the tools data miners like to use. RapidMiner is an ideal complement to OpenOffice, and was selected for this book for several reasons:

  • RapidMiner provides specific data mining functions not currently found in OpenOffice, such as decision trees and association rules, which you will learn to use later in this book.
  • RapidMiner is easy to install and will run on just about any computer.
  • RapidMiner’s maker provides a Community Edition of its software, making it free for readers to obtain and use.
  • Both RapidMiner and OpenOffice provide intuitive graphical user interface environments which make it easier for general computer-using audiences to the experience the power of data mining.

All examples using OpenOffice or RapidMiner in this book will be illustrated in a Microsoft Windows environment, although it should be noted that these software packages will work on a variety of computing platforms. It is recommended that you download and install these two software packages on your computer now, so that you can work along with the examples in the book if you would like.

The Data Mining Process

Although data mining’s roots can be traced back to the late 1980s, for most of the 1990s the field was still in its infancy. Data mining was still being defined, and refined. It was largely a loose conglomeration of data models, analysis algorithms, and ad hoc outputs. In 1999, several sizeable companies including auto maker Daimler-Benz, insurance provider OHRA, hardware and software manufacturer NCR Corp. and statistical software maker SPSS, Inc. began working together to formalize and standardize an approach to data mining. The result of their work was CRISP-DM, the CRoss-Industry Standard Process for Data Mining. Although the participants in the creation of CRISP-DM certainly had vested interests in certain software and hardware tools, the process was designed independent of any specific tool. It was written in such a way as to be conceptual in nature—something that could be applied independent of any certain tool or kind of data. The process consists of six steps or phases, as illustrated in Figure 1-1.

Figure 1-1: CRISP-DM Conceptual Model

Figure1-1.png

CRISP-DM Step 1: Business (Organizational) Understanding

The first step in CRISP-DM is Business Understanding, or what will be referred to in this text as Organizational Understanding, since organizations of all kinds, not just businesses, can use data mining to answer questions and solve problems. This step is crucial to a successful data mining outcome, yet is often overlooked as folks try to dive right into mining their data. This is natural of course—we are often anxious to generate some interesting output; we want to find answers. But you wouldn’t begin building a car without first defining what you want the vehicle to do, and without first designing what you are going to build. Consider these oft-quoted lines from Lewis Carroll’s Alice’s Adventures in Wonderland:

"Would you tell me, please, which way I ought to go from here?"
"That depends a good deal on where you want to get to," said the Cat.
"I don’t much care where--" said Alice.
"Then it doesn’t matter which way you go," said the Cat.
"--so long as I get SOMEWHERE," Alice added as an explanation.
"Oh, you’re sure to do that," said the Cat, "if you only walk long enough."

Indeed. You can mine data all day long and into the night, but if you don’t know what you want to know, if you haven’t defined any questions to answer, then the efforts of your data mining are less likely to be fruitful. Start with high level ideas: What is making my customers complain so much?

How can I increase my per-unit profit margin? How can I anticipate and fix manufacturing flaws and thus avoid shipping a defective product? From there, you can begin to develop the more specific questions you want to answer, and this will enable you to proceed to …

CRISP-DM Step 2: Data Understanding

As with Organizational Understanding, Data Understanding is a preparatory activity, and sometimes, its value is lost on people. Don’t let its value be lost on you! Years ago when workers did not have their own computer (or multiple computers) sitting on their desk (or lap, or in their pocket), data were centralized. If you needed information from a company’s data store, you could request a report from someone who could query that information from a central database (or fetch it from a company filing cabinet) and provide the results to you. The inventions of the personal computer, workstation, laptop, tablet computer and even smartphone have each triggered moves away from data centralization. As hard drives became simultaneously larger and cheaper, and as software like Microsoft Excel and Access became increasingly more accessible and easier to use, data began to disperse across the enterprise. Over time, valuable data stores became strewn across hundred and even thousands of devices, sequestered in marketing managers’ spreadsheets, customer support databases, and human resources file systems.

As you can imagine, this has created a multi-faceted data problem. Marketing may have wonderful data that could be a valuable asset to senior management, but senior management may not be aware of the data’s existence—either because of territorialism on the part of the marketing department, or because the marketing folks simply haven’t thought to tell the executives about the data they’ve gathered. The same could be said of the information sharing, or lack thereof, between almost any two business units in an organization. In Corporate America lingo, the term ‘silos’ is often invoked to describe the separation of units to the point where interdepartmental sharing and communication is almost non-existent. It is unlikely that effective organizational data mining can occur when employees do not know what data they have (or could have) at their disposal or where those data are currently located. In chapter two we will take a closer look at some mechanisms that organizations are using to try bring all their data into a common location. These include databases, data marts and data warehouses.

Simply centralizing data is not enough however. There are plenty of question that arise once an organization’s data have been corralled. Where did the data come from? Who collected them and was there a standard method of collection? What do the various columns and rows of data mean? Are there acronyms or abbreviations that are unknown or unclear? You may need to do some research in the Data Preparation phase of your data mining activities. Sometimes you will need to meet with subject matter experts in various departments to unravel where certain data came from, how they were collected, and how they have been coded and stored. It is critically important that you verify the accuracy and reliability of the data as well. The old adage “It’s better than nothing” does not apply in data mining. Inaccurate or incomplete data could be worse than nothing in a data mining activity, because decisions based upon partial or wrong data are likely to be partial or wrong decisions. Once you have gathered, identified and understood your data assets, then you may engage in…

CRISP-DM Step 3: Data Preparation

Data come in many shapes and formats. Some data are numeric, some are in paragraphs of text, and others are in picture form such as charts, graphs and maps. Some data are anecdotal or narrative, such as comments on a customer satisfaction survey or the transcript of a witness’s testimony. Data that aren’t in rows or columns of numbers shouldn’t be dismissed though—sometimes non-traditional data formats can be the most information rich. We’ll talk in this book about approaches to formatting data, beginning in Chapter 2. Although rows and columns will be one of our most common layouts, we’ll also get into text mining where paragraphs can be fed into RapidMiner and analyzed for patterns as well.

Data Preparation involves a number of activities. These may include joining two or more data sets together, reducing data sets to only those variables that are interesting in a given data mining exercise, scrubbing data clean of anomalies such as outlier observations or missing data, or re-formatting data for consistency purposes. For example, you may have seen a spreadsheet or database that held phone numbers in many different formats:

(555) 555-5555 555/555-5555
555-555-5555 555.555.5555
555 555 5555 5555555555

Each of these offers the same phone number, but stored in different formats. The results of a data mining exercise are most likely to yield good, useful results when the underlying data are as consistent as possible. Data preparation can help to ensure that you improve your chances of a successful outcome when you begin…​

CRISP-DM Step 4: Modeling

A model, in data mining at least, is a computerized representation of real-world observations. Models are the application of algorithms to seek out, identify, and display any patterns or messages in your data. There are two basic kinds or types of models in data mining: those that classify and those that predict.

Figure 1-2. Types of Data Mining Models

Figure1-2.png

As you can see in Figure 1-2, there is some overlap between the types of models data mining uses. For example, this book will teaching you about decision trees. Decision Trees are a predictive model used to determine which attributes of a given data set are the strongest indicators of a given outcome. The outcome is usually expressed as the likelihood that an observation will fall into a certain category. Thus, Decision Trees are predictive in nature, but they also help us to classify our data. This will probably make more sense when we get to the chapter on Decision Trees, but for now, it’s important just to understand that models help us to classify and predict based on patterns the models find in our data.

Models may be simple or complex. They may contain only a single process, or stream, or they may contain sub-processes. Regardless of their layout, models are where data mining moves from preparation and understanding to development and interpretation. We will build a number of example models in this text. Once a model has been built, it is time for…

CRISP-DM Step 5: Evaluation

All analyses of data have the potential for false positives. Even if a model doesn’t yield false positives however, the model may not find any interesting patterns in your data. This may be because the model isn’t set up well to find the patterns, you could be using the wrong technique, or there simply may not be anything interesting in your data for the model to find. The Evaluation phase of CRISP-DM is there specifically to help you determine how valuable your model is, and what you might want to do with it.

Evaluation can be accomplished using a number of techniques, both mathematical and logical in nature. This book will examine techniques for cross-validation and testing for false positives using RapidMiner. For some models, the power or strength indicated by certain test statistics will also be discussed. Beyond these measures however, model evaluation must also include a human aspect. As individuals gain experience and expertise in their field, they will have operational knowledge which may not be measurable in a mathematical sense, but is nonetheless indispensable in determining the value of a data mining model. This human element will also be discussed throughout the book. Using both data-driven and instinctive evaluation techniques to determine a model’s usefulness, we can then decide how to move on to…

CRISP-DM Step 6: Deployment

If you have successfully identified your questions, prepared data that can answer those questions, and created a model that passes the test of being interesting and useful, then you have arrived at the point of actually using your results. This is deployment, and it is a happy and busy time for a data miner. Activities in this phase include setting up automating your model, meeting with consumers of your model’s outputs, integrating with existing management or operational information systems, feeding new learning from model use back into the model to improve its accuracy and performance, and monitoring and measuring the outcomes of model use. Be prepared for a bit of distrust of your model at first—you may even face pushback from groups who may feel their jobs are threatened by this new tool, or who may not trust the reliability or accuracy of the outputs. But don’t let this discourage you! Remember that CBS did not trust the initial predictions of the UNIVAC, one of the first commercial computer systems, when the network used it to predict the eventual outcome of the 1952 presidential election on election night. With only 5% of the votes counted, UNIVAC predicted Dwight D. Eisenhower would defeat Adlai Stevenson in a landslide; something no pollster or election insider consider likely, or even possible. In fact, most ‘experts’ expected Stevenson to win by a narrow margin, with some acknowledging that because they expected it to be close, Eisenhower might also prevail in a tight vote. It was only late that night, when human vote counts confirmed that Eisenhower was running away with the election, that CBS went on the air to acknowledge first that Eisenhower had won, and second, that UNIVAC had predicted this very outcome hours earlier, but network brass had refused to trust the computer’s prediction. UNIVAC was further vindicated later, when it’s prediction was found to be within 1% of what the eventually tally showed. New technology is often unsettling to people, and it is hard sometimes to trust what computers show. Be patient and specific as you explain how a new data mining model works, what the results mean, and how they can be used.

While the UNIVAC example illustrates the power and utility of predictive computer modeling (despite inherent mistrust), it should not construed as a reason for blind trust either. In the days of UNIVAC, the biggest problem was the newness of the technology. It was doing something no one really expected or could explain, and because few people understood how the computer worked, it was hard to trust it. Today we face a different but equally troubling problem: computers have become ubiquitous, and too often, we don’t question enough whether or not the results are accurate and meaningful. In order for data mining models to be effectively deployed, balance must be struck. By clearly communicating a model’s function and utility to stake holders, thoroughly testing and proving the model, then planning for and monitoring its implementation, data mining models can be effectively introduced into the organizational flow. Failure to carefully and effectively manage deployment however can sink even the best and most effective models.

Data Mining and You

Because data mining can be applied to such a wide array of professional fields, this book has been written with the intent of explaining data mining in plain English, using software tools that are accessible and intuitive to everyone. You may not have studied algorithms, data structures, or programming, but you may have questions that can be answered through data mining. It is our hope that by writing in an informal tone and by illustrating data mining concepts with accessible, logical examples, data mining can become a useful tool for you regardless of your previous level of data analysis or computing expertise. Let’s start digging!

Chapter Two: Organizational Understanding and Data Understanding

Context and Perspective

Consider some of the activities you’ve been involved with in the past three or four days. Have you purchased groceries or gasoline? Attended a concert, movie or other public event? Perhaps you went out to eat at a restaurant, stopped by your local post office to mail a package, made a purchase online, or placed a phone call to a utility company. Every day, our lives are filled with interactions – encounters with companies, other individuals, the government, and various other organizations.

In today’s technology-driven society, many of those encounters involve the transfer of information electronically. That information is recorded and passed across networks in order to complete financial transactions, reassign ownership or responsibility, and enable delivery of goods and services. Think about the amount of data collected each time even one of these activities occurs.

Take the grocery store for example. If you take items off the shelf, those items will have to be replenished for future shoppers – perhaps even for yourself – after all you’ll need to make similar purchases again when that case of cereal runs out in a few weeks. The grocery store must constantly replenish its supply of inventory, keeping the items people want in stock while maintaining freshness in the products they sell. It makes sense that large databases are running behind the scenes, recording data about what you bought and how much of it, as you check out and pay your grocery bill. All of that data must be recorded and then reported to someone whose job it is to reorder items for the store’s inventory.

However, in the world of data mining, simply keeping inventory up-to-date is only the beginning. Does your grocery store require you to carry a frequent shopper card or similar device which, when scanned at checkout time, gives you the best price on each item you’re buying? If so, they can now begin not only keep track of store-wide purchasing trends, but individual purchasing trends as well. The store can target market to you by sending mailers with coupons for products you tend to purchase most frequently.

Now let’s take it one step further. Remember, if you can, what types of information you provided when you filled out the form to receive your frequent shopper card. You probably indicated your address, date of birth (or at least birth year), whether you’re male or female, and perhaps the size of your family, annual household income range, or other such information. Think about the range of possibilities now open to your grocery store as they analyze that vast amount of data they collect at the cash register each day:

  • Using ZIP codes, the store can locate the areas of greatest customer density, perhaps aiding their decision about the construction location for their next store.
  • Using information regarding customer gender, the store may be able to tailor marketing displays or promotions to the preferences of male or female customers.
  • With age information, the store can avoid mailing coupons for baby food to elderly customers, or promotions for feminine hygiene products to households with a single male occupant.

These are only a few the many examples of potential uses for data mining. Perhaps as you read through this introduction, some other potential uses for data mining came to your mind. You may have also wondered how ethical some of these applications might be. This text has been designed to help you understand not only the possibilities brought about through data mining, but also the techniques involved in making those possibilities a reality while accepting the responsibility that accompanies the collection and use of such vast amounts of personal information.

Learning Objectives

After completing the reading and exercises in this chapter, you should be able to:

  • Define the discipline of Data Mining
  • List and define various types of data
  • List and define various sources of data
  • Explain the fundamental differences between databases, data warehouses and data sets
  • Explain some of the ethical dilemmas associated with data mining and outline possible solutions
Purposes, Intents and Limitations of Data Mining

Data mining, as explained in Chapter 1 of this text, applies statistical and logical methods to large data sets. These methods can be used to categorize the data, or they can be used to create predictive models. Categorizations of large sets may include grouping people into similar types of classifications, or in identifying similar characteristics across a large number of observations.

Predictive models however, transform these descriptions into expectations upon which we can base decisions. For example, the owner of a book-selling Web site could project how frequently she may need to restock her supply of a given title, or the owner of a ski resort may attempt to predict the earliest possible opening date based on projected snow arrivals and accumulations.

It is important to recognize that data mining cannot provide answers to every question, nor can we expect that predictive models will always yield results which will in fact turn out to be the reality. Data mining is limited to the data that has been collected. And those limitations may be many. We must remember that the data may not be completely representative of the group of individuals to which we would like to apply our results. The data may have been collected incorrectly, or it may be out-of-date. There is an expression which can adequately be applied to data mining, among many other things: GIGO, or Garbage In, Garbage Out. The quality of our data mining results will directly depend upon the quality of our data collection and organization. Even after doing our very best to collect high quality data, we must still remember to base decisions not only on data mining results, but also on available resources, acceptable amounts of risk, and plain old common sense.

Database, Data Warehouse, Data Mart, Data Set…?

In order to understand data mining, it is important to understand the nature of databases, data collection and data organization. This is fundamental to the discipline of Data Mining, and will directly impact the quality and reliability of all data mining activities. In this section, we will examine the differences between databases, data warehouses, and data sets. We will also examine some of the variations in terminology used to describe data attributes.

Although we will be examining the differences between databases, data warehouses and data sets, we will begin by discussing what they have in common. In Figure 2-1, we see some data organized into rows (shown here as A, B, etc.) and columns (shown here as 1, 2, etc.). In varying data environments, these may be referred to by differing names. In a database, rows would be referred to as tuples or records, while the columns would be referred to as fields.

Figure 2-1: Data arranged in columns and rows

Figure2-1.png

In data warehouses and data sets, rows are sometimes referred to as observations, examples or cases, and columns are sometimes called variables or attributes. For purposes of consistency in this book, we will use the terminology of observations for rows and attributes for columns. It is important to note that RapidMiner will use the term examples for rows of data, so keep this in mind throughout the rest of the text.

A database is an organized grouping of information within a specific structure. Database containers, such as the one pictured in Figure 2-2, are called tables in a database environment. Most databases in use today are relational databases—they are designed using many tables which relate to one another in a logical fashion. Relational databases generally contain dozens or even hundreds of tables, depending upon the size of the organization.

Figure 2-2: A simple database with a relation between two tables

Figure2-2.png

Figure 2-2 depicts a relational database environment with two tables. The first table contains information about pet owners; the second, information about pets. The tables are related by the single column they have in common: Owner_ID. By relating tables to one another, we can reduce redundancy of data and improve database performance. The process of breaking tables apart and thereby reducing data redundancy is called normalization.

Most relational databases which are designed to handle a high number of reads and writes (updates and retrievals of information) are referred to as OLTP (online transaction processing) systems. OLTP systems are very efficient for high volume activities such as cashiering, where many items are being recorded via bar code scanners in a very short period of time. However, using OLTP databases for analysis is generally not very efficient, because in order to retrieve data from multiple tables at the same time, a query containing joins must be written. A query is simple a method of retrieving data from database tables for viewing. Queries are usually written in a language called SQL (Structured Query Language; pronounced ‘sequel’). Because it is not very useful to only query pet names or owner names, for example, we must join two or more tables together in order to retrieve both pets and owners at the same time. Joining requires that the computer match the Owner_ID column in the Owners table to the Owner_ID column in the Pets table. When tables contain thousands or even millions of rows of data, this matching process can be very intensive and time consuming on even the most robust computers.

For much more on database design and management, check out geekgirls.com: (http://www.geekgirls.com/ menu_databases.htm).

In order to keep our transactional databases running quickly and smoothly, we may wish to create a data warehouse. A data warehouse is a type of large database that has been denormalized and archived. Denormalization is the process of intentionally combining some tables into a single table in spite of the fact that this may introduce duplicate data in some columns (or in other words, attributes).

Figure 2-3: A combination of the tables into a single data set

Figure2-3.png

Figure 2-3 depicts what our simple example data might look like if it were in a data warehouse. When we design databases in this way, we reduce the number of joins necessary to query related data, thereby speeding up the process of analyzing our data. Databases designed in this manner are called OLAP (online analytical processing) systems.

Transactional systems and analytical systems have conflicting purposes when it comes to database speed and performance. For this reason, it is difficult to design a single system which will serve both purposes. This is why data warehouses generally contain archived data. Archived data are data that have been copied out of a transactional database. Denormalization typically takes place at the time data are copied out of the transactional system. It is important to keep in mind that if a copy of the data is made in the data warehouse, the data may become out-of-synch. This happens when a copy is made in the data warehouse and then later, a change to the original record (observation) is made in the source database. Data mining activities performed on out-of-synch observations may be useless, or worse, misleading. An alternative archiving method would be to move the data out of the transactional system. This ensures that data won’t get out-of-synch, however, it also makes the data unavailable should a user of the transactional system need to view or update it.

A data set is a subset of a database or a data warehouse. It is usually denormalized so that only one table is used. The creation of a data set may contain several steps, including appending or combining tables from source database tables, or simplifying some data expressions. One example of this may be changing a date/time format from ‘10-DEC-2002 12:21:56’ to ‘12/10/02’. If this latter date format is adequate for the type of data mining being performed, it would make sense to simplify the attribute containing dates and times when we create our data set. Data sets may be made up of a representative sample of a larger set of data, or they may contain all observations relevant to a specific group. We will discuss sampling methods and practices in Chapter 3.

Types of Data

Thus far in this text, you’ve read about some fundamental aspects of data which are critical to the discipline of data mining. But we haven’t spent much time discussing where that data are going to come from. In essence, there are really two types of data that can be mined: operational and organizational.

The most elemental type of data, operational data, comes from transactional systems which record everyday activities. Simple encounters like buying gasoline, making an online purchase, or checking in for a flight at the airport all result in the creation of operational data. The times, prices and descriptions of the goods or services we have purchased are all recorded. This information can be combined in a data warehouse or may be extracted directly into a data set from the OLTP system.

Often times, transactional data is too detailed to be of much use, or the detail may compromise individuals’ privacy. In many instances, government, academic or not-for-profit organizations may create data sets and then make them available to the public. For example, if we wanted to identify regions of the United States which are historically at high risk for influenza, it would be difficult to obtain permission and to collect doctor visit records nationwide and compile this information into a meaningful data set. However, the U.S. Centers for Disease Control and Prevention (CDCP), do exactly that every year. Government agencies do not always make this information immediately available to the general public, but it often can be requested. Other organizations create such summary data as well. The grocery store mentioned at the beginning of this chapter wouldn’t necessarily want to analyze records of individual cans of greens beans sold, but they may want to watch trends for daily, weekly or perhaps monthly totals. Organizational data sets can help to protect peoples’ privacy, while still proving useful to data miners watching for trends in a given population.

Another type of data often overlooked within organizations is something called a data mart. A data mart is an organizational data store, similar to a data warehouse, but often created in conjunction with business units’ needs in mind, such as Marketing or Customer Service, for reporting and management purposes. Data marts are usually intentionally created by an organization to be a type of one-stop shop for employees throughout the organization to find data they might be looking for. Data marts may contain wonderful data, prime for data mining activities, but they must be known, current, and accurate to be useful. They should also be well-managed in terms of privacy and security.

All of these types of organizational data carry with them some concern. Because they are secondary, meaning they have been derived from other more detailed primary data sources, they may lack adequate documentation, and the rigor with which they were created can be highly variable. Such data sources may also not be intended for general distribution, and it is always wise to ensure proper permission is obtained before engaging in data mining activities on any data set. Remember, simply because a data set may have been acquired from the Internet does not mean it is in the public domain; and simply because a data set may exist within your organization does not mean it can be freely mined. Checking with relevant managers, authors and stakeholders is critical before beginning data mining activities.

A Note about Privacy and Security

In 2003, JetBlue Airlines supplied more than one million passenger records to a U.S. government contractor, Torch Concepts. Torch then subsequently augmented the passenger data with additional information such as family sizes and social security numbers—information purchased from a data broker called Acxiom. The data were intended for a data mining project in order to develop potential terrorist profiles. All of this was done without notification or consent of passengers. When news of the activities got out however, dozens of privacy lawsuits were filed against JetBlue, Torch and Acxiom, and several U.S. senators called for an investigation into the incident.

This incident serves several valuable purposes for this book. First, we should be aware that as we gather, organize and analyze data, there are real people behind the figures. These people have certain rights to privacy and protection against crimes such as identity theft. We as data miners have an ethical obligation to protect these individuals’ rights. This requires the utmost care in terms of information security. Simply because a government representative or contractor asks for data does not mean it should be given.

Beyond technological security however, we must also consider our moral obligation to those individuals behind the numbers. Recall the grocery store shopping card example given at the beginning of this chapter. In order to encourage use of frequent shopper cards, grocery stores frequently list two prices for items, one with use of the card and one without. For each individual, the answer to this question may vary, however, answer it for yourself: At what price mark-up has the grocery store crossed an ethical line between encouraging consumers to participate in frequent shopper programs, and forcing them to participate in order to afford to buy groceries? Again, your answer will be unique from others’, however it is important to keep such moral obligations in mind when gathering, storing and mining data.

The objectives hoped for through data mining activities should never justify unethical means of achievement. Data mining can be a powerful tool for customer relationship management, marketing, operations management, and production, however in all cases the human element must be kept sharply in focus. When working long hours at a data mining task, interacting primarily with hardware, software, and numbers, it can be easy to forget about the people, and therefore it is so emphasized here.

Chapter Summary

This chapter has introduced you to the discipline of data mining. Data mining brings statistical and logical methods of analysis to large data sets for the purposes of describing them and using them to create predictive models. Databases, data warehouses and data sets are all unique kinds of digital record keeping systems, however, they do share many similarities. Data mining is generally most effectively executed on data data sets, extracted from OLAP, rather than OLTP systems. Both operational data and organizational data provide good starting points for data mining activities, however both come with their own issues that may inhibit quality data mining activities. These should be mitigated before beginning to mine the data. Finally, when mining data, it is critical to remember the human factor behind manipulation of numbers and figures. Data miners have an ethical responsibility to the individuals whose lives may be affected by the decisions that are made as a result of data mining activities.

Review Questions

1) What is data mining in general terms?
2) What is the difference between a database, a data warehouse and a data set?
3) What are some of the limitations of data mining? How can we address those limitations?
4) What is the difference between operational and organizational data? What are the pros and cons of each?
5) What are some of the ethical issues we face in data mining? How can they be addressed?
6) What is meant by out-of-synch data? How can this situation be remedied?
7) What is normalization? What are some reasons why it is a good thing in OLTP systems, but not so good in OLAP systems?

Exercises

1) Design a relational database with at least three tables. Be sure to create the columns necessary within each table to relate the tables to one another.
2) Design a data warehouse table with some columns which would usually be normalized. Explain why it makes sense to denormalize in a data warehouse.
3) Perform an Internet search to find information about data security and privacy. List three web sites that you found that provided information that could be applied to data mining. Explain how it might be applied.
4) Find a newspaper, magazine or Internet news article related to information privacy or security. Summarize the article and explain how it might be related to data mining.

5) Using the Internet, locate a data set which is available for download. Describe the data set (contents, purpose, size, age, etc.). Classify the data set as operational or organizational. Summarize any requirements placed on individuals who may wish to use the data set.
6) Obtain a copy of an application for a grocery store shopping card. Summarize the type of data requested when filling out the application. Give an example of how that data may aid in a data mining activity. What privacy concerns arise regarding the data being collected?

Chapter Three: Data Preparation

Context and Perspective

Jerry is the marketing manager for a small Internet design and advertising firm. Jerry’s boss asks him to develop a data set containing information about Internet users. The company will use this data to determine what kinds of people are using the Internet and how the firm may be able to market their services to this group of users.

To accomplish his assignment, Jerry creates an online survey and places links to the survey on several popular Web sites. Within two weeks, Jerry has collected enough data to begin analysis, but he finds that his data needs to be denormalized. He also notes that some observations in the set are missing values or they appear to contain invalid values. Jerry realizes that some additional work on the data needs to take place before analysis begins.

Learning Objectives
Collation
Data Scrubbing
Hands on Exercise

Starting now, and throughout the next chapters of this book, there will be opportunities for you to put your hands on your computer and follow along. In order to do this, you will need to be sure to install OpenOffice and RapidMiner, as was discussed in the section A Note about Tools in Chapter 1. You will also need to have an Internet connection to access this book’s companion web site, where copies of all data sets used in the chapter exercises are available. The companion web site is located at:
https://sites.google.com/site/dataminingforthemasses/

You can download the Chapter 3 data set, which is an export of the view created in OpenOffice Base, from the web site by locating it in the list of files and then clicking the down arrow to the far right of the file name, as indicated by the black arrows in Figure 3-4 You may want to consider creating a folder labeled ‘data mining’ or something similar where you can keep copies of your data—more files will be required and created as we continue through the rest of the book, especially when we get into building data mining models in RapidMiner. Having a central place to keep everything together will simplify things, and upon your first launch of the RapidMiner software, you’ll be prompted to create a repository, so it’s a good idea to have a space ready. Once you’ve downloaded the Chapter 3 data set, you’re ready to begin learning how to handle and prepare data for mining in RapidMiner.

Preparing RapidMiner, Importing Data, and Handling Missing Data
Data Reduction
Handling Inconsistent Data
Attribute Reduction
Chapter Summary

This chapter has introduced you to a number of concepts related to data preparation. Recall that Data Preparation is the third step in the CRISP-DM process. Once you have established Organizational Understanding as it relates to your data mining plans, and developed Data Understanding in terms of what data you need, what data you have, where it is located, and so forth; you can begin to prepare your data for mining. This has been the focus of this chapter.

The chapter used a small and very simple data set to help you learn to set up the RapidMiner data mining environment. You have learned about viewing data sets in OpenOffice Base, and learned some ways that data sets in relational databases can be collated. You have also learned about comma separated values (CSV) files.

We have then stepped through adding CSV files to a RapidMiner data repository in order to handle missing data, reduce data through observation filtering, handle inconsistencies in data, and reduce the number of attributes in a model. All of these methods will be used in future chapters to prepare data for modeling.

Data mining is most successful when conducted upon a foundation of well-prepared data. Recall the quotation from Chapter 1from Alice’s Adventures in Wonderland—which way you go does not matter very much if you don’t know, or don’t care, where you are going. Likewise, the value of where you arrive when you complete a data mining exercise will largely depend upon how well you prepared to get there. Sometimes we hear the phrase “It’s better than nothing”. Well, in data mining, results gleaned from poorly prepared data might be “Worse than nothing”, because they may be misleading. Decisions based upon them could lead an organization down a detrimental and costly path. Learn to value the process of data preparation, and you will learn to be a better data miner.

Review Questions

1) What are the four main processes of data preparation discussed in this chapter? What do they accomplish and why are they important?
2) What are some ways to collate data from a relational database?
3) For what kinds of problems might a data set need to be scrubbed?
4) Why is it often better to perform reductions using operators rather than excluding attributes or observations as data are imported?
5) What is a data repository in RapidMiner and how is one created?
6) How might inconsistent data cause later trouble in data mining activities?

Exercise

1) Locate a data set of any number of attributes and observations. You may have access to data sets through personal data collection or through your employment, although if you use an employer’s data, make sure to do so only by permission! You can also search the Internet for data set libraries. A simple search on the term ‘data sets’ in your favorite search engine will yield a number of web sites that offer libraries of data sets that you can use for academic and learning purposes. Download a data set that looks interesting to you and complete the following:
2) Format the data set into a CSV file. It may come in this format, or you may need to open the data in OpenOffice Calc or some similar software, and then use the File > Save As feature to save your data as a CSV file.

3) Import your data into your RapidMiner repository. Save it in the repository as Chapter3_Exercise.
4) Create a new, blank process stream in RapidMiner and drag your data set into the process window.
5) Run your process and examine your data set in both meta data view and Data View. Note if any attributes have missing or inconsistent data.
6) If you found any missing or inconsistent data, use operators to handle these. Perhaps try browsing through the folder tree in the Operators tab and experiment with some operators that were not covered in this chapter.
7) Try filtering out some observations based on some attibute’s value, and filter out some attributes.
8) Document where you found your data set, how you prepared it for import into RapidMiner, and what data preparation activities you applied to it.

Section Two: Data Mining Models and Methods

Chapter Four: Correlation

Context and Perspective

Sarah is a regional sales manager for a nationwide supplier of fossil fuels for home heating. Recent volatility in market prices for heating oil specifically, coupled with wide variability in the size of each order for home heating oil, has Sarah concerned. She feels a need to understand the types of behaviors and other factors that may influence the demand for heating oil in the domestic market. What factors are related to heating oil usage, and how might she use a knowledge of such factors to better manage her inventory, and anticipate demand? Sarah believes that data mining can help her begin to formulate an understanding of these factors and interactions.

Organizational Understanding
Data Understanding

In order to investigate her question, Sarah has enlisted our help in creating a correlation matrix of six attributes. Working together, using Sarah’s employer’s data resources which are primarily drawn from the company’s billing database, we create a data set comprised of the following attributes:

  • Insulation: This is a density rating, ranging from one to ten, indicating the thickness of each home’s insulation. A home with a density rating of one is poorly insulated, while a home with a density of ten has excellent insulation.
  • Temperature: This is the average outdoor ambient temperature at each home for the most recent year, measure in degree Fahrenheit.
  • Heating_Oil: This is the total number of units of heating oil purchased by the owner of each home in the most recent year.
  • Num_Occupants: This is the total number of occupants living in each home.
  • Avg_Age: This is the average age of those occupants.
  • Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The higher the number, the larger the home.
Data Preparation
Modeling
Evaluation
Deployment
Chapter Summary

This chapter has introduced the concept of correlation as a data mining model. It has been chosen as the first model for this book because it is relatively simple to construct, run and interpret, thus serving as an easy starting point upon which to build. Future models will become more complex, but continuing to develop your skills in RapidMiner and getting comfortable with the tools will make the more complex models easier for you to achieve as we move forward.

Recall from Chapter 1 (Figure 1-2) that data mining has two somewhat interconnected sides: Classification, and Prediction. Correlation has been shown to be primarily on the side of Classification. We do not infer causation using correlation metrics, nor do we use correlation coefficients to predict one attribute’s value based on another’s. We can however quickly find general trends in data sets using correlations, and we can anticipate how strongly an observed movement in one attribute will occur in conjunction with movement in another.

Correlation can be a quick and easy way to see how elements of a given problem may be interacting with one another. Whenever you find yourself asking how certain factors in a problem you’re trying to solve interact with one another, consider building a correlation matrix to find out. For example, does customer satisfaction change based on time of year? Does the amount of rainfall change the price of a crop? Does household income influence which restaurants a person patronizes? The answer to each of these questions is probably ‘yes’, but correlation can not only help us know if that’s true, it can also help us learn how strongly the interactions are when, and if, they occur.

Review Questions

1) What are some of the limitations of correlation models?
2) What is a correlation coefficient? How is it interpreted?
3) What is the difference between a positive and a negative correlation? If two attributes have values that decrease at essentially the same rate, is that a negative correlation? Why or why not?
4) How is correlation strength measured? What are the ranges for strengths of correlation?
5) The number of heating oil consuming devices was suggested as a possibly interesting attribute that could be added to the example data set for this chapter. Can you think of others? Why might they be interesting? To what other attributes in the data set do you think your suggested attributes might be correlated? What would be the value in knowing if they are?

Exercise

It is now your turn to develop a correlation model, generate a coefficient matrix, and analyze the results. To complete this chapter’s exercise, follow the steps below.
1) Select a professional sporting organization that you enjoy, or of which you are aware. Locate that organization’s web site and search it for statistics, facts and figures about the athletes in that organization.
2) Open OpenOffice Calc, and starting in Cell A across Row 1 of the spreadsheet, define some attributes (at least three or four) to hold data about each athlete. Some possible attributes you may wish to consider could be annual_salary, points_per_game, years_as_pro, height, weight, age, etc. The list is potentially unlimited, will vary based on the type of sport you choose, and will depend on the data available to you on the web site you’ve selected. Measurements of the athletes’ salaries and performance in competition are likely to be the most interesting. You may include the athletes’ names, however keep in mind that correlations can only be conducted on numeric data, so the name attribute would need to be reduced out of your data set before creating your correlation matrix. (Remember the Select Attributes operator!)
3) Look up the statistics for each of your selected attributes and enter them as observations into your spreadsheet. Try to find as many as you can—at least thirty is a good rule of thumb in order to achieve at least a basic level of statistical validity. More is better.
4) Once you’ve created your data set, use the menu to save it as a CSV file. Click File, then Save As. Enter a file name, and change ‘Save as type:’ to be Text CSV (.csv). Be sure to save the file in your data mining data folder.
5) Open RapidMiner and import your data set into your RapidMiner repository. Name it Chapter4Exercise, or something descriptive so that you will remember what data are contained in the data set when you look in your repository.
6) Add the data set to a new process in RapidMiner. Ensure that the out port is connected to a res port and run your model. Save your process with a descriptive name if you wish. Examine your data in results perspective and ensure there are no missing, inconsistent, or other potentially problematic data that might need to be handled as part of your Data Preparation phase. Return to design perspective and handle any data preparation tasks that may be necessary.
7) Add a Correlation Matrix operator to your stream and ensure that the mat port is connected to a res port. Run your model again. Interpret your correlation coefficients as displayed on the matrix tab.
8) Document your findings. What correlations exist? How strong are they? Are they surprising to you and if so, why? What other attributes would you like to add? Are there any you’d eliminate now that you’ve mined your data?

Challenge step!
9) While still in results perspective, click on the ExampleSet tab (which exists assuming you left the exa port connected to a res port when you were in design perspective). Click on the Plot View radio button. Examine correlations that you found in your model visually by creating a scatter plot of your data. Choose one attribute for your x-Axis and a correlated one for your y-Axis. Experiment with the Jitter slide bar. What is it doing? (Hint: Try an Internet search on the term ‘jittering statistics’.) For an additional visual experience, try a Scatter 3D or Scatter 3D Color plot. Consider Figures 4-8 and 4-9 as examples. Note that with 3D plots in RapidMiner, you can click and hold to rotate your plot in order to better see the interactions between the data.

Figure 4-8. A two-dimensional scatterplot with a colored third dimension and a slight jitter

Figure4-8.png

Figure 4-9. A three-dimensional scatterplot with a colored fourth dimension

Figure4-9.png

Chapter Five: Association Rules

Context and Perspective

Roger is a city manager for a medium-sized, but steadily growing, city. The city has limited resources, and like most municipalities, there are more needs than there are resources. He feels like the citizens in the community are fairly active in various community organizations, and believes that he may be able to get a number of groups to work together to meet some of the needs in the community. He knows there are churches, social clubs, hobby enthusiasts and other types of groups in the community. What he doesn’t know is if there are connections between the groups that might enable natural collaborations between two or more groups that could work together on projects around town. He decides that before he can begin asking community organizations to begin working together and to accept responsibility for projects, he needs to find out if there are any existing associations between the different types of groups in the area.

Learning Objectives
Organizational Understanding
Data Understanding

In order to answer his question, Roger has enlisted our help in creating an association rules data mining model. Association rules are a data mining methodology that seeks to find frequent connections between attributes in a data set. Association rules are very common when doing shopping basket analysis. Marketers and vendors in many sectors use this data mining approach to try to find which products are most frequently purchased together. If you have ever purchased items on an e-Commerce retail site like Amazon.com, you have probably seen the fruits of association rule data mining. These are most commonly found in the recommendations sections of such web sites. You might notice that when you search for a smartphone, recommendations for screen protectors, protective cases, and other accessories such as charging cords or data cables are often recommended to you. The items being recommended are identified by mining for items that previous customers bought in conjunction with the item you search for. In other words, those items are found to be associated with the item you are looking for, and that association is so frequent in the web site’s data set, that the association might be considered a rule. Thus is born the name of this data mining approach: “association rules”. While association rules are most common in shopping basket analysis, this modeling technique can be applied to a broad range of questions. We will help Roger by creating an association rule model to try to find linkages across types of community organizations.

Working together, we using Roger’s knowledge of the local community to create a short survey which we will administer online via a web site. In order to ensure a measure of data integrity and to try to protect against possible abuse, our web survey is password protected. Each organization invited to participate in the survey is given a unique password. The leader of that organization is asked to share the password with his or her membership and to encourage participation in the survey. Community members are given a month to respond, and each time an individual logs on complete the survey, the password used is recorded so that we can determine how many people from each organization responded. After the month ends, we have a data set comprised of the following attributes:

  • Elapsed_Time: This is the amount of time each respondent spent completing our survey. It is expressed in decimal minutes (e.g. 4.5 in this attribute would be four minutes, thirty seconds).
  • Time_in_Community: This question on the survey asked the person if they have lived in the area for 0-2 years, 3-9 years, or 10+ years; and is recorded in the data set as Short, Medium, or Long respectively.
  • Gender: The survey respondent’s gender.
  • Working: A yes/no column indicating whether or not the respondent currently has a paid job.
  • Age: The survey respondent’s age in years.
  • Family: A yes/no column indicating whether or not the respondent is currently a member of a family-oriented community organization, such as Big Brothers/Big Sisters, childrens’ recreation or sports leagues, genealogy groups, etc.
  • Hobbies: A yes/no column indicating whether or not the respondent is currently a member of a hobby-oriented community organization, such as amateur radio, outdoor recreation, motorcycle or bicycle riding, etc.
  • Social_Club: A yes/no column indicating whether or not the respondent is currently a member of a community social organization, such as Rotary International, Lion’s Club, etc.
  • Political: A yes/no column indicating whether or not the respondent is currently a member of a political organization with regular meetings in the community, such as a political party, a grass-roots action group, a lobbying effort, etc.
  • Professional: A yes/no column indicating whether or not the respondent is currently a member of a professional organization with local chapter meetings, such as a chapter of a law or medical society, a small business owner’s group, etc.
  • Religious: A yes/no column indicating whether or not the respondent is currently a member of a church in the community.
  • Support_Group: A yes/no column indicating whether or not the respondent is currently a member of a support-oriented community organization, such as Alcoholics Anonymous, an anger management group, etc.

In order to preserve a level of personal privacy, individual respondents’ names were not collected through the survey, and no respondent was asked to give personally identifiable information when responding.

Data Preparation
Modeling
Evaluation
Deployment
Chapter Summary

This chapter’s fictional scenario with Roger’s desire to use community groups to improve his city has shown how association rule data mining can identify linkages in data that can have a practical application. In addition to learning about the process of creating association rule models in RapidMiner, we introduced a new operator that enabled us to change attributes’ data types. We also used CRISP-DM’s cyclical nature to understand that sometimes data mining involves some back and forth ‘digging’ before moving on to the next step. You learned how support and confidence percentages are calculated and about the importance of these two metrics in identifying rules and determining their strength in a data set.

Review Questions

1) What are association rules? What are they good for?
2) What are the two main metrics that are calculated in association rules and how are they calculated?
3) What data type must a data set’s attributes be in order to use Frequent Pattern operators in RapidMiner?
4) How are rule results interpreted? In this chapter’s example, what was our strongest rule? How do we know?

Exercise

In explaining support and confidence percentages in this chapter, the classic example of shopping basket analysis was used. For this exercise, you will do a shopping basket association rule analysis. Complete the following steps:
1) Using the Internet, locate a sample shopping basket data set. Search terms such as ‘association rule data set’ or ‘shopping basket data set’ will yield a number of downloadable examples. With a little effort, you will be able to find a suitable example.
2) If necessary, convert your data set to CSV format and import it into your RapidMiner repository. Give it a descriptive name and drag it into a new process window.
3) As necessary, conduct your Data Understanding and Data Preparation activities on your data set. Ensure that all of your variables have consistent data and that their data types are appropriate for the FP-Growth operator.

4) Generate association rules for your data set. Modify your confidence and support values in order to identify their most ideal levels such that you will have some interesting rules with reasonable confidence and support. Look at the other measures of rule strength such as LaPlace or Conviction.
5) Document your findings. What rules did you find? What attributes are most strongly associated with one another. Are there products that are frequently connected that surprise you? Why do you think this might be? How much did you have to test different support and confidence values before you found some association rules? Were any of your association rules good enough that you would base decisions on them? Why or why not?
Challenge Step!
6) Build a new association rule model using your same data set, but this time, use the W-FPGrowth operator. (Hints for using the W-FPGrowth operator: (1) This operator creates its own rules without help from other operators; and (2) This operator’s support and confidence parameters are labeled U and C, respectively.
Exploration!
7) The Apriori algorithm is often used in data mining for associations. Search the RapidMiner Operators tree for Apriori operators and add them to your data set in a new process. Use the Help tab in RapidMiner’s lower right hand corner to learn about these operators’ parameters and functions (be sure you have the operator selected in your main process window in order to see its help content).

Chapter Six: k-Means Clustering

Context and Perspective

Sonia is a program director for a major health insurance provider. Recently she has been reading in medical journals and other articles, and found a strong emphasis on the influence of weight, gender and cholesterol on the development of coronary heart disease. The research she’s read confirms time after time that there is a connection between these three variables, and while there is little that can be done about one’s gender, there are certainly life choices that can be made to alter one’s cholesterol and weight. She begins brainstorming ideas for her company to offer weight and cholesterol management programs to individuals who receive health insurance through her employer. As she considers where her efforts might be most effective, she finds herself wondering if there are natural groups of individuals who are most at risk for high weight and high cholesterol, and if there are such groups, where the natural dividing lines between the groups occur.

Learning Objectives
Organizational Understanding
Data Understanding

Using the insurance company’s claims database, Sonia extracts three attributes for 547 randomly selected individuals. The three attributes are the insured’s weight in pounds as recorded on the person’s most recent medical examination, their last cholesterol level determined by blood work in their doctor’s lab, and their gender. As is typical in many data sets, the gender attribute uses 0 to indicate Female and 1 to indicate Male. We will use this sample data from Sonia’s employer’s database to build a cluster model to help Sonia understand how her company’s clients, the health insurance policy holders, appear to group together on the basis of their weights, genders and cholesterol levels. We should remember as we do this that means are particularly susceptible to undue influence by extreme outliers, so watching for inconsistent data when using the k-Means clustering data mining methodology is very important.

Data Preparation
Modeling
Evaluation
Deployment
Chapter Summary

k-Means clustering is a data mining model that falls primarily on the side of Classification when referring to the Venn diagram from Chapter 1 (Figure 1-2). For this chapter’s example, it does not necessarily predict which insurance policy holders will or will not develop heart disease. It simply takes known indicators from the attributes in a data set, and groups them together based on those attributes’ similarity to group averages. Because any attributes that can be quantified can also have means calculated, k-means clustering provides an effective way of grouping observations together based on what is typical or normal for that group. It also helps us understand where one group begins and the other ends, or in other words, where the natural breaks occur between groups in a data set.

k-Means clustering is very flexible in its ability to group observations together. The k-Means operator in RapidMiner allows data miners to set the number of clusters they wish to generate, to dictate the number of sample means used to determine the clusters, and to use a number of different algorithms to evaluate means. While fairly simple in its set-up and definition, k-Means clustering is a powerful method for finding natural groups of observations in a data set.

Review Questions

1) What does the k in k-Means clustering stand for?
2) How are clusters identified? What process does RapidMiner use to define clusters and place observations in a given cluster?
3) What does the Centroid Table tell the data miner? How do you interpret the values in a Centroid Table?
4) How do descriptive statistics aid in the process of evaluating and deploying a k-Means clustering model?

5) How might the presence of outliers in the attributes of a data set influence the usefulness of a k-Means clustering model? What could be done to address the problem?

Exercise

Think of an example of a problem that could be at least partially addressed by being able to group observations in a data set into clusters. Some examples might be grouping kids who might be at risk for delinquency, grouping product sale volumes, grouping workers by productivity and effectiveness, etc. Search the Internet or other resources available to you for a data set that would allow you to investigate your question using a k-means model. As with all exercises in this text, please ensure that you have permission to use any data set that might belong to your employer or another entity. When you have secured your data set, complete the following steps:
1) Ensure that your data set is saved as a CSV file. Import your data set into your RapidMiner repository and save it with a meaningful name. Drag it into a new process window in RapidMiner.
2) Conduct any data preparation that you need for your data set. This may include handling inconsistent data, dealing with missing values, or changing data types. Remember that in order to calculate means, each attribute in your data set will need to be numeric. If, for example, one of your attributes contains the values ‘yes’ and ‘no’, you may need to change these to be 1 and 0 respectively, in order for the k-Means operator to work.
3) Connect a k-Means operator to your data set, configure your parameters (especially set your k to something meaningful for your question) and then run your model.
4) Investigate your Centroid Table, Folder View, and the other evaluation tools.
5) Report your findings for your clusters. Discuss what is interesting about them and describe what iterations of modeling you went through, such as experimentation with different parameter values, to generate the clusters. Explain how your findings are relevant to your original question.

Challenge Step!
6) Experiment with the other k-Means operators in RapidMiner, such as Kernel or Fast. How are they different from your original model. Did the use of these operators change your clusters, and if so, how?

Chapter Seven: Discriminant Analysis

Context and Perspective

Gill runs a sports academy designed to help high school aged athletes achieve their maximum athletic potential. On the boys side of his academy, he focuses on four major sports: Football, Basketball, Baseball and Hockey. He has found that while many high school athletes enjoy participating in a number of sports in high school, as they begin to consider playing a sport at the college level, they would prefer to specialize in one sport. As he’s worked with athletes over the years, Gill has developed an extensive data set, and he now is wondering if he can use past performance from some of his previous clients to predict prime sports for up-and-coming high school athletes. Ultimately, he hopes he can make a recommendation to each athlete as to the sport in which they should most likely choose to specialize. By evaluating each athlete’s performance across a battery of test, Gill hopes we can help him figure out for which sport each athlete has the highest aptitude.

Learning Objectives
Organizational Understanding
Data Understanding

In order to begin to formulate a plan, we sit down with Gill to review his data assets. Every athlete that has enrolled at Gill’s academy over the past several years has taken a battery test, which tested for a number of athletic and personal traits. The battery has been administered to both boys and girls participating in a number of different sports, but for this preliminary study we have decided with Gill that we will look at data only for boys. Because the academy has been operating for some time, Gill has the benefit of knowing which of his former pupils have gone on to specialize in a single sport, and which sport it was for each of them. Working with Gill, we gather the results of the batteries for all former clients who have gone on to specialize, Gill adds the sport each person specialized in, and we have a data set comprised of 493 observations containing the following attributes:

  • Age: This is the age in years (one decimal precision for the part of the year since the client’s last birthday) at the time that the athletic and personality trait battery test was administered. Participants ranged in age from 13-19 years old at the time they took the battery.
  • Strength: This is the participant’s strength measured through a series of weight lifting exercises and recorded on a scale of 0-10, with 0 being limited strength and 10 being sufficient strength to perform all lifts without any difficulty. No participant scored 8, 9 or 10, but some participants did score 0.
  • Quickness: This is the participant’s performance on a series of responsiveness tests. Participants were timed on how quickly they were able to press buttons when they were illuminated or to jump when a buzzer sounded. Their response times were tabulated on a scale of 0-6, with 6 being extremely quick response and 0 being very slow. Participants scored all along the spectrum for this attribute.
  • Injury: This is a simple yes (1) / no (0) column indicating whether or not the young athlete had already suffered an athletic-related injury that was severe enough to require surgery or other major medical intervention. Common injuries treated with ice, rest, stretching, etc. were entered as 0. Injuries that took more than three week to heal, that required physical therapy or surgery were flagged as 1.
  • Vision: Athletes were not only tested on the usual 20/20 vision scale using an eye chart, but were also tested using eye-tracking technology to see how well they were able to pick up objects visually. This test challenged participants to identify items that moved quickly across their field of vision, and to estimate speed and direction of moving objects. Their scores were recorded on a 0 to 4 scale with 4 being perfect vision and identification of moving objects. No participant scored a perfect 4, but the scores did range from 0 to 3.
  • Endurance: Participants were subjected to an array of physical fitness tests including running, calisthenics, aerobic and cardiovascular exercise, and distance swimming. Their performance was rated on a scale of 0-10, with 10 representing the ability to perform all tasks without fatigue of any kind. Scores ranged from 0 to 6 on this attribute. Gill has acknowledged to us that even finely tuned professional athletes would not be able to score a 10 on this portion of the battery, as it is specifically designed to test the limits of human endurance.
  • Agility: This is the participant’s score on a series of tests of their ability to move, twist, turn, jump, change direction, etc. The test checked the athlete’s ability to move nimbly, precisely, and powerfully in a full range of directions. This metric is comprehensive in nature, and is influenced by some of the other metrics, as agility is often dictated by one’s strength, quickness, etc. Participants were scored between 0 and 100 on this attribute, and in our data set from Gill, we have found performance between 13 and 80.
  • Decision_Making: This portion of the battery tests the athlete’s process of deciding what to do in athletic situations. Athlete’s participated in simulations that tested their choices of whether or not to swing a bat, pass a ball, move to a potentially advantageous location of a playing surface, etc. Their scores were to have been recorded on a scale of 0 to 100, though Gill has indicated that no one who completed the test should have been able to score lower than a 3, as three points are awarded simply for successfully entering and exiting the decision making part of the battery. Gill knows that all 493 of his former athletes represented in this data set successfully entered and exited this portion, but there are a few scores lower than 3, and also a few over 100 in the data set, so we know we have some data preparation in our future.
  • Prime_Sport: This attribute is the sport each of the 453 athletes went on to specialize in after they left Gill’s academy. This is the attribute Gill is hoping to be able to predict for his current clients. For the boys in this study, this attribute will be one of four sports: football (American, not soccer; sorry soccer fans), Basketball, Baseball, or Hockey.
Data Preparation
Modeling
Evaluation
Deployment
Chapter Summary

Discriminant analysis helps us to cross the threshold between Classification and Prediction in data mining. Prior to Chapter 7, our data mining models and methodologies focused primarily on categorization of data. With Discriminant Analysis, we can take a process that is very similar in nature to k-means clustering, and with the right target attribute in a training data set, generate predictions for a scoring data set. This can become a powerful addition to k-means models, giving us the ability to apply our clusters to other data sets that haven’t yet been classified.

Discriminant analysis can be useful where the classification for some observations is known and is not known for others. Some classic applications of discriminant analysis are in the fields of biology and organizational behavior. In biology, for example, discriminant analysis has been successfully applied to the classification of plant and animal species based on the traits of those living things. In organizational behavior, this type of data modeling has been used to help workers identify potentially successful career paths based on personality traits, preferences and aptitudes. By coupling known past performance with unknown but similarly structured data, we can use discriminant analysis to effectively train a model that can then score the unknown records for us, giving us a picture of what categories the unknown observations would likely be in.

Review Questions

1) What type of attribute does a data set need in order to conduct discriminant analysis instead of k-means clustering?
2) What is a ‘label’ role in RapidMiner and why do you need an attribute with this role in order to conduct discriminant analysis?
3) What is the difference between a training data set and a scoring data set?
4) What is the purpose of the Apply Model operator in RapidMiner?
5) What are confidence percent attributes used for in RapidMiner? What was the likely reason that did we not find any in this chapter’s example? Are there attributes about young athletes that you can think of that were not included in our data sets that might have helped up find some confidence percents? (Hint: think of things that are fairly specific to only one or two sports.)
6) What would be problematic about including both male and female athletes in this chapter’s example data?

Exercise

For this chapter’s exercise, you will compile your own data set based on people you know and the cars they drive, and then create a linear discriminant analysis of your data in order to predict categories for a scoring data set. Complete the following steps:
1) Open a new blank spreadsheet in OpenOffice Calc. At the bottom of the spreadsheet there will be three default tabs labeled Sheet1, Sheet2, Sheet3. Rename the first one Training and the second one Scoring. You can rename the tabs by double clicking on their labels. You can delete or ignore the third default sheet.
2) On the training sheet, starting in cell A1 and going across, create attribute labels for six attributes: Age, Gender, Marital_Status, Employment, Housing, and Car_Type.
3) Copy each of these attribute names except Car_Type into the Scoring sheet.
4) On the Training sheet, enter values for each of these attributes for several people that you know who have a car. These could be family members, friends and neighbors, coworkers or fellow students, etc. Try to do at least 20 observations; 30 or more would be better. Enter husband and wife couples as two separate observations, so long as each spouse has a different vehicle. Use the following to guide your data entry:
a. For Age, you could put the person’s actual age in years, or you could put them in buckets. For example, you could put 10 for people aged 10-19; 20 for people aged 20-29; etc.
b. For Gender, enter 0 for female and 1 for male.
c. For Marital_Status, use 0 for single, 1 for married, 2 for divorced, and 3 for widowed.
d. For Employment, enter 0 for student, 1 for full-time, 2 for part-time, and 3 for retired.
e. For Housing, use 0 for lives rent-free with someone else, 1 for rents housing, and 2 for owns housing.
f. For Car_Type, you can record data in a number of ways. This will be your label, or the attribute you wish to predict. You could record each person’s car by make (e.g. Toyota, Honda, Ford, etc.), or you could record it by body style (e.g. Car, Truck, SUV, etc.). Be consistent in assigning classifications, and note that depending on the size of the data set you create, you won’t want to have too many possible classificatons, or your predictions in the scoring data set will be spread out too much. With small data sets containing only 20-30 observations, the number of categories should be limited to three or four. You might even consider using Japanese, American, European as your Car_Types values.
5) Once you’ve compiled your Training data set, switch to the Scoring sheet in OpenOffice Calc. Repeat the data entry process for at least 20 people (more is better) that you know who do not have a car. You will use the training set to try to predict the type of car each of these people would drive if they had one.
6) Use the File > Save As menu option in OpenOffice Calc to save your Training and Scoring sheets as CSV files.
7) Import your two CSV files into your RapidMiner respository. Be sure to give them descriptive names.
8) Drag your two data sets into a new process window. If you have prepared your data well in OpenOffice Calc, you shouldn’t have any missing or inconsistent data to contend with, so data preparation should be minimal. Rename the two retrieve operators so you can tell the difference between your training and scoring data sets.
9) One necessary data preparation step is to add a Set Role operator and define the Car_Type attribute as your label.
10) Add a Linear Discriminant Analysis operator to your Training stream.
11) Apply your LDA model to your scoring data and run your model. Evaluate and report your results. Did you get any confidence percentages? Do the predicted Car_Types seem reasonable and consistent with your training data? Why or why not?

Challenge Step!
12) Change your LDA operator to a different type of discriminant analysis (e.g. Quadratic) operator. Re-run your model. Consider doing some research to learn about the difference between linear and quadratic discriminant analysis. Compare your new results to the LDA results and report any interesting findings or differences.

Chapter Eight: Linear Regression

Context and Perspective

Sarah, the regional sales manager from the Chapter 4 example, is back for more help. Business is booming, her sales team is signing up thousands of new clients, and she wants to be sure the company will be able to meet this new level of demand. She was so pleased with our assistance in finding correlations in her data, she now is hoping we can help her do some prediction as well. She knows that there is some correlation between the attributes in her data set (things like temperature, insulation, and occupant ages), and she’s now wondering if she can use the data set from Chapter 4 to predict heating oil usage for new customers. You see, these new customers haven’t begun consuming heating oil yet, there are a lot of them (42,650 to be exact), and she wants to know how much oil she needs to expect to keep in stock in order to meet these new customers’ demand. Can she use data mining to examine household attributes and known past consumption quantities to anticipate and meet her new customers’ needs?

Learning Objectives
Organizational Understanding
Data Understanding

As a review, our data set from Chapter 4 contains the following attributes:

  • Insulation: This is a density rating, ranging from one to ten, indicating the thickness of each home’s insulation. A home with a density rating of one is poorly insulated, while a home with a density of ten has excellent insulation.
  • Temperature: This is the average outdoor ambient temperature at each home for the most recent year, measure in degree Fahrenheit.
  • Heating_Oil: This is the total number of units of heating oil purchased by the owner of each home in the most recent year.
  • Num_Occupants: This is the total number of occupants living in each home.
  • Avg_Age: This is the average age of those occupants.
  • Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The higher the number, the larger the home.

We will use the Chapter 4 data set as our training data set in this chapter. Sarah has assembled a separate Comma Separated Values file containing all of these same attributes, except of course for Heating_Oil, for her 42,650 new clients. She has provided this data set to us to use as the scoring data set in our model.

Data Preparation
Modeling
Evaluation
Deployment
Chapter Summary

Linear regression is a predictive model that uses training and scoring data sets to generate numeric predictions in data. It is important to remember that linear regression uses numeric data types for all of its attributes. It uses the algebraic formula for calculating the slope of a line to determine where an observation would fall along an imaginary line through the scoring data. Each attribute in the data set is evaluated statistically for its ability to predict the target attribute. Attributes that are not strong predictors are removed from the model. Those attributes that are good predictors are assigned coefficients which give them weight in the prediction formula. Any observations whose attribute values fall in the range of corresponding training attribute values can be plugged into the formula in order to predict the target.

Once linear regression predictions are calculated, the results can be summarized in order to determine if there are differences in the predictions in subsets of the scoring data. As more data are collected, they can be added into the training data set in order to create a more robust training data set, or to expand the ranges of some attributes to include even more values. It is very important to remember that the ranges for the scoring attributes must fall within the ranges for the training attributes in order to ensure valid predictions.

Review Questions

1) What data type does linear regression expect for all attributes? What data type will the predicted attribute be when it is calculated?
2) Why are the attribute ranges so important when doing linear regression data mining?

3) What are linear regression coefficients? What does ‘weight’ mean?
4) What is the linear regression mathematical formula, and how is it arranged?
5) How are linear regression results interpreted?
Extra thought question:
6) If you have an attribute that you want to use in a linear regression model, but it contains text data, such as the make or model of a car, what could you do in order to be able to use that attribute in your model?

Exercise

In the Chapter 4 exercise, you compiled your own data set about professional athletes. For this exercise, we will enhance this data set and then build a linear regression model on it. Complete the following steps:

1) Open the data set you compiled for the Chapter 4 exercise. If you did not do that exercise, please turn back to Chapter 4 and complete steps 1 – 4.
2) Split your data set’s observations in two: a training portion and a scoring portion. Be sure that you have at least 20 observations in your training data set, and at least 10 in your scoring data set. More would be better, so if you only have 30 observations total, perhaps it would be good to take some time to look up ten or so more athletes to add to your scoring data set. Also, we are going to try to predict each athlete’s salary, so if Salary is not one of your attributes, look it up for each athlete in your training data set (don’t look it up for the scoring data set athletes, we’re going to try to predict these). Also, if there are other attributes that you don’t have, but that you think would be great predictors of salary, look these up, and add them to both your training and scoring data sets. These might be things like points per game, defensive statistics, etc. Be sure your attributes are numeric.

3) Import both of your data sets into your RapidMiner repository. Be sure to give them descriptive names. Drag and drop them into a new process, and rename them as Training and Scoring so that you can tell them apart.
4) Use a Set Role operator to designate the Salary attribute as the label for the training data.
5) Add a linear regression operator and apply your model to your scoring data set.
6) Run your model. In results perspective, examine your attribute coefficients and the predictions for the athletes’ salaries in your scoring data set.
7) Report your results:
a. Which attributes have the greatest weight?
b. Were any attributes dropped from the data set as non-predictors? If so, which ones and why do you think they weren’t effective predictors?
c. Look up a few of the salaries for some of your scoring data athletes and compare their actual salary to the predicted salary. Is it very close? Why or why not, do you think?
d. What other attributes do you think would help your model better predict professional athletes’ salaries?

Chapter Nine: Logistic Regression

Context and Perspective

Remember Sonia, the health insurance program director from Chapter 6? Well, she’s back for more help too! Her k-means clustering project was so helpful in finding groups of folks who could benefit from her programs, that she wants to do more. This time around, she is concerned with helping those who have suffered heart attacks. She wants to help them improve lifestyle choices, including management of weight and stress, in order to improve their chances of not suffering a second heart attack. Sonia is wondering if, with the right training data, we can predict the chances of her company’s policy holders suffering second heart attacks. She feels like she could really help some of her policy holders who have suffered heart attacks by offering weight, cholesterol and stress management classes or support groups. By lowering these key heart attack risk factors, her employer’s clients will live healthier lives, and her employer’s risk at having to pay costs associated with treatment of second heart attacks will also go down. Sonia thinks she might even be able to educate the insured individuals about ways to save money in other aspects of their lives, such as their life insurance premiums, by being able to demonstrate that they are now a lower risk policy holder.

Learning Objectives
Organizational Understanding
Data Understanding

Sonia has access to the company’s medical claims database. With this access, she is able to generate two data sets for us. This first is a list of people who have suffered heart attacks, with an attribute indicating whether or not they have had more than one; and the second is a list of those who have had a first heart attack, but not a second. The former data set, comprised of 138 observations, will serve as our training data; while the latter, comprised of 690 peoples’ data, will be for scoring. Sonia’s hope is to help this latter group of people avoid becoming second heart attack victims. In compiling the two data sets we have defined the following attributes:

  • Age: The age in years of the person, rounded to the nearest whole year.
  • Marital_Status: The person’s current marital status, indicated by a coded number: 0–Single, never married; 1–Married; 2–Divorced; 3–Widowed.
  • Gender: The person’s gender: 0 for female; 1 for male.
  • Weight_Category: The person’s weight categorized into one of three levels: 0 for normal weight range; 1 for overweight; and 2 for obese.
  • Cholesterol: The person’s cholesterol level, as recorded at the time of their treatment for their most recent heart attack (their only heart attack, in the case of those individuals in the scoring data set.)
  • Stress_Management: A binary attribute indicating whether or not the person has previously attended a stress management course: 0 for no; 1 for yes.
  • Trait_Anxiety: A score on a scale of 0 to 100 measuring the level of each person’s natural stress levels and abilities to cope with stress. A short time after each person in each of the two data sets had recovered from their first heart attack, they were administered a standard test of natural anxiety. Their scores are tabulated and recorded in this attribute along five point increments. A score of 0 would indicate that the person never feels anxiety, pressure or stress in any situation, while a score of 100 would indicate that the person lives in a constant state of being overwhelmed and unable to deal with his or her circumstances.
  • 2nd_Heart_Attack: This attribute exists only in the training data set. It will be our label, the prediction or target attribute. In the training data set, the attribute is set to ‘yes’ for individuals who have suffered second heart attacks, and ‘no’ for those who have not.
Data Preparation
Modeling
Evaluation
Deployment
Chapter Summary

Logistic regression is an excellent way to predict whether or not something will happen, and how confident we are in such predictions. It takes a number of numeric attributes into account and then uses those through a training data set to predict the probable outcomes in a comparable scoring data set. Logistic regression uses a nominal target attribute (or label, in RapidMiner) to categorize observations in a scoring data set into their probable outcomes.

As with linear regression, the scoring data must have ranges that fall within their corresponding training data ranges. Without such bounds, it is unsafe and unwise to draw assumptions about observations in the scoring data set, since there are no comparable observations in the training data upon which to base your scoring assumptions. When used within these bounds however, logistic regression can help us quickly and easily predict the outcome of some phenomenon in a data set, and to determine how confident we can be in the accuracy of that prediction.

Review Questions

1) What is the appropriate data type for independent variables (predictor attributes) in logistic regression? What about for the dependent variable (target or label attribute)?
2) Compare the predictions for Row 15 and 669 in the chapter’s example model.
a. What is the single difference between these two people, and how does it affect their predicted 2nd_Heart_Attack risk?
b. Locate other 67 year old men in the results and compare them to the men on rows 15 and 669. How do they compare?
c. Can you spot areas when the men represented on rows 15 and 669 could improve their chances of not suffering a second heart attack?
3) What is the difference between confidence(Yes) and confidence(No) in this chapter’s example?
4) How can you set an attribute’s role to be ‘label’ in RapidMiner without using the Set Role operator? What is one drawback to doing it that way?

Exercise

For this chapter’s exercise, you will use logistic regression to try to predict whether or not young people you know will eventually graduate from college. Complete the following steps:
1) Open a new blank spreadsheet in OpenOffice Calc. At the bottom of the spreadsheet there will be three default tabs labeled Sheet1, Sheet2, Sheet3. Rename the first one Training and the second one Scoring. You can rename the tabs by double clicking on their labels. You can delete or ignore the third default sheet.
2) On the training sheet, starting in cell A1 and going across, create attribute labels for five attributes: Parent_Grad, Gender, Income_Level, Num_Siblings, and Graduated.
3) Copy each of these attribute names except Graduated into the Scoring sheet.

4) On the Training sheet, enter values for each of these attributes for several adults that you know who are at the age that they could have graduated from college by now. These could be family members, friends and neighbors, coworkers or fellow students, etc. Try to do at least 20 observations; 30 or more would be better. Enter husband and wife couples as two separate observations. Use the following to guide your data entry:
a. For Parent_Grad, enter a 0 if neither of the person’s parents graduated from college, a 1 if one parent did, and a 2 if both parents did. If the person’s parents went on to earn graduate degress, you could experiment with making this attribute even more interesting by using it to hold the total number of college degrees by the person’s parents. For example, if the person represented in the observation had a mother who earned a bachelor’s, master’s and doctorate, and a father who earned a bachelor’s and a master’s, you could enter a 5 in this attribute for that person.
b. For Gender, enter 0 for female and 1 for male.
c. For Income_Level, enter a 0 if the person lives in a household with an income level below what you would consider to be below average, a 1 for average, and a 2 for above average. You can estimate or generalize. Be sensitive to others when gathering your data—don’t snoop too much or risk offending your data subjects.
d. For Num_Siblings, enter the number of siblings the person has.
e. For Graduated, put ‘Yes’ if the person has graduated from college and ‘No’ if they have not.
5) Once you’ve compiled your Training data set, switch to the Scoring sheet in OpenOffice Calc. Repeat the data entry process for at least 20 (more is better) young people between the ages of 0 and 18 that you know. You will use the training set to try to predict whether or not these young people will graduate from college, and if so, how confident you are in your prediction. Remember this is your scoring data, so you won’t provide the Graduated attribute, you’ll predict it shortly.
6) Use the File > Save As menu option in OpenOffice Calc to save your Training and Scoring sheets as CSV files.
7) Import your two CSV files into your RapidMiner respository. Be sure to give them descriptive names.

8) Drag your two data sets into a new process window. If you have prepared your data well in OpenOffice Calc, you shouldn’t have any missing or inconsistent data to contend with, so data preparation should be minimal. Rename the two retrieve operators so you can tell the difference between your training and scoring data sets.
9) One necessary data preparation step is to add a Set Role operator and define the Graduated attribute as your label in your training data. Alternatively, you can set your Graduated attribute as the label during data import.
10) Add a Logistic Regression operator to your Training stream.
11) Apply your Logistic Regression model to your scoring data and run your model. Evaluate and report your results. Are your confidence percentages interesting? Surprising? Do the predicted Graduation values seem reasonable and consistent with your training data? Does any one independent variable (predictor attribute) seem to be a particularly good predictor of the dependent variable (label or prediction attribute)? If so, why do you think so?

Challenge Step!
12) Change your Logistic Regression operator to a different type of Logistic operator (for example, maybe try the Weka W-Logistic operator). Re-run your model. Consider doing some research to learn about the difference between algorithms underlying different logistic approaches. Compare your new results to the original Logistic Regression results and report any interesting findings or differences.

Chapter Ten: Decision Trees

Context and Perspective

Richard works for a large online retailer. His company is launching a next-generation eReader soon, and they want to maximize the effectiveness of their marketing. They have many customers, some of whom purchased one of the company’s previous generation digital readers. Richard has noticed that certain types of people were the most anxious to get the previous generation device, while other folks seemed to content to wait to buy the electronic gadget later. He’s wondering what makes some people motivated to buy something as soon as it comes out, while others are less driven to have the product.

Richard’s employer helps to drive the sales of its new eReader by offering specific products and services for the eReader through its massive web site—for example, eReader owners can use the company’s web site to buy digital magazines, newspapers, books, music, and so forth. The company also sells thousands of other types of media, such as traditional printed books and electronics of every kind. Richard believes that by mining the customers’ data regarding general consumer behaviors on the web site, he’ll be able to figure out which customers will buy the new eReader early, which ones will buy next, and which ones will buy later on. He hopes that by predicting when a customer will be ready to buy the next-gen eReader, he’ll be able to time his target marketing to the people most ready to respond to advertisements and promotions.

Learning Objectives
Organizational Understanding
Data Understanding

Richard has engaged us to help him with his project. We have decided to use a decision tree model in order to find good early predictors of buying behavior. Because Richard’s company does all of its business through its web site, there is a rich data set of information for each customer, including items they have just browsed for, and those they have actually purchased. He has prepared two data sets for us to use. The training data set contains the web site activities of customers who bought the company’s previous generation reader, and the timing with which they bought their reader. The second is comprised of attributes of current customers which Richard hopes will buy the new eReader. He hopes to figure out which category of adopter each person in the scoring data set will fall into based on the profiles and buying timing of those people in the training data set.

In analyzing his data set, Richard has found that customers’ activity in the areas of digital media and books, and their general activity with electronics for sale on his company’s site, seem to have a lot in common with when a person buys an eReader. With this in mind, we have worked with Richard to compile data sets comprised of the following attributes:

  • User_ID: A numeric, unique identifier assigned to each person who has an account on the company’s web site.
  • Gender: The customer’s gender, as identified in their customer account. In this data set, it is recorded a ‘M’ for male and ‘F’ for Female. The Decision Tree operator can handle non-numeric data types.
  • Age: The person’s age at the time the data were extracted from the web site’s database. This is calculated to the nearest year by taking the difference between the system date and the person’s birthdate as recorded in their account.
  • Marital_Status: The person’s marital status as recorded in their account. People who indicated on their account that they are married are entered in the data set as ‘M’. Since the web site does not distinguish single types of people, those who are divorced or widowed are included with those who have never been married (indicated in the data set as ‘S’).
  • Website_Activity: This attribute is an indication of how active each customer is on the company’s web site. Working with Richard, we used the web site database’s information which records the duration of each customers visits to the web site to calculate how frequently, and for how long each time, the customers use the web site. This is then translated into one of three categories: Seldom, Regular, or Frequent.
  • Browsed_Electronics_12Mo: This is simply a Yes/No column indicating whether or not the person browsed for electronic products on the company’s web site in the past year.
  • Bought_Electronics_12Mo: Another Yes/No column indicating whether or not they purchased an electronic item through Richard’s company’s web site in the past year.
  • Bought_Digital_Media_18Mo: This attribute is a Yes/No field indicating whether or not the person has purchased some form of digital media (such as MP3 music) in the past year and a half. This attribute does not include digital book purchases.
  • Bought_Digital_Books: Richard believes that as an indicator of buying behavior relative to the company’s new eReader, this attribute will likely be the best indicator. Thus, this attribute has been set apart from the purchase of other types of digital media. Further, this attribute indicates whether or not the customer has ever bought a digital book, not just in the past year or so.
  • Payment_Method: This attribute indicates how the person pays for their purchases. In cases where the person has paid in more than one way, the mode, or most frequent method of payment is used. There are four options:
    • Bank Transfer—payment via e-check or other form of wire transfer directly from the bank to the company.
    • Website Account—the customer has set up a credit card or permanent electronic funds transfer on their account so that purchases are directly charged through their account at the time of purchase.
    • Credit Card—the person enters a credit card number and authorization each time they purchase something through the site.
    • Monthly Billing—the person makes purchases periodically and receives a paper or electronic bill which they pay later either by mailing a check or through the company web site’s payment system.
    • eReader_Adoption: This attribute exists only in the training data set. It consists of data for customers who purchased the previous-gen eReader. Those who purchased within a week of the product’s release are recorded in this attribute as ‘Innovator’. Those who purchased after the first week but within the second or third weeks are entered as ‘Early Adopter’. Those who purchased after three weeks but within the first two months are ‘Early Majority’. Those who purchased after the first two months are ‘Late Majority’. This attribute will serve as our label when we apply our training data to our scoring data.

With Richard’s data and an understanding of what it means, we can now proceed to…

Data Preparation
Modeling
Evaluation
Figure 10-12. Tree resulting from a gini_index algorithm

Figure10-12.png

Deployment
Chapter Summary

Decision trees are excellent predictive models when the target attribute is categorical in nature, and when the data set is of mixed types. Although this chapter’s data sets did not contain any examples, decision trees are better than more statistics-based approaches at handling attributes that have missing or inconsistent values that are not handled—decision trees will work around such data and still generate usable results.

Decision trees are made of nodes and leaves (connected by labeled branch arrows), representing the best predictor attributes in a data set. These nodes and leaves lead to confidence percentages based on the actual attributes in the training data set, and can then be applied to similarly structured scoring data in order to generate predictions for the scoring observations. Decision trees tell us what is predicted, how confident we can be in the prediction, and how we arrived at the prediction. The ‘how we arrived at’ portion of a decision tree’s output is shown in a graphical view of the tree.

Review Questions

1) What characteristics of a data set’s attributes might prompt you to choose a decision tree data mining methodology, rather than a logistic or linear regression approach? Why?
2) Run this chapter’s model using the gain_ratio algorithm and make a note of three or four individuals’ prediction and confidences. Then re-run the model under gini_index. Locate the people you noted. Did their prediction and/or confidences change? Look at their attribute values and compare them to the nodes and leaves in the decision tree. Explain why you think at least one person’s prediction changed under Gini, based on that person’s attributes and the tree’s nodes.
3) What are confidence percentages used for, and why would they be important to consider, in addition to just considering the prediction attribute?
4) How do you keep an attribute, such as a person’s name or ID number, that should not be considered predictive in a process’s model, but is useful to have in the data mining results?

5) If your decision tree is large or hard to read, how can you adjust its visual layout to improve readability?

Exercise

For this chapter’s exercise, you will make a decision tree to predict whether or not you, and others you know would have lived, died, or been lost if you had been on the Titanic. Complete the following steps.
1) Conduct an Internet search for passenger lists for the Titanic. The search term ‘Titanic passenger list’ in your favorite search engine will yield a number of web sites containing lists of passengers.
2) Select from the sources you find a sample of passengers. You do not need to construct a training data set of every passenger on the Titanic (unless you want to), but get at least 30, and preferably more. The more robust your training data set is, the more interesting your results will be.
3) In a spreadsheet in OpenOffice Calc, enter these passengers’ data.
a. Record attributes such as their name, age, gender, class of service they traveled in, race or nationality if known, or other attributes that may be available to you depending on the detail level of the data source you find.
b. Be sure to have at least four attributes, preferably more. Remember that the passengers’ names or ID numbers won’t be predictive, so that attribute shouldn’t be counted as one of your predictor attributes.
c. Add to your data set whether the person lived (i.e. was rescued from a life boat or from the water), died (i.e. their body was recovered), or was lost (i.e. was on the Titanic’s manifest but was never accounted for and therefore presumed dead after the ship’s sinking). Call this attribute ‘Survival_Result’.
d. Save this spreadsheet as a CSV file and then import it into your RapidMiner repository. Set the Survival_Result attribute’s role to be your label. Set other attributes which are not predictive, such as names, to not be considered in the decision tree model.
e. Add a Decision Tree operator to your stream.
4) In a new, blank spreadsheet in OpenOffice Calc, duplicate the attribute names from your training data set, with the exception of Survival_Result. You will predict this attribute using your decision tree.
5) Enter data for yourself and people that you know into this spreadsheet.
a. For some attributes, you may have to decide what to put. For example, the author acknowledges that based on how relentlessly he searches for the absolutely cheapest ticket when shopping for airfare, he almost certainly would have been in 3rd class if he had been on the Titanic. He further knows some people who very likely would have been in 1st class.
b. If you want to include some people in your data set but you don’t know every single attribute for them, remember, decision trees can handle some missing values.
c. Save this spreadsheet as a CSV file and import it into your RapidMiner repository.
d. Drag this data set into your process and ensure that attributes that are not predictive, such as names, will not be included as predictors in the model.
6) Apply your decision tree model to your scoring data set.
7) Run your model using gain_ratio. Report your tree nodes, and discuss whether you and the people you know would have lived, died or been lost.
8) Re-run your model using gini_index. Report differences in your tree’s structure. Discuss whether your chances for survival increase under Gini.
9) Experiment with changing leaf and split sizes, and other decision tree algorithm criteria, such as information_gain. Analyze and report your results.

Chapter Eleven: Neural Networks

Context and Perspective

Juan is a statistical performance analyst for a major professional athletic team. His team has been steadily improving over recent seasons, and heading into the coming season management believes that by adding between two and four excellent players, the team will have an outstanding shot at achieving the league championship. They have tasked Juan with identifying their best options from among a list of 59 experienced players that will be available to them. All of these players have experience, some have played professionally before and some have many years of experience as amateurs. None are to be ruled out without being assessed for their potential ability to add star power and productivity to the existing team. The executives Juan works for are anxious to get going on contacting the most promising prospects, so Juan needs to quickly evaluate these athletes’ past performance and make recommendations based on his analysis.

Learning Objectives
Organizational Understanding
Data Understanding

Juan knows the business of athletic statistical analysis. He has seen how performance in one area, such as scoring, is often interconnected with other areas such as defense or fouls. The best athletes generally have strong connections between two or more performance areas, while more typical athletes may have a strength in one area but weaknesses in others. For example, good role players are often good defenders, but can’t contribute much scoring to the team. Using league data and his knowledge of and experience with the players in the league, Juan prepares a training data set comprised of 263 observations and 19 attributes. The 59 prospective athletes Juan’s team could acquire form the scoring data set, and he has the same attributes for each of these people. We will help Juan build a neural network, which is a data mining methodology that can predict categories or classifications in much the same way that decision trees do, but neural networks are better at finding the strength of connections between attributes, and it is those very connections that Juan is interested in. The attributes our neural network will evaluate are:

  • Player_Name: This is the player’s name. In our data preparation phase, we will set its role to ‘id’, since it is not predictive in any way, but is important to keep in our data set so that Juan can quickly make his recommendations without having to match the data back to the players’ names later. (Note that the names in this chapter’s data sets were created using a random name generator. They are fictitious and any similarity to real persons is unintended and purely conincidental.)
  • Position_ID: For the sport Juan’s team plays, there are 12 possible positions. Each one is represented as an integer from 0 to 11 in the data sets.
  • Shots: This the total number of shots, or scoring opportunities each player took in their most recent season.
  • Makes: This is the number times the athlete scored when shooting during the most recent season.
  • Personal_Points: This is the number of points the athlete personally scored during the most recent season.
  • Total_Points: This is the total number of points the athlete contributed to scoring in the most recent season. In the sport Juan’s team plays, this statistic is recorded for each point an athlete contributes to scoring. In other words, each time an athlete scores a personal point, their total points increase by one, and every time an athlete contributes to a teammate scoring, their total points increase by one as well.
  • Assists: This is a defensive statistic indicating the number of times the athlete helped his team get the ball away from the opposing team during the most recent season.
  • Concessions: This is the number of times the athlete’s play directly caused the opposing team to concede an offensive advantage during the most recent season.
  • Blocks: This is the number of times the athlete directly and independently blocked the opposing team’s shot during the most recent season.
  • Block_Assists: This is the number of times an athlete collaborated with a teammate to block the opposing team’s shot during the most recent season. If recorded as a block assist, two or more players must have been involved. If only one player blocked the shot, it is recorded as a block. Since the playing surface is large and the players are spread out, it is much more likely for an athlete to record a block than for two or more to record block assists.
  • Fouls: This is the number of times, in the most recent season, that the athlete committed a foul. Since fouling the other team gives them an advantage, the lower this number, the better the athlete’s performance for his own team.
  • Years_Pro: In the training data set, this is the number of years the athlete has played at the professional level. In the scoring data set, this is the number of year experience the athlete has, including years as a professional if any, and years in organized, competitive amateur leagues.
  • Career_Shots: This is the same as the Shots attribute, except it is cumulative for the athlete’s entire career. All career attributes are an attempt to assess the person’s ability to perform consistently over time.
  • Career_Makes: This is the same as the Makes attribute, except it is cumulative for the athlete’s entire career.
  • Career_PP: This is the same as the Personal Points attribute, except it is cumulative for the athlete’s entire career.
  • Career_TP: This is the same as the Total Points attribute, except it is cumulative for the athlete’s entire career.
  • Career_Assists: This is the same as the Career Assists attribute, except it is cumulative for the athlete’s entire career.
  • Career_Con: This is the same as the Career Concessions attribute, except it is cumulative for the athlete’s entire career.
  • Team_Value: This is a categorical attribute summarizing the athlete’s value to his team. It is present only in the training data, as it will serve as our label to predict a Team_Value for each observation in the scoring data set. There are four categories:
    • Role Player: This is an athlete who is good enough to play at the professional level, and may be really good in one area, but is not excellent overall.
    • Contributor: This is an athlete who contributes across several categories of defense and offense and can be counted on to regularly help the team win.
    • Franchise Player: This is an athlete whose skills are so broad, strong and consistent that the team will want to hang on to them for a long time. These players are of such a talent level that they can form the foundation of a really good, competitive team.
    • Superstar: This is that rare individual who gifts are so superior that they make a difference in every game. Most teams in the league will have one such player, but teams with two or three always contend for the league title.

Juan’s data are ready and we understand the attributes available to us. We can now proceed to…

Data Preparation
Modeling
Evaluation
Figure 11-5. A graphical view of our neural network showing different strength neurons and the four nodes for each of the possible Team_Value categories

Figure11-5.png

Deployment
Chapter Summary

Neural networks try to mimic the human brain by using artificial ‘neurons’ to compare attributes to one another and look for strong connections. By taking in attribute values, processing them, and generating nodes connected by neurons, this data mining model can offer predictions and confidence percentages, even amid uncertainty in some data. Neural networks are not as limited regarding value ranges as some other methodologies.

In their graphical representation, neural nets are drawn using nodes and neurons. The thicker or darker the line between nodes, the stronger the connection represented by that neuron. Stronger neurons equate to a stronger ability by that attribute to predict. Although the graphical view can be difficult to read, which can often happen when there are a larger number of attributes, the computer is able to read the network and apply the model to scoring data in order to make predictions. Confidence percentages can further inform the value of an observation’s prediction, as was illustrated with our hypothetical athlete Lance Goodwin in this chapter. Between the prediction and confidence percentages, we can use neural networks to find interesting observations that may not be obvious, but still represent good opportunities to answer questions or solve problems.

Review Questions

1) Where do neural networks get their name? What characteristics of the model make it ‘neural’?
2) Find another observation in this chapter’s example that is interesting but not obvious, similar to the Lance Goodwin observation. Why is the observation you found interesting? Why is it less obvious than some?
3) How should confidence percentages be used in conjunction with a neural network’s predictions?
4) Why might a data miner prefer a neural network over a decision tree?
5) If you want to see a node’s details in a RapidMiner graph of a neural network, what can you do?

Exercise

For this chapter’s exercise, you will create a neural network to predict risk levels for loan applicants at a bank. Complete the following steps.
1) Access the companion web site for this text. Locate and download the training data set labeled Chapter11Exercise_TrainingData.csv.
2) Import the training data set into your RapidMiner repository and name it descriptively. Drag and drop the data set into a new, blank main process.
3) Set the Credit_Risk attribute as your label. Remember that Applicant_ID is not predictive.
4) Add a Neural Net operator to your model.

5) Create your own scoring data set using the attributes in the training data set as a guide. Enter at least 20 observations. You can enter data for people that you know (you may have to estimate some of their attribute values, e.g. their credit score), or you can simply test different values for each of the attributes. For example, you might choose to enter four consecutive observations with the same values in all attributes except for the credit score, where you might increment each observation’s credit score by 100 from 400 up to 800.
6) Import your scoring data set and apply your model to it.
7) Run your model and review your predictions for each of your scoring observations. Report your results, including any interesting or unexpected results.
 

Challenge Step!
8) See if you can experiment with different lower bounds for each attribute to find the point at which a person will be predicted in the ‘DO NOT LEND’ category. Use a combination of Declare Missing Values and Replace Missing Values operators to try different thresholds on various attributes. Report your results.

Chapter Twelve: Text Mining

Context and Perspective

Gillian is a historian and archivist at a national museum in the Unites States. She has recently curated an exhibit on the Federalist Papers. The Federalist Papers are a series of dozens of essays that were written and published in the late 1700’s. The essays were published in two different newspapers in the state of New York over the course of about one year, and they were released anonymously under the author name ‘Publius’. Their intent was to educate the American people about the new nation’s proposed constitution, and to advocate in favor of its ratification. No one really knew at the time if ‘Publius’ was one individual or many, but several individuals familiar with the authors and framers of the constitution had spotted some patterns in vocabulary and sentence structure that seemed familiar to sections of the U. S. constitution. Years later, after Alexander Hamilton died in the year 1804, some notes were discovered that revealed that he (Hamilton), James Madison and John Jay had been the authors of the papers. The notes indicated specific authors for some papers, but not for others. Specifically, John Jay was revealed to be the author for papers 3, 4 and 5; Madison for paper 14; and Hamilton for paper 17. Paper 18 had no author named, but there was evidence that Hamilton and Madison worked on that one together.

Learning Objectives
Organizational Understanding
Data Understanding

Gillian’s data set is simple: we will include the full text of Federalist Papers number 5 (Jay), 14 (Madison), 17 (Hamilton), and 18 (suspected collaboration between Madison and Hamilton). The Federalist Papers are available through a number of sources: they have been re-published in book form, they are available on a number of different web sites, and their text is archived in many libraries throughout the world. For this chapter’s exercise, the text of these four papers has been added to the book’s companion web site. There are four files for you to download:

  • Chapter12_Federalist05_Jay.txt
  • Chapter12_Federalist14_Madison.txt
  • Chapter12_Federalist17_Hamilton.txt
  • Chapter12_Federalist18_Collaboration.txt.

Please download these now, but do not import them into a RapidMiner repository. The process of handling textual data in RapidMiner is a bit different than what we have done in past chapters. With these four papers’ text available to us, we can move directly into the CRISP-DM phase of…

Data Preparation
Modeling
Evaluation
Deployment
Chapter Summary

Text mining is a powerful way of analyzing data in an unstructured format such as in paragraphs of text. Text can be fed into a model in different ways, and then that text can be broken down into tokens. Once tokenized, words can be further manipulated to address matters such as case sensitivity, phrases or word groupings, and word stems. The results of these analyses can reveal the frequency and commonality of strong words or grams across groups of documents. This can reveal trends in the text, such as what topics are most important to author(s), or what message should be taken away from the text when reading the documents.

Further, once the documents’ tokens are organized into attributes, the documents can be modeled, just as other, more structured data sets can be modeled. Multiple documents can be handled by a single Process Document operator in RapidMiner, which will apply the same set of tokenization and token handlers to all documents at once through the sub-process stream. After a model has been applied to a set of documents, additional documents can be added to the stream, passed through the document processor, and run through the model to yield more well-trained and specific results.

Review Questions

1) What are some of the benefits of text mining as opposed to the other models you’ve learned in this book?
2) How are some ways that text-based data is imported into RapidMiner?
3) What is a sub-process and when do you use one in RapidMiner?
4) Define the following terms: token, stem, n-gram, case-sensitive.
5) How does tokenization enable the application of data mining models to text-based data?
6) How do you view a k-Means cluster’s details?

Exercise

For this chapter’s exercise, you will mine text for common complaints against a company or industry. Complete the following steps.

1) Using your favorite search engine, locate a web site or discussion forum on the Internet where people have posted complaints, criticisms or pleas for help regarding a company or an industry (e.g. airlines, utility companies, insurance companies, etc.).
2) Copy and paste at least ten of these posts or comments into a text editor, saving each one as its own text document with a unique name.
3) Open a new, blank process in RapidMiner, and using the Read Documents operator, connect to each of your ten (or more) text documents containing the customer complaints you found.
4) Process these documents in RapidMiner. Be sure you tokenize and use other handlers in your sub-process as you deem appropriate/necessary. Experiment with grams and stems.
5) Use a k-Means cluster to group your documents into two, three or more clusters. Output your word list as well.
6) Report the following:
a. Based on your word list, what seem to be the most common complaints or issues in your documents? Why do you think that is? What evidence can you give to support your claim?
b. Based on your word list, are there some terms or phrases that show up in all, or at least most of your documents? Why do you think these are so common?
c. Based on your clusters, what groups did you get? What are the common themes in each of your clusters? Is this surprising? Why or why not?
d. How might a customer service manager use your model to address the common concerns or issues you found?

Challenge Step!
7) Using your knowledge from past chapters, removed the k-Means clustering operator, and try to apply a different data mining methodology such as association rules or decision trees to your text documents. Report your results.

Section Three: Special Considerations in Data Mining

Chapter Thirteen: Evaluation and Deployment

How Far We’ve Come

The purpose of this book, which was explained in Chapter 1, is to introduce non-experts and non-computer scientists to some of the methods and tools of data mining. Certainly there have been a number of processes, tools, operators, data manipulation techniques, etc., demonstrated in this book, but perhaps the most important lesson to take away from this broad treatment of data mining is that the field has become huge, complex, and dynamic. You have learned about the CRISP-DM process, and had it shown to you numerous times as you have seen data mining models that classified, predicted and did both. You have seen a number of data processing tools and techniques, and as you have done this, you have hopefully noticed thy myriad other operators in RapidMiner that we did not use or discuss. Although you may be feeling like you’re getting good at data mining (and we hope you do), please recognize that there is a world of data mining that this book has not touched on—so there is still much for you to learn.

This chapter and the next will discuss some cautions that should be taken before putting any real-world data mining results into practice. This chapter will demonstrate a method for using RapidMiner to conduct some validation for data mining models; while Chapter 14 will discuss the choices you will make as a data miner, and some ways to guide those choices in good directions. Remember from Chapter 1 that CRISP-DM is cyclical—you should always be learning from the work you are doing, and feeding what you’ve learned from your work back into your next data mining activity.

For example, suppose you used a Replace Missing Values operator in a data mining model to set all missing values in a data set to the average for each attribute. Suppose further that you used results of that data mining model in making decisions for your company, and that those decisions turned out to be less than ideal. What if you traced those decisions back to your data mining activities and found that by using the average, you made some general assumptions that weren’t really very realistic. Perhaps you don’t need to throw out the data mining model entirely, but for the next run of that model you should be sure to change it to either remove observations with missing values, or use a more appropriate replacement value based upon what you have learned. Even if you used your data mining results and had excellent outcomes, remember that your business is constantly moving, and through the day-to-day operations of your organization, you are gathering more data. Be sure to add this data to training data sets, compare actual outcomes to predictions, and tune your data mining models in accordance with your experience and the expertise you are developing. Consider Sarah, our hypothetical sales manager from Chapters 4 and 8. Certainly now that we’ve helped her predict heating oil usage by home through a linear regression model, Sarah can track these homes’ actual heating oil orders to see how well their actual use matches our predictions. Once these customers have established several months or years of actual heating oil consumption, their data can be fed into Sarah’s model’s training data set, helping it to be even more accurate in its predictions.

One of the benefits of connecting RapidMiner to a database or data warehouse, rather than importing data via a file (CSV, etc.) is that data can be added to the data sets in real time and fed straight into the RapidMiner models. If you were to acquire some new training data, as Sarah could in the scenario just proposed in the previous paragraph, it could be immediately incorporated into the RapidMiner model if the data were in a connected database. With a CSV file, the new training data would have to be added into the file, and then re-imported into the RapidMiner repository.

As we tune and hone our models, they perform better for us. In addition to using our growing expertise and adding more training data, there are some built-in ways that we can check a model’s performance in RapidMiner.

Learning Objectives
Cross-Validation
Chapter Summary: The Value of Experience

So now we have seen one way to statistically evaluate a model’s reliability. You have seen that there are a number of cross-validation and performance operators that you can use to check a training data set’s ability to perform. But the bottom line is that there is no substitute for experience and expertise. Use subject matter experts to review your data mining results. Ask them to give you feedback on your model’s output. Run pilot tests and use focus groups to try out your model’s predictions before rolling them out organization-wide. Do not be offended if someone questions or challenges the reliability of your model’s results—be humble enough to take their questions as an opportunity to validate and strengthen your model. Remember that ‘pride goeth before the fall’! Data mining is a process. If you present your data mining results and recommendations as infallible, you are not participating in the cyclical nature of CRISP-DM, and you’ll likely end up looking foolish sooner or later. CRISP-DM is such a good process precisely because of its ability to help us investigate data, learn from our investigation, and then do it again from a more informed position. Evaluation and Deployment are the two steps in the process where we establish that more informed position.

Review Questions

1) What is cross-validation and why should you do it?
2) What is a false positive and why might one be generated?
3) Why would false positives not negate all value for a data mining model?
4) How does a model’s overall performance percentage relate to the target attribute’s (label’s) individual performance percentages?
5) How can changing a data mining methodology’s underlying algorithm affect a model’s cross-validation performance percentages?

Exercise

For this chapter’s exercise, you will create a cross-validation model for your Chapter 10 exercise training data set. Complete the following steps.
1) Open RapidMiner to a new, blank process and add the training data set you created for your Chapter 10 exercise (the Titanic survival data set).
2) Set roles as necessary.
3) Apply a cross-validation operator to the data set.
4) Configure your sub-process using gain_ratio for the Decision Tree operator’s algorithm. Apply the model and run it through a Performance (Classification) operator.

5) Report your training data set’s ability to predict.
6) Change your Decision Tree operator’s algorthim to gini_index and re-run your model.
7) Report your results in the context of any changes that occurred in your training data set’s ability to predict.
Challenge Step!
8) Change your Decision Tree operator’s algorithm to one of the other options, such as information_schema, and report your results again, comparative to gain_ratio and gini_index.
Extra Challenge Step!
9) Repeat steps 1-7 for the linear regression training data set (Chapter 8). You will need to use a slightly different Performance operator. Report your results. If you would like, repeat step 8 for your Chapter 8 exercise training data set and report your results.

Chapter Fourteen: Data Mining Ethics

Why Data Mining Ethics?

It has been said that when you are teaching someone something, you should leave the thing that you want them to remember most to the very end. It will be the last thing they remember hearing from you, the thing they take with them as they depart from your instruction. It is in harmony with this philosophy that the chapter on data mining ethics has been left to the end of this book. Please don’t misconstrue this chapter’s placement as an afterthought. It is here at the end so you’ll take it with you and remember it. It is believed that especially if you make a big deal out of it, the last thing you share with your audience will end up being what they remember from your teaching, so here is our effort at making a big deal about data mining ethics:

FIGURE 14-1. This Just In: BEING AN ETHICAL DATA MINER IS IMPORTANT

Figure14-1.png

In all seriousness, when we are dealing with data, those data represent peoples’ lives. In this book alone, we have touched on peoples’ buying behaviors, ownership of creative works, and even serious health issues. Imagine the ethical ramifications of using a decision tree to predict the risk levels of juvenile delinquents as just one example. You’d be profiling, potentially branding these youth, so any attempt to do so must be done ethically. But what does this mean? Ethics is the set of moral codes, above and beyond the legally required minimums, that an individual uses to make right and respectful decisions. When mining data, questions of an ethical nature will invariably arise. Simply because it is legal to gather and mine certain data does not make it ethical.

Because of these serious matters, there are some in the world who fear, shun and even fight against data mining. These types of reactions have led some data mining advocates and leaders to respond with attempts to defend and explain data mining technologies. One such response came in the year 2003. The Association for Computing Machinery (ACM) is the world’s foremost professional organization for computing professionals in all disciplines. This includes the ACM Special Interest Group for Knowledge Discovery and Data Mining (SIGKDD). At that time, a number of criticisms and calls against data mining were occurring, mostly driven by concerns over citizens’ privacy as the United States government increased its use of data mining in anti-terrorism activities in the years following the September 11th terrorist attacks. Certainly any time a government increases its scrutiny of its own citizens and those of other countries, it can be unsettling; however the leaders of ACM SIGKDD were likewise unsettled by the blame being placed on data mining itself. These leaders felt that the tool should be separated from the way it was being used. In response, the executive committee of ACM SIGKDD, a group that included such pioneers as Gregory Piatetsky-Shapiro, Usama Fayyad, Jiawei Han, and others, penned an open letter titled “Data Mining” is NOT Against Civil Liberties. (The two-page text of their letter is easily available on the Internet and you are encouraged to read and consider it). Their objective in writing this letter was not to defend government, or any data mining programs, but rather, to help individuals see that there is a large difference between a technology and the choices people make in the ways they use that technology.

In truth, every technology will have its detractors. It may seem a silly example, but consider a chair as a technology. It is a tool, invented by mankind to serve a purpose: sitting. If it is ergonomically designed and made of the right materials, it can facilitate very comfortable sitting. If it is fancy enough, it may exclude certain socio-economic classes from being able to own it. If pointed into a corner and associated with misbehavior, it becomes an object for punishment. If equipped with restraining straps and voltage high enough to take someone’s life, it becomes a politicized object of controversy. If picked up and used to strike another person it becomes a weapon; and yet, it is still a chair. So it is with essentially all technologies—all tools invented by mankind to do work. It is not the tool, but the choices we make in how to use it, that create and answer the questions of ethics.

This is not a simple proposition. Every one of us have a different moral compass. Each is guided by a different set of values and influenced by a unique set of backgrounds, experiences and forces. No one set of ethical guidelines is completely right or completely wrong. However, there are ways for each of us to reflectively evaluate, at least for our own, and hopefully for our organizations’ purposes, what our ethical parameters will be for each data mining activity we undertake. In order to aid in this process, we offer here a series of…

Ethical Frameworks and Suggestions

The brilliant legal scholar Lawrence Lessig has offered four mechanisms whereby we can frame and contain computing activities within reasonable bounds. These are:

  • Laws: These are statutes, enacted by a government and enforced by the same. If these are violated, they carry with them a prescribed punishment, adjudicated by a judge or a jury. Adherence to laws as a mechanism for right behavior represents the basest form of ethical decision making, because at this level, a person is merely doing what they have to do to stay out of trouble. Lessig suggests that while we often look to laws first as a method to enforce good behavior, there are other more reasonable and perhaps more effective methods.
  • Markets: Here Lessig suggests an economic solution to guiding behavior. If bad behavior is not profitable or would not enable an organization to stay in business, then bad behavior will not be prevalent. There are many ways that market forces, such as a good reputation for high quality products, excellent customer service, reliability, etc., can help guide good actions.
  • Code: In computing disciplines, code is a powerful guide of behavior, because it can be written to allow some actions while stopping others. If we feel that although it would not be illegal for members of a web site to access one another’s accounts, but that it would be unethical, we can write code to require usernames and passwords, making it more difficult for users to get into each other’s personal information. Further, we can write a code of conduct, usually referred to as an Acceptable Use Policy, which dictates what users can and cannot do. The policy is not a law, that is, it is not enacted or enforced by a government, but it is an agreement to abide by certain rules or risk losing the privilege of using the site’s services.
  • Social Norms: This form of determining what is ethical is based on what is acceptable in our society. As we look around us, interact with our friends, family, neighbors, and associates, ethical bounds can be established by what is acceptable to these people. Often, if we would be embarrassed, humiliated or otherwise shamed by our behavior, if we find ourselves wanting to hide what we’re doing from others, we have a strong indication that our activity is not ethical. We can also contribute to the establishment of social norms as ethical guides by making our own expectations of what is acceptable clear to others.
  • Organizational Standard Operating Procedures: Ethical standards can often be established by creating a set of acceptable practices for your organization. Such an effort should be undertaken by company leadership, with input from a broad cross-section of employees. These should be well-documented and communicated to employees, and reviewed regularly. Checks and balances can be built into work processes to help ensure that workers are adhering to established procedures.
  • Professional Code of Conduct: Similar to organizational operating standards, professional codes of conduct can help to establish boundaries of ethical conduct. The aforementioned Association for Computing Machinery maintains a Code of Ethics and Professional Conduct that is an excellent resource for computing professionals seeking guidance (http://www.acm.org/about/code-of-ethics). Other organizations also have codes of conduct that could be consulted in order to frame ethical decision making in data mining.
  • Immanuel Kant’s Categorical Imperative: Immanuel Kant was a German philosopher and anthropologist who lived in the 1700’s. Among his extensive writings on ethical morality, Kant’s Categorical Imperative is perhaps his most famous. This maxim states that if a given action cannot ethically be taken by anyone in a certain situation, then it should not be taken at all. In data mining, we could use this philosophy to determine: Would it be ethical for any business to collect and mine these data? What would be the outcome if every business mined data in this way? If the answers to such questions are negative and appear to be unethical, then we should not undertake the data mining project either.
  • Rene Descartes’ Rule of Change: Rene Descartes was a French philosopher and mathematician who like Kant, wrote extensively about moral decision making. His rule of change reflects his mathematical background. It states that if an act cannot be taken repeatedly, it is not ethical to do that act even once. Again to apply this to data mining, we can ask: Can I collect and mine these data on an ongoing basis without causing problems for myself, my organization, our customers or others? If you cannot do it repeatedly, according to Decartes, then you shouldn’t do it at all.

There are a few other ways that are not quite as specifically defined that you can use to seek out ethical boundaries. There is the old adage known as the Golden Rule, which dictates that we should treat others the way we hope they would treat us. There are also philosophies that help us to consider how our actions might be perceived by others and how they might make them feel. Some ethical frameworks are built around actions that will bring the greatest good to the largest number of people.

Conclusion

We can protect privacy by aggregating data, anonymizing observations through removal of names and personally identifiable information, and by storing it in secure and protected environments. When you are busy working with numbers, attributes and observations, it can be easy to forget about the people behind the data. We should be cautious when data mining models might brand a person as a certain risk. Be sensitive to peoples’ feelings and rights. When appropriate, ask for the their permission to gather and use data about them. Don’t rationalize a justification for your data mining project—ensure that you’re doing fair and just work that will help and benefit others.

Regardless of the mechanism you use to determine your ethical boundaries, our hope is that you will always keep ethical behavior in mind when mining data. Remember the personal side of what you are doing. As we began this book, we talked about the desire to introduce the subject of data mining to a new, non-traditional audience. We hope you are gaining confidence in your data mining skills, and that your creativity is helping you to envision your own data mining solutions to real-world problems you might be facing. Go exploring both within RapidMiner and through other tools for ways to find unexpected and interesting patterns in your data. The purpose of this book from the outset was to be a beginner’s guide, a way to get started in data mining—even if you don’t have a background in computer science or data analysis. Hopefully through the chapter examples and exercises, you’ve learned a lot and are well on your way to becoming an accomplished, and ethical, data miner. You’ve learned enough to be dangerous…don’t be. Apply what you’ve learned to use data as a powerful and beneficial advantage. And so as we close this book, let us do so as we began it: Let’s start digging!

Glossary and Index

This glossary contains key words, bolded throughout the text, and their definitions as they are used for the purposes of this book. The page number listed is not the only page where the term is found, but it is the page where the term is introduced or where it is primarily defined and discussed.

Antecedent

In an association rules data mining model, the antecedent is the attribute which precedes the consequent in an identified rule. Attribute order makes a difference when calculating the confidence percentage, so identifying which attribute comes first is necessary even if the reciprocal of the association is also a rule. (Page 85)

Archived Data

Data which have been copied out of a live production database and into a data warehouse or other permanent system where they can be accessed and analyzed, but not by primary operational business systems. (Page 18)

Association Rules

A data mining methodology which compares attributes in a data set across all observations to identify areas where two or more attributes are frequently found together. If their frequency of coexistence is high enough throughout the data set, the association of those attributes can be said to be a rule. (Page 74)

Attribute

In columnar data, an attribute is one column. It is named in the data so that it can be referred to by a model and used in data mining. The term attribute is sometimes interchanged with the terms ‘field’, ‘variable’, or ‘column’. (Page 16)

Average

The arithmetic mean, calculated by summing all values and dividing by the count of the values. (Pages 47, 77)

Binomial

A data type for any set of values that is limited to one of two numeric options. (Page 80)

Binominal

​In RapidMiner, the data type binominal is used instead of binomial, enabling both numerical and character-based sets of values that are limited to one of two options. (Page 80)

Business Understanding

See Organizational Understanding. (Page 6)

Case

See Observation. (Page 16)

Case Sensitive

A situation where a computer program recognizes the uppercase version of a letter or word as being different from the lowercase version of the same letter or word. (Page 199)

Classification

One of the two main goals of conducting data mining activities, with the other being prediction. Classification creates groupings in a data set based on the similarity of the observations’ attributes. Some data mining methodologies, such as decision trees, can predict an observation’s classification. (Page 9)

Code

Code is the result of a computer worker’s work. It is a set of instructions, typed in a specific grammar and syntax, that a computer can understand and execute. According to Lawrence Lessig, it is one of four methods humans can use to set and control boundaries for behavior when interacting with computer systems. (Page 233)

Coefficient

In data mining, a coefficient is a value that is calculated based on the values in a data set that can be used as a multiplier or as an indicator of the relative strength of some attribute or component in a data mining model. (Page 63)

Column

See Attribute. (Page 16)

Comma Separated Values (CSV)

A common text-based format for data sets where the divisions between attributes (columns of data) are indicated by commas. If commas occur naturally in some of the values in the data set, a CSV file will misunderstand these to be attribute separators, leading to misalignment of attributes. (Page 35)

Conclusion

See Consequent. (Page 85)

Confidence (Alpha) Level

A value, usually 5% or 0.05, used to test for statistical significance in some data mining methods. If statistical significance is found, a data miner can say that there is a 95% likelihood that a calculated or predicted value is not a false positive. (Page 132)

Confidence Percent

In predictive data mining, this is the percent of calculated confidence that the model has calculated for one or more possible predicted values. It is a measure for the likelihood of false positives in predictions. Regardless of the number of possible predicted values, their collective confidence percentages will always total to 100%. (Page 84)

Consequent

In an association rules data mining model, the consequent is the attribute which results from the antecedent in an identified rule. If an association rule were characterized as “If this, then that”, the consequent would be that—in other words, the outcome. (Page 85)

Correlation

A statistical measure of the strength of affinity, based on the similarity of observational values, of the attributes in a data set. These can be positive (as one attribute’s values go up or down, so too does the correlated attribute’s values); or negative (correlated attributes’ values move in opposite directions). Correlations are indicated by coefficients which fall on a scale between -1 (complete negative correlation) and 1 (complete positive correlation), with 0 indicating no correlation at all between two attributes.

CRISP-DM

​An acronym for Cross-Industry Standard Process for Data Mining. This process was jointly developed by several major multi-national corporations around the turn of the new millennium in order to standardize the approach to mining data. It is comprised of six cyclical steps: Business (Organizational) Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment. (Page 5)

Cross-validation

A method of statistically evaluating a training data set for its likelihood of producing false positives in a predictive data mining model. (Page 221).

Data

Data are any arrangement and compilation of facts. Data may be structured (e.g. arranged in columns (attributes) and rows (observations)), or unstructured (e.g. paragraphs of text, computer log file). (Page 3)

Data Analysis

The process of examining data in a repeatable and structured way in order to extract meaning, patterns or messages from a set of data. (Page 3)

Data Mart

A location where data are stored for easy access by a broad range of people in an organization. Data in a data mart are generally archived data, enabling analysis in a setting that does not impact live operations. (Page 20)

Data Mining

A computational process of analyzing data sets, usually large in nature, using both statistical and logical methods, in order to uncover hidden, previously unknown, and interesting patterns that can inform organizational decision making. (Page 3)

Data Preparation

The third in the six steps of CRISP-DM. At this stage, the data miner ensures that the data to be mined are clean and ready for mining. This may include handling outliers or other inconsistent data, dealing with missing values, reducing attributes or observations, setting attribute roles for modeling, etc. (Page 8)

Data Set

Any compilation of data that is suitable for analysis. (Page 18)

Data Type

In a data set, each attribute is assigned a data type based on the kind of data stored in the attribute. There are many data types which can be generalized into one of three areas: Character (Text) based; Numeric; and Date/Time. Within these categories, RapidMiner has several data types. For example, in the Character area, RapidMiner has Polynominal, Binominal, etc.; and in the Numeric area it has Real, Integer, etc. (Page 39)

Data Understanding

The second in the six steps of CRISP-DM. At this stage, the data miner seeks out sources of data in the organization, and works to collect, compile, standardize, define and document the data. The data miner develops a comprehension of where the data have come from, how they were collected and what they mean. (Page 7)

Data Warehouse

A large-scale repository for archived data which are available for analysis. Data in a data warehouse are often stored in multiple formats (e.g. by week, month, quarter and year), facilitating large scale analyses at higher speeds. The data warehouse is populated by extracting data from operational systems so that analyses do not interfere with live business operations. (Page 18)

Database

A structured organization of facts that is organized such that the facts can be reliably and repeatedly accessed. The most common type of database is a relational database, in which facts (data) are arranged in tables of columns and rows. The data are then accessed using a query language, usually SQL (Structured Query Language), in order to extract meaning from the tables. (Page 16)

Decision Tree

A data mining methodology where leaves and nodes are generated to construct a predictive tree, whereby a data miner can see the attributes which are most predictive of each possible outcome in a target (label) attribute. (Pages 9, 159).

Denormalization

​The process of removing relational organization from data, reintroducing redundancy into the data, but simultaneously eliminating the need for joins in a relational database, enabling faster querying. (Page 18)

Dependent Variable (Attribute)

The attribute in a data set that is being acted upon by the other attributes. It is the thing we want to predict, the target, or label, attribute in a predictive model. (Page 108)

Deployment

The sixth and final of the six steps of CRISP-DM. At this stage, the data miner takes the results of data mining activities and puts them into practice in the organization. The data miner watches closely and collects data to determine if the deployment is successful and ethical. Deployment can happen in stages, such as through pilot programs before a full-scale roll out. (Page 10)

Descartes' Rule of Change

An ethical framework set forth by Rene Descartes which states that if an action cannot be taken repeatedly, it cannot be ethically taken even once. (Page 235)

Design Perspective

The view in RapidMiner where a data miner adds operators to a data mining stream, sets those operators’ parameters, and runs the model. (Page 41)

Discriminant Analysis

A predictive data mining model which attempts to compare the values of all observations across all attributes and identify where natural breaks occur from one category to another, and then predict which category each observation in the data set will fall into. (Page 108)

Ethics

A set of moral codes or guidelines that an individual develops to guide his or her decision making in order to make fair and respectful decisions and engage in right actions. Ethical standards are higher than legally required minimums. (Page 232)

Evaluation

The fifth of the six steps of CRISP-DM. At this stage, the data miner reviews the results of the data mining model, interprets results and determines how useful they are. He or she may also conduct an investigation into false positives or other potentially misleading results. (Page 10)

False Positive

A predicted value that ends up not being correct. (Page 221)

Field

See Attribute (Page 16).

Frequency Pattern

A recurrence of the same, or similar, observations numerous times in a single data set. (Page 81)

Fuzzy Logic

A data mining concept often associated with neural networks where predictions are made using a training data set, even though some uncertainty exists regarding the data and a model’s predictions. (Page 181)

Gain Ratio

One of several algorithms used to construct decision tree models. (Page 168)

Gini Index

An algorithm created by Corrodo Gini that can be used to generate decision tree models. (Page 168)

Heterogeneity

In statistical analysis, this is the amount of variety found in the values of an attribute. (Page 119)

Inconsistent Data

These are values in an attribute in a data set that are out-of-the-ordinary among the whole set of values in that attribute. They can be statistical outliers, or other values that simply don’t make sense in the context of the ‘normal’ range of values for the attribute. They are generally replaced or remove during the Data Preparation phase of CRISP-DM. (Page 50)

Independent Variable (Attribute)

These are attributes that act on the dependent attribute (the target, or label). They are used to help predict the label in a predictive model. (Pages 133)

Jittering

The process of adding a small, random decimal to discrete values in a data set so that when they are plotted in a scatter plot, they are slightly apart from one another, enabling the analyst to better see clustering and density. (Pages 17, 70)

Join

The process of connecting two or more tables in a relational database together so that their attributes can be accessed in a single query, such as in a view. (Page 17)

Kant's Categorical Imperative

An ethical framework proposed by Immanuel Kant which states that if everyone cannot ethically take some action, then no one can ethically take that action. (Page 234)

k-Means Clustering

A data mining methodology that uses the mean (average) values of the attributes in a data set to group each observation into a cluster of other observations whose values are most similar to the mean for that cluster. (Page 92)

Label

In RapidMiner, this is the role that must be set in order to use an attribute as the dependent, or target, attribute in a predictive model. (Page 108)

Laws

These are regulatory statutes which have associated consequences that are established and enforced by a governmental agency. According to Lawrence Lessig, these are one of the four methods for establishing boundaries to define and regulate social behavior. (Page 233)

Leaf

In a decision tree data mining model, this is the terminal end point of a branch, indicating the predicted outcome for observations whose values follow that branch of the tree. (Page 164)

Linear Regression

A predictive data mining method which uses the algebraic formula for calculating the slope of a line in order to predict where a given observation will likely fall along that line. (Page 128)

Logistic Regression

A predictive data mining method which uses a quadratic formula to predict one of a set of possible outcomes, along with a probability that the prediction will be the actual outcome. (Page 142)

Markets

A socio-economic construct in which peoples’ buying, selling, and exchanging behaviors define the boundaries of acceptable or unacceptable behavior. Lawrence Lessig offers this as one of four methods for defining the parameters of appropriate behavior. (Page 233)

Mean

See Average. (Pages 47, 77)

Median

With the Mean and Mode, this is one of three generally used Measures of Central Tendency. It is an arithmetic way of defining what ‘normal’ looks like in a numeric attribute. It is calculated by rank ordering the values in an attribute and finding the one in the middle. If there are an even number of observations, the two in the middle are averaged to find the median. (Page 47)

Meta Data

These are facts that describe the observational values in an attribute. Meta data may include who collected the data, when, why, where, how, how often; and usually include some descriptive statistics such as the range, average, standard deviation, etc. (Page 42)

Missing Data

These are instances in an observation where one or more attributes does not have a value. It is not the same as zero, because zero is a value. Missing data are like Null values in a database, they are either unknown or undefined. These are usually replaced or removed during the Data Preparation phase of CRISP-DM. (Page 30)

Mode

With Mean and Median, this is one of three common Measures of Central Tendency. It is the value in an attribute which is the most common. It can be numerical or text. If an attribute contains two or more values that appear an equal number of times and more than any other values, then all are listed as the mode, and the attribute is said to be Bimodal or Multimodal. (Pages 42, 47)

Model

A computer-based representation of real-life events or activities, constructed upon the basis of data which represent those events. (Page 8)

Name (Attribute)

This is the text descriptor of each attribute in a data set. In RapidMiner, the first row of an imported data set should be designated as the attribute name, so that these are not interpreted as the first observation in the data set. (Page 38)

Neural Network

A predictive data mining methodology which tries to mimic human brain processes by comparing the values of all attributes in a data set to one another through the use of a hidden layer of nodes. The frequencies with which the attribute values match, or are strongly similar, create neurons which become stronger at higher frequencies of similarity. (Page 176)

n-Gram

In text mining, this is a combination of words or word stems that represent a phrase that may have more meaning or significance that would the single word or stem. (Page 201)

Node

A terminal or mid-point in decision trees and neural networks where an attribute branches or forks away from other terminal or branches because the values represented at that point have become significantly different from all other values for that attribute. (Page 164)

Normalization

In a relational database, this is the process of breaking data out into multiple related tables in order to reduce redundancy and eliminate multivalued dependencies. (Page 18)

Null

The absence of a value in a database. The value is unrecorded, unknown, or undefined. See Missing Values. (Page 30)

Observation

A row of data in a data set. It consists of the value assigned to each attribute for one record in the data set. It is sometimes called a tuple in database language. (Page 16)

Online Analytical Processing (OLAP)

A database concept where data are collected and organized in a way that facilitates analysis, rather than practical, daily operational work. Evaluating data in a data warehouse is an example of OLAP. The underlying structure that collects and holds the data makes analysis faster, but would slow down transactional work. (Page 18)

Online Transaction Processing (OLTP)

A database concept where data are collected and organized in a way that facilitates fast and repeated transactions, rather than broader analytical work. Scanning items being purchased at a cash register is an example of OLTP. The underlying structure that collects and holds the data makes transactions faster, but would slow down analysis. (Page 17)

Operational Data

Data which are generated as a result of day-to-day work (e.g. the entry of work orders for an electrical service company). (Page 19)

Operator

In RapidMiner, an operator is any one of more than 100 tools that can be added to a data mining stream in order to perform some function. Functions range from adding a data set, to setting an attribute’s role, to applying a modeling algorithm. Operators are connected into a stream by way of ports connected by splines. (Page 34, 41)

Organizational Data

These are data which are collected by an organization, often in aggregate or summary format, in order to address a specific question, tell a story, or answer a specific question. They may be constructed from Operational Data, or added to through other means such as surveys, questionnaires or tests. (Page 19)

Organizational Understanding

The first step in the CRISP-DM process, usually referred to as Business Understanding, where the data miner develops an understanding of an organization’s goals, objectives, questions, and anticipated outcomes relative to data mining tasks. The data miner must understand why the data mining task is being undertaken before proceeding to gather and understand data. (Page 6)

Parameters

In RapidMiner, these are the settings that control values and thresholds that an operator will use to perform its job. These may be the attribute name and role in a Set Role operator, or the algorithm the data miner desires to use in a model operator. (Page 44)

Port

The input or output required for an operator to perform its function in RapidMiner. These are connected to one another using splines. (Page 41)

Prediction

The target, or label, or dependent attribute that is generated by a predictive model, usually for a scoring data set in a model. (Page 8)

Premise

See Antecedent. (Page 85)

Privacy

The concept describing a person’s right to be let alone; to have information about them kept away from those who should not, or do not need to, see it. A data miner must always respect and safeguard the privacy of individuals represented in the data he or she mines. (Page 20)

Professional Code of Conduct

A helpful guide or documented set of parameters by which an individual in a given profession agrees to abide. These are usually written by a board or panel of experts and adopted formally by a professional organization. (Page 234)

Query

A method of structuring a question, usually using code, that can be submitted to, interpreted, and answered by a computer. (Page 17)

Record

See Observation. (Page 16)

Relational Database

A computerized repository, comprised of entities that relate to one another through keys. The most basic and elemental entity in a relational database is the table, and tables are made up of attributes. One or more of these attributes serves as a key that can be matched (or related) to a corresponding attribute in another table, creating the relational effect which reduces data redundancy and eliminates multivalued dependencies. (Page 16)

Repository

In RapidMiner, this is the place where imported data sets are stored so that they are accessible for modeling. (Page 34)

Results Perspective

The view in RapidMiner that is seen when a model has been run. It is usually comprised of two or more tabs which show meta data, data in a spreadsheet-like view, and predictions and model outcomes (including graphical representations where applicable). (Page 41)

Role (Attribute)

In a data mining model, each attribute must be assigned a role. The role is the part the attribute plays in the model. It is usually equated to serving as an independent variable (regular), or dependent variable (label). (Page 39)

Row

See Observation. (Page 16)

Sample

A subset of an entire data set, selected randomly or in a structured way. This usually reduces a data set down, allowing models to be run faster, especially during development and proof-of-concept work on a model. (Page 49)

Scoring Data

A data set with the same attributes as a training data set in a predictive model, with the exception of the label. The training data set, with the label defined, is used to create a predictive model, and that model is then applied to a scoring data set possessing the same attributes in order to predict the label for each scoring observation. (Page 108)

Social Norms

These are the sets of behaviors and actions that are generally tolerated and found to be acceptable in a society. According to Lawrence Lessig, these are one of four methods of defining and regulating appropriate behavior. (Page 233)

Spline

​In RapidMiner, these lines connect the ports between operators, creating the stream of a data mining model. (Page 41)

Standard Deviation

One of the most common statistical measures of how dispersed the values in an attribute are. This measure can help determine whether or not there are outliers (a common type of inconsistent data) in a data set. (Page 77)

Standard Operating Procedures

These are organizational guidelines that are documented and shared with employees which help to define the boundaries for appropriate and acceptable behavior in the business setting. They are usually created and formally adopted by a group of leaders in the organization, with input from key stakeholders in the organization. (Page 234)

Statistical Significance

In statistically-based data mining activities, this is the measure of whether or not the model has yielded any results that are mathematically reliable enough to be used. Any model lacking statistical significance should not be used in operational decision making. (Page 133)

Stemming

In text mining, this is the process of reducing like-terms down into a single, common token (e.g. country, countries, country’s, countryman, etc. → countr). (Page 201)

Stopwords

​In text mining, these are small words that are necessary for grammatical correctness, but which carry little meaning or power in the message of the text being mined. These are often articles, prepositions or conjuntions, such as ‘a’, ‘the’, ‘and’, etc., and are usually removed in the Process Document operator’s sub-process. (Page 199)

Stream

This is the string of operators in a data mining model, connected through the operators’ ports via splines, that represents all actions that will be taken on a data set in order to mine it. (Page 41)

Structured Query Language (SQL)

The set of codes, reserved keywords and syntax defined by the American National Standards Institute used to create, manage and use relational databases. (Page 17)

Sub-process

In RapidMiner, this is a stream of operators set up to apply a series of actions to all inputs connected to the parent operator. (Page 197)

Support Percent

In an association rule data mining model, this is the percent of the time that when the antecedent is found in an observation, the consequent is also found. Since this is calculated as the number of times the two are found together divided by the total number of they could have been found together, the Support Percent is the same for reciprocal rules. (Page 84)

Table

In data collection, a table is a grid of columns and rows, where in general, the columns are individual attributes in the data set, and the rows are observations across those attributes. Tables are the most elemental entity in relational databases. (Page 16)

Target Attribute

See Label; Dependent Variable. (Page 108)

Technology

Any tool or process invented by mankind to do or improve work. (Page 11)

Text Mining

The process of data mining unstructured text-based data such as essays, news articles, speech transcripts, etc. to discover patterns of word or phrase usage to reveal deeper or previously unrecognized meaning. (Page 190)

Token (Tokenize)

In text mining, this is the process of turning words in the input document(s) into attributes that can be mined. (Page 197)

Training Data

In a predictive model, this data set already has the label, or dependent variable defined, so that it can be used to create a model which can be applied to a scoring data set in order to generate predictions for the latter. (Page 108)

Tuple

​See Observation. (Page 16)

Variable

See Attribute. (Page 16)

View

A type of pseudo-table in a relational database which is actually a named, stored query. This query runs against one or more tables, retrieving a defined number of attributes that can then be referenced as if they were in a table in the database. Views can limit users’ ability to see attributes to only those that are relevant and/or approved for those users to see. They can also speed up the query process because although they may contain joins, the key columns for the joins can be indexed and cached, making the view’s query run faster than it would if it were not stored as a view. Views can be useful in data mining as data miners can be given read-only access to the view, upon which they can build data mining models, without having to have broader administrative rights on the database itself. (Page 27)

About the Author

Dr. Matthew North is Associate Professor of Computing and Information Studies at Washington & Jefferson College in Washington, Pennsylvania, USA. He has taught data management and data mining for more than a decade, and previously worked in industry as a data miner, most recently at eBay.com. He continues to consult with various organizations on data mining projects as well.

Dr. North holds a Bachelor of Arts degree in Latin American History and Portuguese from Brigham Young University; a Master of Science in Business Information Systems from Utah State University; and a Doctorate in Technology Education from West Virginia University. He is the author of the book Life Lessons & Leadership (Agami Press, 2011), and numerous papers and articles on technology and pedagogy. His dissertation, on the topic of teaching models and learning styles in introductory data mining courses, earned him a New Faculty Fellows award from the Center for Advancement of Scholarship on Engineering Education (CASEE); and in 2010, he was awarded the Ben Bauman Award for Excellence by the International Association for Computer Information Systems (IACIS). He lives with his wife, Joanne, and their three daughters in southwestern Pennsylvania.

To contact Dr. North regarding this text, consulting or training opportunities, or for speaking engagements, please access this book’s companion web site at: https://sites.google.com/site/dataminingforthemasses/

NEXT

Page statistics
3698 view(s) and 45 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments