Table of contents
  1. Story
  2. SAS Exercises Slide Numbers
  3. Slides-SAS Exercises
    1. Cover Page
    2. Knowledge Base and Data Sets
    3. Distributions
      1. Charts and Plots
      2. Descriptive Statistics
      3. Tests for Normality
      4. Confidence Intervals
    4. t-Tests
      1. One Sample
      2. Paired Sample
      3. Two Sample
    5. Correlation
      1. Graphical Evaluation
      2. Pearson Correlation
      3. Spearman Correlation
    6. Linear Regression
      1. Simple
      2. Multiple
      3. Polynomial
      4. Stepwise
    7. Analysis of Variance
      1. One-Way ANOVA
      2. Two-Way ANOVA
      3. Mixed Models
      4. Repeated Measures
    8. Tests of Association
      1. Pearson Chi-Square
      2. Likelihood Ratio Chi-Square
      3. Mantel-Haenszel Chi-Square
    9. Logistic Regression
      1. Simple Binary
      2. Multiple Binary
    10. Nonparametric Analyses: Independent Samples
      1. 1 Sample
      2. 2 Sample
      3. k Samples
    11. Nonparametric Analyses: Dependent Samples
      1. 2 Sample
      2. k Samples
    12. Survival Analysis
      1. Test of Equality over Strata
      2. Comparing Survival Functions
    13. Extra
      1. Oranges
  4. Spotfire Dashboard-SAS Exercises
  5. Slides-Spotfire Tutorial
    1. Slide 1 Cover Page
    2. Slide 2 Data Relationships: Linear Regression
    3. Slide 3 Data Relationships: Spearman R
    4. Slide 4 Data Relationships: ANOVA
    5. Slide 5 Data Relationships: Kruskal-Wallis
    6. Slide 6 Data Relationships: Chi-square
    7. Slide 7 Knowledge Base & Data Sets
  6. Spotfire Dashboard-Spotfire Tutorial
  7. Take a Quick-Start Tutorial!
    1. Overview
      1. About SAS Enterprise Guide
      2. Slide Show: The Workspace
    2. Working with Data
      1. Types of Data in SAS Enterprise Guide
      2. Understanding Data Properties
      3. Slide Show: Adding Data to a Project
      4. Try It: Add Data and View Properties
    3. Using Tasks
      1. About SAS Tasks
      2. Deciding Which Task to Use
      3. Slide Show: Opening Tasks
      4. Slide Show: A Typical Task Window
      5. Try It: Create a List Report
    4. Working with Results
      1. Changing the Result Format and Style
      2. Try It: Create PDF and Export Results
  8. Distributions
    1. Charts and Plots
      1. Solve Exercises in Your Own Statistical Software
        1. Hot Dogs (Bar Chart)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Typing Speed (Stem & Leaf Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Student Survey 1 (Histogram)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        4. Student Survey 2 (Box Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        5. Muzzle Velocities 1 (Box Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        6. Candy Bars (Pie Chart)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        7. Car Types (Bar Chart)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        8. Arrests (Stem & Leaf Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        9. Cholesterol Levels (Quantiles Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Descriptive Statistics
      1. Solve Exercises in Your Own Statistical Software
        1. Football: Bench Presses and Squats
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Muzzle Velocities 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Tests for Normality
      1. Solve Exercises in Your Own Statistical Software
        1. Typing Speed
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Tree Weights
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    4. Confidence Intervals
      1. Solve Exercises in Your Own Statistical Software
        1. Fat Content of Candy
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  9. t-Tests
    1. One Sample
      1. Solve Exercises in Your Own Statistical Software
        1. Physics Aptitude
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Bell Peppers
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Paired Sample
      1. Solve Exercises in Your Own Statistical Software
        1. Arrivals
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Blood Pressure 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Two Sample
      1. Solve Exercises in Your Own Statistical Software
        1. Value of Owner-Occupied Homes
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  10. Correlation
    1. Graphical Evaluation
      1. Solve Exercises in Your Own Statistical Software
        1. Crime and Education in Cities
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Pearson Correlation
      1. Solve Exercises in Your Own Statistical Software
        1. Football: Height and Neck Measurements
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Student Survey 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Crime and Recreation in Cities
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Spearman Correlation
      1. Solve Exercises in Your Own Statistical Software
        1. Top Movies
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Mail-Order Customers 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  11. Linear Regression
    1. Simple
      1. Solve Exercises in Your Own Statistical Software
        1. Student Survey 4
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Livestock Auctions 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Multiple
      1. Solve Exercises in Your Own Statistical Software
        1. Livestock Auctions 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Polynomial
      1. Solve Exercises in Your Own Statistical Software
        1. Weight to Height
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Tree Weights and Trunk Girths
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    4. Stepwise
      1. Solve Exercises in Your Own Statistical Software
        1. Mass Prediction
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Corn Yield
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  12. Analysis of Variance
    1. One-Way ANOVA
      1. Solve Exercises in Your Own Statistical Software
        1. Wafers
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Blood Pressure 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Two-Way ANOVA
      1. Solve Exercises in Your Own Statistical Software
        1. Starch Out
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Treadmill Exercise
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Mixed Models
      1. Solve Exercises in Your Own Statistical Software
        1. Orange Grove Irrigation
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Wanderers
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    4. Repeated Measures
      1. Solve Exercises in Your Own Statistical Software
        1. Foxes and Coyotes
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  13. Tests of Association
    1. Pearson Chi-Square
      1. Solve Exercises in Your Own Statistical Software
        1. Student Survey 5
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Colon Cancer
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Likelihood Ratio Chi-Square
      1. Solve Exercises in Your Own Statistical Software
        1. Pancreatic Cancer
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Survivors
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Mantel-Haenszel Chi-Square
      1. Solve Exercises in Your Own Statistical Software
        1. Popularity 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Cotton Plants
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  14. Logistic Regression
    1. Simple Binary
      1. Solve Exercises in Your Own Statistical Software
        1. Rainy Days 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Space Shuttles
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Multiple Binary
      1. Solve Exercises in Your Own Statistical Software
        1. Rainy Days 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Mail Order Customers 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  15. Nonparametric Analyses: Independent Samples
    1. 1 Sample
      1. Sign Test
        1. Solve Exercises in Your Own Statistical Software
        2. College Course Test Scores 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Wired 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
      2. Wilcoxon Signed Rank Test
        1. Solve Exercises in Your Own Statistical Software
        2. College Course Test Scores 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Wired 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. 2 Sample
      1. Median Test
        1. Solve Exercises in Your Own Statistical Software
        2. Hoop Ratings
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Physician Referrals 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        4. Adverse Reactions 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        5. Adverse Reactions 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
      2. Wilcoxon Rank Sum Test
        1. Solve Exercises in Your Own Statistical Software
        2. Physician Referrals 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Teen Growth
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
      3. Kruskal-Wallis Test
        1. Solve Exercises in Your Own Statistical Software
        2. Physician Referrals 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Adverse Reactions 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. k Samples
      1. Kruskal-Wallis Test
        1. Solve Exercises in Your Own Statistical Software
        2. Synthetic Wood Veneer
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Adverse Reactions 4
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  16. Nonparametric Analyses: Dependent Samples
    1. 2 Sample
      1. Wilcoxon Signed Rank Test
        1. Solve Exercises in Your Own Statistical Software
        2. College Course Test Scores 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Wired 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. k Samples
      1. Friedman Test
        1. Solve Exercises in Your Own Statistical Software
        2. Skin Potential for Emotions
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Adverse Reactions 5
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
      2. Spearman Correlation
        1. Solve Exercises in Your Own Statistical Software
        2. Arts and Economics in Cities
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Hoop Ratings
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  17. Survival Analysis
    1. Test of Equality over Strata
      1. Solve Exercises in Your Own Statistical Software
        1. Maintenance Therapy
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Comparing Survival Functions
      1. Solve Exercises in Your Own Statistical Software
        1. Heart Disease
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result

SAS Public Data Sets

Last modified
Table of contents
  1. Story
  2. SAS Exercises Slide Numbers
  3. Slides-SAS Exercises
    1. Cover Page
    2. Knowledge Base and Data Sets
    3. Distributions
      1. Charts and Plots
      2. Descriptive Statistics
      3. Tests for Normality
      4. Confidence Intervals
    4. t-Tests
      1. One Sample
      2. Paired Sample
      3. Two Sample
    5. Correlation
      1. Graphical Evaluation
      2. Pearson Correlation
      3. Spearman Correlation
    6. Linear Regression
      1. Simple
      2. Multiple
      3. Polynomial
      4. Stepwise
    7. Analysis of Variance
      1. One-Way ANOVA
      2. Two-Way ANOVA
      3. Mixed Models
      4. Repeated Measures
    8. Tests of Association
      1. Pearson Chi-Square
      2. Likelihood Ratio Chi-Square
      3. Mantel-Haenszel Chi-Square
    9. Logistic Regression
      1. Simple Binary
      2. Multiple Binary
    10. Nonparametric Analyses: Independent Samples
      1. 1 Sample
      2. 2 Sample
      3. k Samples
    11. Nonparametric Analyses: Dependent Samples
      1. 2 Sample
      2. k Samples
    12. Survival Analysis
      1. Test of Equality over Strata
      2. Comparing Survival Functions
    13. Extra
      1. Oranges
  4. Spotfire Dashboard-SAS Exercises
  5. Slides-Spotfire Tutorial
    1. Slide 1 Cover Page
    2. Slide 2 Data Relationships: Linear Regression
    3. Slide 3 Data Relationships: Spearman R
    4. Slide 4 Data Relationships: ANOVA
    5. Slide 5 Data Relationships: Kruskal-Wallis
    6. Slide 6 Data Relationships: Chi-square
    7. Slide 7 Knowledge Base & Data Sets
  6. Spotfire Dashboard-Spotfire Tutorial
  7. Take a Quick-Start Tutorial!
    1. Overview
      1. About SAS Enterprise Guide
      2. Slide Show: The Workspace
    2. Working with Data
      1. Types of Data in SAS Enterprise Guide
      2. Understanding Data Properties
      3. Slide Show: Adding Data to a Project
      4. Try It: Add Data and View Properties
    3. Using Tasks
      1. About SAS Tasks
      2. Deciding Which Task to Use
      3. Slide Show: Opening Tasks
      4. Slide Show: A Typical Task Window
      5. Try It: Create a List Report
    4. Working with Results
      1. Changing the Result Format and Style
      2. Try It: Create PDF and Export Results
  8. Distributions
    1. Charts and Plots
      1. Solve Exercises in Your Own Statistical Software
        1. Hot Dogs (Bar Chart)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Typing Speed (Stem & Leaf Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Student Survey 1 (Histogram)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        4. Student Survey 2 (Box Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        5. Muzzle Velocities 1 (Box Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        6. Candy Bars (Pie Chart)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        7. Car Types (Bar Chart)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        8. Arrests (Stem & Leaf Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        9. Cholesterol Levels (Quantiles Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Descriptive Statistics
      1. Solve Exercises in Your Own Statistical Software
        1. Football: Bench Presses and Squats
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Muzzle Velocities 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Tests for Normality
      1. Solve Exercises in Your Own Statistical Software
        1. Typing Speed
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Tree Weights
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    4. Confidence Intervals
      1. Solve Exercises in Your Own Statistical Software
        1. Fat Content of Candy
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  9. t-Tests
    1. One Sample
      1. Solve Exercises in Your Own Statistical Software
        1. Physics Aptitude
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Bell Peppers
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Paired Sample
      1. Solve Exercises in Your Own Statistical Software
        1. Arrivals
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Blood Pressure 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Two Sample
      1. Solve Exercises in Your Own Statistical Software
        1. Value of Owner-Occupied Homes
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  10. Correlation
    1. Graphical Evaluation
      1. Solve Exercises in Your Own Statistical Software
        1. Crime and Education in Cities
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Pearson Correlation
      1. Solve Exercises in Your Own Statistical Software
        1. Football: Height and Neck Measurements
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Student Survey 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Crime and Recreation in Cities
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Spearman Correlation
      1. Solve Exercises in Your Own Statistical Software
        1. Top Movies
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Mail-Order Customers 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  11. Linear Regression
    1. Simple
      1. Solve Exercises in Your Own Statistical Software
        1. Student Survey 4
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Livestock Auctions 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Multiple
      1. Solve Exercises in Your Own Statistical Software
        1. Livestock Auctions 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Polynomial
      1. Solve Exercises in Your Own Statistical Software
        1. Weight to Height
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Tree Weights and Trunk Girths
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    4. Stepwise
      1. Solve Exercises in Your Own Statistical Software
        1. Mass Prediction
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Corn Yield
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  12. Analysis of Variance
    1. One-Way ANOVA
      1. Solve Exercises in Your Own Statistical Software
        1. Wafers
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Blood Pressure 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Two-Way ANOVA
      1. Solve Exercises in Your Own Statistical Software
        1. Starch Out
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Treadmill Exercise
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Mixed Models
      1. Solve Exercises in Your Own Statistical Software
        1. Orange Grove Irrigation
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Wanderers
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    4. Repeated Measures
      1. Solve Exercises in Your Own Statistical Software
        1. Foxes and Coyotes
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  13. Tests of Association
    1. Pearson Chi-Square
      1. Solve Exercises in Your Own Statistical Software
        1. Student Survey 5
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Colon Cancer
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Likelihood Ratio Chi-Square
      1. Solve Exercises in Your Own Statistical Software
        1. Pancreatic Cancer
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Survivors
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Mantel-Haenszel Chi-Square
      1. Solve Exercises in Your Own Statistical Software
        1. Popularity 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Cotton Plants
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  14. Logistic Regression
    1. Simple Binary
      1. Solve Exercises in Your Own Statistical Software
        1. Rainy Days 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Space Shuttles
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Multiple Binary
      1. Solve Exercises in Your Own Statistical Software
        1. Rainy Days 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Mail Order Customers 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  15. Nonparametric Analyses: Independent Samples
    1. 1 Sample
      1. Sign Test
        1. Solve Exercises in Your Own Statistical Software
        2. College Course Test Scores 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Wired 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
      2. Wilcoxon Signed Rank Test
        1. Solve Exercises in Your Own Statistical Software
        2. College Course Test Scores 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Wired 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. 2 Sample
      1. Median Test
        1. Solve Exercises in Your Own Statistical Software
        2. Hoop Ratings
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Physician Referrals 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        4. Adverse Reactions 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        5. Adverse Reactions 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
      2. Wilcoxon Rank Sum Test
        1. Solve Exercises in Your Own Statistical Software
        2. Physician Referrals 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Teen Growth
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
      3. Kruskal-Wallis Test
        1. Solve Exercises in Your Own Statistical Software
        2. Physician Referrals 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Adverse Reactions 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. k Samples
      1. Kruskal-Wallis Test
        1. Solve Exercises in Your Own Statistical Software
        2. Synthetic Wood Veneer
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Adverse Reactions 4
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  16. Nonparametric Analyses: Dependent Samples
    1. 2 Sample
      1. Wilcoxon Signed Rank Test
        1. Solve Exercises in Your Own Statistical Software
        2. College Course Test Scores 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Wired 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. k Samples
      1. Friedman Test
        1. Solve Exercises in Your Own Statistical Software
        2. Skin Potential for Emotions
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Adverse Reactions 5
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
      2. Spearman Correlation
        1. Solve Exercises in Your Own Statistical Software
        2. Arts and Economics in Cities
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Hoop Ratings
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  17. Survival Analysis
    1. Test of Equality over Strata
      1. Solve Exercises in Your Own Statistical Software
        1. Maintenance Therapy
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Comparing Survival Functions
      1. Solve Exercises in Your Own Statistical Software
        1. Heart Disease
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result

  1. Story
  2. SAS Exercises Slide Numbers
  3. Slides-SAS Exercises
    1. Cover Page
    2. Knowledge Base and Data Sets
    3. Distributions
      1. Charts and Plots
      2. Descriptive Statistics
      3. Tests for Normality
      4. Confidence Intervals
    4. t-Tests
      1. One Sample
      2. Paired Sample
      3. Two Sample
    5. Correlation
      1. Graphical Evaluation
      2. Pearson Correlation
      3. Spearman Correlation
    6. Linear Regression
      1. Simple
      2. Multiple
      3. Polynomial
      4. Stepwise
    7. Analysis of Variance
      1. One-Way ANOVA
      2. Two-Way ANOVA
      3. Mixed Models
      4. Repeated Measures
    8. Tests of Association
      1. Pearson Chi-Square
      2. Likelihood Ratio Chi-Square
      3. Mantel-Haenszel Chi-Square
    9. Logistic Regression
      1. Simple Binary
      2. Multiple Binary
    10. Nonparametric Analyses: Independent Samples
      1. 1 Sample
      2. 2 Sample
      3. k Samples
    11. Nonparametric Analyses: Dependent Samples
      1. 2 Sample
      2. k Samples
    12. Survival Analysis
      1. Test of Equality over Strata
      2. Comparing Survival Functions
    13. Extra
      1. Oranges
  4. Spotfire Dashboard-SAS Exercises
  5. Slides-Spotfire Tutorial
    1. Slide 1 Cover Page
    2. Slide 2 Data Relationships: Linear Regression
    3. Slide 3 Data Relationships: Spearman R
    4. Slide 4 Data Relationships: ANOVA
    5. Slide 5 Data Relationships: Kruskal-Wallis
    6. Slide 6 Data Relationships: Chi-square
    7. Slide 7 Knowledge Base & Data Sets
  6. Spotfire Dashboard-Spotfire Tutorial
  7. Take a Quick-Start Tutorial!
    1. Overview
      1. About SAS Enterprise Guide
      2. Slide Show: The Workspace
    2. Working with Data
      1. Types of Data in SAS Enterprise Guide
      2. Understanding Data Properties
      3. Slide Show: Adding Data to a Project
      4. Try It: Add Data and View Properties
    3. Using Tasks
      1. About SAS Tasks
      2. Deciding Which Task to Use
      3. Slide Show: Opening Tasks
      4. Slide Show: A Typical Task Window
      5. Try It: Create a List Report
    4. Working with Results
      1. Changing the Result Format and Style
      2. Try It: Create PDF and Export Results
  8. Distributions
    1. Charts and Plots
      1. Solve Exercises in Your Own Statistical Software
        1. Hot Dogs (Bar Chart)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Typing Speed (Stem & Leaf Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Student Survey 1 (Histogram)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        4. Student Survey 2 (Box Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        5. Muzzle Velocities 1 (Box Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        6. Candy Bars (Pie Chart)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        7. Car Types (Bar Chart)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        8. Arrests (Stem & Leaf Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        9. Cholesterol Levels (Quantiles Plot)
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Descriptive Statistics
      1. Solve Exercises in Your Own Statistical Software
        1. Football: Bench Presses and Squats
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Muzzle Velocities 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Tests for Normality
      1. Solve Exercises in Your Own Statistical Software
        1. Typing Speed
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Tree Weights
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    4. Confidence Intervals
      1. Solve Exercises in Your Own Statistical Software
        1. Fat Content of Candy
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  9. t-Tests
    1. One Sample
      1. Solve Exercises in Your Own Statistical Software
        1. Physics Aptitude
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Bell Peppers
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Paired Sample
      1. Solve Exercises in Your Own Statistical Software
        1. Arrivals
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Blood Pressure 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Two Sample
      1. Solve Exercises in Your Own Statistical Software
        1. Value of Owner-Occupied Homes
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  10. Correlation
    1. Graphical Evaluation
      1. Solve Exercises in Your Own Statistical Software
        1. Crime and Education in Cities
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Pearson Correlation
      1. Solve Exercises in Your Own Statistical Software
        1. Football: Height and Neck Measurements
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Student Survey 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Crime and Recreation in Cities
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Spearman Correlation
      1. Solve Exercises in Your Own Statistical Software
        1. Top Movies
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Mail-Order Customers 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  11. Linear Regression
    1. Simple
      1. Solve Exercises in Your Own Statistical Software
        1. Student Survey 4
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Livestock Auctions 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Multiple
      1. Solve Exercises in Your Own Statistical Software
        1. Livestock Auctions 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Polynomial
      1. Solve Exercises in Your Own Statistical Software
        1. Weight to Height
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Tree Weights and Trunk Girths
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    4. Stepwise
      1. Solve Exercises in Your Own Statistical Software
        1. Mass Prediction
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Corn Yield
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  12. Analysis of Variance
    1. One-Way ANOVA
      1. Solve Exercises in Your Own Statistical Software
        1. Wafers
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Blood Pressure 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Two-Way ANOVA
      1. Solve Exercises in Your Own Statistical Software
        1. Starch Out
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Treadmill Exercise
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Mixed Models
      1. Solve Exercises in Your Own Statistical Software
        1. Orange Grove Irrigation
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Wanderers
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    4. Repeated Measures
      1. Solve Exercises in Your Own Statistical Software
        1. Foxes and Coyotes
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  13. Tests of Association
    1. Pearson Chi-Square
      1. Solve Exercises in Your Own Statistical Software
        1. Student Survey 5
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Colon Cancer
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Likelihood Ratio Chi-Square
      1. Solve Exercises in Your Own Statistical Software
        1. Pancreatic Cancer
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Survivors
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. Mantel-Haenszel Chi-Square
      1. Solve Exercises in Your Own Statistical Software
        1. Popularity 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Cotton Plants
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  14. Logistic Regression
    1. Simple Binary
      1. Solve Exercises in Your Own Statistical Software
        1. Rainy Days 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Space Shuttles
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Multiple Binary
      1. Solve Exercises in Your Own Statistical Software
        1. Rainy Days 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        2. Mail Order Customers 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  15. Nonparametric Analyses: Independent Samples
    1. 1 Sample
      1. Sign Test
        1. Solve Exercises in Your Own Statistical Software
        2. College Course Test Scores 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Wired 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
      2. Wilcoxon Signed Rank Test
        1. Solve Exercises in Your Own Statistical Software
        2. College Course Test Scores 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Wired 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. 2 Sample
      1. Median Test
        1. Solve Exercises in Your Own Statistical Software
        2. Hoop Ratings
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Physician Referrals 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        4. Adverse Reactions 1
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        5. Adverse Reactions 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
      2. Wilcoxon Rank Sum Test
        1. Solve Exercises in Your Own Statistical Software
        2. Physician Referrals 2
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Teen Growth
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
      3. Kruskal-Wallis Test
        1. Solve Exercises in Your Own Statistical Software
        2. Physician Referrals 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Adverse Reactions 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    3. k Samples
      1. Kruskal-Wallis Test
        1. Solve Exercises in Your Own Statistical Software
        2. Synthetic Wood Veneer
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Adverse Reactions 4
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  16. Nonparametric Analyses: Dependent Samples
    1. 2 Sample
      1. Wilcoxon Signed Rank Test
        1. Solve Exercises in Your Own Statistical Software
        2. College Course Test Scores 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Wired 3
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. k Samples
      1. Friedman Test
        1. Solve Exercises in Your Own Statistical Software
        2. Skin Potential for Emotions
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Adverse Reactions 5
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
      2. Spearman Correlation
        1. Solve Exercises in Your Own Statistical Software
        2. Arts and Economics in Cities
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
        3. Hoop Ratings
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
  17. Survival Analysis
    1. Test of Equality over Strata
      1. Solve Exercises in Your Own Statistical Software
        1. Maintenance Therapy
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result
    2. Comparing Survival Functions
      1. Solve Exercises in Your Own Statistical Software
        1. Heart Disease
          1. Problem
          2. Sample Data
          3. Source of Data
          4. Result

Story

SAS Public Data Sets as Big Data

A recent Information Week article caught my attention with: Big Data Professors Want Your Data Sets - Access to large, relevant data sets is the biggest problem facing business intelligence and data analytics instructors and students, new survey finds.

But that article did not provide any solution, so I wanted to provide the simulated SAS data sets in the SAS Visual Analytics online demo to the DC Data Community Sandbox I am buidling here. But I was told that the ability to download or screen scrape those data sets had been recently disabled to protect the investment made in them. The SAS support folks I spoke with said they would see about making them public. In my previous analysis of SAS Visual Analytics, I was able to download and reuse a small sample of the Toy Store data set.

In searching for other data sets, I found that the SAS Online Resource For Statistics Education provides three functions:

The third option says: In this section, you can work through statistics exercises using any statistical software. We've provided sample data and solutions for each exercise, which have been evaluated by a team of experts. Select an analysis from the list at left to see the available exercises.

I organized the list of 10 topics (below) and 30 subtopics (see Research Notesin the Wiki below to facilitate use of the data sets in Spotfire to recreate the exercises.

  • Distributions
  • t-Tests
  • Correlation
  • Linear Regression
  • Analysis of Variance
  • Tests of Association
  • Logistic Regression
  • Nonparametric Analyses: Independent Samples
  • Nonparametric Analyses: Dependent Samples
  • Survival Analysis

I found that Spotfire provides most of the same statistical functions as SAS, but without programming, and allows a data scientist to be very creative in exporing the SAS Education data sets. My previous work with SAS in Excel and Spotfire is provided elsewhere.

I recreated about half the 30 SAS Online Exercises below using 29 of the 48 data sets. Then I decided to use the SAS Online Exercise data sets (48) in the Spotfire Users Guide to create a SAS Data Sets - Spotfire Tutorial that links to my Spotfire Users Guide Knowledge Base.

The results are shown below in the Slides-Spotfire Tutorial and the Spotfire Dashboard-Spotfire Tutorial for Data Relationships as follows:

Next I want to test using the SAS Public Data Sets - Spotfire Tutorial on more than one data set to see if it is a good template for students to use for the other 47 data sets.

Then I want to do a SAS Public Data Sets - Spotfire Tutorial that uses the other Spotfire tools as follows:

Then I will be able to say whether Spotfire is able to do what SAS does or more.

IN PROCESS

SAS Exercises Slide Numbers

Data Source: http://support.sas.com/learn/statlib.../top_exer1.htm

Slide Numbers

1 Cover Page

Distributions
2 Charts and Plots
3 Descriptive Statistics
4 Tests for Normality
5 Confidence Intervals

t-Tests
6 One Sample
7 Paired Sample
8 Two Sample

Correlation
9 Graphical Evaluation
10 Pearson Correlation
11 Spearman Correlation

Linear Regression
12 Simple
13 Multiple
14 Polynomial
15 Stepwise

Analysis of Variance
16 One-Way ANOVA
17 Two-Way ANOVA
18 Mixed Models
19 Repeated Measures

Tests of Association
20 Pearson Chi-Square
21 Likelihood Ratio Chi-Square
22 Mantel-Haenszel Chi-Square

Logistic Regression
23 Simple Binary
24 Multiple Binary

Nonparametric Analyses: Independent Samples
25 1 Sample
26 2 Sample
27 k Samples

Nonparametric Analyses: Dependent Samples
28 2 Sample
29 k Samples

Survival Analysis
30 Test of Equality over Strata
31 Comparing Survival Functions

Slides-SAS Exercises

Cover Page

SASPublicDataSetsAsBigDataSlide1.png\

Knowledge Base and Data Sets

SASPublicDataSetsAsBigDataSlide1A.png

 

Distributions

Charts and Plots

SASPublicDataSetsAsBigDataSlide2.png


Descriptive Statistics

SASPublicDataSetsAsBigDataSlide3.png


Tests for Normality

SASPublicDataSetsAsBigDataSlide4&5.png


Confidence Intervals

SASPublicDataSetsAsBigDataSlide4&5.png

t-Tests

One Sample

SASPublicDataSetsAsBigDataSlide6&7&8.png


Paired Sample

SASPublicDataSetsAsBigDataSlide6&7&8.png


Two Sample

SASPublicDataSetsAsBigDataSlide6&7&8.png

Correlation

Graphical Evaluation

SASPublicDataSetsAsBigDataSlide9.png


Pearson Correlation


Spearman Correlation

SASPublicDataSetsAsBigDataSlide11A.png

 

SASPublicDataSetsAsBigDataSlide11B.png

Linear Regression

Simple

SASPublicDataSetsAsBigDataSlide12.png


Multiple

SASPublicDataSetsAsBigDataSlide13.png


Polynomial

SASPublicDataSetsAsBigDataSlide14.png


Stepwise

SASPublicDataSetsAsBigDataSlide15.png

 

Analysis of Variance

One-Way ANOVA


Two-Way ANOVA


Mixed Models


Repeated Measures

 

Tests of Association

Pearson Chi-Square

SASPublicDataSetsAsBigDataSlide20.png


Likelihood Ratio Chi-Square


Mantel-Haenszel Chi-Square

 

Logistic Regression

Simple Binary

SASPublicDataSetsAsBigDataSlide23.png


Multiple Binary

 

Nonparametric Analyses: Independent Samples

1 Sample


2 Sample


k Samples

 

Nonparametric Analyses: Dependent Samples

2 Sample


k Samples

 

Survival Analysis

Test of Equality over Strata


Comparing Survival Functions

SASPublicDataSetsAsBigDataSlide31.png

Extra

Oranges

SASPublicDataSetsAsBigDataSlide32.png

Spotfire Dashboard-SAS Exercises

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Slides-Spotfire Tutorial

Slide 1 Cover Page

SASPublicDataSets-SpotfireTutorialSlide1.png

Slide 2 Data Relationships: Linear Regression

SASPublicDataSets-SpotfireTutorialSlide2.png

Slide 3 Data Relationships: Spearman R

SASPublicDataSets-SpotfireTutorialSlide3.png

Slide 4 Data Relationships: ANOVA

SASPublicDataSets-SpotfireTutorialSlide4.png

Slide 5 Data Relationships: Kruskal-Wallis

SASPublicDataSets-SpotfireTutorialSlide5.png

Slide 6 Data Relationships: Chi-square

SASPublicDataSets-SpotfireTutorialSlide6.png

Slide 7 Knowledge Base & Data Sets

SASPublicDataSets-SpotfireTutorialSlide7.png

Spotfire Dashboard-Spotfire Tutorial

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Take a Quick-Start Tutorial!

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/tut_index.htm

Learn the basics in minutes!

To learn about SAS Enterprise Guide, select a topic below.

My Note: This was done in 2009 and the Slide Show and Try It features no longer seem to work.

Overview

About SAS Enterprise Guide

Source: http://support.sas.com/learn/statlib...g4.2/tut_1.htm

You might have heard about SAS software and thought that you had to be a programmer to use it. Well, you don't! SAS Enterprise Guide has ready-to-use tasks that you can use to create detail, summary, and graphical reports, and to perform statistical analyses. When you use these tasks, SAS Enterprise Guide writes the SAS program for you and sends it to SAS. SAS runs the program and sends the formatted results back to SAS Enterprise Guide. If you need to manipulate your data, SAS Enterprise Guide also has a point-and-click interface that enables you to perform queries without doing any programming.

SAS Enterprise Guide also makes it easy to present the results of the tasks that you can perform. You can create results in several different formats, such as HTML, PDF, and RTF that you paste into a Microsoft Word or Microsoft PowerPoint document.

 

Slide Show: The Workspace

Source: http://support.sas.com/learn/statlib...g4.2/tut_2.htm

Working with Data

Types of Data in SAS Enterprise Guide

Source: http://support.sas.com/learn/statlib...g4.2/tut_3.htm

Before you can use tasks, you need to add the data that you want to analyze to your project. In addition to SAS data files, SAS Enterprise Guide can read most PC data files such as Microsoft Access, dBASE, Microsoft Excel, Microsoft Exchange, IBM Lotus 1-2-3, Paradox, and HTML files. You can open data that is located on your own computer or on any server that you are authorized to access.

When you open data, a shortcut to the data is automatically added to the project and the data opens in a data grid. By default, the data opens in read-only mode. In the Process Flow window that is shown below, there are shortcuts to a SAS data set, an Excel data file, and a text file. The text file has been imported to create a SAS data set.

 

 

 

 

In addition to opening existing data, you can also work with data in the following ways:

  • use the Import Data wizard to create SAS data sets from raw data files and Excel files
  • use the New Data wizard to create a new SAS data file
  • use the Query Builder to manipulate your data (for example, filter, join, add columns, and sort)
  • change your data to update mode and make changes directly to data in the data grid
  • run tasks from the Data menu

Understanding Data Properties

Source: http://support.sas.com/learn/statlib...g4.2/tut_4.htm

Before you start working with data, it's helpful to know something about what your data contains. You can explore your data by viewing it in the data grid or by viewing the properties of the data. You can view the properties of data by right-clicking the data object in the Project Explorer or Process Flow window and selectingProperties from the pop-up menu.

One of the important things to know about your data is what type of data each column (or variable) contains. SAS reads the data in each column as either characterdata or numeric data. If the data in a column contains letters, it is character data. If the data in a column contains numbers, it can be either character or numeric. Numeric data is grouped into four different types of data depending on how it is displayed. The table below shows the icon that is associated with each type of data. These icons appear in the column headings of the data grid. 
 

Variable Type
Icon
Data Type
Character
character data symbol
character
Numeric
numeric data symbol
numeric
date data symbol
date
time data symbol
time
currency data symbol
currency

 

 

You also see these icons when you run a task. The icons help you know what things you can do with the columns, or variables, in your task.

Slide Show: Adding Data to a Project

Source: http://support.sas.com/learn/statlib...g4.2/tut_5.htm

Try It: Add Data and View Properties

Source: http://support.sas.com/learn/statlib...g4.2/tut_6.htm

Using Tasks

About SAS Tasks

Source: http://support.sas.com/learn/statlib...g4.2/tut_7.htm

About SAS Tasks

After you have data in your project, you generally want to work with it in some way. In SAS Enterprise Guide, you use tasks to do everything from manipulating data, to running specific analytical procedures, to creating reports. Tasks are based on SAS programming procedures, and as you make selections in a task window, SAS Enterprise Guide writes a program to send to SAS.

Each task has its own window. For example, if you want to create a bar chart, you select the data that you want to use, and then select the Bar Chart task. Some tasks have wizard versions that simplify the task even more for you.

 

 

Task windows in SAS Enterprise Guide

 
In each task window, there are certain steps that you must complete before you can run the task. For example, you must specify which variables you want to analyze and how you want to analyze them. After that, you can select from a variety of options that pertain to the particular task. The most common options for each task are selected for you, so after you've specified the information that is necessary to run the task, the Run button becomes available and you can run the task and get the default results. 


Deciding Which Task to Use

Source: http://support.sas.com/learn/statlib...g4.2/tut_8.htm

You might be wondering how you know which task to use. Most of the time the name of the task will be enough to help you choose, but if you aren't sure, you can view a description and business example for each SAS task in the What is window. You can open this window by selecting Viewthen What is. As you move your mouse pointer over a task in the Task List, a description of that task appears in the What is window. The image below shows the What is window with a description for the Summary Tables task.

 

 



 
Note The What is window opens on the right side of the workspace, but you can undock this window from the workspace by dragging the window title bar out of the workspace and position it wherever you want it.

If you need helping using a task, you can press F1 to open the Help window with specific help for the task. In addition, you can view a short description of each option that you can select in a task by moving your mouse pointer over the option and reading the description in the help pane at the bottom of the task window.


Slide Show: Opening Tasks

Source: http://support.sas.com/learn/statlib...g4.2/tut_9.htm


Slide Show: A Typical Task Window

Source: http://support.sas.com/learn/statlib...4.2/tut_10.htm


Try It: Create a List Report

Source: http://support.sas.com/learn/statlib...4.2/tut_11.htm


Working with Results

Changing the Result Format and Style

Source: http://support.sas.com/learn/statlib...4.2/tut_12.htm

The default report type that SAS Enterprise Guide creates is an HTML report that uses the EGDefault style. Suppose you want to create an HTML report that uses a different style and you also want to generate the same report in PDF format. You don't have to start over to do this, you just need to change the style of HTML results and select an additional report format to generate.

You can do this for all the tasks that you run by using the Options window. When you make changes in the Options window, you change the default settings. You can also override the default settings for an individual task by changing options in the Properties window for the task.

 

 

Options and Properties windows 

 

When you run a task, SAS Enterprise Guide creates HTML output that is part of a SAS Enterprise Guide project. If you want to make this HTML report available on the Web, you can export the HTML file to a location on your computer or on a server. You can also automatically export the file each time you run the project so that your reports are based on the most up-to-date data. 


Try It: Create PDF and Export Results

Source: http://support.sas.com/learn/statlib...4.2/tut_13.htm

Distributions

Charts and Plots

Source: http://support.sas.com/learn/statlib...eg_dist_1a.htm

Problem: Create box plots to display the distributions of the muzzle velocities of cartridges made from types of gunpowder.

It seems that the use of powder 2 generally results in a higher muzzle velocity, because the overall distribution of its values is higher.

Problem: Create a pie chart to show the distribution of car type for data on various car models.

It appears that the medium car type occurred with the most frequency in the sample. For small differences such as this, you may need to refer to the labels. 

Problem: Create a bar chart to show the distribution of car type for data on various car models.

It appears that the medium car type occurred with the most frequency in the sample.

Problem: Create a stem plot for the total number of arrests in the U.S. over a specified time period.

To get the actual value in the data set represented by each stem-leaf combination, multiply these pairs by 1,000,000.

Problem: Create a normal quantile plot to examine the normality of cholesterol levels.

The points on the q-q plot do not display an overall strong departure from a linear pattern, so we can conclude that the total cholesterol levels could be normally distributed.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/dist_1.htm#KB1

Hot Dogs (Bar Chart)

Create a bar chart to display the distribution of hot dog type.

Problem

An investigation of the taste and nutritional content of three types of hot dogs was carried out. The types of franks included in the study were categorized as beef, meat, and poultry.

Construct a bar chart of hot dog type and use it to compare the type frequencies.

Sample Data

The Hot_dogs data set is from a study on the taste and nutritional content of hot dogs. Fifty-four brands of hot dogs were included. Information on the type of hot dog, a description of the taste, weight (ounces), protein content, calories, sodium, and protein fat were collected.

These are the variables in the data set:

Name
Type
Description
Product_Name char brand of hot dog
Type char type of hot dog (beef, meat, poultry)
Taste char description of taste (bland, medium, scrumptious)
_oz num weight of frank (ounces)
_lb_Protein num protein content
Calories num caloric content
Sodium num sodium content
Protein_Fat num protein from fat
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

Using the variable Type as the categorical variable to chart for this problem, the horizontal axis of the bar chart is divided into three classes—for each hot dog type. The vertical axis indicates the frequency (count) of the hot dog types. Vertical bars are constructed for each class, with the height of each bar representing the class frequency. It turns out that meat and poultry types had the same hot dog count (17), while the beef type was included more in the study (20 hot dogs).

Typing Speed (Stem & Leaf Plot)

Create a stem and leaf plot to display the distribution of typing speeds.

Problem

Three brands of typewriters were tested for typing speed by having expert typists type identical passages of text. 

Construct a stem and leaf plot to examine the distribution of typing speeds. Would you describe the shape of the distribution as symmetric, skewed to the left, or skewed to the right?

Sample Data

The Typing_data data set contains the results of a test for typing speed on three different brands of typewriters. The speeds were recorded as words typed per minute.

These are the variables in the data set:

Name
Type
Description
brand char brand of typewriter
speed num typing speed (words per minute)
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

Stem Leaf
6 12
6 668
7 001223
7 779
8 01
8 7

The distribution of typing speeds appears to be skewed to the right.

Student Survey 1 (Histogram)

Investigate characteristics of college students using survey results.

Problem

To determine the characteristics of the students in their large introductory statistics class, a group of college professors administered a survey to the students in their classes. Some of the questions asked include: the student’s age in years, height (in inches), shoe size, and high school GPA.

Create a histogram of the age of these students in years. 

a. What shape does this histogram take on?

b. Where is the main cluster of the students’ ages?

c. How old is the oldest student in this class (approximately)?

d. Examine the shape of this histogram. Give an explanation in layman’s terms to why this histogram takes on the shape that it does.

Sample Data

The Survey data set is the result of a survey administered to a large introductory statistics class. The data set contains answers from 485 participants. Not all questions are answered, resulting in missing data. Missing data is indicated by a period.

These are the variables in the data set:

Name
Type
Description
Gender char gender (male or female) of the respondent
Age num age of the subject in years
Textbook num answer to the question “How much did you spend for textbooks this term (to nearest dollar)?”
Cigs num answer to the question “How many cigarettes did you smoke yesterday?”
ColGPA num answer to the question “What is your cumulative Grade Point Average at this institution?”
HSGPA num answer to the question “What was your cumulative high school GPA (4 point scale)?”
Height num height of the respondent in inches
Mateh num the height of the respondent’s “ideal mate”
Shoe num the respondent’s shoe size
Breakfast char answer to the question “Did you eat breakfast this morning?”
Flight char answer to the question “Did you fly on a commercial airline during the past 30 days? (Yes or No)”
Play char answer to the question “Have you seen a play in a live theater in the past 6 months? Yes or No”
Vote char answer to the question “Are you registered to vote? Yes or No”
Credit num answer to the question “How many credit hours are you taking this term?”
Source of Data

This data was collected by Roger Woodard of the North Carolina State University in 2005.

Result

a. The histogram is skewed to the right. 

b. The main cluster of ages is in the 19 to 22 year age range.

c. The oldest student is about 60 years of age. 

d. This histogram is skewed to the right. This is because the main cluster of students represents traditional students who have gone to college immediately after high school. The age of college students is limited on the low end by the practical limitation of attending college. It is unreasonable to expect many individuals under age 16 to attend college. However, there is no age limit on attending college, so the long right tail of the distribution represents students who have taken a few years longer to complete college, non-traditional students, and lifelong learners.

Student Survey 2 (Box Plot)

Investigate characteristics of college students using survey results.

Problem

To determine the characteristics of the students in their large introductory statistics class, a group of college professors administered a survey to the students in their classes. Some of the questions asked include: the student’s age in years, height (in inches), shoe size and high school GPA.

Create box plots of the heights of these students separated by gender. Compare the shape, location and spread of the height of the groups.

Sample Data

The Survey data set is the result of a survey administered to a large introductory statistics class. The data set contains answers from 485 participants. Not all questions are answered resulting in missing data. Missing data is indicated by a period.

These are the variables in the data set:

Name
Type
Description
Gender char gender (male or female) of the respondent
Age num age of the subject in years
Textbook num answer to the question “How much did you spend for textbooks this term (to nearest dollar)?”
Cigs num answer to the question “How many cigarettes did you smoke yesterday?”
ColGPA num answer to the question “What is your cumulative Grade Point Average at this institution?”
HSGPA num answer to the question “What was your cumulative high school GPA (4 point scale)?”
Height num height of the respondent in inches
Mateh num the height of the respondent’s “ideal mate”
Shoe num the respondent’s shoe size
Breakfast char answer to the question “Did you eat breakfast this morning?”
Flight char answer to the question “Did you fly on a commercial airline during the past 30 days? (Yes or No)”
Play char answer to the question “Have you seen a play in a live theater in the past 6 months? Yes or No”
Vote char answer to the question “Are you registered to vote? Yes or No”
Credit num answer to the question “How many credit hours are you taking this term?”
Source of Data

This data was collected by Roger Woodard of the North Carolina State University in 2005.

Result

Shape: For both genders the height is approximately symmetric. We notice that the males are centered around 71 inches, while the females are centered at about 65 inches. In terms of spread we notice that the males are more spread out than the females; however, the spread of the middle 50% (as represented by the box of the boxplot) is about the same for both males and females.

Muzzle Velocities 1 (Box Plot)

Display the distributions of muzzle velocities of cartridges made from different types of gunpowder.

Problem

A government bureau in charge of assessing the efficiency of firearms for law enforcement agencies performed an experiment where the muzzle velocities of cartridges made from two types of gunpowder were recorded. The same type of firearm and cartridge was used for both types of gunpowder in the study.

Create box plots for each gunpowder type. Then, based on a comparison of the distributions, decide which gunpowder seems generally to result in a higher muzzle velocity.

Sample Data

The Bullets data set contains data that was collected to determine whether there is a difference in the muzzle velocity of cartridges made from two types of gunpowder.

These are the variables in the data set:

Name
Type
Description
powder num type of gunpowder
velocity num muzzle velocity
Source of Data

This data is sample data from SAS Institute Inc.

Result

It seems that the use of powder 2 generally results in a higher muzzle velocity, because the overall distribution of its values is higher.

Candy Bars (Pie Chart)

Show the distribution of candy bar brands in data collected to study daily diet.

Problem

The United States Department of Agriculture (USDA) and the Department of Health and Human Service (HHS) have suggested that a daily diet should consist of an appropriate number of calories (among other things), of which 30% or fewer should be calories from fat. For adults consuming 2000 calories per day, this works out to 65 grams of fat or less.

A sample of various candy bars and non-bar candies (such as M&Ms, Skittles, etc.) was collected, and nutritional facts from each candy were recorded including total fat and saturated fat (measured in grams).

Create a pie chart to display the number of candy bars from each brand that were included in the sample.

Sample Data

The Candy data set contains nutritional facts about candy bars and non-bar candies such as M&Ms, Reese's Pieces, Skittles, and Super Hot Tamales.

These are the variables in the data set:

Name
Type
Description
Brand char brand of candy
Name char name of candy
Serving_pkg num servings per package
Oz_pkg num ounces per package
Calories num calories
Total_fat_g num total fat content in grams
Saturated_fat_g num saturated fat content in grams
Cholesterol_g num cholesterol content in grams
Sodium_mg num sodium content in milligrams
Carbohydrate_g num carbohydrate content in grams
Dietary_fiber_g num dietary fiber content in grams
Sugars_g num sugars content in grams
Protein_g num protein content in grams
Vitamin_A_RDI num vitamin A content as a percentage of the RDI (Reference Daily Intake)
Vitamin_C_RDI num vitamin C content as a percentage of the RDI (Reference Daily Intake)
Calcium_RDI num calcium content as a percentage of the RDI (Reference Daily Intake)
Iron_RDI num iron content as a percentage of the RDI (Reference Daily Intake)
Source of Data

This data is sample data from SAS Institute Inc.

Result

It appears that most of the candy bars in the sample were of the Hershey brand.

Car Types (Bar Chart)

Show the distribution of car type for data on various car models.

Problem

A set of data was collected on 116 cars from different countries, containing information such as weight, gas tank size, turning radius, horsepower and engine displacement.

Create a bar chart to display the distribution of car type (compact, sporty, small, medium, and large).

Sample Data

The Cars data set contains data about cars from different countries.

These are the variables in the data set:

Name
Type
Description
Model char model
Country char country
Type char type (Compact, Small, Medium, Large, Sporty)
Weight num weight
TurningRadius num turning radius
Displacement num engine displacement
Horsepower num horsepower
GasTank num capacity of gas tank
Source of Data

This data is sample data from SAS Institute Inc.

Result

It appears that the medium car type occurred with the most frequency in the sample.

Arrests (Stem & Leaf Plot)

Plot the total number of arrests in the U.S. over a specified time period.

Problem

A division of the U.S. Department of Justice collected data on the total number of arrests in the United States from the year 1970 to 1999. Variables in the data set include year, total number of arrests, total number of arrests by age group, total population of the U.S. on July 1 of the given, etc.

Use a stem plot to display the distribution of total number of arrests in the U.S. from 1970 to 1999.

Sample Data

The Totarrests data set contains data about the total number of arrests in the United States from the year 1970 to 1999.

These are the variables in the data set:

Name
Type
Description
Year num year of arrest
TotalArrests num total number of arrests
AGE1 num number of arrests for age group 1
AGE2 num number of arrests for age group 2
AGE3 num number of arrests for age group 3
AGE4 num number of arrests for age group 4
AGE5 num number of arrests for age group 5
ArrestRate num total arrests per 100 thousand in population
AGE1rate num arrests per 100 thousand in population for age group 1
AGE2rate num arrests per 100 thousand in population for age group 2
AGE3rate num arrests per 100 thousand in population for age group 3
AGE4rate num arrests per 100 thousand in population for age group 4
AGE5rate num arrests per 100 thousand in population for age group 5
Population num total population of the U.S. on July 1 of the given year
AGE1pop num population of the U.S. in age group 1 on July 1 of the given year
AGE2pop num population of the U.S. in age group 2 on July 1 of the given year
AGE3pop num population of the U.S. in age group 3 on July 1 of the given year
AGE4pop num population of the U.S. in age group 4 on July 1 of the given year
AGE5pop num population of the U.S. in age group 5 on July 1 of the given year
Source of Data

This data is sample data from SAS Institute Inc.

Result

This is the stem plot:

Stem Leaf
   15 123
   14 00122356
   13 8
   12 157
   11 679
   10 22348
    9 0136
    8 167

To get the actual value in the data set represented by each stem-leaf combination, multiply these pairs by 1,000,000.

Cholesterol Levels (Quantiles Plot)

Examine the normality of cholesterol levels.

Problem

In a study to investigate the relationships between various factors and heart disease, blood lipid screenings were conducted on a group of patients. Three months after an initial screening, data was collected from a second screening that included information such as gender, age, weight, total cholesterol, and history of heart disease.

Use a normal q-q plot to inspect the normality of total cholesterol level before we proceed with a simple linear regression with this variable as our response.

Sample Data

The Lipid data set contains data about blood lipid screenings and patient history.

These are the variables in the data set:

Name
Type
Description
Name char name
Gender char gender
Age num age
Weight num weight at first screening
Cholesterol num total cholesterol level at first screening
Triglycerides num triglycerides level at first screening
HDL num HDL level at first screening
LDL num LDL level at first screening
PercentIdeal num percentage of ideal weight at first screening
Height num height
Skinfold num skinfold measurement
SystolicBP num systolic blood pressure
DiastolicBP num diastolic blood pressure
Weight3 num weight at 3-month screening
PercentIdeal3 num percentage of ideal weight at 3-month screening
Triglyceride3 num triglycerides level at 3-month screening
Cholesterol3 num total cholesterol level at 3-month screening
HDL3 num HDL level at 3-month screening
LDL3 num LDL level at 3-month screening
Exercise num exercise
Coffee num coffee consumption (cups per day)
Smoking char smoking behavior (none, quit, cigar, pipes, cigarettes)
Alcohol char alcohol consumption (number of drinks per day)
HeartDisease char history of heart disease
CholesterolLoss num reduction in cholesterol level between first and 3-month screening
Source of Data

This data is sample data from SAS Institute Inc.

Result

The points on the q-q plot do not display an overall strong departure from a linear pattern, so we can conclude that the total cholesterol levels could be normally distributed.

Descriptive Statistics

Source: http://support.sas.com/learn/statlib.../eg_dist_2.htm

Problem: Generate descriptive statistics for the muzzle velocities of cartridges made from types of gunpowder.

The mean and standard deviation of muzzle velocity for powder 1 are 27.6375 and 0.3925648.  For powder 2, these statistics are 28.0600 and 0.3062316.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/dist_2.htm#KB1

Football: Bench Presses and Squats

Find the value of the covariance between bench press and squat.

Problem

A set of data was collected for the Brigham Young football team. Information on players’ positions, heights, weights, percent body fat, neck measurements, and performances on various weight lifts were recorded. 

Determine the value of the covariance between bench press and squat, and state what this says about the relationship between the two variables.

Sample Data

The Football data set was collected from the Brigham Young University football program. Various body composition measurements (such as weight, height, percent body fat) and physical performance measurements (such as speed, bench press, and squat) are included in the data set. Note: Some observations have missing data values.

These are the variables in the data set:

Name
Type
Description
Height num height of player (inches)
Weight num weight of player (pounds)
Fat num percent body fat
Speed num evaluation of player’s speed
Neck num neck measurement (inches)
Bench num player’s bench press (pounds)
Squat num player’s squat (pounds)
LegPress num player’s leg press (pounds)
Position num primary position
Position2 num secondary position
Speed2 num second evaluation of player’s speed
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

Using SAS Enterprise Guide, cov(Bench, Squat) = 3530.19 lbs. This implies that as a player’s bench press increases his squat increases, and vice versa. That is, we would expect that the more a player bench presses the more he can squat, and the more a player can squat the more he bench presses.

Muzzle Velocities 1

Find muzzle velocities of cartridges made from different types of gunpowder.

Problem

A government agency, assigned to compare the efficiency of weaponry used by state-level law enforcement with that of weaponry used by the general public, performed an experiment where the muzzle velocities of cartridges made from two types of gunpowder were recorded. The same type of firearm and cartridge was used for both types of gunpowder in the study. However, one type of powder (powder 1) is available for the general public and the other (powder 2) is made available primarily to law enforcement officials. 

Determine the mean and standard deviation, and give the five-number summary for each type of gunpowder.

Sample Data

The Bullets data set contains data that was collected to determine whether there is a difference in the muzzle velocity of cartridges made from two types of gunpowder.

These are the variables in the data set:

Name
Type
Description
powder num type of gunpowder
velocity num muzzle velocity
Source of Data

This data is sample data from SAS Institute Inc.

Result

The mean and standard deviation of muzzle velocity for powder 1 are 27.6375 and 0.3925648. For powder 2, these statistics are 28.0600 and 0.3062316.

For powder 1, this is the five-number summary:
27.10 27.35 27.55 28.05 28.10

For powder 2, this is the five-number summary: 
27.60 27.90 28.00 28.30 28.50

Tests for Normality

Source: http://support.sas.com/learn/statlib.../eg_dist_3.htm

Problem: Test the normality of tree weights for regression analysis.

You can base your conclusion on any of the four test results given in the output table.  For instance, using the Shapiro-Wilk test, the outcome is not significant (at a level as high as α = 0.10), since the p-value = 0.1235.  So, we do not have strong enough evidence to reject the null hypothesis of normality.  Hence, we cannot claim that the condition for the normality of tree weights is violated (i.e., fail to reject H0).

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/dist_3.htm#KB1

Typing Speed

Determine whether the data for typing speeds are normally distributed.

Problem

Three brands of typewriters were tested for typing speed by having expert typists type identical passages of text. 

Perform the Shapiro-Wilk test to test that the typing speeds have a normal distribution.

Sample Data

The Typing_speed data set is from a test in which three brands of typewriters were tested for typing speed by 17 expert typists.

These are the variables in the data set:

Name
Type
Description
brand char brand of typewriter
speed num typing speed (words per minute)
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The output for the Shapiro-Wilk test (generated by using SAS Enterprise Guide) yields a p-value of 0.9214. This indicates that we cannot conclude that the typing speed data are non-Normal.

Tree Weights

Test the normality of tree weights for regression analysis.

Problem

A forestry commission once sought a way to accurately estimate the weights of trees without having to go through the damaging process of cutting the trees down to weigh them. The weights and trunk girths of 104 tree specimens were measured, in hopes that girth would be useful in predicting weight.

Begin checking the conditions for the simple linear regression analysis by performing a test for normality for the response, tree weight.

Sample Data

The Tree data set contains data about the weights and trunk girths of 104 tree specimens (eight specimens from each of thirteen rootstocks).

These are the variables in the data set:

Name
Type
Description
RootStock char rootstock (I – XIII)
TrunkGirth num trunk girth of specimen
Weight num weight of specimen
Source of Data

This data is sample data from SAS Institute Inc.

Result

You can base your conclusion on any of the following four test results: Shapiro-Wilk, Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling. For instance, using the Shapiro-Wilk test, the outcome is not significant (at a level as high as α = 0.10), since the p -value = 0.1235. So, we do not have strong enough evidence to reject to null hypothesis of normality. Hence, we cannot claim that the condition for the normality of tree weights is violated (i.e., fail to reject H0).

Confidence Intervals

Source: http://support.sas.com/learn/statlib.../eg_dist_4.htm

Problem: Generate summary statistics to determine the 95% confidence interval for mean total fat in candy bars and non-bar candies.

You can expect the mean total fat for candy bars and non-bar candies to fall between 10.55 and 13.19 grams in 95% of all samples of this size. The 95% confidence interval is [10.55g, 13.19g]; the estimate is 11.87g; and the standard error is 0.6614g.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/dist_4.htm#KB1

Fat Content of Candy

Determine the mean total fat in candy bars and non-bar candies.

Problem

The United States Department of Agriculture (USDA) and the Department of Health and Human Service (HHS) have suggested that a daily diet should consist of an appropriate number of calories (among other things), of which 30% or fewer should be calories from fat. For adults consuming 2000 calories per day, this works out to 65 grams of fat or less. A sample of various candy bars and non-bar candies (such as M&Ms, Skittles, etc.) was collected, and nutritional facts from each candy were recorded, including total fat and saturated fat (measured in grams).

Determine the 95% confidence interval for the mean total fat m for candy bars and non-bar candies. Also identify the estimate and the standard error of the estimate, and interpret the interval.

Sample Data

The Candy data set contains nutritional facts about candy bars and non-bar candies such as M&Ms, Reese's Pieces, Skittles, and Super Hot Tamales.

These are the variables in the data set:

Name
Type
Description
Brand char brand of candy
Name char name of candy
Serving_pkg num servings per package
Oz_pkg num ounces per package
Calories num calories
Total_fat_g num total fat content in grams
Saturated_fat_g num saturated fat content in grams
Cholesterol_g num cholesterol content in grams
Sodium_mg num sodium content in milligrams
Carbohydrate_g num carbohydrate content in grams
Dietary_fiber_g num dietary fiber content in grams
Sugars_g num sugars content in grams
Protein_g num protein content in grams
Vitamin_A_RDI num vitamin A content as a percentage of the RDI (Reference Daily Intake)
Vitamin_C_RDI num vitamin C content as a percentage of the RDI (Reference Daily Intake)
Calcium_RDI num calcium content as a percentage of the RDI (Reference Daily Intake)
Iron_RDI num iron content as a percentage of the RDI (Reference Daily Intake)
Source of Data

This data is sample data shipped with SAS Enterprise Guide by SAS Institute Inc.

Result

You can expect the mean total fat for candy bars and non-bar candies to fall between 10.55 and 13.19 grams in 95% of all samples of this size. The 95% confidence interval is [10.55g, 13.19g]; the estimate is 11.87g; and the standard error is 0.6614g.

t-Tests

One Sample

Source: http://support.sas.com/learn/statlib...eg_ttest_1.htm

Problem: Perform a one-sample t-test to determine if bell peppers hang on plants at an angle.

  1. Looking for evidence that peppers hang on plants at angles, you would test the hypotheses
    Hoμ = 0  versus  Haμ ≠ 0,
    where μ is the mean angle at which bell peppers hang.

    The p-value of 0.0037 is less than .05, so there is sufficient evidence to reject the null hypothesis.  Thus, you can conclude that bell peppers hang on plants at an angle.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/ttest_1.htm#KB2

Physics Aptitude

Test whether students' score on a physics test should be higher than 420.

Problem

Over 5000 students were tested on their abilities in Calculus and Physics for a study performed by the Third International Mathematics and Science Study. The testing was separated into four regions of the United States. Some students took the Calculus test, some took the Physics test, and some took both. Assume that the scores represent a random sample for each of the four regions of the U.S. 

Test the claim made by Physics teachers that the overall mean U.S. score on the Physics test should be higher than 420. Use a significance level of α = 0.05.

Sample Data

The Scores data set contains the scores on a Calculus and a Physics test given to 1000 students in a particular region of the U.S.

These are the variables in the data set:

Name
Type
Description
region num region of the U.S. the tests were administered in (categorical)
Calculus_Score num score on Calculus test
Physics_Score num score on Physics test
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The p-value for our one-sided test (given by SAS Enterprise Guide) with alternative hypothesis Haμ > 420, is .09375. This outcome is not significant at the α = 0.05 level. Thus we can conclude that there is not enough evidence to suggest that the data support the claim that the overall U.S. score on the Physics test should be higher than 420.

Bell Peppers

Determine whether bell peppers hang on plants at an angle.

Problem

An agricultural company interested in cutting down the time needed to harvest its crops hired a mechanical engineer to design a mechanical harvester for bell peppers. To heighten the precision of his machine, the engineer measured and recorded the angle at which peppers hang on the plant.

Perform a one-sample t-test to determine whether the data gives good evidence that peppers hang on plants at an angle (different from zero).

Sample Data

The Peppers data set contains data about the angle at which peppers hang on the plant.

These are the variables in the data set:

Name
Type
Description
angle num angle at which the pepper hangs on the plant
Source of Data

This data is sample data from SAS Institute Inc.

Result

You’re testing the following hypotheses:

Hoμ =0  versus  Haμ ≠ 0,
where μ is the mean angle at which bell peppers hang.

The p-value of 0.0037 is less than .05, so there is sufficient evidence to reject the null hypothesis.  Thus, you can conclude that bell peppers hang on plants at an angle.

Paired Sample

Source: http://support.sas.com/learn/statlib...eg_ttest_2.htm

Problem: Perform a paired-sample t-test to determine whether a medication was successful in reducing blood pressure.

Looking for evidence of a reduction in blood pressure, you would test the hypotheses

Hoμ = 0  versus  Haμ > 0,

where μ is the mean difference in blood pressure (BaselineBP - NewBP).

The p-value given in the output (< 0.0001) is the result for the two-sided test with the alternative hypothesis Haμ > 0. The result for our one-sided test would be half of this value, which is still less than the .05 significance level.  Hence, we can conclude that there is strong evidence to suggest that the medication was effective in reducing blood pressure.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/ttest_2.htm#KB2

Arrivals

Determine whether on-time arrivals differ between months.

Problem

The Department of Transportation recorded data that contains the percentage of airlines’ planes that arrived on time in 29 airports. Suppose you to examine the differences between March and June.

Perform a matched pairs test to determine whether there is a difference in on-time arrivals between the two months. Use a significance level of α = 0.05.

Sample Data

The On_time_arrivals data set gives information on the percentage of on-time arrivals for airplanes of 10 airlines for the months of March, June, and August in the year 1999.

These are the variables in the data set:

Name
Type
Description
Airline char airplane airline
March_1999 num percentage of airplanes that arrived on-time in March
June_1999 num percentage of airplanes that arrived on-time in June
August_1999 num percentage of airplanes that arrived on-time in August
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

Using SAS Enterprise Guide, the p-value for the matched pairs test is found to be 0.0021, which is significant at the α = 0.05 level. Hence, we can conclude that there is a significant difference in on-time arrivals between the months of March and June.

Blood Pressure 1

Determine whether a medication was successful in reducing blood pressure.

Problem

To observe the effectiveness of a medication in reducing blood pressure, an experiment was conducted in which researchers collected data from a random sample of individuals who were considered to have high blood pressure. The diastolic blood pressure of these individuals was recorded, after which they were placed on the medication. One month later, their diastolic pressure was recorded again.

Determine whether the data gives good evidence that the medication was effective in reducing blood pressure by carrying out a test of significance (at level α = 0.05), pairing the two blood pressure recordings for each subject.

Sample Data

The Bloodpressure data set contains data from a random sample of individuals with high blood pressure. Variables include subject, age, an initial measurement of diastolic pressure, and a later measurement of diastolic pressure after one month of medication.

These are the variables in the data set:

Name
Type
Description
Subject char subject code
Age num subject’s age
BaselineBP num subject’s baseline blood pressure
NewBP num subject’s blood pressure one month after starting to take medication
Source of Data

This data is sample data from SAS Institute Inc.

Result

The mean difference between the two blood pressure recordings is greater than zero, which indicates an overall reduction in blood pressure, on average.

You’re testing the following hypotheses:

Hoμ = 0  versus  Haμ > 0,
where μ is the mean difference in blood pressure (BaselineBP - NewBP).

The p-value is less than 0.0001, which is less than the .05 significance level.  Hence, you can conclude that there is strong evidence to suggest that the medication was effective in reducing blood pressure.

Two Sample

Problem: Perform a two-sample t-test to compare the median values of owner-occupied homes in two groups of housing tracts.

Source: http://support.sas.com/learn/statlib...eg_ttest_3.htm

  1. The second table in the TTEST Procedure output gives the result of the paired-sample t-test. Looking for evidence that the median home value for homes near the Charles River is greater than that for homes farther away, we would test the hypotheses

    Hoμ= 0  versus  Haμ< 0,
    where μ is the mean difference in median home value (Far - Nea).

    For more accuracy, we will rely on the results of the Satterthwaite method, which accounts for the possibility of unequal variances between the two populations (which is tested by the F-test in the bottom table).

    The t statistic has a value of -3.11, which indicates evidence that the median home value for “near homes” is greater than the median home value for “far homes.”

    The p-value given in the table (0.0036) is for a two-sided test.  To obtain thep-value for our one-sided test, we would divide this by two, which would result in an outcome that is still significant at the = 0.05 level (p-value = 0.0018).

    So, we can conclude that on average, the median value of owner-occupied homes near the Charles River is greater than that of owner-occupied homes farther from the river.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/ttest_3.htm#KB2

Value of Owner-Occupied Homes

Compare the median values of owner-occupied homes in two groups of housing tracts

Problem

A census study was carried out, in which data was collected from 506 housing tracts in the Boston area. Some of the variables of interest for each tract were median home value, crime rate, relative distance from the Charles River (near or far), percentage of lower economic status families, and so on.

Use a two-sample t-test (with significance level α = 0.05) to determine whether, on average, the median value of owner-occupied homes near the Charles River is greater than that of homes farther away from it. 

Sample Data

The Bostonhousing data set contains census information for 506 housing tracts in the Boston area.

These are the variables in the data set:

Name
Type
Description
MedianValue num median value of owner-occupied homes in $1000's
Crime num per capita crime rate by town
Zone num proportion of residential land zoned for lots over 25,000 sq.ft
Industry num proportion of non-retail business acres per town
Charles char Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX num nitric oxides concentration (parts per 10 million)
Rooms num average number of rooms per dwelling
Bef_1940 num proportion of owner-occupied units built prior to 1940
DistEmp num weighted distances to five Boston employment centers
Highways num index of accessibility to highways
TaxRate char full-value property-tax rate per $10,000
PTRatio char pupil-teacher ratio by town
LowStatus char % lower status of the population
Source of Data

This data is sample data from SAS Institute Inc.

Result

Looking for evidence that the median home value for homes near the Charles River is greater than that for homes farther away, you test the hypotheses

Hoμ=0  versus  Haμ< 0,

where μ is the mean difference in median home value (Far - Near). The median value for houses near the Charles River is 28.44, and the median value for houses farther away is 22.094. The difference is -6.346. Depending on the method you use for the t-test, this difference may or may not be significant. For example, for more accuracy, you can use the Satterthwaite method, which accounts for the possibility of unequal variances between the two populations (which is tested by the F-test). Using the Satterthwaite method, the t statistic has a value of -3.11, which indicates evidence that the median home value for “near homes” is greater than the median home value for “far homes.”

The p-value for a two-sided test is 0.0036. To obtain the p-value for a one-sided test, you divide this by two, which results in an outcome that is still significant at the a = 0.05 level (p-value = 0.0018).

So, you can conclude that on average, the median value of owner-occupied homes near the Charles River is greater than that of owner-occupied homes farther from the river.

Correlation

Graphical Evaluation

Source: http://support.sas.com/learn/statlib.../eg_corr_1.htm

Problem: Make a scatter plot to display the relationship between two quantitative variables.

View the scatter plot. It appears that there is somewhat of a weak, positive linear relationship between crime and education. The relationship is positive in the sense that as the education score increases, there appears to be an increase in the crime score for these western cities, which one would think is an undesirable effect.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/corr_1.htm#KB3

Crime and Education in Cities

Plot the relationship between crime and education in western cities

Problem

In a study to investigate the prevalence of relationships between different socioeconomic factors, 52 western cities were rated by nine criteria: climate terrain, housing, health care environment, crime, transportation, education, arts, recreation, and economics. For housing and crime, the lower the rating score, the better. For the remaining seven criteria, the higher the score, the better.

Sample Data

The Westernrates data set contains data about ratings on nine criteria (climate and terrain, housing, health care and environment, crime, transportation, education, arts, recreation, and economics) for 52 western cities.

These are the variables in the data set:

Name
Type
Description
City char city
State char state
ClimateTerrain num rating of climate and terrain
Housing num rating of housing
HealthCareEnvironment num rating of health care and environment
Crime num rating of crime
Transportation num rating of transportation
Education num rating of education
Arts num rating of the arts
Recreation num rating of recreation
Economics num rating of economics
Source of Data

This data is sample data from SAS Institute Inc.

Result

It appears that there is somewhat of a weak, positive linear relationship between crime and education. The relationship is positive in the sense that as the education score increases, there appears to be an increase in the crime score for these western cities, which one would think is an undesirable effect.

Pearson Correlation

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/eg_corr_2.htm

Problem: Determine the Pearson correlation coefficient to measure the strength of the linear relationship between two quantitative variables.

See the value of the Pearson correlation coefficient for the two variables. The correlation value of 0.30337 suggests that there is a weak positive relationship between crime and recreation. Furthermore, this implies that there is an increase in the crime score as the recreation score increases for these western cities.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/corr_2.htm#KB3

Football: Height and Neck Measurements

Determine the correlations between height and neck measurement, and weight and neck measurement.

Problem

A set of data was collected for the Brigham Young football team. Information on players’ positions, heights, weights, percent body fat, neck measurements, and performances on various weight lifts were recorded.

Use the Pearson correlation coefficient to evaluate the relationship between height and neck measurement, and weight and neck measurement. With which of these two variables does neck measurement seem to be more correlated?

Sample Data

The Football data set was collected from the Brigham Young University football program. Various body composition measurements (such as weight, height, percent body fat) and physical performance measurements (such as speed, bench press, and squat) are included in the data set. Note: some observations have missing data values.

These are the variables in the data set:

Name
Type
Description
Height num height of player (inches)
Weight num weight of player (pounds)
Fat num percent body fat
Speed num evaluation of player’s speed
Neck num neck measurement (inches)
Bench num player’s bench press (pounds)
Squat num player’s squat (pounds)
LegPress num player’s leg press (pounds)
Position num primary position
Position2 num secondary position
Speed2 num second evaluation of player’s speed
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

Using SAS Enterprise Guide, the value of the Pearson correlation coefficient between height and neck measurement is 0.47769. The value of correlation between weight and neck measurement is 0.81370. So, based on these values it appears that neck measurement is more closely correlated with weight, rather than height.

Student Survey 3

Investigate characteristics of college students using survey results.

Problem

To determine the characteristics of the students in their large introductory statistics class, a group of college professors administered a survey to the students in their classes. Some of the questions asked include: the student’s age in years, height (in inches), shoe size and high school GPA.

Exercise: 
a. Before you actually do any calculations or construct graphics, consider the relationship between a student’s height and the height of their ideal mate. What type of relationship would you expect for these variables? (Positive or negative? Strong or weak?)
b. Explain in layman’s terms what a positive correlation between these variables would mean. Then explain (again in layman’s terms) what a negative correlation would mean.
c. Calculate Pearson’s correlation for these variables. Is it positive or negative? Does this match your intuition? Explain.
d. Create a scatterplot of these variables. Does the scatterplot match the correlation you calculated? Explain. 
e. Further explore the data to develop an explanation for the calculated correlation.

Sample Data

The Survey data is the result of a survey administered to a large introductory statistics class. The data set contains answers from 485 participants. Not all questions are answered, resulting in missing data. Missing data is indicated by a period.

These are the variables in the data set:

Name
Type
Description
Gender char gender (male or female) of the respondent
Age num age of the subject in years
Textbook num answer to the question “How much did you spend for textbooks this term (to nearest dollar)?”
Cigs num answer to the question “How many cigarettes did you smoke yesterday?”
ColGPA num answer to the question “What is your cumulative Grade Point Average at this institution?”
HSGPA num answer to the question “What was your cumulative high school GPA (4 point scale)?”
Height num height of the respondent in inches
Mateh num the height of the respondent’s “ideal mate”
Shoe num the respondent’s shoe size
Breakfast char answer to the question “Did you eat breakfast this morning?”
Flight char answer to the question “Did you fly on a commercial airline during the past 30 days? (Yes or No)”
Play char answer to the question “Have you seen a play in a live theater in the past 6 months? Yes or No”
Vote char answer to the question “Are you registered to vote? Yes or No”
Credit num answer to the question “How many credit hours are you taking this term?”
Source of Data

This data was collected by Roger Woodard of the North Carolina State University in 2005.

Result

a. Most people would expect that a student’s height and the height of their ideal mate are positively correlated.
b. A positive relationship would indicate that a taller individual would want a taller mate. Thus, a tall basketball player would want a tall mate rather than a very short one. A negative correlation would indicate that tall people want a short mate and a short person wants a tall mate.
c. The correlation is -0.289. This does not match what most people would expect.
d. The scatterplot is given below. It does match the correlation.
e. The scatterplot has two clusters. Further exploration might lead a person to split the scatterplot by gender. Doing this (see the second scatterplot) shows that there are two positive correlations joined together. The positive correlation matches what we expect. If we calculate the correlations separately by gender we find that for males (0.468) and females (0.401) both are positive correlations.

Teacher's note: This is a good example of why one should not take a correlation at face value, but instead always make a scatterplot of the relationship. It is a continuous example of Simpson’s paradox where a third variable changes the apparent relationship of two variables.

Crime and Recreation in Cities

Find the correlation of crime and recreation in western cities.

Problem

In a study to investigate the prevalence of relationships between different socioeconomic factors, 52 western cities were rated by nine criteria: climate terrain, housing, health care environment, crime, transportation, education, arts, recreation, and economics. For housing and crime, the lower the rating score, the better. For the remaining seven criteria, the higher the score, the better.

Find the correlation of crime and recreation for this sample data. Based on your value for the Pearson correlation coefficient, would you describe the linear relationship between crime and recreation as strong or weak, positive or negative?

Sample Data

The Westernrates data set contains data about ratings on nine criteria (climate and terrain, housing, health care and environment, crime, transportation, education, arts, recreation, and economics) for 52 western cities.

These are the variables in the data set:

Name
Type
Description
City char city
State char state
ClimateTerrain num rating of climate and terrain
Housing num rating of housing
HealthCareEnvironment num rating of health care and environment
Crime num rating of crime
Transportation num rating of transportation
Education num rating of education
Arts num rating of the arts
Recreation num rating of recreation
Economics num rating of economics
Source of Data

This data is sample data from SAS Institute Inc.

Result

The correlation value of 0.30337 suggests that there is a weak positive relationship between crime and recreation. Furthermore, this implies that there is an increase in the crime score as the recreation score increases for these western cities

Spearman Correlation

Source: http://support.sas.com/learn/statlib.../eg_corr_3.htm

Problem: Determine the Spearman correlation coefficient to measure the strength of the relationship between purchase level and age.

The value of the Spearman correlation coefficient for the two variables.The correlation value of 0.03070 indicates a very weak association between purchase level and age.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/corr_3.htm#KB3

Top Movies

Determine the correlation between the year of release and the amount of money a movie makes.

Problem

A set of data was compiled, which lists the top grossing movies of all time (as of June 2003). It contains information showing the name of the movie, the amount of money it made both in the U.S. and foreign markets, its year of release, and the type of movie.

Use Spearman’s correlation to assess the relationship between the amount of money made in the U.S. by movies and their years of release.

Sample Data

The Movies data set contains a list of the top grossing movies of all time (as of June 2003). The names of the 277 movies in the list are provided, along with the type and rating of each movie, year of release, money grossed in both U.S. and foreign markets, and the name of the movie director.

These are the variables in the data set:

Name
Type
Description
Movie char name of movie
Type char type of movie (comedy, family, drama, etc.)
Rating char movie rating (G, PG, PG-13, R)
Year num year of release
Domestic_ num money made in U.S. (in millions of dollars)
Worldwide_ num money made in foreign market (in millions of dollars)
Director char movie director
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

Using SAS Enterprise Guide, the value of the Spearman correlation coefficient was found to be 0.13396. So, the sample data suggests that there is a weak, positive, linear relationship between the amount of money made in the U.S. market by movies and their years of release.

Mail-Order Customers 1

Measure the relationship between customers’ purchase level and age.

Problem

A mail-order company has decided that customers who spend 100 dollars or more on purchases should be the focus of its advertising efforts. To help identify this target group, the company collected information from its customers including purchase level (1 = at least $100, 0 = less than $100 dollars), gender, income level, and age.

Use the Spearman correlation coefficient to measure the strength of association between purchase level and age

Sample Data

The Sales data set contains data about customers of a mail-order company.

These are the variables in the data set:

Name
Type
Description
purchase num customer’s purchase level (1 = at least $100, 0 = less than $100 dollars)
age num customer’s age
gender char customer’s gender
income char customer’s income level (Low, Medium, High)
Source of Data

This data is sample data from SAS Institute Inc.

Result

The correlation value of 0.03070 indicates a very weak association between purchase level and age.

Linear Regression

Simple

Source: http://support.sas.com/learn/statlib...g_linreg_1.htm

Problem: Use simple linear regression to describe the relationship between the cost of operations of auction markets and the number of cattle sold.

The first of these tables displays some statistics for the sample data.

The last table in the output gives the parameter estimates for regressing cost on cattle sold, along with the standard errors for these estimates and the p-values for their tests of significance to the model.

The estimate for the intercept parameter is 7.19650, and the estimate for the slope parameter is 4.56396.  So, the equation of the simple linear regression line is

y-hat = 7.19650 + 4.56396x,

where y-hat is the predicted cost of operations (in thousands of dollars) and x is the number of cattle sold (in thousands).
For a market anticipating the sell of 13,500 cattle, the predicted cost of operations is

y-hat = 7.19650 + 4.59396(13.5) = 69.215

So, $69,215 in operating costs is expected for 13,500 cattle sold.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/linreg_1.htm#KB4

Student Survey 4

Investigate characteristics of college students using survey results.

Problem

To determine the characteristics of the students in their large introductory statistics class, a group of college professors administered a survey to the students in their classes. Some of the questions asked include: the student’s age in years, height (in inches), shoe size and high school GPA. 

Exercise: 
a. Can we estimate a person’s height from their shoe size? Create a scatterplot of the relationship between shoe size and height. Choose the appropriate variable to be on the horizontal axis. 
b. Examine the scatterplot. Is the relationship positive or negative? 
c. Calculate the least squares regression line that relates height to shoe size. What is the least squares line? 
d. Explain in layman’s terms the meaning of the slope term in this regression line.
e. Is the slope between shoe size and height significant? Explain how you know. 
f. Does the intercept term make sense in this setting? Explain. 
g. Calculate the R-square statistic for the relationship between shoe size and height. Explain this terms meaning.
h. We find a student shoe print that was a size 10. What height would you estimate for this student? Also calculate a prediction interval. Would this interval be useful in identifying the student? Explain.

Sample Data

The Survey data is the result of a survey administered to a large introductory statistics class. The data set contains answers from 485 participants. Not all questions are answered resulting in missing data. Missing data is indicated by a period.

These are the variables in the data set:

Name
Type
Description
Gender char gender (male or female) of the respondent
Age num age of the subject in years
Textbook num answer to the question “How much did you spend for textbooks this term (to nearest dollar)?”
Cigs num answer to the question “How many cigarettes did you smoke yesterday?”
ColGPA num answer to the question “What is your cumulative Grade Point Average at this institution?”
HSGPA num answer to the question “What was your cumulative high school GPA (4 point scale)?”
Height num height of the respondent in inches
Mateh num the height of the respondent’s “ideal mate”
Shoe num the respondent’s shoe size
Breakfast char answer to the question “Did you eat breakfast this morning?”
Flight char answer to the question “Did you fly on a commercial airline during the past 30 days? (Yes or No)”
Play char answer to the question “Have you seen a play in a live theater in the past 6 months? Yes or No”
Vote char answer to the question “Are you registered to vote? Yes or No”
Credit num answer to the question “How many credit hours are you taking this term?”
Source of Data

This data was collected by Roger Woodard of the North Carolina State University in 2005.

Result

a. The scatterplot is given below. If we are predicting height with the shoe size we should put the shoe size on the horizontal axis.
b. For this scatterplot we see a strong positive correlation.
c. The least squares line is “height=52.05+1.64*shoe”.
d. The slope term 1.64 indicates that the expected height of a student would increase 1.64 inches for each larger shoe size they have.
e. The slope term is significant based on the t-test of significance. This term has a p-value of less than 0.001.
f. The intercept term would indicate that a person with shoe size zero would be 52.05 inches tall. It is impossible to have a shoe size of zero so this term does not have a practical interpretation. 
g. The R-square is 0.7318. This indicates that about 73% of the variability in the height of a subject can be explained by the straight line relationship with shoe size.
h. We would predict that the height would be around 68.45. If we want to put a prediction interval around this estimate it would range from 64.34 to 72.55. This interval is very uninformative. Even with out the information about shoe size we might guess that some one would be between 5’4 and 6’0 tall. 

Teacher's note: This example can be well motivated by the question of evaluating crime scene evidence. It can serve as a good lead in for multiple regression by considering adding additional variables such as gender and age to the relationship. 

Livestock Auctions 1

Describe the relationship between the cost of operations and the number of cattle sold at auctions.

Problem

A group of livestock auction market managers were interested in learning how the number of cattle sold at their markets influenced the cost of operations of their markets. Data was collected from 19 such auction markets.

Find the simple linear regression equation for cost of operations on cattle sold. Use this equation to predict the cost of operations for a market that anticipates selling 13,500 cattle.

Sample Data

The Market data set contains data from 19 livestock auction markets, including the number of head of cattle sold (in thousands), the cost of operations of the auction market (in thousands of dollars), and the market identifier.

These are the variables in the data set:

Name
Type
Description
marketid char market identifier
cattle num numbers of head of cattle sold (in thousands)
cost num cost of operations of the auction market (in thousands of dollars)
Source of Data

This data is sample data from SAS Institute Inc.

Result

The estimate for the intercept parameter is 7.19650, and the estimate for the slope parameter is 4.56396. So, the equation of the simple linear regression line is

y-hat = 7.19650 + 4.56396x,

where y-hat is the predicted cost of operations (in thousands of dollars) and x is the number of cattle sold (in thousands).

For a market anticipating the sale of 13,500 cattle, the predicted cost of operations is

y-hat = 7.19650 + 4.59396(13.5) = 69.215

So, $69,215 in operating costs is expected for 13,500 cattle sold.

Multiple

Source: http://support.sas.com/learn/statlib...g_linreg_2.htm

Problem: Problem: Use multiple linear regression to describe the relationship between the cost of operations of auction markets and the number of livestock sold.

The last part of the output gives the parameter estimates for regressing cost on the number of sheep, hogs, calves and cattle sold, along with the standard errors for these estimates and the p-values for their tests of significance to the model.
Based on the parameter estimates, the multiple linear regression function is given by

Y-hat = 2.28842 + 3.21552X1 + 1.61315X2 + 0.81485X3 + 0.80258X4,

where Y-hat is the predicted cost of operations (in thousands of dollars) and X1 is the number of cattle sold, X2 is the number of calves sold, X3  is the number of hogs sold, and X4 is the number of sheep sold (in thousands).

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/linreg_2.htm#KB4

Livestock Auctions 2

Describe the relationship between the cost of operations and the number of livestock sold at auction

Problem

A cost analyst was hired by a group of livestock auction market managers to determine how the number of various classes of livestock (cattle, calves, hogs, and sheep) sold at their markets influenced the cost of operations of their markets. Data was provided by 19 market managers.

Find the multiple linear regression function for expressing cost of operations as a function of the number of livestock sold for the four classes.

Sample Data

The Auction data set contains data from 19 livestock auction markets, including the number of head of each of four types of livestock sold (in thousands), the cost of operation of the auction market (in thousands of dollars), and the market identifier.

These are the variables in the data set:

Name
Type
Description
marketid char market identifier
cattle num number of head of cattle sold (in thousands)
calves num number of head of calves sold (in thousands)
hogs num number of head of hogs sold (in thousands)
sheep num number of head of sheep sold (in thousands)
cost num cost of operation of the auction market (in thousands of dollars)
volume num total of all cattle, calves, hogs, and sheep sold in each market
Source of Data

This data is sample data from SAS Institute Inc.

Result

Based on the parameter estimates, the multiple linear regression function is given by

Y-hat = 2.28842 + 3.21552X1 + 1.61315X2 + 0.81485X3 + 0.80258X4,

where Y-hat is the predicted cost of operations (in thousands of dollars) and X1 is the number of cattle sold, X2 is the number of calves sold, X3 is the number of hogs sold, and X4 is the number of sheep sold (in thousands).

Polynomial

Source: http://support.sas.com/learn/statlib...g_linreg_3.htm

Problem: Determine whether a quadratic regression function adequately describes the relationship between tree weight and trunk girth.

The ANOVA table for the quadratic model.  It appears that the model is significant, based on the p-value being less than 0.0001.

Look at the F-tests for type I and type III sums of squares.   (The type I sums of squares are the extra sums of squares obtained by sequentially adding the linear and quadratic variables to the model, in that order.  The type III sums of squares are the extra sums of squares that result from adding one variable to a model which already contains the other variable.)   Based on the results, it appears that both the linear and quadratic trunk girth terms are significant in predicting tree weight (p-values less than or equal to 0.0001).

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/linreg_3.htm#KB4

Weight to Height

Fit a quadratic model to describe a relationship between two variables.

Problem

A study was performed in which 72 children were measured in order to examine how the weight to height ratio changes as kids get older. The ratio and age (in months) were recorded for each young participant.

Fit a quadratic curve to the sample data to model the relationship between weight to height ratio and age, and give its equation. Display this curve on a scatterplot of the data. For your model, Ratio will be the response variable and Age will be the explanatory variable.

Sample Data

The Growth data set is the recordings of the weight to height ratios and ages (in months) of 72 children.

These are the variables in the data set:

Name
Type
Description
ratio num weight to height ratio
age num age in months
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

Using SAS Enterprise Guide, the estimated quadratic fit to the data is given by the equation
predicted ratio = 0.6022 + 0.01060age – 0.00007337age^2.

To create the scatterplot of the data, with the quadratic curve displayed, use a graphing facility to plot the data points and graph the function (with age on the horizontal axis and ratio on the vertical axis).

Tree Weights and Trunk Girths

Determine whether a quadratic regression function adequately describes a weight/girth relationship.

Problem

A forestry commission once sought a way to accurately estimate the weights of trees without having to go through the damaging process of cutting the trees down to weigh them. The weights and trunk girths of 104 tree specimens were measured, in hopes that girth would be useful in predicting weight.

Find the quadratic function for regressing tree weight on trunk girth, and determine whether this function provides a good fit for the data.

Sample Data

The Tree data set contains data about the weights and trunk girths of 104 tree specimens (eight specimens from each of thirteen rootstocks).

These are the variables in the data set:

Name
Type
Description
RootStock char rootstock (I – XIII)
TrunkGirth num trunk girth of specimen
Weight num weight of specimen
Source of Data

This data is sample data from SAS Institute Inc.

Result

It appears that the quadratic model is significant, based on the p-value being less than 0.0001.

Based on the results of the F-tests for type I and type III sums of squares, it appears that both the linear and quadratic trunk girth terms are significant in predicting tree weight (p-values less than or equal to 0.0001).

Stepwise

Source: http://support.sas.com/learn/statlib...g_linreg_4.htm

Problem: Use the stepwise selection method to choose a model that “best” fits the corn data.

View the first part of the output to see the number of observations read and used.

Scroll down to view the results of step 1 of the stepwise selection. Notice that the variable July_Temp entered the model. The ANOVA table, followed by the estimates, standard errors, and significance tests for the parameters, is displayed for the resulting model.

Scroll down to view the results of step 2 of the stepwise selection. The variable July_Rain was added to the model in this step.

View the summary of the stepwise selection for this data. Then view the F value and p-value.

The final results of the stepwise selection show the inclusion of only the July_Temp and July_Rain variables in the “best” fit model.  So, the resulting linear regression function based on this selection method is

predicted CornYield = 163.47 + 3.49(July_Rain)- 1.68(July_Temp)

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/linreg_4.htm#KB4

Mass Prediction

Construct a model with mass as the response variable, using stepwise selection.

Problem

A statistics project was carried out in Australia in which body measurements were collected from 22 male subjects. The aim of the project was to construct a model which would predict the mass of a person based on other characteristics.

Use the stepwise selection method to fit such a model using a 0.15 significance level for entry.

Sample Data

The Body_measurements data set is from a project that involved the recording of 11 body measurements (such as mass, arm measurements, chest measurement, height, etc.) for 22 male subjects.

These are the variables in the data set:

Name
Type
Description
Mass num body mass
Fore num forearm measurement
Bicep num bicep measurement
Chest num chest measurement
Neck num neck measurement
Shoulder num numeric
Waist num waist measurement
Height num height
Calf num calf measurement
Thigh num circumference around one thigh
Head num circumference around head
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The resulting model for this data fit by the stepwise selection method (using the REG procedure in SAS) is the following:

predicted Mass = -80.4533 +2.1232*Fore + 0.6656*Waist + 0.2770*Height + 0.5232*Thigh – 0.6371*Head

Corn Yield

Choose a model that “best” fits the corn data.

Problem

Data was collected to study the effect of weather related events on corn yield. The variables involved in the study were the total precipitation (in inches) for the year before the growing season, the average daily temperature (Fahrenheit) for the months May through August, the total rain for the months June through August, and corn yield (in bushels per acre).

Apply the stepwise selection method to the linear regression of corn yield on the other variables to find the model that “best” fits the data, according to this criterion.

Sample Data

The Corn data set contains information about weather related events and corn yield.

These are the variables in the data set:

Name
Type
Description
CornYield num corn yield (in bushels per acre)
Year num year
Pre_seasonPrecip num total precipitation (in inches) for the year prior to the start of the growing season
May_Temp num average daily temperature (in degrees Fahrenheit) for May
June_Rain num total rain (in inches) for June
June_Temp num average daily temperature (in degrees Fahrenheit) for June
July_Rain num total rain (in inches) for July
July_Temp num average daily temperature (in degrees Fahrenheit) for July
Aug_Rain num total rain (in inches) for August
Aug_Temp num average daily temperature (in degrees Fahrenheit) for August
Source of Data

This data is sample data from SAS Institute Inc.

Result

Using a significance level for entry and remaining in the model if 0.15, the final results of the stepwise selection show the inclusion of only the July_Temp and July_Rain variables in the “best” fit model. So, the resulting linear regression function based on this selection method is

predicted CornYield = 163.47 + 3.49(July_Rain)- 1.68(July_Temp)

Analysis of Variance

One-Way ANOVA

Source: http://support.sas.com/learn/statlib...eg_anova_1.htm

Problem: Perform a one-way ANOVA to determine whether there are significant differences between four drugs in mean increase in systolic blood pressure.

View the first part of the ANOVA table to see the number of levels and observations. Then scroll down to the next page of the output, the ANOVA table.

The p-value of .0001 is less than .05, so there is sufficient evidence to reject the null hypothesis that all means are equal. The F test does not give any specifics about which means are different, only that there is at least one pair of means that is statistically different.

Scroll down to the results of Tukey's test and the table of groupings.

In the results for Tukey's test, groups are shown with Tukey Grouping letters assigned to them. If the same letter is assigned to any two groups, those groups are not significantly different from each other. This test shows that drugs 1 and 2 are significantly different from 3 and 4.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/anova_1.htm#KB5

Wafers

Determine whether there is a difference between probes in measuring resistance.

Problem

The National Institute of Standards and Technology (NIST) references research involving the doping of silicon wafers with phosphorus. The wafers were doped with phosphorus by neutron transmutation doping in order to have resistivities of 200 ohm/cm. Measurements of bulk resistivity of silicon wafers were made with 5 probing instruments on each of 5 days. The experimenters are interested in testing differences among the instruments.

Complete an ANOVA to determine whether there is a difference between the probes in measuring resistance. If there are differences, then distinguish them by using a test for multiple comparisons.

Sample Data

The Doped_wafers data set gives the resistance measurements for 5 different probing instruments from a study conducted by the National Institute of Standards and Technology.

These are the variables in the data set:

Name
Type
Description
Instrument num probing instrument (categorical)
Resistance num resistance measurement
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The value of the F statistic for the ANOVA F test is 1.18 with a corresponding p-value of 0.3494, which is not significant. So, the data does not indicate that there is a difference between the probes in measuring resistance. This is supported by the results from the Tukey test for multiple comparisons. These results were generated by performing the analysis in SAS Enterprise Guide.

Blood Pressure 2

Determine whether four drugs differ in mean increase in systolic blood pressure.

Problem

Suppose that a pharmaceutical company has recently focused on the effect of its potential products on blood pressure. Researchers conducted a study in which the subjects were 72 individuals with one of three diseases. Eighteen individuals were randomly assigned to each of the four drugs. The treatments were administered over time and the increase in systolic blood pressure was recorded.

Perform a one-way ANOVA to determine whether there are significant differences between the mean increases in systolic blood pressure for the four drugs. If there are, you need to determine which drugs differ and how greatly they differ.

Sample Data

The Drug data set contains data from an experiment to evaluate the effect of four different drugs on blood pressure for individuals with one of three possible diseases.

These are the variables in the data set:

Name
Type
Description
Drug char drug (1, 2, 3, 4)
Disease char disease (A, B, C)
BPChange char change in systolic blood pressure
Source of Data

This data is sample data from SAS Institute Inc.

Result

The p-value of .0001 is less than .05, so there is sufficient evidence to reject the null hypothesis that all means are equal. The F test does not give any specifics about which means are different, only that there is at least one pair of means that is statistically different.

To compare means, you can use a test such as Tukey's studentized range test (HSD). For example, in the results for Tukey's test, drugs 1 and 2 are significantly different from 3 and 4.

Two-Way ANOVA

Source: http://support.sas.com/learn/statlib...eg_anova_2.htm

Problem: Perform a two-way ANOVA to determine whether the effects of stretching and wearing ankle weights are significant on exercise.

View the first part of the output to see information about the class level and number of observations.

Scroll down to the ANOVA table for the two-way model. It appears that the model is significant, based on the p-value of 0.0008.

The last two tables of the output give the results of the F-tests for type I and type III sums of squares, which are equivalent in this case. Based on these, with p-values of 0.0045 and 0.0032, we can say that there is sufficient evidence to claim that stretching and wearing ankle weights both have significant effects on exercise benefit (in terms of burning calories).

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/anova_2.htm#KB5

Starch Out

Determine whether sand blasting significantly affects starch removal in jeans.

Problem

Denim manufacturers are concerned with maximizing the comfort of the jeans they make. In the manufacturing process, starch is often built up in the fabric, creating stiff jeans that must be “broken in” by the wearer before they are comfortable. To minimize the break-in time for each pair of pants, the manufacturers often wash the jeans to remove as much starch as possible. In a particular study to evaluate starch removal, data was recorded including the washing method (3 types), size of the load (in pounds), whether or not the fabric was sand blasted prior to washing, and the starch content of the fabric after washing.

Carry out a two-way analysis of variance to examine if the interaction between washing method and whether or not the fabric was sand blasted has a significant effect on the removal of starch.

Sample Data

The Denim data set contains the four columns that are described below.

These are the variables in the data set:

Name
Type
Description
Method char method used in washing
Size of Load num size in pounds
Sand Blasted char whether or not the fabric was sand blasted before washing
Starch Content num the starch content of the fabric after washing
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The value of the F statistic for the interaction effect Method*Sand_blasted_ is 1.324 with a corresponding p-value of 0.271, which is not significant at a level as high as α = 0.10. So we can conclude that the interaction between washing method and sand blasting does not have a significant effect on the removal of starch from jeans. These results were given by the output from the GLM procedure in SAS.

Treadmill Exercise

Determine whether stretching and wearing ankle weights affects exercise on treadmills.

Problem

An exercise physiologist wants to examine whether stretching and wearing ankle weights affect the value of exercise on treadmills. To carry out her study, she recruits subjects who have roughly the same level of physical fitness, and divides them randomly into four groups: with or without ankle weighs, and with or without a stretching period before the exercise.

Using the amount of calories burned as the response, carry out a two-way ANOVA to determine whether stretching and wearing ankle weights have significant effects on exercise.

Sample Data

The Exercise data set contains data about the effects of stretching and wearing ankle weights on treadmill exercise.

These are the variables in the data set:

Name
Type
Description
PreStretch char stretch group (Stretch, No stretch)
AnkleWeights char weights group (Weights, No weights)
Energy num calories burned
Speed num average speed (in meters per minute)
Oxygen num oxygen consumed (in liters)
Source of Data

This data is sample data from SAS Institute Inc.

Result

It appears that the two-way model is significant, based on the p-value of 0.0008. The results of the F-tests for type I and type III sums of squares are equivalent in this case. Based on these, with p-values of 0.0045 and 0.0032, we can say that there is sufficient evidence to claim that stretching and wearing ankle weights both have significant effects on exercise benefit (in terms of burning calories).

Mixed Models

Source: http://support.sas.com/learn/statlib...eg_anova_3.htm

Problem: Perform an ANOVA for a mixed model to determine whether the data indicates a difference between irrigation methods on the harvest of orange trees.

Scroll down to view the results of the F-test for the irrigation effect.  The Type III test for the effect of irrigation method yields an F statistic of 3.27 with a corresponding p­-value of 0.0254.  So at a level of α = 0.05, we can conclude that the effect of irrigation method on the harvest of orange trees at this grove is significant.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/anova_3.htm#KB5

Orange Grove Irrigation

Determine whether there is a difference between irrigation methods on the harvest of orange trees.

Problem

A randomized complete block design study was performed to investigate the effects of irrigation method (with 5 levels) on the harvest of orange trees in a particular grove. The grove was divided into eight blocks to account for local variation in the grove, and irrigation methods were assigned to trees within each block at random. Each method appeared in every block. At harvest, the fruit from the trees was weighed.

Considering the blocking factor to be a random effect, carry out an ANOVA with a mixed model. Use the results from the appropriate F-test to decide whether the irrigation method effect is significant.

Sample Data

The Methods data set contains data about five irrigation methods that were used on an orange grove, and the weight of the fruit at harvest.

These are the variables in the data set:

Name
Type
Description
irrig char irrigation method
block num block within grove (1 – 8)
fruitwt num weight of fruit at harvest
Source of Data

This data is sample data from SAS Institute Inc.

Result

The Type III test for the effect of irrigation method yields an F statistic of 3.27 with a corresponding p-value of 0.0254. So at a level of α = 0.05, you can conclude that the effect of irrigation method on the harvest of orange trees at this grove is significant.

Wanderers

Construct a mixed model to examine how wandering radius differs between foxes and coyotes.

Problem

Six animals from two species (fox and coyote) were tracked, and the diameter of the area that each animal wandered was recorded. Each animal was measured four times, once per season. The animals were selected randomly form a large population, and the variability from animal to animal is from some unknown distribution. Use a random effects-mixed model, specifying subject(species) as the random effect, to test how wandering radius differs between species. (Recall that the radius is simply half of the diameter.) Base your conclusion on a significance level of α = 0.05.

Sample Data

The Animals data set contains recordings of the diameter of the area wandered by six animals from two species. The individual animals were identified as subjects within each species. Each animal was measured four times, once per season.

These are the variables in the data set:

Name
Type
Description
species char animal species (fox or coyote)
subject num animal identifier within species (has values 1, 2, 3 for each species)
miles num diameter of wandering area
season char season of the year the distance was recorded
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The MIXED procedure in SAS gives the following result:The F test for the mixed model gives an F statistic of 11.89 and a p-value of 0.0261 for the species effect. This outcome is significant at the α = 0.05 level, and thus we can conclude that there is a difference in wandering radius between foxes and coyotes.

Repeated Measures

Source: http://support.sas.com/learn/statlib...eg_anova_4.htm

Problem: Perform a repeated measures ANOVA to determine if the data suggest that there is difference in home range areas of foxes and coyotes.

The Type 3 test for the species effect yields an F statistic of 12.71 with a corresponding p-value = 0.0235.  So the data suggests that the species effect is significant, and thus we can conclude that there is a difference in home range areas between foxes and coyotes.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/anova_4.htm#KB5

Foxes and Coyotes

Determine whether there is difference in home range areas of foxes and coyotes.

Problem

Home ranges have been found to vary seasonally for both foxes and coyotes. A study was conducted in a region of central Illinois, in which three red foxes and three coyotes were captured and radio-collared. The animals were tracked throughout the course of a year, and the home range area for each one was recorded (in square miles) at the end of each season.

Carry out a repeated measures ANOVA to determine whether there is a significant difference in the home ranges of the two species.

Sample Data

The Canine data set contains data about the home range area covered by three red foxes and three coyotes each season throughout the course of a year.

These are the variables in the data set:

Name
Type
Description
species char species
subject num subject
miles num home range area (in square miles)
season char season
recode_season num season code
Source of Data

This data is sample data from SAS Institute Inc.

Result

The Type 3 test for the species effect yields an F statistic of 12.71 with a corresponding p­value = 0.0235. So the data suggests that the species effect is significant, and thus you can conclude that there is a difference in home range areas between foxes and coyotes.

Tests of Association

Pearson Chi-Square

Source: http://support.sas.com/learn/statlib.../eg_asso_1.htm

Problem: Perform the Pearson chi-square test to determine if there are significant differences between age groups for risk of colon cancer.

The value of the Pearson chi-square statistic (labeled Chi-Square) is 28.2566, with a p-value of 0.0004, which can be considered to be very significant. So, we can conclude that there are significant differences among age groups for risk of colon cancer.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/asso_1.htm#KB6

Student Survey 5

Investigate characteristics of college students using survey results.

Problem

To determine the characteristics of the students in their large introductory statistics class, a group of college professors administered a survey to the students in their classes. Some of the questions asked include: the student’s age in years, height (in inches), shoe size, and high school GPA. We all know that there are many physical differences in male and female college students, but what other characteristics differ? Assume that we can consider our sample results to be a random sample from the population of all undergraduate students at this institution.

Exercise: 
a. Is there a relationship between the gender of a student and if the student is registered to vote? Create a two-way table displaying the gender and voter registration status for these students. Does there appear to be a relationship? 
b. Calculate Pearson’s Chi-square statistic and the associated p-value for these variables. 
c. Explain the null and alternative hypothesis for the Chi-square test in this setting.
d. What conclusion can you draw from the results of the Chi-square test?
e. Conduct the Chi-square test for gender and whether or not the student ate breakfast.
f. Conduct the Chi-square test for gender and if the student took a commercial flight in the last 30 days.
g. Conduct the Chi-square test for gender and if the student has seen a play in the past 6 months.

Sample Data

The Survey data is the result of a survey administered to a large introductory statistics class. The data set contains answers from 485 participants. Not all questions are answered resulting in missing data. Missing data is indicated by a period.

These are the variables in the data set:

Name
Type
Description
Gender char gender (male or female) of the respondent
Age num age of the subject in years
Textbook num answer to the question “How much did you spend for textbooks this term (to nearest dollar)?”
Cigs num answer to the question “How many cigarettes did you smoke yesterday?”
ColGPA num answer to the question “What is your cumulative Grade Point Average at this institution?”
HSGPA num answer to the question “What was your cumulative high school GPA (4 point scale)?”
Height num height of the respondent in inches
Mateh num the height of the respondent’s “ideal mate”
Shoe num the respondent’s shoe size
Breakfast char answer to the question “Did you eat breakfast this morning?”
Flight char answer to the question “Did you fly on a commercial airline during the past 30 days? (Yes or No)”
Play char answer to the question “Have you seen a play in a live theater in the past 6 months? Yes or No”
Vote char answer to the question “Are you registered to vote? Yes or No”
Credit num answer to the question “How many credit hours are you taking this term?”
Source of Data

This data was collected by Roger Woodard of the North Carolina State University in 2005.

Result

a. The two-way table is given below. There does not seem to be much of a relationship, for both males and females. Over 86% of the subjects were registered to vote. 

Table of gender by vote

gender vote  
Frequency
Percent
Row Pct
Col Pct
n y Total
f

40
8.26
13.75
66.67

251
51.86
86.25
59.20
291
60.12
m
20
4.13
10.36
33.33
173
35.74
89.64
40.80
193
39.88
Total
60
12.40
424
87.60
484
100.00
Frequency Missing = 2


b. The test statistic is 1.2229 and the p-value is 0.2688. 
c. The null hypothesis is that gender and voter registration is independent. The alternative hypothesis is that the variables are not independent. 
d. Based on the large p-value, we see that we cannot reject the null hypothesis. Therefore, we cannot reject the idea that gender and voter registration is independent. 
e. The test statistic is 5.7322 and p-value is 0.0167. We can reject the null hypothesis at the 5% level (but not the 1% level). We can conclude that there is evidence that eating breakfast and gender are related among the students at this university.
f. The test statistic is 1.1028 and p-value is 0.2937. We cannot reject the null hypothesis. Therefore there is not enough evidence to conclude there is a relationship between gender and taking a flight among the students at this university.
g. The test statistic is 9.1603 and p-value is 0.0025. We can reject the null hypothesis at both the 5% and 1% level. We can conclude that there is evidence that seeing a play and gender are related among the students at this university.

Colon Cancer

Determine whether risk of colon cancer differs between age groups

Problem

A colonoscopy screening study was performed on individuals who were considered to be at a high risk of colon cancer, due to adenoma findings in previous examinations. The data recorded from the screening were the variables Finding (coded 0 for negative examination, 1 for small adenoma, and 2 for large adenoma) and Age (rounded: 30-39 years coded as 35, 40-49 years coded as 45, and so on).

Using the Pearson chi-square test, determine whether there are significance differences among age groups for risk of colon cancer. Give the chi-square test statistic and the p-value for the test.

Sample Data

The Colonoscopy data set contains data from a colonoscopy screening study on individuals considered to be at high risk of colon cancer.

These are the variables in the data set:

Name
Type
Description
Finding num finding (0 for negative examination, 1 for small adenoma, and 2 for large adenoma)
Age num age (rounded to the midpoint of each decade; e.g., ages 30-39 years coded as 35, ages 40-49 years coded as 45)
Source of Data

This data is sample data from SAS Institute Inc.

Result

The value of the Pearson chi-square statistic is 28.2566, with a p-value of 0.0004, which can be considered to be very significant. So, we can conclude that there are significant differences among age groups for risk of colon cancer.

Likelihood Ratio Chi-Square

Source: http://support.sas.com/learn/statlib.../eg_asso_2.htm

Problem: Use the likelihood ratio chi-square test to determine if there is an association between coffee consumption and incidence of pancreatic cancer.

The value of the likelihood ratio chi-square statistic is 7.6195, with a p-value of 0.0546, which is not significant at the a = 0.05 level.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/asso_2.htm#KB6

Pancreatic Cancer

Determine whether coffee consumption is associated with the incidence of pancreatic cancer.

Problem

A case-control study was carried out to investigate the existence of a relationship between coffee consumption and pancreatic cancer. The data provided is for male subjects, with Outcome as a categorical variable representing whether or not a subject had cancer, and DailyCoffee as a continuous variable representing the daily coffee consumption of a subject.

Using the likelihood ratio chi-square test, determine whether the data indicates an association between coffee consumption and incidence of pancreatic cancer. Give the chi-square test statistic and the p-value for the test.

Sample Data

The Coffee data set contains data about coffee consumption by individuals with and without pancreatic cancer.

These are the variables in the data set:

Name
Type
Description
Outcome char indicator of whether individual is a case (pancreatic cancer) or a control (no cancer)
DailyCoffee num amount of coffee drunk (0 for none, 1.5 for 1-2 cups per day, 3.5 for 3-4 cups per day, or 5.5 for 5 or more cups per day)
AnyCoffee char coded value: if DailyCoffee=0, "No coffee"; otherwise "Some coffee"
Source of Data

This data is sample data from SAS Institute Inc.

Result

The value of the likelihood ratio chi-square statistic is 7.6195, with a p-value of 0.0546, which is not significant at the α = 0.05 level. However, at a significance level of α = 0.10, you could say that the data gives sufficient evidence of an association between coffee consumption and incidence of pancreatic cancer.

Survivors

Determine whether there is a difference in survival rate among classes aboard the Titanic.

Problem

Information was gathered on the passengers of the RMS Titanic and recorded in a dataset. The four variables represent the class (first, second, third, and crew), age, sex, and survival status (yes or no) for each passenger. Test the hypothesis that there is no difference in the survival rate among classes by using the likelihood ratio chi-square test with a significance level of α = 0.01.

Sample Data

The Titanic data set contains information on the passengers of the RMS Titanic including class level, age, sex, and survival status.

These are the variables in the data set:

Name
Type
Description
Class char class level of passenger (first, second, third, or crew)
Age char age classification of passenger (adult or child)
Sex char sex (male or female)
Survived char survival status (yes or no)
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

Based on output from the FREQ procedure in SAS, we find that the value of the likelihood ratio chi square statistic is 180.9014, and the p-value for the test is found to be < 0.0001, which is significant at the α = 0.01 level. Hence, we can say that there is very strong evidence that there is a difference in the survival rate among classes who were aboard the RMS Titanic.

Mantel-Haenszel Chi-Square

Source: http://support.sas.com/learn/statlib.../eg_asso_3.htm

Problem: Use the Mantel-Haenszel chi-square test to determine if there is a relationship between variety of cotton and the distance between planting rows.

The value of the Mantel-Haenszel chi-square statistic is 0.0376, with a p-value of 0.8462, so we can conclude that the test is not significant.  Hence, we fail to reject the null hypothesis that there is no relationship between variety of cotton and distance between planting rows.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/asso_3.htm#KB6

Popularity 2

Investigate whether there is a gender difference in the perceived importance of good grades.

Problem

M.A. Chase and G.M. Dummer conducted a study in 1992 to determine what traits children regarded as important to popularity. Demographic information was recorded, as well as the rating given to four traits assessing their importance to popularity: Grades, Sports, Looks, and Money. The rating scale was from 1 to 4, with 1 being the most important of the four options, 4 being the least.

Determine whether there is a difference based on gender, on the importance given to making good grades by carrying out the Mantel-Haenszel chi-square test with a significance level of alpha=0.10.

Sample Data

The Childrens_popularity data contains demographic information and ratings given to four traits assessing their importance to popularity – grades, sports, looks, and money – for students in grades 4 through 6.

These are the variables in the data set:

Name
Type
Description
Gender char gender of student (boy or girl)
Grade num grade level of student
Age num age in years
Race char race (white or other)
Urban_Rural char type of residence area (rural, suburban, urban)
School char school student attends
Goals char area student strives for (grades, sports, popular)
Grades num rating on importance of grades (1=most important, 2, 3, 4=least important)
Sports num rating on importance of sports (1=most important, 2, 3, 4=least important)
Looks num rating on importance of looks (1=most important, 2, 3, 4=least important)
Money num rating on importance of money (1=most important, 2, 3, 4=least important)
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

From the FREQ procedure in SAS we find that the value of the Mantel-Haenszel chi-square statistic is 0.4491, with a corresponding p-value of 0.5028, which is not significant at the α = 0.10 level. So, we can conclude that there is not enough evidence to suggest that there is a difference based on gender, given to the importance of making good grades.

Cotton Plants

Determine whether variety of cotton is related to the distance between planting rows.

Problem

A student in the Department of Crop Science at NC State University was interested in investigating the relationship between cotton varieties and the distance between planting rows for various cotton varieties. She used data from a concurrent experiment which included two levels of variety and two levels of spacing.

Use the Mantel-Haenszel chi-square test to test the null hypothesis of no relationship between variety and spacing. Give the chi-square test statistic and the p-value for the test.

Sample Data

The Cotton data set contains data about various characteristics of cotton plants.

These are the variables in the data set:

Name
Type
Description
variety num cotton variety (37, 213)
spacing num distance between planting rows)
plant num plant
bollwt num total weight of cotton bolls
lint num weight of usable lint
Source of Data

This data is sample data from SAS Institute Inc.

Result

The value of the Mantel-Haenszel chi-square statistic is 0.0376, with a p-value of 0.8462, which can be considered to be highly insignificant.  Hence, we fail to reject the null hypothesis that there is no relationship between variety of cotton and distance between planting rows.

Logistic Regression

Simple Binary

Source: http://support.sas.com/learn/statlib...g_logreg_1.htm

Problem: Use simple binary logistic regression to describe the relationship between the temperature at the launch time of space shuttles and O-ring thermal distress.

The Analysis of Maximum Likelihood Estimates Table.

Based on the parameter estimates, we have the logistic regression equation

ln[p/(1-p)] = -15.0429 +0.2322x.

The p-value for the test of significance is 0.0320. Therefore, at a level of α = 0.05, we can say that temperature at launch time is a significant predictor for the probability of no distress.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/logreg_1.htm#KB7

Rainy Days 1

Use simple logistic regression to predict the probability of rain from temperature.

Problem

A weather record was compiled for the month of April for a city in the eastern United States. The amount of rainfall (Precip), temperature (Temp), and barometric pressure (Pressure) were recorded for each of the 30 days of the month. A variable Rained was added to the set of data to categorize rainfall based on the following formula:

Rained ={"Rainy" if Precip > 0.02
{"Dry" otherwise

Fit a simple logistic regression function to the data to determine whether temperature is a significant predictor of the probability of rain based on this sample.

Sample Data

The Spring_rain data set is from a weather record for the month of April. The dataset contains information on the temperature, precipitation, and barometric pressure for each of the thirty days in the month. Also a categorical variable Rained is included which categorizes rainfall in the following manner:

Rained ={"Rainy" if Precip > 0.02
{"Dry" otherwise

These are the variables in the data set:

Name
Type
Description
date char date given as mm/dd
Temp num temperature
Precip num amount of rainfall
Pressure num barometric pressure
Rained char categorical—if Precip > 0.02, then Rained = “Rained”; otherwise Rained = “Dry”
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

Based on the output generated by the LOGISTIC procedure in SAS, we find that the coefficient on temperature is very small (-0.00863) and the chi-square statistic is not significant (p-value=0.8707). So, we can conclude that temperature alone is not a significant predictor of the probability of rain.

Space Shuttles

Describe the relationship between the space shuttle temperature and O-ring thermal distress.

Problem

In the 23 launches preceding the Challenger mission, data was collected recording the temperature at launch time and the presence or absence of O-ring thermal distress (coded as 0 for no distress, and 1 for distress).

Find the equation for the logistic regression of the presence or absence of O-ring thermal distress on temperature at launch time. Determine whether launch-time temperature is a significant predictor for the probability of no distress.

Sample Data

The O_ring data set contains data about temperature and O-ring thermal distress for the 23 space shuttle launches preceding the Challenger mission.

These are the variables in the data set:

Name
Type
Description
flt num flight number
temp num temperature at launch time
td num indicator of whether or not there was thermal distress during the launch (0 for no distress, 1 for distress)
Source of Data

This data is sample data from SAS Institute Inc.

Result

The probability modeled is for Td = 0 or no distress.

Based on the parameter estimates, we have this logistic regression equation:

ln[p/(1-p)] = -15.0429 +0.2322x

The p-value for the test of significance is 0.0320. Therefore, at a level of α = 0.05, we can say that temperature at launch time is a significant predictor for the probability of no distress.

Multiple Binary

Source: http://support.sas.com/learn/statlib...g_logreg_2.htm

Problem: Use multiple binary logistic regression to describe the relationship between purchase level and factors such as age, gender, and income level.

Notice that the probability modeled is for purchase = 1 or “$100 or more.”

Scroll down again to view the Type 3 Analysis of Effects table.

The results of the Wald chi-square test yield a test statistic value of W = 5.9494 and a corresponding p-value = 0.0147 for the gender effect. So, based on these findings we can conclude that gender is a significant predictor for the probability of purchasing $100 or more.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/logreg_2.htm#KB7

Rainy Days 2

Use multiple logistic regression to predict the probability of rain from temperature.

Problem

A weather record was compiled for the month of April for a city in the eastern United States. The amount of rainfall (Precip), temperature (Temp), and barometric pressure (Pressure) were recorded for each of the 30 days of the month. A variable Rained was added to the set of data to categorize rainfall based on the following formula:

Rained = {"Rainy" if Precip > 0.02
{"Dry" otherwise

Find the equation of the logistic regression function fitting the probability of “Dry” against temperature and pressure. Use the coefficient of determination to make a statement about how the model fits the data.

Sample Data

The Spring_rain data set is from a weather record for the month of April. The data set contains information on the temperature, precipitation, and barometric pressure for each of the thirty days in the month. Also a categorical variable Rained is included which categorizes rainfall in the following manner:

Rained = {"Rainy" if Precip > 0.02
{"Dry" otherwise

These are the variables in the data set:

Name
Type
Description
date char date given as mm/dd
Temp num temperature
Precip num amount of rainfall
Pressure num barometric pressure
Rained char categorical—if Precip > 0.02, then Rained = “Rained”; otherwise Rained = “Dry”
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

Using PROC LOGISTIC in SAS, the logistic equation is given by

ln(p/(1 – p)) = -445.8 – 0.0551*Temp + 15.3079*Press

which is equivalent to

p = 1/(1 + exp( - (-445.8 – 0.0551*Temp + 15.3079*Press)))

where p is the prediction probability of “Dry.”

The coefficient of determination has a value of r^2 = 0.3555, which means that only about 36% of the variability in the probability of “Dry” is explained by its regression on temperature and barometric pressure.

Mail Order Customers 2

Describe the relationship between purchase level and factors such as age, gender, and income level.

Problem

A mail-order company has decided that customers who spend 100 dollars or more on purchases should be the focus of its advertising efforts. To help identify this target group, the company collected information from its customers including purchase level (1 = at least $100, 0 = less than $100 dollars), gender, income level, and age.

Using the logistic regression of purchase level on gender, income level, and age, determine whether gender is a significant predictor for the probability of purchasing $100 or more.

Sample Data

The Sales data set contains data about customers of a mail-order company.

These are the variables in the data set:

Name
Type
Description
purchase num customer’s purchase level (1 = at least $100, 0 = less than $100 dollars)
age num customer’s age
gender char customer’s gender
income char customer’s income level (Low, Medium, High)
Source of Data

This data is sample data from SAS Institute Inc.

Result

The results of the Wald chi-square test yield a test statistic value of W = 5.9494 and a corresponding p-value = 0.0147 for the gender effect. So, based on these findings we can conclude that gender is a significant predictor for the probability of purchasing $100 or more.

Nonparametric Analyses: Independent Samples

1 Sample

Sign Test

Source: http://support.sas.com/learn/statlib..._nonpara_1.htm

Problem: Perform the sign test to determine if there is a difference between pretest and posttest scores for students in a college course.

The value of the sign test statistic is M = -1, with a p-value of 0.7744. This gives insufficient evidence of a significant difference between pretest and posttest scores for the students.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_1.htm#KB9

College Course Test Scores 1

Determine whether there is a difference between pretest and posttest scores for students.

Problem

An instructor at a community college is interested in examining a set of score changes between a pair of tests given in one of his college courses. He issued a pretest on the first day of class, and after a few weeks of lecture he issued a posttest on the same material. The instructor recorded the scores on both tests, as well as the difference in scores (posttest-pretest) for each student.

Carry out the sign test to determine whether there is evidence that the scores on the posttest were different than those on the pretest for the students.

Sample Data

The Score data set contains data about pretest and postest scores for students in a college course.

These are the variables in the data set:

Name
Type
Description
Student char student
PreTest num pretest score
PostTest num posttest score
ScoreChange num difference between posttest score and pretest score
Source of Data

This data is sample data from SAS Institute Inc.

Result

The value of the sign test statistic is M = -1, with a p-value of 0.7744. This gives insufficient evidence of a significant difference between pretest and posttest scores for the students.

Wired 1

Determine whether electrical measurements taken outside and inside a chamber differ.

Problem

An experiment was conducted in which electrical measurements were taken on 24 wiring boards. Each board was measured first when soldering was completed, and again after three weeks in a chamber with a controlled environment of high temperature and humidity.

The Shapiro-Wilk W test for Normality for the data yielded a p-value of 0.0076, indicating that the data are significantly non-Normal. Use the sign test to determine if there is a significant difference between the outside and inside chamber measurements, at the α = 0.01 level.

Sample Data

The Chamber data set represents electrical measurements on 24 wiring boards. Measurements were taken both outside and inside a chamber, and the difference between these measurements (outside – inside) was also recorded.

These are the variables in the data set:

Name
Type
Description
board num identifier for wiring board
outside num electrical measurement taken outside the chamber
inside num electrical measurement taken inside the chamber
diff num the difference between the measurements (outside – inside)
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The analysis variable for this procedure is the variable diff which is calculated as outside – inside. The results of the sign test (as given by the UNIVARIATE procedure in SAS) yield a p-value of 0.0066, which gives evidence of a significant difference between the outside and inside chamber measurements at the α = 0.01 level.

Wilcoxon Signed Rank Test

Source: http://support.sas.com/learn/statlib..._nonpara_2.htm

Problem: Perform the Wilcoxon signed rank test to determine if there is a difference between pretest and posttest scores for students in a college course.

The value of the Wilcoxon signed rank test statistic is S = -8.5, with a p-value of 0.5278. This gives insufficient evidence of a significant difference between pretest and posttest scores for the students.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_2.htm#KB9

College Course Test Scores 2

Determine whether there is a difference between pretest and posttest scores for students.

Problem

An instructor at a community college is interested in examining a set of score changes between a pair of tests given in one of his college courses. He issued a pretest on the first day of class, and after a few weeks of lecture he issued a posttest on the same material. The instructor recorded the scores on both tests, as well as the difference in scores (posttest-pretest) for each student.

Perform the Wilcoxon signed rank test to determine whether there is evidence that the scores on the posttest were different than those on the pretest for the students.

Sample Data

The Score data set contains data about pretest and postest scores for students in a college course.

These are the variables in the data set:

Name
Type
Description
Student char student
PreTest num pretest score
PostTest num posttest score
ScoreChange num difference between posttest score and pretest score
Source of Data

This data is sample data from SAS Institute Inc.

Result

The value of the sign test statistic is M = -1, with a p-value of 0.7744. This gives insufficient evidence of a significant difference between pretest and posttest scores for the students.

Wired 2

Determine whether electrical measurements taken outside and inside a chamber differ.

Problem

An experiment was conducted in which electrical measurements were taken on 24 wiring boards. Each board was measured first when soldering was completed, and again after three weeks in a chamber with a controlled environment of high temperature and humidity.

The Shapiro-Wilk W test for Normality for the data yielded a p-value of 0.0076, indicating that the data are significantly non-Normal. Use the Wilcoxon signed-rank test to determine if there is a significant difference (at level α = 0.05) between the outside and inside chamber measurements.

Sample Data

The Chamber data set represents electrical measurements on 24 wiring boards. Measurements were taken both outside and inside a chamber, and the difference between these measurements (outside – inside) was also recorded.

These are the variables in the data set:

Name
Type
Description
board num identifier for wiring board
outside num electrical measurement taken outside the chamber
inside num electrical measurement taken inside the chamber
diff num the difference between the measurements (outside – inside)
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

Using the UNIVARIATE procedure in SAS, the analysis variable for this procedure is the variable diff, which is calculated as outside – inside. The Wilcoxon signed-rank test gives a p-value of 0.0106, which gives evidence of a significant difference between the outside and inside chamber measurements (at the α = 0.05 level).

2 Sample

Median Test

Source: http://support.sas.com/learn/statlib..._nonpara_3.htm

Problem: Carry out the two-sample median test to determine if referrals received from physicians differs between types of hospice marketing visits.

  1. View the first table in the output to see the median score statistics. 
     
  2. View the results of the two-sample median test for the following hypotheses:

    Hono difference in change in referrals after three
    months between the two types of visits

    against the two-sided alternative

     Ha: there is a difference in change in referrals after
     three months between the two types of visits

    The median test statistic is M = 6.9375, with a standardized value of z = 0.4275.  The p-value for the two-sided test is 0.6690, which fails to give sufficient evidence against the null hypothesis. So, we cannot conclude that there is a difference in the change in referrals after one month between the two types of visits.
Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_3.htm#KB9

Hoop Ratings

Examine the correspondence between the team ratings of collegiate basketball polls.

Problem

The Basketball data set contains the preseason ratings of collegiate basketball teams for the 1985-86 season as given by ten media outlets. Use the Spearman correlation to give an assessment of the relationship between the team ratings by the UPI and AP polls.

Sample Data

The Basketball data set gives the pre-season ratings of collegiate basketball teams for the 1985-86 season as given by ten media outlets.

These are the variables in the data set:

Name
Type
Description
school char university team represents
CSN num team rating by CSN
Durham Sun num team rating by Durham Sun
Durham Herald num team rating by Durham Herald
Washington Post num team rating by Washington Post
USA Today num team rating by USA Today
Sports Magazine num team rating by Sports Magazine
In Sport num team rating by In Sport
UPI num team rating by UPI
AP num team rating by AP
Sports Illustrated num team rating by Sports Illustrated
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The value of the Spearman correlation coefficient is 0.95. This indicates that there is a very strong positive association between the ratings by the UPI and AP polls. So, we can say that high team ratings by the UPI poll are associated with high ratings by the AP poll, and low ratings are by the UPI are associated with low ratings by the AP.

Physician Referrals 1

Determine whether the type of hospice marketing visit affects physician referrals.

Problem

A study was done to investigate the effect of hospice marketing visits on the change in number of referrals received from doctors. There were two types of visits (one accompanied by a physician, and one by the hospice staff only), and the change in referrals after one month (change1) and after three months (change3) were recorded, along with other variables.

Use the median test to determine whether the change in referrals after one month differs for the two types of visits.

Sample Data

The Hospice data set contains data about referrals received from physicians after a visit by a hospice marketing nurse.

These are the variables in the data set:

Name
Type
Description
ID num physician ID
Practice char type of practice
Date char date of visit
Change3 num change in number of referrals after 3 months
Change1 num change in number of referrals after one month
Visit char type of visit
Source of Data

This data is sample data from SAS Institute Inc.

Result

You’re testing the following hypotheses:

Hono difference in change in referrals after three months between the two types of visits

against the two-sided alternative

 Ha: there is a difference in change in referrals after
 three months between the two types of visits

The median test statistic is M = 6.9375, with a standardized value of z = 0.4275. The p-value for the two-sided test is 0.6690, which fails to give sufficient evidence against the null hypothesis. So, we cannot conclude that there is a difference in the change in referrals after one month between the two types of visits.

Adverse Reactions 1

Test whether adverse reaction times differ significantly between two groups of patients.

Problem

The manufacturers of a medication were concerned about adverse reactions in patients that were treated with their drug. Data on adverse reactions was gathered and stored in a file. The duration of the adverse reaction was recorded as the dependent variable. Patients were either given a placebo or received the standard drug regimen.

Test whether there is a significant difference in adverse reaction times between the two groups using the nonparametric median test. Use a significance level of α = 0.10.

Sample Data

The Adverser data set contains information on patients and their adverse reactions to a drug treatment.

These are the variables in the data set:

Name
Type
Description
PATIENT_ID num patient identification number
TREATMENT_GROUP char treatment patient received (placebo or standard drug)
TOTAL_DAILY_DOSE num daily dosage
DAY_ON_DRUG num number of days patient was on treatment
AGE num age
SEX char sex
WEIGHT num weight
ADVERSE_REACTION char type of adverse reaction
RACE char race
ADR_SEVERITY char level of severity of adverse reaction
RELATION_TO_DRUG char relation of adverse reaction to drug
ADR_DURATION char duration of adverse reaction
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The results of the two-sample median test generated by the NPAR1WAY procedure in SAS show a p-value of 0.1132. This outcome is not significant at the α = 0.10 level (since p-value > 0.10). So we can conclude that the data does not provide strong enough evidence to suggest that there is a significant difference in adverse reaction times between the two treatment groups.

Adverse Reactions 2

Test whether adverse reaction times differ significantly between two groups of patients.

Problem

The manufacturers of a medication were concerned about adverse reactions in patients that were treated with their drug. Data on adverse reactions was gathered and stored in a file. The duration of the adverse reaction was recorded as the dependent variable.

Patients were either given a placebo or received the standard drug regimen. Test whether there is a significant difference in adverse reaction times between the two groups using the nonparametric median test. Use a significance level of α = 0.10. ~nl~

Sample Data

The Adverser data set contains information on patients and their adverse reactions to a drug treatment.

These are the variables in the data set:

Name
Type
Description
PATIENT_ID num patient identification number
TREATMENT_GROUP char treatment patient received (placebo or standard drug)
TOTAL_DAILY_DOSE num daily dosage
DAY_ON_DRUG num number of days patient was on treatment
AGE num age
SEX char sex
WEIGHT num weight
ADVERSE_REACTION char type of adverse reaction
RACE char race
ADR_SEVERITY char level of severity of adverse reaction
RELATION_TO_DRUG char relation of adverse reaction to drug
ADR_DURATION char duration of adverse reaction
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The results of the two sample median test generated by the NPAR1WAY procedure in SAS shows a p-value of 0.1132. This outcome is not significant at the α = 0.10 level (since p-value > 0.10). So we can conclude that the data does not provide strong enough evidence to suggest that there is a significant difference in adverse reaction times between the two treatment groups.

Wilcoxon Rank Sum Test

Source: http://support.sas.com/learn/statlib..._nonpara_4.htm

Problem: Perform a Wilcoxon rank sum test to determine if there is a significant difference between types of hospice marketing visits for referrals received from physicians.

  1. View the first table in the output to see the Wilcoxon rank sums statistics.
     
  2. View the second table. This table gives the results of the Wilcoxon two-sample test for the following hypotheses:

    Hono difference in change in referrals after 
    one month between the two types of visits

    versus the two-sided alternative

    Ha: there is a difference in change in referrals after
     one month between the two types of visits

    The rank sum statistic is S = 378.50. The table gives results for three different tests.  You could choose any of the tests to draw your conclusion, but for our case suppose we wanted to make our inference based on the exact distribution of the rank sum statistic. 
     
  3. Scroll down to the results of the Exact Test.

    Since our alternative is two-sided, we will rely on the two-sided p-value, which is equal to 0.6531.  This does not give sufficient evidence to reject the null hypothesis, so we cannot conclude that there is a difference in the change in referrals after one month between the two types of visits.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_4.htm#KB9

Physician Referrals 2

Determine whether the type of hospice marketing visit affects physician referrals.

Problem

A study was done to investigate the effect of hospice marketing visits on the change in number of referrals received from doctors. For the two types of visits (one accompanied by a physician, and one by the hospice staff only), the change in referrals after one month (change1) and after three months (change3) were recorded, along with other variables.

Perform the Wilcoxon rank sum test to determine whether the change in referrals after one month differs for the two types of visits.

Sample Data

The Hospice data set contains data about referrals received from physicians after a visit by a hospice marketing nurse.

These are the variables in the data set:

Name
Type
Description
ID num physician ID
Practice char type of practice
Date char date of visit
Change3 num change in number of referrals after 3 months
Change1 num change in number of referrals after one month
Visit char type of visit
Source of Data

This data is sample data from SAS Institute Inc.(database delimiter)

Result

You’re testing the following hypotheses: 

Hono difference in change in referrals after 
one month between the two types of visits

versus the two-sided alternative

Ha: there is a difference in change in referrals after
one month between the two types of visits

In the results of the Wilcoxon two-sample test for three different tests, the rank sum statistic is S = 378.50. You could choose any of the tests to draw your conclusion. However, in this situation, suppose you want to make your inference based on the exact distribution of the rank sum statistic.

Since your alternative is two-sided, you will rely on the two-sided p-value, which is equal to 0.6531. This does not give sufficient evidence to reject the null hypothesis, so you cannot conclude that there is a difference in the change in referrals after one month between the two types of visits.

Teen Growth

Determine whether there is a significant difference in the heights of 15-year-old males and females.

Problem

A study compiled the heights and weights of 39 teenagers, all age 15. Use the Wilcoxon rank sum test to determine if there is a significant difference (at level α = 0.05) in the mean heights of 15-year-old males and females.

Sample Data

The Htwt15 data set contains the heights and weights of 39 15-year-olds.

These are the variables in the data set:

Name
Type
Description
gender char gender (male or female)
height num height (in inches)
weight num weight (in pounds)
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The NPAR1WAY procedure in SAS gives the results of two approximations to the Wilcoxon – the Normal and the t approximations (with p-values 0.0002 and 0.0006, respectively) – along with the results of the exact test (p-value = 0.000075). All three of these give strong evidence of a significant difference in the mean heights of 15-year-old males and females (since all p-values are less than α = 0.05).

Kruskal-Wallis Test

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/eg_nonpara_5.htm

Problem: Carry out the Kruskal-Wallis test to determine if there is a significant difference between types of hospice marketing visits for referrals received from physicians.

  1. View the first table in the output to see the Wilcoxon rank sums statistics.
     
  2. View the second table. This table gives the results of the Wilcoxon two-sample test.
     
  3. Scroll down to the last table, which gives the results of the Kruskal-Wallis test for the following hypotheses:

    Hono difference in change in referrals after 
    three months between the two types of visits

    versus the two-sided alternative

    Ha: there is a difference in change in referrals after
     three months between the two types of visits

    The Kruskal-Wallis test statistic is K = 1.5991, with a one-sided p-value of 0. 2060.  The p-value for our two-sided test is 2(0.2060) = 0.4120.  This fails to give sufficient evidence against the null hypothesis, so we cannotconclude that there is a difference in the change in referrals after three months between the two types of visits.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_5.htm#KB9

Physician Referrals 3

Determine whether the type of hospice marketing visit affects physician referrals.

Problem

A study was done to investigate the effect of hospice marketing visits on the change in number of referrals received from doctors. There were two types of visits (one accompanied by a physician, and one by the hospice staff only), and the change in referrals after one month (change1) and after three months (change3) were recorded, along with other variables.

Use the Kruskal-Wallis test to determine whether the change in referrals after three months differs for the two types of visits.

Sample Data

The Hospice data set contains data about referrals received from physicians after a visit by a hospice marketing nurse.

These are the variables in the data set:

Name
Type
Description
ID num physician ID
Practice char type of practice
Date char date of visit
Change3 num change in number of referrals after 3 months
Change1 num change in number of referrals after one month
Visit char type of visit
Source of Data

This data is sample data from SAS Institute Inc.(database delimiter)

Result

You’re testing the following hypotheses:

Hono difference in change in referrals after 
three months between the two types of visits

versus the two-sided alternative

Ha: there is a difference in change in referrals after
 three months between the two types of visits

The Kruskal-Wallis test statistic is K = 1.5991, with a one-sided p-value of 0. 2060. The p-value for our two-sided test is 2(0.2060) = 0.4120. This fails to give sufficient evidence against the null hypothesis, so we cannot conclude that there is a difference in the change in referrals after three months between the two types of visits.

Adverse Reactions 3

Test whether adverse reaction times differ significantly based on the gender of patients.

Problem

The manufacturers of a medication were concerned about adverse reactions in patients that were treated with their drug. Data on adverse reactions was gathered and stored in a file. The duration of the adverse reaction was recorded as the dependent variable.

Patients were either given a placebo or received the standard drug regimen. Test whether there is a difference in adverse reaction times between males and females using the two sample Kruskal-Wallis test. Use a significance level of α = 0.05. ~nl~

Sample Data

The Adverser data set contains information on patients and their adverse reactions to a drug treatment.

These are the variables in the data set:

Name
Type
Description
PATIENT_ID num patient identification number
TREATMENT_GROUP char treatment patient received (placebo or standard drug)
TOTAL_DAILY_DOSE num daily dosage
DAY_ON_DRUG num number of days patient was on treatment
AGE num age
SEX char sex
WEIGHT num weight
ADVERSE_REACTION char type of adverse reaction
RACE char race
ADR_SEVERITY char level of severity of adverse reaction
RELATION_TO_DRUG char relation of adverse reaction to drug
ADR_DURATION char duration of adverse reaction
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The results of the Kruskal-Wallis test generated by the NPAR1WAY procedure in SAS shows a p-value of 0.6356. This outcome is not significant at the α = 0.05 level. So we can conclude that the data does not provide evidence to suggest that there is a difference in adverse reaction times between males and females.

k Samples

Kruskal-Wallis Test

Source: http://support.sas.com/learn/statlib..._nonpara_8.htm

Problem: Perform the Friedman test to determine if there is a difference in the effect on skin potential for four emotions induced by hypnosis.

The first eight tables in the output give the frequency tables for emotion by skin response for each subject.

Scroll down to the last table to see the results of the Friedman test.

In our setting, the Friedman test statistic is identical to the Cochran-Mantel-Haenszel, Row Mean Scores Differ statistic.  The value of this test statistic is 6.45, with a corresponding p-value of 0.0917.  So at a significance level of α=0.10, we can say that the differences in skin potential for the four emotions are significant.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_6.htm#KB9

Synthetic Wood Veneer

Determine whether there is a difference in the durability of three brands of synthetic wood veneer.

Problem

An experiment was conducted to investigate the durability of three brands of synthetic wood veneer that is often used in office furniture and on kitchen countertops. Samples of each brand were subjected to a friction test to determine durability. The amount of veneer material that is worn away due to friction was measured. Brands that have small measurements are desireable.

Carry out the Kruskal-Wallis test to see if the data indicate a significant difference in the distributions of wear measurements for the three brands of synthetic wood veneer.

Sample Data

The Veneer data set contains data about the wear recorded for three brands of synthetic wood veneer after they were subjected to a friction test.

These are the variables in the data set:

Name
Type
Description
Wear num measurement of wear
Brand char brand of wood veneer
Source of Data

This data is sample data from SAS Institute Inc.

Result

You’re testing the following hypotheses:

Hothe distributions of wear measurements
for the three brands are all equal

versus the two-sided alternative

Ha: the distributions of wear measurements
for the three brands are not all equal

The Kruskal-Wallis test statistic is K = 7.2558, with a p-value of 0.0266. The outcome is significant at a level α = 0.05, so you have enough evidence to reject the null hypothesis at this level.

Thus, you can conclude that the distributions of wear measurements for the three brands of synthetic wood veneer are not all equal (that is, there is a significant difference in durability between the brands).

Adverse Reactions 4

Test whether adverse reaction times differ significantly based on the race of patients.

Problem

The manufacturers of a medication were concerned about adverse reactions in patients that were treated with their drug. Data on adverse reactions was gathered and stored in a file. The duration of the adverse reaction was recorded as the dependent variable.

Patients were either given a placebo or received the standard drug regimen. Test whether there is a difference in adverse reaction times based on the race of patients using the k-sample Kruskal-Wallis test. Use a significance level of α = 0.01. ~nl~

Sample Data

The Adverser data set contains information on patients and their adverse reactions to a drug treatment.

These are the variables in the data set:

Name
Type
Description
PATIENT_ID num patient identification number
TREATMENT_GROUP char treatment patient received (placebo or standard drug)
TOTAL_DAILY_DOSE num daily dosage
DAY_ON_DRUG num number of days patient was on treatment
AGE num age
SEX char sex
WEIGHT num weight
ADVERSE_REACTION char type of adverse reaction
RACE char race
ADR_SEVERITY char level of severity of adverse reaction
RELATION_TO_DRUG char relation of adverse reaction to drug
ADR_DURATION char duration of adverse reaction
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The NPAR1WAY procedure in SAS yields a p-value < 0.001, which is significant at the α = 0.01 level. Therefore, we can conclude that there is enough evidence to support the claim that there is a difference in adverse reaction times based on races.

Nonparametric Analyses: Dependent Samples

2 Sample

Wilcoxon Signed Rank Test

Source: http://support.sas.com/learn/statlib..._nonpara_7.htm

Problem: Perform the Wilcoxon signed rank test to determine if there is a significant difference between pretest and posttest scores for students in a college course.

The value of the sign test statistic is S = -8.5, with a p-value of 0.5278. This gives insufficient evidence of a significant difference between pretest and posttest scores for the students.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_7.htm#KB9

College Course Test Scores 3

Determine whether there is a difference between pretest and posttest scores.

Problem

An instructor at a community college is interested in examining a set of score changes between a pair of tests given in one of his college courses. He issued a pretest on the first day of class, and after a few weeks of lecture he issued a posttest on the same material. The instructor recorded the scores on both tests, as well as the difference in scores (posttest-pretest) for each student.

Carry out the sign test to Determine whether there is evidence that the scores on the posttest were different than those on the pretest for the students.

Sample Data

The Score data set contains data about pretest and postest scores for students in a college course.

These are the variables in the data set:

Name
Type
Description
Student char student
PreTest num pretest score
PostTest num posttest score
ScoreChange num difference between posttest score and pretest score
Source of Data

This data is sample data from SAS Institute Inc.

Result

The value of the sign test statistic is M = -1, with a p-value of 0.7744. This gives insufficient evidence of a significant difference between pretest and posttest scores for the students.

Wired 3

Determine whether electrical measurements taken outside and inside a chamber differ.

Problem

An experiment was conducted in which electrical measurements were taken on 24 wiring boards. Each board was measured first when soldering was completed, and again after three weeks in a chamber with a controlled environment of high temperature and humidity.

The Shapiro-Wilk W test for Normality for the data yielded a p-value of 0.0076, indicating that the data are significantly non-Normal. Use the Wilcoxon signed-rank test to determine if there is a significant difference (at level α = 0.05) between the outside and inside chamber measurements.~nl~

Sample Data

The Chamber data set represents electrical measurements on 24 wiring boards. Measurements were taken both outside and inside a chamber, and the difference between these measurements (outside – inside) was also recorded.

These are the variables in the data set:

Name
Type
Description
board num identifier for wiring board
outside num electrical measurement taken outside the chamber
inside num electrical measurement taken inside the chamber
diff num the difference between the measurements (outside – inside)
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The two dependent samples for this analysis were the paired outside and inside chamber measurements taken for each board. From the UNIVARIATE procedure in SAS, the Wilcoxon signed-rank test gives a p-value of 0.0106, which gives evidence of a significant difference between the outside and inside chamber measurements.

k Samples

Friedman Test

Source: http://support.sas.com/learn/statlib..._nonpara_8.htm

Problem: Perform the Friedman test to determine if there is a difference in the effect on skin potential for four emotions induced by hypnosis.

The first eight tables in the output give the frequency tables for emotion by skin response for each subject.

Scroll down to the last table to see the results of the Friedman test.

In our setting, the Friedman test statistic is identical to the Cochran-Mantel-Haenszel, Row Mean Scores Differ statistic.  The value of this test statistic is 6.45, with a corresponding p-value of 0.0917.  So at a significance level of α=0.10, we can say that the differences in skin potential for the four emotions are significant.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_8.htm#KB9

Skin Potential for Emotions

Determine how hypnosis affects the skin potential for four different emotions.

Problem

A study was conducted to investigate whether hypnosis has the same effect on skin potential for four different emotions. Subjects were asked to display fear, joy, sadness and calmness under hypnosis, and the resulting skin potential (measured in millivolts) was recorded for each emotion.

Use the Friedman test to determine whether the data suggests that there is a difference in the effect on skin potential for the emotions.

Sample Data

The Hypnosis data set contains data about the skin response that was recorded for subjects who were asked to display four different emotions under hypnosis.

These are the variables in the data set:

Name
Type
Description
Emotion char emotion (fear, joy, sadness, or calmness)
Subject num subject identifier
SkinResponse num skin response (in millivolts)
Source of Data

This data is sample data from SAS Institute Inc.

Result

The value of the Friedman test statistic is 6.45, with a corresponding p-value of 0.0917. So at a significance level of α=0.10, we can say that the differences in skin potential for the four emotions are significant.

Adverse Reactions 5

Test whether the severity of adverse reactions differs based on the reaction's relation to the drug.

Problem

The manufacturers of a medication were concerned about adverse reactions in patients that were treated with their drug. Data on adverse reactions was gathered and stored in a file. The duration of the adverse reaction was recorded as the dependent variable. Patients were either given a placebo or received the standard drug regimen.

Test whether there is a difference in the severity of adverse reactions based on the relation of the reaction to the drug using the Friedman test. Use a significance level of α = 0.05.

Sample Data

The Adverser data set contains information on patients and their adverse reactions to a drug treatment.

These are the variables in the data set:

Name
Type
Description
PATIENT_ID num patient identification number
TREATMENT_GROUP char treatment patient received (placebo or standard drug)
TOTAL_DAILY_DOSE num daily dosage
DAY_ON_DRUG num number of days patient was on treatment
AGE num age
SEX char sex
WEIGHT num weight
ADVERSE_REACTION char type of adverse reaction
RACE char race
ADR_SEVERITY char level of severity of adverse reaction
RELATION_TO_DRUG char relation of adverse reaction to drug
ADR_DURATION char duration of adverse reaction
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The FREQ procedure in SAS yields a p-value = 0.0061, which is significant at the α = 0.05 level. Therefore, we can conclude that there is enough evidence to support the claim that there is a difference in the severity of adverse reactions based on the relation of the reaction to the drug.

Spearman Correlation

Source: http://support.sas.com/learn/statlib..._nonpara_9.htm

Problem: Determine the Spearman correlation coefficient to measure the strength of the linear relationship between arts and economics.

View the first table in the output to see the variables involved in the analysis. View the second table to see the value of the Spearman correlation coefficient for the two variables. 

The correlation value of 0.27926 suggests that there is a weak positive relationship between arts and economics.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_9.htm#KB9

Arts and Economics in Cities

Measure the relationship between arts and economics in western cities.

Problem

In a study to investigate the prevalence of relationships between different socioeconomic factors, 52 western cities were rated by nine criteria: climate terrain, housing, health care environment, crime, transportation, education, arts, recreation, and economics. For housing and crime, the lower the rating score, the better. For the remaining seven criteria, the higher the score, the better.

Find the correlation of the arts and economics ratings for this sample data. Based on your value for the Spearman correlation coefficient would you describe the linear relationship between arts and economics as strong or weak, positive or negative?

Sample Data

The Westernrates data set contains data about ratings on nine criteria (climate and terrain, housing, health care and environment, crime, transportation, education, arts, recreation, and economics) for 52 western cities.

These are the variables in the data set:

Name
Type
Description
City char city
State char state
ClimateTerrain num rating of climate and terrain
Housing num rating of housing
HealthCareEnvironment num rating of health care and environment
Crime num rating of crime
Transportation num rating of transportation
Education num rating of education
Arts num rating of the arts
Recreation num rating of recreation
Economics num rating of economics
Source of Data

This data is sample data from SAS Institute Inc.

Result

The correlation value of 0.27926 suggests that there is a weak positive relationship between arts and economics.

Hoop Ratings

Examine the correspondence between the team ratings of collegiate basketball polls.

Problem

The Basketball data set contains the preseason ratings of collegiate basketball teams for the 1985-86 season as given by ten media outlets. Use the Spearman correlation to give an assessment of the relationship between the team ratings by the UPI and AP polls.

Sample Data

The Basketball data set gives the preseason ratings of collegiate basketball teams for the 1985-86 season as given by ten media outlets.

These are the variables in the data set:

Name
Type
Description
school char university team represents
CSN num team rating by CSN
Durham Sun num team rating by Durham Sun
Durham Herald num team rating by Durham Herald
Washington Post num team rating by Washington Post
USA Today num team rating by USA Today
Sports Magazine num team rating by Sports Magazine
In Sport num team rating by In Sport
UPI num team rating by UPI
AP num team rating by AP
Sports Illustrated num team rating by Sports Illustrated
Source of Data

Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.

Result

The value of the Spearman correlation coefficient (given by the CORR procedure in SAS) is 0.95. This indicates that there is a very strong positive association between the ratings by the UPI and AP polls. So, we can say that high team ratings by the UPI poll are associated with high ratings by the AP poll, and low ratings are by the UPI are associated with low ratings by the AP.

Survival Analysis

Test of Equality over Strata

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/eg_surv_1.htm

Problem: Perform a test of equality over strata to determine if receiving maintenance therapy is effective in lengthening the time in remission.

View the first part of the output to see the Kaplan-Meier survival estimates. Scroll down to see the rest of the table.

Scroll down again to view the summary statistics for MonitorTime for the control group.

Scroll down again to view survival estimates for the maintenance therapy group. Scroll down to see the rest of the table.

Scroll down again to view summary statistics for MonitorTime for the maintenance therapy group.

Scroll down again to the last table in the output, the results of the Test of Equality over Strata.
 
The likelihood ratio test should be used if the distribution of the event times is found to be exponential. The results from both the log-rank and Wilcoxon tests suggest that the data does not indicate evidence that there is a difference in survival curves between patients that received the maintenance therapy and those who did not.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/surv_1.htm#KB8

Maintenance Therapy

Determine whether receiving maintenance therapy lengthens remission time.

Problem

A trial was conducted at Stanford University to investigate the effectiveness of maintenance therapy for acute myelogenous leukemia (aml). After being treated by chemotherapy until remission, patients were randomized into two groups: one that received maintenance therapy and a control group that did not. The patients were observed until they suffered a relapse, the event of interest. The event time variable is the length of time in remission—the time from entry into the study (being randomized into a group) until relapse.

Use the Kaplan-Meier method to estimate proportion of patients in remission for each group. Carry out a test of equality over strata to Determine whether there is a significant difference in remission time between the patients who received maintenance therapy and those who did not.

Sample Data

The Aml_survival data set contains data about the efficacy of maintenance therapy for acute myelogenous leukemia (aml).

These are the variables in the data set:

Name
Type
Description
MonitorTime num length of time in remission (the time from entry into the study until relapse)
Treatment char indicator of whether or not the patient received maintenance therapy
Censored_ num indicator of whether the patient suffered a relapse (0 = patient did not suffer a relapse,
Source of Data

This data is sample data from SAS Institute Inc.

Result

The likelihood ratio test should be used if the distribution of the event times is found to be exponential. The results from both the log-rank and Wilcoxon tests suggest that the data does not indicate evidence that there is a difference in remission time between patients that received the maintenance therapy and those who did not.

Comparing Survival Functions

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/eg_surv_2.htm

Problem: Compare the survival functions for two behavior types by examining the survival curves.

The tables in the output give the Kaplan-Meier survival estimates and the summary statistics for Personality for Type A and Type B personalities. If you like, you can view these tables in their entirety here.

Scroll down to the graph of the survival distribution functions.
It appears that the survival curve for the Type B behavior group lies above that of the Type A behavior group from time = 0 days to about time = 2850 days. So, we can say that the more relaxed, noncompetitive individuals had a more favorable survival experience for the first 5 years of the study.

Solve Exercises in Your Own Statistical Software

Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/surv_2.htm#KB8

Heart Disease

Compare the effects of two behavior types on the incidence of heart disease.

Problem

A prospective study was performed to investigate the effects of behavior type and smoking habits on heart disease. Participants were followed for nine years, and the time variable of interest was the interval (in days) from entry into the study until the appearance of coronary heart disease. Individuals were classified by two types of behavior on the basis of an interview: Type A is characterized by aggressiveness and competitiveness, and Type B is considered more relaxed and noncompetitive.

Use the plot of the survival curves to determine which behavior type yielded the more favorable “survival experiences” among the participants for the first 5 years (roughly 1825 days) of the study.

Sample Data

The Wcgs data set contains event time and censor variables for 614 participants, as well as measurements of two covariates of interest: smoking behavior at study entry and behavior type.

These are the variables in the data set:

Name
Type
Description
Cigarettes num number of cigarettes smoked per day
Personality char A (Type A) or B (Type B)
Censor num censor (0 or 1)
Time num number of days from entry into study until the appearance of coronary heart disease
Source of Data

This data is sample data from SAS Institute Inc.

 

Result

It appears that the survival curve for the Type B behavior group lies above that of the Type A behavior group from time = 0 days to about time = 2850 days. So, we can say that the more relaxed, noncompetitive individuals had a more favorable survival experience for the first 5 years of the study.

Table: http://support.sas.com/learn/statlibrary/statlib_eg4.2/images/eg_surv_2_tables.htm

Page statistics
3442 view(s) and 60 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments