Table of contents
 Story
 SAS Exercises Slide Numbers
 SlidesSAS Exercises
 Spotfire DashboardSAS Exercises
 SlidesSpotfire Tutorial
 Spotfire DashboardSpotfire Tutorial
 Take a QuickStart Tutorial!
 Distributions
 Charts and Plots
 Descriptive Statistics
 Tests for Normality
 Confidence Intervals
 tTests
 Correlation
 Linear Regression
 Analysis of Variance
 Tests of Association
 Logistic Regression
 Nonparametric Analyses: Independent Samples
 Nonparametric Analyses: Dependent Samples
 Survival Analysis
 Story
 SAS Exercises Slide Numbers
 SlidesSAS Exercises
 Spotfire DashboardSAS Exercises
 SlidesSpotfire Tutorial
 Spotfire DashboardSpotfire Tutorial
 Take a QuickStart Tutorial!
 Distributions
 Charts and Plots
 Descriptive Statistics
 Tests for Normality
 Confidence Intervals
 tTests
 Correlation
 Linear Regression
 Analysis of Variance
 Tests of Association
 Logistic Regression
 Nonparametric Analyses: Independent Samples
 Nonparametric Analyses: Dependent Samples
 Survival Analysis
Story
SAS Public Data Sets as Big Data
A recent Information Week article caught my attention with: Big Data Professors Want Your Data Sets  Access to large, relevant data sets is the biggest problem facing business intelligence and data analytics instructors and students, new survey finds.
But that article did not provide any solution, so I wanted to provide the simulated SAS data sets in the SAS Visual Analytics online demo to the DC Data Community Sandbox I am buidling here. But I was told that the ability to download or screen scrape those data sets had been recently disabled to protect the investment made in them. The SAS support folks I spoke with said they would see about making them public. In my previous analysis of SAS Visual Analytics, I was able to download and reuse a small sample of the Toy Store data set.
In searching for other data sets, I found that the SAS Online Resource For Statistics Education provides three functions:
 Learn How In SAS My Note: Same URL as above.
 Download SAS Data My Note: I sent a request to access this. The available data sets are in either SAS or Excel and I downloaded both.
 Solve Exercises in Your Own Statistical Software
The third option says: In this section, you can work through statistics exercises using any statistical software. We've provided sample data and solutions for each exercise, which have been evaluated by a team of experts. Select an analysis from the list at left to see the available exercises.
I organized the list of 10 topics (below) and 30 subtopics (see Research Notes) in the Wiki below to facilitate use of the data sets in Spotfire to recreate the exercises.
 Distributions
 tTests
 Correlation
 Linear Regression
 Analysis of Variance
 Tests of Association
 Logistic Regression
 Nonparametric Analyses: Independent Samples
 Nonparametric Analyses: Dependent Samples
 Survival Analysis
I found that Spotfire provides most of the same statistical functions as SAS, but without programming, and allows a data scientist to be very creative in exporing the SAS Education data sets. My previous work with SAS in Excel and Spotfire is provided elsewhere.
I recreated about half the 30 SAS Online Exercises below using 29 of the 48 data sets. Then I decided to use the SAS Online Exercise data sets (48) in the Spotfire Users Guide to create a SAS Data Sets  Spotfire Tutorial that links to my Spotfire Users Guide Knowledge Base.
The results are shown below in the SlidesSpotfire Tutorial and the Spotfire DashboardSpotfire Tutorial for Data Relationships as follows:
 Data Relationships Linear regression algorithm
 Data Relationships Spearman R algorithm
 Data Relationships Anova algorithm
 Data Relationships KruskalWallis algorithm
 Data Relationships Chisquare independence test algorithm
Next I want to test using the SAS Public Data Sets  Spotfire Tutorial on more than one data set to see if it is a good template for students to use for the other 47 data sets.
Then I want to do a SAS Public Data Sets  Spotfire Tutorial that uses the other Spotfire tools as follows:
 Kmeans Clustering
 Line Similarity
 Hierarchical Clustering
 Regression Modeling
 Classification Modeling
Then I will be able to say whether Spotfire is able to do what SAS does or more.
IN PROCESS
SAS Exercises Slide Numbers
Data Source: http://support.sas.com/learn/statlib.../top_exer1.htm
Slide Numbers
1 Cover Page
Distributions
2 Charts and Plots
3 Descriptive Statistics
4 Tests for Normality
5 Confidence Intervals
tTests
6 One Sample
7 Paired Sample
8 Two Sample
Correlation
9 Graphical Evaluation
10 Pearson Correlation
11 Spearman Correlation
Linear Regression
12 Simple
13 Multiple
14 Polynomial
15 Stepwise
Analysis of Variance
16 OneWay ANOVA
17 TwoWay ANOVA
18 Mixed Models
19 Repeated Measures
Tests of Association
20 Pearson ChiSquare
21 Likelihood Ratio ChiSquare
22 MantelHaenszel ChiSquare
Logistic Regression
23 Simple Binary
24 Multiple Binary
Nonparametric Analyses: Independent Samples
25 1 Sample
26 2 Sample
27 k Samples
Nonparametric Analyses: Dependent Samples
28 2 Sample
29 k Samples
Survival Analysis
30 Test of Equality over Strata
31 Comparing Survival Functions
SlidesSAS Exercises
Analysis of Variance
OneWay ANOVA
TwoWay ANOVA
Mixed Models
Repeated Measures
Nonparametric Analyses: Independent Samples
1 Sample
2 Sample
k Samples
Nonparametric Analyses: Dependent Samples
2 Sample
k Samples
Spotfire DashboardSAS Exercises
For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App
SlidesSpotfire Tutorial
Spotfire DashboardSpotfire Tutorial
For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App
Take a QuickStart Tutorial!
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/tut_index.htm
Learn the basics in minutes!
To learn about SAS Enterprise Guide, select a topic below.
My Note: This was done in 2009 and the Slide Show and Try It features no longer seem to work.
Overview
About SAS Enterprise Guide
Source: http://support.sas.com/learn/statlib...g4.2/tut_1.htm
You might have heard about SAS software and thought that you had to be a programmer to use it. Well, you don't! SAS Enterprise Guide has readytouse tasks that you can use to create detail, summary, and graphical reports, and to perform statistical analyses. When you use these tasks, SAS Enterprise Guide writes the SAS program for you and sends it to SAS. SAS runs the program and sends the formatted results back to SAS Enterprise Guide. If you need to manipulate your data, SAS Enterprise Guide also has a pointandclick interface that enables you to perform queries without doing any programming.
SAS Enterprise Guide also makes it easy to present the results of the tasks that you can perform. You can create results in several different formats, such as HTML, PDF, and RTF that you paste into a Microsoft Word or Microsoft PowerPoint document.
Slide Show: The Workspace
Source: http://support.sas.com/learn/statlib...g4.2/tut_2.htm
Working with Data
Types of Data in SAS Enterprise Guide
Source: http://support.sas.com/learn/statlib...g4.2/tut_3.htm
Before you can use tasks, you need to add the data that you want to analyze to your project. In addition to SAS data files, SAS Enterprise Guide can read most PC data files such as Microsoft Access, dBASE, Microsoft Excel, Microsoft Exchange, IBM Lotus 123, Paradox, and HTML files. You can open data that is located on your own computer or on any server that you are authorized to access. When you open data, a shortcut to the data is automatically added to the project and the data opens in a data grid. By default, the data opens in readonly mode. In the Process Flow window that is shown below, there are shortcuts to a SAS data set, an Excel data file, and a text file. The text file has been imported to create a SAS data set. 
In addition to opening existing data, you can also work with data in the following ways:

Understanding Data Properties
Source: http://support.sas.com/learn/statlib...g4.2/tut_4.htm
Before you start working with data, it's helpful to know something about what your data contains. You can explore your data by viewing it in the data grid or by viewing the properties of the data. You can view the properties of data by rightclicking the data object in the Project Explorer or Process Flow window and selectingProperties from the popup menu.
One of the important things to know about your data is what type of data each column (or variable) contains. SAS reads the data in each column as either characterdata or numeric data. If the data in a column contains letters, it is character data. If the data in a column contains numbers, it can be either character or numeric. Numeric data is grouped into four different types of data depending on how it is displayed. The table below shows the icon that is associated with each type of data. These icons appear in the column headings of the data grid.
Variable Type  Icon  Data Type 
Character  character  
Numeric  numeric  
date  
time  
currency 
You also see these icons when you run a task. The icons help you know what things you can do with the columns, or variables, in your task.
Slide Show: Adding Data to a Project
Source: http://support.sas.com/learn/statlib...g4.2/tut_5.htm
Try It: Add Data and View Properties
Source: http://support.sas.com/learn/statlib...g4.2/tut_6.htm
Using Tasks
About SAS Tasks
Source: http://support.sas.com/learn/statlib...g4.2/tut_7.htm
About SAS Tasks After you have data in your project, you generally want to work with it in some way. In SAS Enterprise Guide, you use tasks to do everything from manipulating data, to running specific analytical procedures, to creating reports. Tasks are based on SAS programming procedures, and as you make selections in a task window, SAS Enterprise Guide writes a program to send to SAS. Each task has its own window. For example, if you want to create a bar chart, you select the data that you want to use, and then select the Bar Chart task. Some tasks have wizard versions that simplify the task even more for you. 
In each task window, there are certain steps that you must complete before you can run the task. For example, you must specify which variables you want to analyze and how you want to analyze them. After that, you can select from a variety of options that pertain to the particular task. The most common options for each task are selected for you, so after you've specified the information that is necessary to run the task, the Run button becomes available and you can run the task and get the default results. 
Deciding Which Task to Use
Source: http://support.sas.com/learn/statlib...g4.2/tut_8.htm
You might be wondering how you know which task to use. Most of the time the name of the task will be enough to help you choose, but if you aren't sure, you can view a description and business example for each SAS task in the What is window. You can open this window by selecting View What is. As you move your mouse pointer over a task in the Task List, a description of that task appears in the What is window. The image below shows the What is window with a description for the Summary Tables task. 
If you need helping using a task, you can press F1 to open the Help window with specific help for the task. In addition, you can view a short description of each option that you can select in a task by moving your mouse pointer over the option and reading the description in the help pane at the bottom of the task window. 
Slide Show: Opening Tasks
Source: http://support.sas.com/learn/statlib...g4.2/tut_9.htm
Slide Show: A Typical Task Window
Source: http://support.sas.com/learn/statlib...4.2/tut_10.htm
Try It: Create a List Report
Source: http://support.sas.com/learn/statlib...4.2/tut_11.htm
Working with Results
Changing the Result Format and Style
Source: http://support.sas.com/learn/statlib...4.2/tut_12.htm
The default report type that SAS Enterprise Guide creates is an HTML report that uses the EGDefault style. Suppose you want to create an HTML report that uses a different style and you also want to generate the same report in PDF format. You don't have to start over to do this, you just need to change the style of HTML results and select an additional report format to generate. You can do this for all the tasks that you run by using the Options window. When you make changes in the Options window, you change the default settings. You can also override the default settings for an individual task by changing options in the Properties window for the task. 
When you run a task, SAS Enterprise Guide creates HTML output that is part of a SAS Enterprise Guide project. If you want to make this HTML report available on the Web, you can export the HTML file to a location on your computer or on a server. You can also automatically export the file each time you run the project so that your reports are based on the most uptodate data. 
Try It: Create PDF and Export Results
Source: http://support.sas.com/learn/statlib...4.2/tut_13.htm
Distributions
Charts and Plots
Source: http://support.sas.com/learn/statlib...eg_dist_1a.htm
Problem: Create box plots to display the distributions of the muzzle velocities of cartridges made from types of gunpowder.
It seems that the use of powder 2 generally results in a higher muzzle velocity, because the overall distribution of its values is higher.
Problem: Create a pie chart to show the distribution of car type for data on various car models.
It appears that the medium car type occurred with the most frequency in the sample. For small differences such as this, you may need to refer to the labels.
Problem: Create a bar chart to show the distribution of car type for data on various car models.
It appears that the medium car type occurred with the most frequency in the sample.
Problem: Create a stem plot for the total number of arrests in the U.S. over a specified time period.
To get the actual value in the data set represented by each stemleaf combination, multiply these pairs by 1,000,000.
Problem: Create a normal quantile plot to examine the normality of cholesterol levels.
The points on the qq plot do not display an overall strong departure from a linear pattern, so we can conclude that the total cholesterol levels could be normally distributed.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/dist_1.htm#KB1
Hot Dogs (Bar Chart)
Create a bar chart to display the distribution of hot dog type.
Problem
An investigation of the taste and nutritional content of three types of hot dogs was carried out. The types of franks included in the study were categorized as beef, meat, and poultry.
Construct a bar chart of hot dog type and use it to compare the type frequencies.
Sample Data
The Hot_dogs data set is from a study on the taste and nutritional content of hot dogs. Fiftyfour brands of hot dogs were included. Information on the type of hot dog, a description of the taste, weight (ounces), protein content, calories, sodium, and protein fat were collected.
These are the variables in the data set:
Name  Type  Description 
Product_Name  char  brand of hot dog 
Type  char  type of hot dog (beef, meat, poultry) 
Taste  char  description of taste (bland, medium, scrumptious) 
_oz  num  weight of frank (ounces) 
_lb_Protein  num  protein content 
Calories  num  caloric content 
Sodium  num  sodium content 
Protein_Fat  num  protein from fat 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
Using the variable Type as the categorical variable to chart for this problem, the horizontal axis of the bar chart is divided into three classes—for each hot dog type. The vertical axis indicates the frequency (count) of the hot dog types. Vertical bars are constructed for each class, with the height of each bar representing the class frequency. It turns out that meat and poultry types had the same hot dog count (17), while the beef type was included more in the study (20 hot dogs).
Typing Speed (Stem & Leaf Plot)
Create a stem and leaf plot to display the distribution of typing speeds.
Problem
Three brands of typewriters were tested for typing speed by having expert typists type identical passages of text.
Construct a stem and leaf plot to examine the distribution of typing speeds. Would you describe the shape of the distribution as symmetric, skewed to the left, or skewed to the right?
Sample Data
The Typing_data data set contains the results of a test for typing speed on three different brands of typewriters. The speeds were recorded as words typed per minute.
These are the variables in the data set:
Name  Type  Description 
brand  char  brand of typewriter 
speed  num  typing speed (words per minute) 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
Stem Leaf
6 12
6 668
7 001223
7 779
8 01
8 7
The distribution of typing speeds appears to be skewed to the right.
Student Survey 1 (Histogram)
Investigate characteristics of college students using survey results.
Problem
To determine the characteristics of the students in their large introductory statistics class, a group of college professors administered a survey to the students in their classes. Some of the questions asked include: the student’s age in years, height (in inches), shoe size, and high school GPA.
Create a histogram of the age of these students in years.
a. What shape does this histogram take on?
b. Where is the main cluster of the students’ ages?
c. How old is the oldest student in this class (approximately)?
d. Examine the shape of this histogram. Give an explanation in layman’s terms to why this histogram takes on the shape that it does.
Sample Data
The Survey data set is the result of a survey administered to a large introductory statistics class. The data set contains answers from 485 participants. Not all questions are answered, resulting in missing data. Missing data is indicated by a period.
These are the variables in the data set:
Name  Type  Description 
Gender  char  gender (male or female) of the respondent 
Age  num  age of the subject in years 
Textbook  num  answer to the question “How much did you spend for textbooks this term (to nearest dollar)?” 
Cigs  num  answer to the question “How many cigarettes did you smoke yesterday?” 
ColGPA  num  answer to the question “What is your cumulative Grade Point Average at this institution?” 
HSGPA  num  answer to the question “What was your cumulative high school GPA (4 point scale)?” 
Height  num  height of the respondent in inches 
Mateh  num  the height of the respondent’s “ideal mate” 
Shoe  num  the respondent’s shoe size 
Breakfast  char  answer to the question “Did you eat breakfast this morning?” 
Flight  char  answer to the question “Did you fly on a commercial airline during the past 30 days? (Yes or No)” 
Play  char  answer to the question “Have you seen a play in a live theater in the past 6 months? Yes or No” 
Vote  char  answer to the question “Are you registered to vote? Yes or No” 
Credit  num  answer to the question “How many credit hours are you taking this term?” 
Source of Data
This data was collected by Roger Woodard of the North Carolina State University in 2005.
Result
a. The histogram is skewed to the right.
b. The main cluster of ages is in the 19 to 22 year age range.
c. The oldest student is about 60 years of age.
d. This histogram is skewed to the right. This is because the main cluster of students represents traditional students who have gone to college immediately after high school. The age of college students is limited on the low end by the practical limitation of attending college. It is unreasonable to expect many individuals under age 16 to attend college. However, there is no age limit on attending college, so the long right tail of the distribution represents students who have taken a few years longer to complete college, nontraditional students, and lifelong learners.
Student Survey 2 (Box Plot)
Investigate characteristics of college students using survey results.
Problem
To determine the characteristics of the students in their large introductory statistics class, a group of college professors administered a survey to the students in their classes. Some of the questions asked include: the student’s age in years, height (in inches), shoe size and high school GPA.
Create box plots of the heights of these students separated by gender. Compare the shape, location and spread of the height of the groups.
Sample Data
The Survey data set is the result of a survey administered to a large introductory statistics class. The data set contains answers from 485 participants. Not all questions are answered resulting in missing data. Missing data is indicated by a period.
These are the variables in the data set:
Name  Type  Description 
Gender  char  gender (male or female) of the respondent 
Age  num  age of the subject in years 
Textbook  num  answer to the question “How much did you spend for textbooks this term (to nearest dollar)?” 
Cigs  num  answer to the question “How many cigarettes did you smoke yesterday?” 
ColGPA  num  answer to the question “What is your cumulative Grade Point Average at this institution?” 
HSGPA  num  answer to the question “What was your cumulative high school GPA (4 point scale)?” 
Height  num  height of the respondent in inches 
Mateh  num  the height of the respondent’s “ideal mate” 
Shoe  num  the respondent’s shoe size 
Breakfast  char  answer to the question “Did you eat breakfast this morning?” 
Flight  char  answer to the question “Did you fly on a commercial airline during the past 30 days? (Yes or No)” 
Play  char  answer to the question “Have you seen a play in a live theater in the past 6 months? Yes or No” 
Vote  char  answer to the question “Are you registered to vote? Yes or No” 
Credit  num  answer to the question “How many credit hours are you taking this term?” 
Source of Data
This data was collected by Roger Woodard of the North Carolina State University in 2005.
Result
Shape: For both genders the height is approximately symmetric. We notice that the males are centered around 71 inches, while the females are centered at about 65 inches. In terms of spread we notice that the males are more spread out than the females; however, the spread of the middle 50% (as represented by the box of the boxplot) is about the same for both males and females.
Muzzle Velocities 1 (Box Plot)
Display the distributions of muzzle velocities of cartridges made from different types of gunpowder.
Problem
A government bureau in charge of assessing the efficiency of firearms for law enforcement agencies performed an experiment where the muzzle velocities of cartridges made from two types of gunpowder were recorded. The same type of firearm and cartridge was used for both types of gunpowder in the study.
Create box plots for each gunpowder type. Then, based on a comparison of the distributions, decide which gunpowder seems generally to result in a higher muzzle velocity.
Sample Data
The Bullets data set contains data that was collected to determine whether there is a difference in the muzzle velocity of cartridges made from two types of gunpowder.
These are the variables in the data set:
Name  Type  Description 
powder  num  type of gunpowder 
velocity  num  muzzle velocity 
Source of Data
This data is sample data from SAS Institute Inc.
Result
It seems that the use of powder 2 generally results in a higher muzzle velocity, because the overall distribution of its values is higher.
Candy Bars (Pie Chart)
Show the distribution of candy bar brands in data collected to study daily diet.
Problem
The United States Department of Agriculture (USDA) and the Department of Health and Human Service (HHS) have suggested that a daily diet should consist of an appropriate number of calories (among other things), of which 30% or fewer should be calories from fat. For adults consuming 2000 calories per day, this works out to 65 grams of fat or less.
A sample of various candy bars and nonbar candies (such as M&Ms, Skittles, etc.) was collected, and nutritional facts from each candy were recorded including total fat and saturated fat (measured in grams).
Create a pie chart to display the number of candy bars from each brand that were included in the sample.
Sample Data
The Candy data set contains nutritional facts about candy bars and nonbar candies such as M&Ms, Reese's Pieces, Skittles, and Super Hot Tamales.
These are the variables in the data set:
Name  Type  Description 
Brand  char  brand of candy 
Name  char  name of candy 
Serving_pkg  num  servings per package 
Oz_pkg  num  ounces per package 
Calories  num  calories 
Total_fat_g  num  total fat content in grams 
Saturated_fat_g  num  saturated fat content in grams 
Cholesterol_g  num  cholesterol content in grams 
Sodium_mg  num  sodium content in milligrams 
Carbohydrate_g  num  carbohydrate content in grams 
Dietary_fiber_g  num  dietary fiber content in grams 
Sugars_g  num  sugars content in grams 
Protein_g  num  protein content in grams 
Vitamin_A_RDI  num  vitamin A content as a percentage of the RDI (Reference Daily Intake) 
Vitamin_C_RDI  num  vitamin C content as a percentage of the RDI (Reference Daily Intake) 
Calcium_RDI  num  calcium content as a percentage of the RDI (Reference Daily Intake) 
Iron_RDI  num  iron content as a percentage of the RDI (Reference Daily Intake) 
Source of Data
This data is sample data from SAS Institute Inc.
Result
It appears that most of the candy bars in the sample were of the Hershey brand.
Car Types (Bar Chart)
Show the distribution of car type for data on various car models.
Problem
A set of data was collected on 116 cars from different countries, containing information such as weight, gas tank size, turning radius, horsepower and engine displacement.
Create a bar chart to display the distribution of car type (compact, sporty, small, medium, and large).
Sample Data
The Cars data set contains data about cars from different countries.
These are the variables in the data set:
Name  Type  Description 
Model  char  model 
Country  char  country 
Type  char  type (Compact, Small, Medium, Large, Sporty) 
Weight  num  weight 
TurningRadius  num  turning radius 
Displacement  num  engine displacement 
Horsepower  num  horsepower 
GasTank  num  capacity of gas tank 
Source of Data
This data is sample data from SAS Institute Inc.
Result
It appears that the medium car type occurred with the most frequency in the sample.
Arrests (Stem & Leaf Plot)
Plot the total number of arrests in the U.S. over a specified time period.
Problem
A division of the U.S. Department of Justice collected data on the total number of arrests in the United States from the year 1970 to 1999. Variables in the data set include year, total number of arrests, total number of arrests by age group, total population of the U.S. on July 1 of the given, etc.
Use a stem plot to display the distribution of total number of arrests in the U.S. from 1970 to 1999.
Sample Data
The Totarrests data set contains data about the total number of arrests in the United States from the year 1970 to 1999.
These are the variables in the data set:
Name  Type  Description 
Year  num  year of arrest 
TotalArrests  num  total number of arrests 
AGE1  num  number of arrests for age group 1 
AGE2  num  number of arrests for age group 2 
AGE3  num  number of arrests for age group 3 
AGE4  num  number of arrests for age group 4 
AGE5  num  number of arrests for age group 5 
ArrestRate  num  total arrests per 100 thousand in population 
AGE1rate  num  arrests per 100 thousand in population for age group 1 
AGE2rate  num  arrests per 100 thousand in population for age group 2 
AGE3rate  num  arrests per 100 thousand in population for age group 3 
AGE4rate  num  arrests per 100 thousand in population for age group 4 
AGE5rate  num  arrests per 100 thousand in population for age group 5 
Population  num  total population of the U.S. on July 1 of the given year 
AGE1pop  num  population of the U.S. in age group 1 on July 1 of the given year 
AGE2pop  num  population of the U.S. in age group 2 on July 1 of the given year 
AGE3pop  num  population of the U.S. in age group 3 on July 1 of the given year 
AGE4pop  num  population of the U.S. in age group 4 on July 1 of the given year 
AGE5pop  num  population of the U.S. in age group 5 on July 1 of the given year 
Source of Data
This data is sample data from SAS Institute Inc.
Result
This is the stem plot:
Stem Leaf
15 123
14 00122356
13 8
12 157
11 679
10 22348
9 0136
8 167
To get the actual value in the data set represented by each stemleaf combination, multiply these pairs by 1,000,000.
Cholesterol Levels (Quantiles Plot)
Examine the normality of cholesterol levels.
Problem
In a study to investigate the relationships between various factors and heart disease, blood lipid screenings were conducted on a group of patients. Three months after an initial screening, data was collected from a second screening that included information such as gender, age, weight, total cholesterol, and history of heart disease.
Use a normal qq plot to inspect the normality of total cholesterol level before we proceed with a simple linear regression with this variable as our response.
Sample Data
The Lipid data set contains data about blood lipid screenings and patient history.
These are the variables in the data set:
Name  Type  Description 
Name  char  name 
Gender  char  gender 
Age  num  age 
Weight  num  weight at first screening 
Cholesterol  num  total cholesterol level at first screening 
Triglycerides  num  triglycerides level at first screening 
HDL  num  HDL level at first screening 
LDL  num  LDL level at first screening 
PercentIdeal  num  percentage of ideal weight at first screening 
Height  num  height 
Skinfold  num  skinfold measurement 
SystolicBP  num  systolic blood pressure 
DiastolicBP  num  diastolic blood pressure 
Weight3  num  weight at 3month screening 
PercentIdeal3  num  percentage of ideal weight at 3month screening 
Triglyceride3  num  triglycerides level at 3month screening 
Cholesterol3  num  total cholesterol level at 3month screening 
HDL3  num  HDL level at 3month screening 
LDL3  num  LDL level at 3month screening 
Exercise  num  exercise 
Coffee  num  coffee consumption (cups per day) 
Smoking  char  smoking behavior (none, quit, cigar, pipes, cigarettes) 
Alcohol  char  alcohol consumption (number of drinks per day) 
HeartDisease  char  history of heart disease 
CholesterolLoss  num  reduction in cholesterol level between first and 3month screening 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The points on the qq plot do not display an overall strong departure from a linear pattern, so we can conclude that the total cholesterol levels could be normally distributed.
Descriptive Statistics
Source: http://support.sas.com/learn/statlib.../eg_dist_2.htm
Problem: Generate descriptive statistics for the muzzle velocities of cartridges made from types of gunpowder.
The mean and standard deviation of muzzle velocity for powder 1 are 27.6375 and 0.3925648. For powder 2, these statistics are 28.0600 and 0.3062316.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/dist_2.htm#KB1
Football: Bench Presses and Squats
Find the value of the covariance between bench press and squat.
Problem
A set of data was collected for the Brigham Young football team. Information on players’ positions, heights, weights, percent body fat, neck measurements, and performances on various weight lifts were recorded.
Determine the value of the covariance between bench press and squat, and state what this says about the relationship between the two variables.
Sample Data
The Football data set was collected from the Brigham Young University football program. Various body composition measurements (such as weight, height, percent body fat) and physical performance measurements (such as speed, bench press, and squat) are included in the data set. Note: Some observations have missing data values.
These are the variables in the data set:
Name  Type  Description 
Height  num  height of player (inches) 
Weight  num  weight of player (pounds) 
Fat  num  percent body fat 
Speed  num  evaluation of player’s speed 
Neck  num  neck measurement (inches) 
Bench  num  player’s bench press (pounds) 
Squat  num  player’s squat (pounds) 
LegPress  num  player’s leg press (pounds) 
Position  num  primary position 
Position2  num  secondary position 
Speed2  num  second evaluation of player’s speed 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
Using SAS Enterprise Guide, cov(Bench, Squat) = 3530.19 lbs. This implies that as a player’s bench press increases his squat increases, and vice versa. That is, we would expect that the more a player bench presses the more he can squat, and the more a player can squat the more he bench presses.
Muzzle Velocities 1
Find muzzle velocities of cartridges made from different types of gunpowder.
Problem
A government agency, assigned to compare the efficiency of weaponry used by statelevel law enforcement with that of weaponry used by the general public, performed an experiment where the muzzle velocities of cartridges made from two types of gunpowder were recorded. The same type of firearm and cartridge was used for both types of gunpowder in the study. However, one type of powder (powder 1) is available for the general public and the other (powder 2) is made available primarily to law enforcement officials.
Determine the mean and standard deviation, and give the fivenumber summary for each type of gunpowder.
Sample Data
The Bullets data set contains data that was collected to determine whether there is a difference in the muzzle velocity of cartridges made from two types of gunpowder.
These are the variables in the data set:
Name  Type  Description 
powder  num  type of gunpowder 
velocity  num  muzzle velocity 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The mean and standard deviation of muzzle velocity for powder 1 are 27.6375 and 0.3925648. For powder 2, these statistics are 28.0600 and 0.3062316.
For powder 1, this is the fivenumber summary:
27.10 27.35 27.55 28.05 28.10
For powder 2, this is the fivenumber summary:
27.60 27.90 28.00 28.30 28.50
Tests for Normality
Source: http://support.sas.com/learn/statlib.../eg_dist_3.htm
Problem: Test the normality of tree weights for regression analysis.
You can base your conclusion on any of the four test results given in the output table. For instance, using the ShapiroWilk test, the outcome is not significant (at a level as high as α = 0.10), since the pvalue = 0.1235. So, we do not have strong enough evidence to reject the null hypothesis of normality. Hence, we cannot claim that the condition for the normality of tree weights is violated (i.e., fail to reject H0).
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/dist_3.htm#KB1
Typing Speed
Determine whether the data for typing speeds are normally distributed.
Problem
Three brands of typewriters were tested for typing speed by having expert typists type identical passages of text.
Perform the ShapiroWilk test to test that the typing speeds have a normal distribution.
Sample Data
The Typing_speed data set is from a test in which three brands of typewriters were tested for typing speed by 17 expert typists.
These are the variables in the data set:
Name  Type  Description 
brand  char  brand of typewriter 
speed  num  typing speed (words per minute) 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The output for the ShapiroWilk test (generated by using SAS Enterprise Guide) yields a pvalue of 0.9214. This indicates that we cannot conclude that the typing speed data are nonNormal.
Tree Weights
Test the normality of tree weights for regression analysis.
Problem
A forestry commission once sought a way to accurately estimate the weights of trees without having to go through the damaging process of cutting the trees down to weigh them. The weights and trunk girths of 104 tree specimens were measured, in hopes that girth would be useful in predicting weight.
Begin checking the conditions for the simple linear regression analysis by performing a test for normality for the response, tree weight.
Sample Data
The Tree data set contains data about the weights and trunk girths of 104 tree specimens (eight specimens from each of thirteen rootstocks).
These are the variables in the data set:
Name  Type  Description 
RootStock  char  rootstock (I – XIII) 
TrunkGirth  num  trunk girth of specimen 
Weight  num  weight of specimen 
Source of Data
This data is sample data from SAS Institute Inc.
Result
You can base your conclusion on any of the following four test results: ShapiroWilk, KolmogorovSmirnov, Cramervon Mises, and AndersonDarling. For instance, using the ShapiroWilk test, the outcome is not significant (at a level as high as α = 0.10), since the p value = 0.1235. So, we do not have strong enough evidence to reject to null hypothesis of normality. Hence, we cannot claim that the condition for the normality of tree weights is violated (i.e., fail to reject H_{0}).
Confidence Intervals
Source: http://support.sas.com/learn/statlib.../eg_dist_4.htm
Problem: Generate summary statistics to determine the 95% confidence interval for mean total fat in candy bars and nonbar candies.
You can expect the mean total fat for candy bars and nonbar candies to fall between 10.55 and 13.19 grams in 95% of all samples of this size. The 95% confidence interval is [10.55g, 13.19g]; the estimate is 11.87g; and the standard error is 0.6614g.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/dist_4.htm#KB1
Fat Content of Candy
Determine the mean total fat in candy bars and nonbar candies.
Problem
The United States Department of Agriculture (USDA) and the Department of Health and Human Service (HHS) have suggested that a daily diet should consist of an appropriate number of calories (among other things), of which 30% or fewer should be calories from fat. For adults consuming 2000 calories per day, this works out to 65 grams of fat or less. A sample of various candy bars and nonbar candies (such as M&Ms, Skittles, etc.) was collected, and nutritional facts from each candy were recorded, including total fat and saturated fat (measured in grams).
Determine the 95% confidence interval for the mean total fat m for candy bars and nonbar candies. Also identify the estimate and the standard error of the estimate, and interpret the interval.
Sample Data
The Candy data set contains nutritional facts about candy bars and nonbar candies such as M&Ms, Reese's Pieces, Skittles, and Super Hot Tamales.
These are the variables in the data set:
Name  Type  Description 
Brand  char  brand of candy 
Name  char  name of candy 
Serving_pkg  num  servings per package 
Oz_pkg  num  ounces per package 
Calories  num  calories 
Total_fat_g  num  total fat content in grams 
Saturated_fat_g  num  saturated fat content in grams 
Cholesterol_g  num  cholesterol content in grams 
Sodium_mg  num  sodium content in milligrams 
Carbohydrate_g  num  carbohydrate content in grams 
Dietary_fiber_g  num  dietary fiber content in grams 
Sugars_g  num  sugars content in grams 
Protein_g  num  protein content in grams 
Vitamin_A_RDI  num  vitamin A content as a percentage of the RDI (Reference Daily Intake) 
Vitamin_C_RDI  num  vitamin C content as a percentage of the RDI (Reference Daily Intake) 
Calcium_RDI  num  calcium content as a percentage of the RDI (Reference Daily Intake) 
Iron_RDI  num  iron content as a percentage of the RDI (Reference Daily Intake) 
Source of Data
This data is sample data shipped with SAS Enterprise Guide by SAS Institute Inc.
Result
You can expect the mean total fat for candy bars and nonbar candies to fall between 10.55 and 13.19 grams in 95% of all samples of this size. The 95% confidence interval is [10.55g, 13.19g]; the estimate is 11.87g; and the standard error is 0.6614g.
tTests
One Sample
Source: http://support.sas.com/learn/statlib...eg_ttest_1.htm
Problem: Perform a onesample ttest to determine if bell peppers hang on plants at an angle.
 Looking for evidence that peppers hang on plants at angles, you would test the hypotheses
H_{o}: μ = 0 versus H_{a}: μ ≠ 0, where μ is the mean angle at which bell peppers hang.The pvalue of 0.0037 is less than .05, so there is sufficient evidence to reject the null hypothesis. Thus, you can conclude that bell peppers hang on plants at an angle.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/ttest_1.htm#KB2
Physics Aptitude
Test whether students' score on a physics test should be higher than 420.
Problem
Over 5000 students were tested on their abilities in Calculus and Physics for a study performed by the Third International Mathematics and Science Study. The testing was separated into four regions of the United States. Some students took the Calculus test, some took the Physics test, and some took both. Assume that the scores represent a random sample for each of the four regions of the U.S.
Test the claim made by Physics teachers that the overall mean U.S. score on the Physics test should be higher than 420. Use a significance level of α = 0.05.
Sample Data
The Scores data set contains the scores on a Calculus and a Physics test given to 1000 students in a particular region of the U.S.
These are the variables in the data set:
Name  Type  Description 
region  num  region of the U.S. the tests were administered in (categorical) 
Calculus_Score  num  score on Calculus test 
Physics_Score  num  score on Physics test 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The pvalue for our onesided test (given by SAS Enterprise Guide) with alternative hypothesis H_{a}: μ > 420, is .09375. This outcome is not significant at the α = 0.05 level. Thus we can conclude that there is not enough evidence to suggest that the data support the claim that the overall U.S. score on the Physics test should be higher than 420.
Bell Peppers
Determine whether bell peppers hang on plants at an angle.
Problem
An agricultural company interested in cutting down the time needed to harvest its crops hired a mechanical engineer to design a mechanical harvester for bell peppers. To heighten the precision of his machine, the engineer measured and recorded the angle at which peppers hang on the plant.
Perform a onesample ttest to determine whether the data gives good evidence that peppers hang on plants at an angle (different from zero).
Sample Data
The Peppers data set contains data about the angle at which peppers hang on the plant.
These are the variables in the data set:
Name  Type  Description 
angle  num  angle at which the pepper hangs on the plant 
Source of Data
This data is sample data from SAS Institute Inc.
Result
You’re testing the following hypotheses:
H_{o}: μ =0 versus H_{a}: μ ≠ 0,
where μ is the mean angle at which bell peppers hang.
The pvalue of 0.0037 is less than .05, so there is sufficient evidence to reject the null hypothesis. Thus, you can conclude that bell peppers hang on plants at an angle.
Paired Sample
Source: http://support.sas.com/learn/statlib...eg_ttest_2.htm
Problem: Perform a pairedsample ttest to determine whether a medication was successful in reducing blood pressure.
Looking for evidence of a reduction in blood pressure, you would test the hypotheses
H_{o}: μ = 0 versus H_{a}: μ > 0,
where μ is the mean difference in blood pressure (BaselineBP  NewBP).
The pvalue given in the output (< 0.0001) is the result for the twosided test with the alternative hypothesis H_{a}: μ > 0. The result for our onesided test would be half of this value, which is still less than the .05 significance level. Hence, we can conclude that there is strong evidence to suggest that the medication was effective in reducing blood pressure.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/ttest_2.htm#KB2
Arrivals
Determine whether ontime arrivals differ between months.
Problem
The Department of Transportation recorded data that contains the percentage of airlines’ planes that arrived on time in 29 airports. Suppose you to examine the differences between March and June.
Perform a matched pairs test to determine whether there is a difference in ontime arrivals between the two months. Use a significance level of α = 0.05.
Sample Data
The On_time_arrivals data set gives information on the percentage of ontime arrivals for airplanes of 10 airlines for the months of March, June, and August in the year 1999.
These are the variables in the data set:
Name  Type  Description 
Airline  char  airplane airline 
March_1999  num  percentage of airplanes that arrived ontime in March 
June_1999  num  percentage of airplanes that arrived ontime in June 
August_1999  num  percentage of airplanes that arrived ontime in August 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
Using SAS Enterprise Guide, the pvalue for the matched pairs test is found to be 0.0021, which is significant at the α = 0.05 level. Hence, we can conclude that there is a significant difference in ontime arrivals between the months of March and June.
Blood Pressure 1
Determine whether a medication was successful in reducing blood pressure.
Problem
To observe the effectiveness of a medication in reducing blood pressure, an experiment was conducted in which researchers collected data from a random sample of individuals who were considered to have high blood pressure. The diastolic blood pressure of these individuals was recorded, after which they were placed on the medication. One month later, their diastolic pressure was recorded again.
Determine whether the data gives good evidence that the medication was effective in reducing blood pressure by carrying out a test of significance (at level α = 0.05), pairing the two blood pressure recordings for each subject.
Sample Data
The Bloodpressure data set contains data from a random sample of individuals with high blood pressure. Variables include subject, age, an initial measurement of diastolic pressure, and a later measurement of diastolic pressure after one month of medication.
These are the variables in the data set:
Name  Type  Description 
Subject  char  subject code 
Age  num  subject’s age 
BaselineBP  num  subject’s baseline blood pressure 
NewBP  num  subject’s blood pressure one month after starting to take medication 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The mean difference between the two blood pressure recordings is greater than zero, which indicates an overall reduction in blood pressure, on average.
You’re testing the following hypotheses:
H_{o}: μ = 0 versus H_{a}: μ > 0,
where μ is the mean difference in blood pressure (BaselineBP  NewBP).
The pvalue is less than 0.0001, which is less than the .05 significance level. Hence, you can conclude that there is strong evidence to suggest that the medication was effective in reducing blood pressure.
Two Sample
Problem: Perform a twosample ttest to compare the median values of owneroccupied homes in two groups of housing tracts.
Source: http://support.sas.com/learn/statlib...eg_ttest_3.htm

The second table in the TTEST Procedure output gives the result of the pairedsample ttest. Looking for evidence that the median home value for homes near the Charles River is greater than that for homes farther away, we would test the hypotheses
H_{o}: μ= 0 versus H_{a}: μ< 0, where μ is the mean difference in median home value (Far  Nea).For more accuracy, we will rely on the results of the Satterthwaite method, which accounts for the possibility of unequal variances between the two populations (which is tested by the Ftest in the bottom table).
The t statistic has a value of 3.11, which indicates evidence that the median home value for “near homes” is greater than the median home value for “far homes.”
The pvalue given in the table (0.0036) is for a twosided test. To obtain thepvalue for our onesided test, we would divide this by two, which would result in an outcome that is still significant at the a = 0.05 level (pvalue = 0.0018).
So, we can conclude that on average, the median value of owneroccupied homes near the Charles River is greater than that of owneroccupied homes farther from the river.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/ttest_3.htm#KB2
Value of OwnerOccupied Homes
Compare the median values of owneroccupied homes in two groups of housing tracts
Problem
A census study was carried out, in which data was collected from 506 housing tracts in the Boston area. Some of the variables of interest for each tract were median home value, crime rate, relative distance from the Charles River (near or far), percentage of lower economic status families, and so on.
Use a twosample ttest (with significance level α = 0.05) to determine whether, on average, the median value of owneroccupied homes near the Charles River is greater than that of homes farther away from it.
Sample Data
The Bostonhousing data set contains census information for 506 housing tracts in the Boston area.
These are the variables in the data set:
Name  Type  Description 
MedianValue  num  median value of owneroccupied homes in $1000's 
Crime  num  per capita crime rate by town 
Zone  num  proportion of residential land zoned for lots over 25,000 sq.ft 
Industry  num  proportion of nonretail business acres per town 
Charles  char  Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 
NOX  num  nitric oxides concentration (parts per 10 million) 
Rooms  num  average number of rooms per dwelling 
Bef_1940  num  proportion of owneroccupied units built prior to 1940 
DistEmp  num  weighted distances to five Boston employment centers 
Highways  num  index of accessibility to highways 
TaxRate  char  fullvalue propertytax rate per $10,000 
PTRatio  char  pupilteacher ratio by town 
LowStatus  char  % lower status of the population 
Source of Data
This data is sample data from SAS Institute Inc.
Result
Looking for evidence that the median home value for homes near the Charles River is greater than that for homes farther away, you test the hypotheses
H_{o}: μ=0 versus H_{a}: μ< 0,
where μ is the mean difference in median home value (Far  Near). The median value for houses near the Charles River is 28.44, and the median value for houses farther away is 22.094. The difference is 6.346. Depending on the method you use for the ttest, this difference may or may not be significant. For example, for more accuracy, you can use the Satterthwaite method, which accounts for the possibility of unequal variances between the two populations (which is tested by the Ftest). Using the Satterthwaite method, the t statistic has a value of 3.11, which indicates evidence that the median home value for “near homes” is greater than the median home value for “far homes.”
The pvalue for a twosided test is 0.0036. To obtain the pvalue for a onesided test, you divide this by two, which results in an outcome that is still significant at the a = 0.05 level (pvalue = 0.0018).
So, you can conclude that on average, the median value of owneroccupied homes near the Charles River is greater than that of owneroccupied homes farther from the river.
Correlation
Graphical Evaluation
Source: http://support.sas.com/learn/statlib.../eg_corr_1.htm
Problem: Make a scatter plot to display the relationship between two quantitative variables.
View the scatter plot. It appears that there is somewhat of a weak, positive linear relationship between crime and education. The relationship is positive in the sense that as the education score increases, there appears to be an increase in the crime score for these western cities, which one would think is an undesirable effect.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/corr_1.htm#KB3
Crime and Education in Cities
Plot the relationship between crime and education in western cities
Problem
In a study to investigate the prevalence of relationships between different socioeconomic factors, 52 western cities were rated by nine criteria: climate terrain, housing, health care environment, crime, transportation, education, arts, recreation, and economics. For housing and crime, the lower the rating score, the better. For the remaining seven criteria, the higher the score, the better.
Sample Data
The Westernrates data set contains data about ratings on nine criteria (climate and terrain, housing, health care and environment, crime, transportation, education, arts, recreation, and economics) for 52 western cities.
These are the variables in the data set:
Name  Type  Description 
City  char  city 
State  char  state 
ClimateTerrain  num  rating of climate and terrain 
Housing  num  rating of housing 
HealthCareEnvironment  num  rating of health care and environment 
Crime  num  rating of crime 
Transportation  num  rating of transportation 
Education  num  rating of education 
Arts  num  rating of the arts 
Recreation  num  rating of recreation 
Economics  num  rating of economics 
Source of Data
This data is sample data from SAS Institute Inc.
Result
It appears that there is somewhat of a weak, positive linear relationship between crime and education. The relationship is positive in the sense that as the education score increases, there appears to be an increase in the crime score for these western cities, which one would think is an undesirable effect.
Pearson Correlation
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/eg_corr_2.htm
Problem: Determine the Pearson correlation coefficient to measure the strength of the linear relationship between two quantitative variables.
See the value of the Pearson correlation coefficient for the two variables. The correlation value of 0.30337 suggests that there is a weak positive relationship between crime and recreation. Furthermore, this implies that there is an increase in the crime score as the recreation score increases for these western cities.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/corr_2.htm#KB3
Football: Height and Neck Measurements
Determine the correlations between height and neck measurement, and weight and neck measurement.
Problem
A set of data was collected for the Brigham Young football team. Information on players’ positions, heights, weights, percent body fat, neck measurements, and performances on various weight lifts were recorded.
Use the Pearson correlation coefficient to evaluate the relationship between height and neck measurement, and weight and neck measurement. With which of these two variables does neck measurement seem to be more correlated?
Sample Data
The Football data set was collected from the Brigham Young University football program. Various body composition measurements (such as weight, height, percent body fat) and physical performance measurements (such as speed, bench press, and squat) are included in the data set. Note: some observations have missing data values.
These are the variables in the data set:
Name  Type  Description 
Height  num  height of player (inches) 
Weight  num  weight of player (pounds) 
Fat  num  percent body fat 
Speed  num  evaluation of player’s speed 
Neck  num  neck measurement (inches) 
Bench  num  player’s bench press (pounds) 
Squat  num  player’s squat (pounds) 
LegPress  num  player’s leg press (pounds) 
Position  num  primary position 
Position2  num  secondary position 
Speed2  num  second evaluation of player’s speed 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
Using SAS Enterprise Guide, the value of the Pearson correlation coefficient between height and neck measurement is 0.47769. The value of correlation between weight and neck measurement is 0.81370. So, based on these values it appears that neck measurement is more closely correlated with weight, rather than height.
Student Survey 3
Investigate characteristics of college students using survey results.
Problem
To determine the characteristics of the students in their large introductory statistics class, a group of college professors administered a survey to the students in their classes. Some of the questions asked include: the student’s age in years, height (in inches), shoe size and high school GPA.
Exercise:
a. Before you actually do any calculations or construct graphics, consider the relationship between a student’s height and the height of their ideal mate. What type of relationship would you expect for these variables? (Positive or negative? Strong or weak?)
b. Explain in layman’s terms what a positive correlation between these variables would mean. Then explain (again in layman’s terms) what a negative correlation would mean.
c. Calculate Pearson’s correlation for these variables. Is it positive or negative? Does this match your intuition? Explain.
d. Create a scatterplot of these variables. Does the scatterplot match the correlation you calculated? Explain.
e. Further explore the data to develop an explanation for the calculated correlation.
Sample Data
The Survey data is the result of a survey administered to a large introductory statistics class. The data set contains answers from 485 participants. Not all questions are answered, resulting in missing data. Missing data is indicated by a period.
These are the variables in the data set:
Name  Type  Description 
Gender  char  gender (male or female) of the respondent 
Age  num  age of the subject in years 
Textbook  num  answer to the question “How much did you spend for textbooks this term (to nearest dollar)?” 
Cigs  num  answer to the question “How many cigarettes did you smoke yesterday?” 
ColGPA  num  answer to the question “What is your cumulative Grade Point Average at this institution?” 
HSGPA  num  answer to the question “What was your cumulative high school GPA (4 point scale)?” 
Height  num  height of the respondent in inches 
Mateh  num  the height of the respondent’s “ideal mate” 
Shoe  num  the respondent’s shoe size 
Breakfast  char  answer to the question “Did you eat breakfast this morning?” 
Flight  char  answer to the question “Did you fly on a commercial airline during the past 30 days? (Yes or No)” 
Play  char  answer to the question “Have you seen a play in a live theater in the past 6 months? Yes or No” 
Vote  char  answer to the question “Are you registered to vote? Yes or No” 
Credit  num  answer to the question “How many credit hours are you taking this term?” 
Source of Data
This data was collected by Roger Woodard of the North Carolina State University in 2005.
Result
a. Most people would expect that a student’s height and the height of their ideal mate are positively correlated.
b. A positive relationship would indicate that a taller individual would want a taller mate. Thus, a tall basketball player would want a tall mate rather than a very short one. A negative correlation would indicate that tall people want a short mate and a short person wants a tall mate.
c. The correlation is 0.289. This does not match what most people would expect.
d. The scatterplot is given below. It does match the correlation.
e. The scatterplot has two clusters. Further exploration might lead a person to split the scatterplot by gender. Doing this (see the second scatterplot) shows that there are two positive correlations joined together. The positive correlation matches what we expect. If we calculate the correlations separately by gender we find that for males (0.468) and females (0.401) both are positive correlations.
Teacher's note: This is a good example of why one should not take a correlation at face value, but instead always make a scatterplot of the relationship. It is a continuous example of Simpson’s paradox where a third variable changes the apparent relationship of two variables.
Crime and Recreation in Cities
Find the correlation of crime and recreation in western cities.
Problem
In a study to investigate the prevalence of relationships between different socioeconomic factors, 52 western cities were rated by nine criteria: climate terrain, housing, health care environment, crime, transportation, education, arts, recreation, and economics. For housing and crime, the lower the rating score, the better. For the remaining seven criteria, the higher the score, the better.
Find the correlation of crime and recreation for this sample data. Based on your value for the Pearson correlation coefficient, would you describe the linear relationship between crime and recreation as strong or weak, positive or negative?
Sample Data
The Westernrates data set contains data about ratings on nine criteria (climate and terrain, housing, health care and environment, crime, transportation, education, arts, recreation, and economics) for 52 western cities.
These are the variables in the data set:
Name  Type  Description 
City  char  city 
State  char  state 
ClimateTerrain  num  rating of climate and terrain 
Housing  num  rating of housing 
HealthCareEnvironment  num  rating of health care and environment 
Crime  num  rating of crime 
Transportation  num  rating of transportation 
Education  num  rating of education 
Arts  num  rating of the arts 
Recreation  num  rating of recreation 
Economics  num  rating of economics 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The correlation value of 0.30337 suggests that there is a weak positive relationship between crime and recreation. Furthermore, this implies that there is an increase in the crime score as the recreation score increases for these western cities
Spearman Correlation
Source: http://support.sas.com/learn/statlib.../eg_corr_3.htm
Problem: Determine the Spearman correlation coefficient to measure the strength of the relationship between purchase level and age.
The value of the Spearman correlation coefficient for the two variables.The correlation value of 0.03070 indicates a very weak association between purchase level and age.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/corr_3.htm#KB3
Top Movies
Determine the correlation between the year of release and the amount of money a movie makes.
Problem
A set of data was compiled, which lists the top grossing movies of all time (as of June 2003). It contains information showing the name of the movie, the amount of money it made both in the U.S. and foreign markets, its year of release, and the type of movie.
Use Spearman’s correlation to assess the relationship between the amount of money made in the U.S. by movies and their years of release.
Sample Data
The Movies data set contains a list of the top grossing movies of all time (as of June 2003). The names of the 277 movies in the list are provided, along with the type and rating of each movie, year of release, money grossed in both U.S. and foreign markets, and the name of the movie director.
These are the variables in the data set:
Name  Type  Description 
Movie  char  name of movie 
Type  char  type of movie (comedy, family, drama, etc.) 
Rating  char  movie rating (G, PG, PG13, R) 
Year  num  year of release 
Domestic_  num  money made in U.S. (in millions of dollars) 
Worldwide_  num  money made in foreign market (in millions of dollars) 
Director  char  movie director 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
Using SAS Enterprise Guide, the value of the Spearman correlation coefficient was found to be 0.13396. So, the sample data suggests that there is a weak, positive, linear relationship between the amount of money made in the U.S. market by movies and their years of release.
MailOrder Customers 1
Measure the relationship between customers’ purchase level and age.
Problem
A mailorder company has decided that customers who spend 100 dollars or more on purchases should be the focus of its advertising efforts. To help identify this target group, the company collected information from its customers including purchase level (1 = at least $100, 0 = less than $100 dollars), gender, income level, and age.
Use the Spearman correlation coefficient to measure the strength of association between purchase level and age
Sample Data
The Sales data set contains data about customers of a mailorder company.
These are the variables in the data set:
Name  Type  Description 
purchase  num  customer’s purchase level (1 = at least $100, 0 = less than $100 dollars) 
age  num  customer’s age 
gender  char  customer’s gender 
income  char  customer’s income level (Low, Medium, High) 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The correlation value of 0.03070 indicates a very weak association between purchase level and age.
Linear Regression
Simple
Source: http://support.sas.com/learn/statlib...g_linreg_1.htm
Problem: Use simple linear regression to describe the relationship between the cost of operations of auction markets and the number of cattle sold.
The first of these tables displays some statistics for the sample data.
The last table in the output gives the parameter estimates for regressing cost on cattle sold, along with the standard errors for these estimates and the pvalues for their tests of significance to the model.
The estimate for the intercept parameter is 7.19650, and the estimate for the slope parameter is 4.56396. So, the equation of the simple linear regression line is
yhat = 7.19650 + 4.56396x,
where yhat is the predicted cost of operations (in thousands of dollars) and x is the number of cattle sold (in thousands).
For a market anticipating the sell of 13,500 cattle, the predicted cost of operations is
yhat = 7.19650 + 4.59396(13.5) = 69.215
So, $69,215 in operating costs is expected for 13,500 cattle sold.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/linreg_1.htm#KB4
Student Survey 4
Investigate characteristics of college students using survey results.
Problem
To determine the characteristics of the students in their large introductory statistics class, a group of college professors administered a survey to the students in their classes. Some of the questions asked include: the student’s age in years, height (in inches), shoe size and high school GPA.
Exercise:
a. Can we estimate a person’s height from their shoe size? Create a scatterplot of the relationship between shoe size and height. Choose the appropriate variable to be on the horizontal axis.
b. Examine the scatterplot. Is the relationship positive or negative?
c. Calculate the least squares regression line that relates height to shoe size. What is the least squares line?
d. Explain in layman’s terms the meaning of the slope term in this regression line.
e. Is the slope between shoe size and height significant? Explain how you know.
f. Does the intercept term make sense in this setting? Explain.
g. Calculate the Rsquare statistic for the relationship between shoe size and height. Explain this terms meaning.
h. We find a student shoe print that was a size 10. What height would you estimate for this student? Also calculate a prediction interval. Would this interval be useful in identifying the student? Explain.
Sample Data
The Survey data is the result of a survey administered to a large introductory statistics class. The data set contains answers from 485 participants. Not all questions are answered resulting in missing data. Missing data is indicated by a period.
These are the variables in the data set:
Name  Type  Description 
Gender  char  gender (male or female) of the respondent 
Age  num  age of the subject in years 
Textbook  num  answer to the question “How much did you spend for textbooks this term (to nearest dollar)?” 
Cigs  num  answer to the question “How many cigarettes did you smoke yesterday?” 
ColGPA  num  answer to the question “What is your cumulative Grade Point Average at this institution?” 
HSGPA  num  answer to the question “What was your cumulative high school GPA (4 point scale)?” 
Height  num  height of the respondent in inches 
Mateh  num  the height of the respondent’s “ideal mate” 
Shoe  num  the respondent’s shoe size 
Breakfast  char  answer to the question “Did you eat breakfast this morning?” 
Flight  char  answer to the question “Did you fly on a commercial airline during the past 30 days? (Yes or No)” 
Play  char  answer to the question “Have you seen a play in a live theater in the past 6 months? Yes or No” 
Vote  char  answer to the question “Are you registered to vote? Yes or No” 
Credit  num  answer to the question “How many credit hours are you taking this term?” 
Source of Data
This data was collected by Roger Woodard of the North Carolina State University in 2005.
Result
a. The scatterplot is given below. If we are predicting height with the shoe size we should put the shoe size on the horizontal axis.
b. For this scatterplot we see a strong positive correlation.
c. The least squares line is “height=52.05+1.64*shoe”.
d. The slope term 1.64 indicates that the expected height of a student would increase 1.64 inches for each larger shoe size they have.
e. The slope term is significant based on the ttest of significance. This term has a pvalue of less than 0.001.
f. The intercept term would indicate that a person with shoe size zero would be 52.05 inches tall. It is impossible to have a shoe size of zero so this term does not have a practical interpretation.
g. The Rsquare is 0.7318. This indicates that about 73% of the variability in the height of a subject can be explained by the straight line relationship with shoe size.
h. We would predict that the height would be around 68.45. If we want to put a prediction interval around this estimate it would range from 64.34 to 72.55. This interval is very uninformative. Even with out the information about shoe size we might guess that some one would be between 5’4 and 6’0 tall.
Teacher's note: This example can be well motivated by the question of evaluating crime scene evidence. It can serve as a good lead in for multiple regression by considering adding additional variables such as gender and age to the relationship.
Livestock Auctions 1
Describe the relationship between the cost of operations and the number of cattle sold at auctions.
Problem
A group of livestock auction market managers were interested in learning how the number of cattle sold at their markets influenced the cost of operations of their markets. Data was collected from 19 such auction markets.
Find the simple linear regression equation for cost of operations on cattle sold. Use this equation to predict the cost of operations for a market that anticipates selling 13,500 cattle.
Sample Data
The Market data set contains data from 19 livestock auction markets, including the number of head of cattle sold (in thousands), the cost of operations of the auction market (in thousands of dollars), and the market identifier.
These are the variables in the data set:
Name  Type  Description 
marketid  char  market identifier 
cattle  num  numbers of head of cattle sold (in thousands) 
cost  num  cost of operations of the auction market (in thousands of dollars) 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The estimate for the intercept parameter is 7.19650, and the estimate for the slope parameter is 4.56396. So, the equation of the simple linear regression line is
yhat = 7.19650 + 4.56396x,
where yhat is the predicted cost of operations (in thousands of dollars) and x is the number of cattle sold (in thousands).
For a market anticipating the sale of 13,500 cattle, the predicted cost of operations is
yhat = 7.19650 + 4.59396(13.5) = 69.215
So, $69,215 in operating costs is expected for 13,500 cattle sold.
Multiple
Source: http://support.sas.com/learn/statlib...g_linreg_2.htm
Problem: Problem: Use multiple linear regression to describe the relationship between the cost of operations of auction markets and the number of livestock sold.
The last part of the output gives the parameter estimates for regressing cost on the number of sheep, hogs, calves and cattle sold, along with the standard errors for these estimates and the pvalues for their tests of significance to the model.
Based on the parameter estimates, the multiple linear regression function is given by
Yhat = 2.28842 + 3.21552X1 + 1.61315X2 + 0.81485X3 + 0.80258X4,
where Yhat is the predicted cost of operations (in thousands of dollars) and X1 is the number of cattle sold, X2 is the number of calves sold, X3 is the number of hogs sold, and X4 is the number of sheep sold (in thousands).
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/linreg_2.htm#KB4
Livestock Auctions 2
Describe the relationship between the cost of operations and the number of livestock sold at auction
Problem
A cost analyst was hired by a group of livestock auction market managers to determine how the number of various classes of livestock (cattle, calves, hogs, and sheep) sold at their markets influenced the cost of operations of their markets. Data was provided by 19 market managers.
Find the multiple linear regression function for expressing cost of operations as a function of the number of livestock sold for the four classes.
Sample Data
The Auction data set contains data from 19 livestock auction markets, including the number of head of each of four types of livestock sold (in thousands), the cost of operation of the auction market (in thousands of dollars), and the market identifier.
These are the variables in the data set:
Name  Type  Description 
marketid  char  market identifier 
cattle  num  number of head of cattle sold (in thousands) 
calves  num  number of head of calves sold (in thousands) 
hogs  num  number of head of hogs sold (in thousands) 
sheep  num  number of head of sheep sold (in thousands) 
cost  num  cost of operation of the auction market (in thousands of dollars) 
volume  num  total of all cattle, calves, hogs, and sheep sold in each market 
Source of Data
This data is sample data from SAS Institute Inc.
Result
Based on the parameter estimates, the multiple linear regression function is given by
Yhat = 2.28842 + 3.21552X1 + 1.61315X2 + 0.81485X3 + 0.80258X4,
where Yhat is the predicted cost of operations (in thousands of dollars) and X1 is the number of cattle sold, X2 is the number of calves sold, X3 is the number of hogs sold, and X4 is the number of sheep sold (in thousands).
Polynomial
Source: http://support.sas.com/learn/statlib...g_linreg_3.htm
Problem: Determine whether a quadratic regression function adequately describes the relationship between tree weight and trunk girth.
The ANOVA table for the quadratic model. It appears that the model is significant, based on the pvalue being less than 0.0001.
Look at the Ftests for type I and type III sums of squares. (The type I sums of squares are the extra sums of squares obtained by sequentially adding the linear and quadratic variables to the model, in that order. The type III sums of squares are the extra sums of squares that result from adding one variable to a model which already contains the other variable.) Based on the results, it appears that both the linear and quadratic trunk girth terms are significant in predicting tree weight (pvalues less than or equal to 0.0001).
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/linreg_3.htm#KB4
Weight to Height
Fit a quadratic model to describe a relationship between two variables.
Problem
A study was performed in which 72 children were measured in order to examine how the weight to height ratio changes as kids get older. The ratio and age (in months) were recorded for each young participant.
Fit a quadratic curve to the sample data to model the relationship between weight to height ratio and age, and give its equation. Display this curve on a scatterplot of the data. For your model, Ratio will be the response variable and Age will be the explanatory variable.
Sample Data
The Growth data set is the recordings of the weight to height ratios and ages (in months) of 72 children.
These are the variables in the data set:
Name  Type  Description 
ratio  num  weight to height ratio 
age  num  age in months 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
Using SAS Enterprise Guide, the estimated quadratic fit to the data is given by the equation
predicted ratio = 0.6022 + 0.01060age – 0.00007337age^2.
To create the scatterplot of the data, with the quadratic curve displayed, use a graphing facility to plot the data points and graph the function (with age on the horizontal axis and ratio on the vertical axis).
Tree Weights and Trunk Girths
Determine whether a quadratic regression function adequately describes a weight/girth relationship.
Problem
A forestry commission once sought a way to accurately estimate the weights of trees without having to go through the damaging process of cutting the trees down to weigh them. The weights and trunk girths of 104 tree specimens were measured, in hopes that girth would be useful in predicting weight.
Find the quadratic function for regressing tree weight on trunk girth, and determine whether this function provides a good fit for the data.
Sample Data
The Tree data set contains data about the weights and trunk girths of 104 tree specimens (eight specimens from each of thirteen rootstocks).
These are the variables in the data set:
Name  Type  Description 
RootStock  char  rootstock (I – XIII) 
TrunkGirth  num  trunk girth of specimen 
Weight  num  weight of specimen 
Source of Data
This data is sample data from SAS Institute Inc.
Result
It appears that the quadratic model is significant, based on the pvalue being less than 0.0001.
Based on the results of the Ftests for type I and type III sums of squares, it appears that both the linear and quadratic trunk girth terms are significant in predicting tree weight (pvalues less than or equal to 0.0001).
Stepwise
Source: http://support.sas.com/learn/statlib...g_linreg_4.htm
Problem: Use the stepwise selection method to choose a model that “best” fits the corn data.
View the first part of the output to see the number of observations read and used.
Scroll down to view the results of step 1 of the stepwise selection. Notice that the variable July_Temp entered the model. The ANOVA table, followed by the estimates, standard errors, and significance tests for the parameters, is displayed for the resulting model.
Scroll down to view the results of step 2 of the stepwise selection. The variable July_Rain was added to the model in this step.
View the summary of the stepwise selection for this data. Then view the F value and pvalue.
The final results of the stepwise selection show the inclusion of only the July_Temp and July_Rain variables in the “best” fit model. So, the resulting linear regression function based on this selection method is
predicted CornYield = 163.47 + 3.49(July_Rain) 1.68(July_Temp)
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/linreg_4.htm#KB4
Mass Prediction
Construct a model with mass as the response variable, using stepwise selection.
Problem
A statistics project was carried out in Australia in which body measurements were collected from 22 male subjects. The aim of the project was to construct a model which would predict the mass of a person based on other characteristics.
Use the stepwise selection method to fit such a model using a 0.15 significance level for entry.
Sample Data
The Body_measurements data set is from a project that involved the recording of 11 body measurements (such as mass, arm measurements, chest measurement, height, etc.) for 22 male subjects.
These are the variables in the data set:
Name  Type  Description 
Mass  num  body mass 
Fore  num  forearm measurement 
Bicep  num  bicep measurement 
Chest  num  chest measurement 
Neck  num  neck measurement 
Shoulder  num  numeric 
Waist  num  waist measurement 
Height  num  height 
Calf  num  calf measurement 
Thigh  num  circumference around one thigh 
Head  num  circumference around head 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The resulting model for this data fit by the stepwise selection method (using the REG procedure in SAS) is the following:
predicted Mass = 80.4533 +2.1232*Fore + 0.6656*Waist + 0.2770*Height + 0.5232*Thigh – 0.6371*Head
Corn Yield
Choose a model that “best” fits the corn data.
Problem
Data was collected to study the effect of weather related events on corn yield. The variables involved in the study were the total precipitation (in inches) for the year before the growing season, the average daily temperature (Fahrenheit) for the months May through August, the total rain for the months June through August, and corn yield (in bushels per acre).
Apply the stepwise selection method to the linear regression of corn yield on the other variables to find the model that “best” fits the data, according to this criterion.
Sample Data
The Corn data set contains information about weather related events and corn yield.
These are the variables in the data set:
Name  Type  Description 
CornYield  num  corn yield (in bushels per acre) 
Year  num  year 
Pre_seasonPrecip  num  total precipitation (in inches) for the year prior to the start of the growing season 
May_Temp  num  average daily temperature (in degrees Fahrenheit) for May 
June_Rain  num  total rain (in inches) for June 
June_Temp  num  average daily temperature (in degrees Fahrenheit) for June 
July_Rain  num  total rain (in inches) for July 
July_Temp  num  average daily temperature (in degrees Fahrenheit) for July 
Aug_Rain  num  total rain (in inches) for August 
Aug_Temp  num  average daily temperature (in degrees Fahrenheit) for August 
Source of Data
This data is sample data from SAS Institute Inc.
Result
Using a significance level for entry and remaining in the model if 0.15, the final results of the stepwise selection show the inclusion of only the July_Temp and July_Rain variables in the “best” fit model. So, the resulting linear regression function based on this selection method is
predicted CornYield = 163.47 + 3.49(July_Rain) 1.68(July_Temp)
Analysis of Variance
OneWay ANOVA
Source: http://support.sas.com/learn/statlib...eg_anova_1.htm
Problem: Perform a oneway ANOVA to determine whether there are significant differences between four drugs in mean increase in systolic blood pressure.
View the first part of the ANOVA table to see the number of levels and observations. Then scroll down to the next page of the output, the ANOVA table.
The pvalue of .0001 is less than .05, so there is sufficient evidence to reject the null hypothesis that all means are equal. The F test does not give any specifics about which means are different, only that there is at least one pair of means that is statistically different.
Scroll down to the results of Tukey's test and the table of groupings.
In the results for Tukey's test, groups are shown with Tukey Grouping letters assigned to them. If the same letter is assigned to any two groups, those groups are not significantly different from each other. This test shows that drugs 1 and 2 are significantly different from 3 and 4.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/anova_1.htm#KB5
Wafers
Determine whether there is a difference between probes in measuring resistance.
Problem
The National Institute of Standards and Technology (NIST) references research involving the doping of silicon wafers with phosphorus. The wafers were doped with phosphorus by neutron transmutation doping in order to have resistivities of 200 ohm/cm. Measurements of bulk resistivity of silicon wafers were made with 5 probing instruments on each of 5 days. The experimenters are interested in testing differences among the instruments.
Complete an ANOVA to determine whether there is a difference between the probes in measuring resistance. If there are differences, then distinguish them by using a test for multiple comparisons.
Sample Data
The Doped_wafers data set gives the resistance measurements for 5 different probing instruments from a study conducted by the National Institute of Standards and Technology.
These are the variables in the data set:
Name  Type  Description 
Instrument  num  probing instrument (categorical) 
Resistance  num  resistance measurement 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The value of the F statistic for the ANOVA F test is 1.18 with a corresponding pvalue of 0.3494, which is not significant. So, the data does not indicate that there is a difference between the probes in measuring resistance. This is supported by the results from the Tukey test for multiple comparisons. These results were generated by performing the analysis in SAS Enterprise Guide.
Blood Pressure 2
Determine whether four drugs differ in mean increase in systolic blood pressure.
Problem
Suppose that a pharmaceutical company has recently focused on the effect of its potential products on blood pressure. Researchers conducted a study in which the subjects were 72 individuals with one of three diseases. Eighteen individuals were randomly assigned to each of the four drugs. The treatments were administered over time and the increase in systolic blood pressure was recorded.
Perform a oneway ANOVA to determine whether there are significant differences between the mean increases in systolic blood pressure for the four drugs. If there are, you need to determine which drugs differ and how greatly they differ.
Sample Data
The Drug data set contains data from an experiment to evaluate the effect of four different drugs on blood pressure for individuals with one of three possible diseases.
These are the variables in the data set:
Name  Type  Description 
Drug  char  drug (1, 2, 3, 4) 
Disease  char  disease (A, B, C) 
BPChange  char  change in systolic blood pressure 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The pvalue of .0001 is less than .05, so there is sufficient evidence to reject the null hypothesis that all means are equal. The F test does not give any specifics about which means are different, only that there is at least one pair of means that is statistically different.
To compare means, you can use a test such as Tukey's studentized range test (HSD). For example, in the results for Tukey's test, drugs 1 and 2 are significantly different from 3 and 4.
TwoWay ANOVA
Source: http://support.sas.com/learn/statlib...eg_anova_2.htm
Problem: Perform a twoway ANOVA to determine whether the effects of stretching and wearing ankle weights are significant on exercise.
View the first part of the output to see information about the class level and number of observations.
Scroll down to the ANOVA table for the twoway model. It appears that the model is significant, based on the pvalue of 0.0008.
The last two tables of the output give the results of the Ftests for type I and type III sums of squares, which are equivalent in this case. Based on these, with pvalues of 0.0045 and 0.0032, we can say that there is sufficient evidence to claim that stretching and wearing ankle weights both have significant effects on exercise benefit (in terms of burning calories).
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/anova_2.htm#KB5
Starch Out
Determine whether sand blasting significantly affects starch removal in jeans.
Problem
Denim manufacturers are concerned with maximizing the comfort of the jeans they make. In the manufacturing process, starch is often built up in the fabric, creating stiff jeans that must be “broken in” by the wearer before they are comfortable. To minimize the breakin time for each pair of pants, the manufacturers often wash the jeans to remove as much starch as possible. In a particular study to evaluate starch removal, data was recorded including the washing method (3 types), size of the load (in pounds), whether or not the fabric was sand blasted prior to washing, and the starch content of the fabric after washing.
Carry out a twoway analysis of variance to examine if the interaction between washing method and whether or not the fabric was sand blasted has a significant effect on the removal of starch.
Sample Data
The Denim data set contains the four columns that are described below.
These are the variables in the data set:
Name  Type  Description 
Method  char  method used in washing 
Size of Load  num  size in pounds 
Sand Blasted  char  whether or not the fabric was sand blasted before washing 
Starch Content  num  the starch content of the fabric after washing 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The value of the F statistic for the interaction effect Method*Sand_blasted_ is 1.324 with a corresponding pvalue of 0.271, which is not significant at a level as high as α = 0.10. So we can conclude that the interaction between washing method and sand blasting does not have a significant effect on the removal of starch from jeans. These results were given by the output from the GLM procedure in SAS.
Treadmill Exercise
Determine whether stretching and wearing ankle weights affects exercise on treadmills.
Problem
An exercise physiologist wants to examine whether stretching and wearing ankle weights affect the value of exercise on treadmills. To carry out her study, she recruits subjects who have roughly the same level of physical fitness, and divides them randomly into four groups: with or without ankle weighs, and with or without a stretching period before the exercise.
Using the amount of calories burned as the response, carry out a twoway ANOVA to determine whether stretching and wearing ankle weights have significant effects on exercise.
Sample Data
The Exercise data set contains data about the effects of stretching and wearing ankle weights on treadmill exercise.
These are the variables in the data set:
Name  Type  Description 
PreStretch  char  stretch group (Stretch, No stretch) 
AnkleWeights  char  weights group (Weights, No weights) 
Energy  num  calories burned 
Speed  num  average speed (in meters per minute) 
Oxygen  num  oxygen consumed (in liters) 
Source of Data
This data is sample data from SAS Institute Inc.
Result
It appears that the twoway model is significant, based on the pvalue of 0.0008. The results of the Ftests for type I and type III sums of squares are equivalent in this case. Based on these, with pvalues of 0.0045 and 0.0032, we can say that there is sufficient evidence to claim that stretching and wearing ankle weights both have significant effects on exercise benefit (in terms of burning calories).
Mixed Models
Source: http://support.sas.com/learn/statlib...eg_anova_3.htm
Problem: Perform an ANOVA for a mixed model to determine whether the data indicates a difference between irrigation methods on the harvest of orange trees.
Scroll down to view the results of the Ftest for the irrigation effect. The Type III test for the effect of irrigation method yields an F statistic of 3.27 with a corresponding pvalue of 0.0254. So at a level of α = 0.05, we can conclude that the effect of irrigation method on the harvest of orange trees at this grove is significant.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/anova_3.htm#KB5
Orange Grove Irrigation
Determine whether there is a difference between irrigation methods on the harvest of orange trees.
Problem
A randomized complete block design study was performed to investigate the effects of irrigation method (with 5 levels) on the harvest of orange trees in a particular grove. The grove was divided into eight blocks to account for local variation in the grove, and irrigation methods were assigned to trees within each block at random. Each method appeared in every block. At harvest, the fruit from the trees was weighed.
Considering the blocking factor to be a random effect, carry out an ANOVA with a mixed model. Use the results from the appropriate Ftest to decide whether the irrigation method effect is significant.
Sample Data
The Methods data set contains data about five irrigation methods that were used on an orange grove, and the weight of the fruit at harvest.
These are the variables in the data set:
Name  Type  Description 
irrig  char  irrigation method 
block  num  block within grove (1 – 8) 
fruitwt  num  weight of fruit at harvest 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The Type III test for the effect of irrigation method yields an F statistic of 3.27 with a corresponding pvalue of 0.0254. So at a level of α = 0.05, you can conclude that the effect of irrigation method on the harvest of orange trees at this grove is significant.
Wanderers
Construct a mixed model to examine how wandering radius differs between foxes and coyotes.
Problem
Six animals from two species (fox and coyote) were tracked, and the diameter of the area that each animal wandered was recorded. Each animal was measured four times, once per season. The animals were selected randomly form a large population, and the variability from animal to animal is from some unknown distribution. Use a random effectsmixed model, specifying subject(species) as the random effect, to test how wandering radius differs between species. (Recall that the radius is simply half of the diameter.) Base your conclusion on a significance level of α = 0.05.
Sample Data
The Animals data set contains recordings of the diameter of the area wandered by six animals from two species. The individual animals were identified as subjects within each species. Each animal was measured four times, once per season.
These are the variables in the data set:
Name  Type  Description 
species  char  animal species (fox or coyote) 
subject  num  animal identifier within species (has values 1, 2, 3 for each species) 
miles  num  diameter of wandering area 
season  char  season of the year the distance was recorded 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The MIXED procedure in SAS gives the following result:The F test for the mixed model gives an F statistic of 11.89 and a pvalue of 0.0261 for the species effect. This outcome is significant at the α = 0.05 level, and thus we can conclude that there is a difference in wandering radius between foxes and coyotes.
Repeated Measures
Source: http://support.sas.com/learn/statlib...eg_anova_4.htm
Problem: Perform a repeated measures ANOVA to determine if the data suggest that there is difference in home range areas of foxes and coyotes.
The Type 3 test for the species effect yields an F statistic of 12.71 with a corresponding pvalue = 0.0235. So the data suggests that the species effect is significant, and thus we can conclude that there is a difference in home range areas between foxes and coyotes.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/anova_4.htm#KB5
Foxes and Coyotes
Determine whether there is difference in home range areas of foxes and coyotes.
Problem
Home ranges have been found to vary seasonally for both foxes and coyotes. A study was conducted in a region of central Illinois, in which three red foxes and three coyotes were captured and radiocollared. The animals were tracked throughout the course of a year, and the home range area for each one was recorded (in square miles) at the end of each season.
Carry out a repeated measures ANOVA to determine whether there is a significant difference in the home ranges of the two species.
Sample Data
The Canine data set contains data about the home range area covered by three red foxes and three coyotes each season throughout the course of a year.
These are the variables in the data set:
Name  Type  Description 
species  char  species 
subject  num  subject 
miles  num  home range area (in square miles) 
season  char  season 
recode_season  num  season code 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The Type 3 test for the species effect yields an F statistic of 12.71 with a corresponding pvalue = 0.0235. So the data suggests that the species effect is significant, and thus you can conclude that there is a difference in home range areas between foxes and coyotes.
Tests of Association
Pearson ChiSquare
Source: http://support.sas.com/learn/statlib.../eg_asso_1.htm
Problem: Perform the Pearson chisquare test to determine if there are significant differences between age groups for risk of colon cancer.
The value of the Pearson chisquare statistic (labeled ChiSquare) is 28.2566, with a pvalue of 0.0004, which can be considered to be very significant. So, we can conclude that there are significant differences among age groups for risk of colon cancer.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/asso_1.htm#KB6
Student Survey 5
Investigate characteristics of college students using survey results.
Problem
To determine the characteristics of the students in their large introductory statistics class, a group of college professors administered a survey to the students in their classes. Some of the questions asked include: the student’s age in years, height (in inches), shoe size, and high school GPA. We all know that there are many physical differences in male and female college students, but what other characteristics differ? Assume that we can consider our sample results to be a random sample from the population of all undergraduate students at this institution.
Exercise:
a. Is there a relationship between the gender of a student and if the student is registered to vote? Create a twoway table displaying the gender and voter registration status for these students. Does there appear to be a relationship?
b. Calculate Pearson’s Chisquare statistic and the associated pvalue for these variables.
c. Explain the null and alternative hypothesis for the Chisquare test in this setting.
d. What conclusion can you draw from the results of the Chisquare test?
e. Conduct the Chisquare test for gender and whether or not the student ate breakfast.
f. Conduct the Chisquare test for gender and if the student took a commercial flight in the last 30 days.
g. Conduct the Chisquare test for gender and if the student has seen a play in the past 6 months.
Sample Data
The Survey data is the result of a survey administered to a large introductory statistics class. The data set contains answers from 485 participants. Not all questions are answered resulting in missing data. Missing data is indicated by a period.
These are the variables in the data set:
Name  Type  Description 
Gender  char  gender (male or female) of the respondent 
Age  num  age of the subject in years 
Textbook  num  answer to the question “How much did you spend for textbooks this term (to nearest dollar)?” 
Cigs  num  answer to the question “How many cigarettes did you smoke yesterday?” 
ColGPA  num  answer to the question “What is your cumulative Grade Point Average at this institution?” 
HSGPA  num  answer to the question “What was your cumulative high school GPA (4 point scale)?” 
Height  num  height of the respondent in inches 
Mateh  num  the height of the respondent’s “ideal mate” 
Shoe  num  the respondent’s shoe size 
Breakfast  char  answer to the question “Did you eat breakfast this morning?” 
Flight  char  answer to the question “Did you fly on a commercial airline during the past 30 days? (Yes or No)” 
Play  char  answer to the question “Have you seen a play in a live theater in the past 6 months? Yes or No” 
Vote  char  answer to the question “Are you registered to vote? Yes or No” 
Credit  num  answer to the question “How many credit hours are you taking this term?” 
Source of Data
This data was collected by Roger Woodard of the North Carolina State University in 2005.
Result
a. The twoway table is given below. There does not seem to be much of a relationship, for both males and females. Over 86% of the subjects were registered to vote.
Table of gender by vote
gender  vote  
Frequency Percent Row Pct Col Pct  n  y  Total 
f  40  251 51.86 86.25 59.20  291 60.12 
m  20 4.13 10.36 33.33  173 35.74 89.64 40.80  193 39.88 
Total  60 12.40  424 87.60  484 100.00 
Frequency Missing = 2 
b. The test statistic is 1.2229 and the pvalue is 0.2688.
c. The null hypothesis is that gender and voter registration is independent. The alternative hypothesis is that the variables are not independent.
d. Based on the large pvalue, we see that we cannot reject the null hypothesis. Therefore, we cannot reject the idea that gender and voter registration is independent.
e. The test statistic is 5.7322 and pvalue is 0.0167. We can reject the null hypothesis at the 5% level (but not the 1% level). We can conclude that there is evidence that eating breakfast and gender are related among the students at this university.
f. The test statistic is 1.1028 and pvalue is 0.2937. We cannot reject the null hypothesis. Therefore there is not enough evidence to conclude there is a relationship between gender and taking a flight among the students at this university.
g. The test statistic is 9.1603 and pvalue is 0.0025. We can reject the null hypothesis at both the 5% and 1% level. We can conclude that there is evidence that seeing a play and gender are related among the students at this university.
Colon Cancer
Determine whether risk of colon cancer differs between age groups
Problem
A colonoscopy screening study was performed on individuals who were considered to be at a high risk of colon cancer, due to adenoma findings in previous examinations. The data recorded from the screening were the variables Finding (coded 0 for negative examination, 1 for small adenoma, and 2 for large adenoma) and Age (rounded: 3039 years coded as 35, 4049 years coded as 45, and so on).
Using the Pearson chisquare test, determine whether there are significance differences among age groups for risk of colon cancer. Give the chisquare test statistic and the pvalue for the test.
Sample Data
The Colonoscopy data set contains data from a colonoscopy screening study on individuals considered to be at high risk of colon cancer.
These are the variables in the data set:
Name  Type  Description 
Finding  num  finding (0 for negative examination, 1 for small adenoma, and 2 for large adenoma) 
Age  num  age (rounded to the midpoint of each decade; e.g., ages 3039 years coded as 35, ages 4049 years coded as 45) 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The value of the Pearson chisquare statistic is 28.2566, with a pvalue of 0.0004, which can be considered to be very significant. So, we can conclude that there are significant differences among age groups for risk of colon cancer.
Likelihood Ratio ChiSquare
Source: http://support.sas.com/learn/statlib.../eg_asso_2.htm
Problem: Use the likelihood ratio chisquare test to determine if there is an association between coffee consumption and incidence of pancreatic cancer.
The value of the likelihood ratio chisquare statistic is 7.6195, with a pvalue of 0.0546, which is not significant at the a = 0.05 level.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/asso_2.htm#KB6
Pancreatic Cancer
Determine whether coffee consumption is associated with the incidence of pancreatic cancer.
Problem
A casecontrol study was carried out to investigate the existence of a relationship between coffee consumption and pancreatic cancer. The data provided is for male subjects, with Outcome as a categorical variable representing whether or not a subject had cancer, and DailyCoffee as a continuous variable representing the daily coffee consumption of a subject.
Using the likelihood ratio chisquare test, determine whether the data indicates an association between coffee consumption and incidence of pancreatic cancer. Give the chisquare test statistic and the pvalue for the test.
Sample Data
The Coffee data set contains data about coffee consumption by individuals with and without pancreatic cancer.
These are the variables in the data set:
Name  Type  Description 
Outcome  char  indicator of whether individual is a case (pancreatic cancer) or a control (no cancer) 
DailyCoffee  num  amount of coffee drunk (0 for none, 1.5 for 12 cups per day, 3.5 for 34 cups per day, or 5.5 for 5 or more cups per day) 
AnyCoffee  char  coded value: if DailyCoffee=0, "No coffee"; otherwise "Some coffee" 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The value of the likelihood ratio chisquare statistic is 7.6195, with a pvalue of 0.0546, which is not significant at the α = 0.05 level. However, at a significance level of α = 0.10, you could say that the data gives sufficient evidence of an association between coffee consumption and incidence of pancreatic cancer.
Survivors
Determine whether there is a difference in survival rate among classes aboard the Titanic.
Problem
Information was gathered on the passengers of the RMS Titanic and recorded in a dataset. The four variables represent the class (first, second, third, and crew), age, sex, and survival status (yes or no) for each passenger. Test the hypothesis that there is no difference in the survival rate among classes by using the likelihood ratio chisquare test with a significance level of α = 0.01.
Sample Data
The Titanic data set contains information on the passengers of the RMS Titanic including class level, age, sex, and survival status.
These are the variables in the data set:
Name  Type  Description 
Class  char  class level of passenger (first, second, third, or crew) 
Age  char  age classification of passenger (adult or child) 
Sex  char  sex (male or female) 
Survived  char  survival status (yes or no) 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
Based on output from the FREQ procedure in SAS, we find that the value of the likelihood ratio chi square statistic is 180.9014, and the pvalue for the test is found to be < 0.0001, which is significant at the α = 0.01 level. Hence, we can say that there is very strong evidence that there is a difference in the survival rate among classes who were aboard the RMS Titanic.
MantelHaenszel ChiSquare
Source: http://support.sas.com/learn/statlib.../eg_asso_3.htm
Problem: Use the MantelHaenszel chisquare test to determine if there is a relationship between variety of cotton and the distance between planting rows.
The value of the MantelHaenszel chisquare statistic is 0.0376, with a pvalue of 0.8462, so we can conclude that the test is not significant. Hence, we fail to reject the null hypothesis that there is no relationship between variety of cotton and distance between planting rows.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/asso_3.htm#KB6
Popularity 2
Investigate whether there is a gender difference in the perceived importance of good grades.
Problem
M.A. Chase and G.M. Dummer conducted a study in 1992 to determine what traits children regarded as important to popularity. Demographic information was recorded, as well as the rating given to four traits assessing their importance to popularity: Grades, Sports, Looks, and Money. The rating scale was from 1 to 4, with 1 being the most important of the four options, 4 being the least.
Determine whether there is a difference based on gender, on the importance given to making good grades by carrying out the MantelHaenszel chisquare test with a significance level of alpha=0.10.
Sample Data
The Childrens_popularity data contains demographic information and ratings given to four traits assessing their importance to popularity – grades, sports, looks, and money – for students in grades 4 through 6.
These are the variables in the data set:
Name  Type  Description 
Gender  char  gender of student (boy or girl) 
Grade  num  grade level of student 
Age  num  age in years 
Race  char  race (white or other) 
Urban_Rural  char  type of residence area (rural, suburban, urban) 
School  char  school student attends 
Goals  char  area student strives for (grades, sports, popular) 
Grades  num  rating on importance of grades (1=most important, 2, 3, 4=least important) 
Sports  num  rating on importance of sports (1=most important, 2, 3, 4=least important) 
Looks  num  rating on importance of looks (1=most important, 2, 3, 4=least important) 
Money  num  rating on importance of money (1=most important, 2, 3, 4=least important) 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
From the FREQ procedure in SAS we find that the value of the MantelHaenszel chisquare statistic is 0.4491, with a corresponding pvalue of 0.5028, which is not significant at the α = 0.10 level. So, we can conclude that there is not enough evidence to suggest that there is a difference based on gender, given to the importance of making good grades.
Cotton Plants
Determine whether variety of cotton is related to the distance between planting rows.
Problem
A student in the Department of Crop Science at NC State University was interested in investigating the relationship between cotton varieties and the distance between planting rows for various cotton varieties. She used data from a concurrent experiment which included two levels of variety and two levels of spacing.
Use the MantelHaenszel chisquare test to test the null hypothesis of no relationship between variety and spacing. Give the chisquare test statistic and the pvalue for the test.
Sample Data
The Cotton data set contains data about various characteristics of cotton plants.
These are the variables in the data set:
Name  Type  Description 
variety  num  cotton variety (37, 213) 
spacing  num  distance between planting rows) 
plant  num  plant 
bollwt  num  total weight of cotton bolls 
lint  num  weight of usable lint 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The value of the MantelHaenszel chisquare statistic is 0.0376, with a pvalue of 0.8462, which can be considered to be highly insignificant. Hence, we fail to reject the null hypothesis that there is no relationship between variety of cotton and distance between planting rows.
Logistic Regression
Simple Binary
Source: http://support.sas.com/learn/statlib...g_logreg_1.htm
Problem: Use simple binary logistic regression to describe the relationship between the temperature at the launch time of space shuttles and Oring thermal distress.
The Analysis of Maximum Likelihood Estimates Table.
Based on the parameter estimates, we have the logistic regression equation
ln[p/(1p)] = 15.0429 +0.2322x.
The pvalue for the test of significance is 0.0320. Therefore, at a level of α = 0.05, we can say that temperature at launch time is a significant predictor for the probability of no distress.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/logreg_1.htm#KB7
Rainy Days 1
Use simple logistic regression to predict the probability of rain from temperature.
Problem
A weather record was compiled for the month of April for a city in the eastern United States. The amount of rainfall (Precip), temperature (Temp), and barometric pressure (Pressure) were recorded for each of the 30 days of the month. A variable Rained was added to the set of data to categorize rainfall based on the following formula:
Rained ={"Rainy" if Precip > 0.02
{"Dry" otherwise
Fit a simple logistic regression function to the data to determine whether temperature is a significant predictor of the probability of rain based on this sample.
Sample Data
The Spring_rain data set is from a weather record for the month of April. The dataset contains information on the temperature, precipitation, and barometric pressure for each of the thirty days in the month. Also a categorical variable Rained is included which categorizes rainfall in the following manner:
Rained ={"Rainy" if Precip > 0.02
{"Dry" otherwise
These are the variables in the data set:
Name  Type  Description 
date  char  date given as mm/dd 
Temp  num  temperature 
Precip  num  amount of rainfall 
Pressure  num  barometric pressure 
Rained  char  categorical—if Precip > 0.02, then Rained = “Rained”; otherwise Rained = “Dry” 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
Based on the output generated by the LOGISTIC procedure in SAS, we find that the coefficient on temperature is very small (0.00863) and the chisquare statistic is not significant (pvalue=0.8707). So, we can conclude that temperature alone is not a significant predictor of the probability of rain.
Space Shuttles
Describe the relationship between the space shuttle temperature and Oring thermal distress.
Problem
In the 23 launches preceding the Challenger mission, data was collected recording the temperature at launch time and the presence or absence of Oring thermal distress (coded as 0 for no distress, and 1 for distress).
Find the equation for the logistic regression of the presence or absence of Oring thermal distress on temperature at launch time. Determine whether launchtime temperature is a significant predictor for the probability of no distress.
Sample Data
The O_ring data set contains data about temperature and Oring thermal distress for the 23 space shuttle launches preceding the Challenger mission.
These are the variables in the data set:
Name  Type  Description 
flt  num  flight number 
temp  num  temperature at launch time 
td  num  indicator of whether or not there was thermal distress during the launch (0 for no distress, 1 for distress) 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The probability modeled is for Td = 0 or no distress.
Based on the parameter estimates, we have this logistic regression equation:
ln[p/(1p)] = 15.0429 +0.2322x
The pvalue for the test of significance is 0.0320. Therefore, at a level of α = 0.05, we can say that temperature at launch time is a significant predictor for the probability of no distress.
Multiple Binary
Source: http://support.sas.com/learn/statlib...g_logreg_2.htm
Problem: Use multiple binary logistic regression to describe the relationship between purchase level and factors such as age, gender, and income level.
Notice that the probability modeled is for purchase = 1 or “$100 or more.”
Scroll down again to view the Type 3 Analysis of Effects table.
The results of the Wald chisquare test yield a test statistic value of W = 5.9494 and a corresponding pvalue = 0.0147 for the gender effect. So, based on these findings we can conclude that gender is a significant predictor for the probability of purchasing $100 or more.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/logreg_2.htm#KB7
Rainy Days 2
Use multiple logistic regression to predict the probability of rain from temperature.
Problem
A weather record was compiled for the month of April for a city in the eastern United States. The amount of rainfall (Precip), temperature (Temp), and barometric pressure (Pressure) were recorded for each of the 30 days of the month. A variable Rained was added to the set of data to categorize rainfall based on the following formula:
Rained = {"Rainy" if Precip > 0.02
{"Dry" otherwise
Find the equation of the logistic regression function fitting the probability of “Dry” against temperature and pressure. Use the coefficient of determination to make a statement about how the model fits the data.
Sample Data
The Spring_rain data set is from a weather record for the month of April. The data set contains information on the temperature, precipitation, and barometric pressure for each of the thirty days in the month. Also a categorical variable Rained is included which categorizes rainfall in the following manner:
Rained = {"Rainy" if Precip > 0.02
{"Dry" otherwise
These are the variables in the data set:
Name  Type  Description 
date  char  date given as mm/dd 
Temp  num  temperature 
Precip  num  amount of rainfall 
Pressure  num  barometric pressure 
Rained  char  categorical—if Precip > 0.02, then Rained = “Rained”; otherwise Rained = “Dry” 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
Using PROC LOGISTIC in SAS, the logistic equation is given by
ln(p/(1 – p)) = 445.8 – 0.0551*Temp + 15.3079*Press
which is equivalent to
p = 1/(1 + exp(  (445.8 – 0.0551*Temp + 15.3079*Press)))
where p is the prediction probability of “Dry.”
The coefficient of determination has a value of r^2 = 0.3555, which means that only about 36% of the variability in the probability of “Dry” is explained by its regression on temperature and barometric pressure.
Mail Order Customers 2
Describe the relationship between purchase level and factors such as age, gender, and income level.
Problem
A mailorder company has decided that customers who spend 100 dollars or more on purchases should be the focus of its advertising efforts. To help identify this target group, the company collected information from its customers including purchase level (1 = at least $100, 0 = less than $100 dollars), gender, income level, and age.
Using the logistic regression of purchase level on gender, income level, and age, determine whether gender is a significant predictor for the probability of purchasing $100 or more.
Sample Data
The Sales data set contains data about customers of a mailorder company.
These are the variables in the data set:
Name  Type  Description 
purchase  num  customer’s purchase level (1 = at least $100, 0 = less than $100 dollars) 
age  num  customer’s age 
gender  char  customer’s gender 
income  char  customer’s income level (Low, Medium, High) 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The results of the Wald chisquare test yield a test statistic value of W = 5.9494 and a corresponding pvalue = 0.0147 for the gender effect. So, based on these findings we can conclude that gender is a significant predictor for the probability of purchasing $100 or more.
Nonparametric Analyses: Independent Samples
1 Sample
Sign Test
Source: http://support.sas.com/learn/statlib..._nonpara_1.htm
Problem: Perform the sign test to determine if there is a difference between pretest and posttest scores for students in a college course.
The value of the sign test statistic is M = 1, with a pvalue of 0.7744. This gives insufficient evidence of a significant difference between pretest and posttest scores for the students.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_1.htm#KB9
College Course Test Scores 1
Determine whether there is a difference between pretest and posttest scores for students.
Problem
An instructor at a community college is interested in examining a set of score changes between a pair of tests given in one of his college courses. He issued a pretest on the first day of class, and after a few weeks of lecture he issued a posttest on the same material. The instructor recorded the scores on both tests, as well as the difference in scores (posttestpretest) for each student.
Carry out the sign test to determine whether there is evidence that the scores on the posttest were different than those on the pretest for the students.
Sample Data
The Score data set contains data about pretest and postest scores for students in a college course.
These are the variables in the data set:
Name  Type  Description 
Student  char  student 
PreTest  num  pretest score 
PostTest  num  posttest score 
ScoreChange  num  difference between posttest score and pretest score 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The value of the sign test statistic is M = 1, with a pvalue of 0.7744. This gives insufficient evidence of a significant difference between pretest and posttest scores for the students.
Wired 1
Determine whether electrical measurements taken outside and inside a chamber differ.
Problem
An experiment was conducted in which electrical measurements were taken on 24 wiring boards. Each board was measured first when soldering was completed, and again after three weeks in a chamber with a controlled environment of high temperature and humidity.
The ShapiroWilk W test for Normality for the data yielded a pvalue of 0.0076, indicating that the data are significantly nonNormal. Use the sign test to determine if there is a significant difference between the outside and inside chamber measurements, at the α = 0.01 level.
Sample Data
The Chamber data set represents electrical measurements on 24 wiring boards. Measurements were taken both outside and inside a chamber, and the difference between these measurements (outside – inside) was also recorded.
These are the variables in the data set:
Name  Type  Description 
board  num  identifier for wiring board 
outside  num  electrical measurement taken outside the chamber 
inside  num  electrical measurement taken inside the chamber 
diff  num  the difference between the measurements (outside – inside) 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The analysis variable for this procedure is the variable diff which is calculated as outside – inside. The results of the sign test (as given by the UNIVARIATE procedure in SAS) yield a pvalue of 0.0066, which gives evidence of a significant difference between the outside and inside chamber measurements at the α = 0.01 level.
Wilcoxon Signed Rank Test
Source: http://support.sas.com/learn/statlib..._nonpara_2.htm
Problem: Perform the Wilcoxon signed rank test to determine if there is a difference between pretest and posttest scores for students in a college course.
The value of the Wilcoxon signed rank test statistic is S = 8.5, with a pvalue of 0.5278. This gives insufficient evidence of a significant difference between pretest and posttest scores for the students.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_2.htm#KB9
College Course Test Scores 2
Determine whether there is a difference between pretest and posttest scores for students.
Problem
An instructor at a community college is interested in examining a set of score changes between a pair of tests given in one of his college courses. He issued a pretest on the first day of class, and after a few weeks of lecture he issued a posttest on the same material. The instructor recorded the scores on both tests, as well as the difference in scores (posttestpretest) for each student.
Perform the Wilcoxon signed rank test to determine whether there is evidence that the scores on the posttest were different than those on the pretest for the students.
Sample Data
The Score data set contains data about pretest and postest scores for students in a college course.
These are the variables in the data set:
Name  Type  Description 
Student  char  student 
PreTest  num  pretest score 
PostTest  num  posttest score 
ScoreChange  num  difference between posttest score and pretest score 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The value of the sign test statistic is M = 1, with a pvalue of 0.7744. This gives insufficient evidence of a significant difference between pretest and posttest scores for the students.
Wired 2
Determine whether electrical measurements taken outside and inside a chamber differ.
Problem
An experiment was conducted in which electrical measurements were taken on 24 wiring boards. Each board was measured first when soldering was completed, and again after three weeks in a chamber with a controlled environment of high temperature and humidity.
The ShapiroWilk W test for Normality for the data yielded a pvalue of 0.0076, indicating that the data are significantly nonNormal. Use the Wilcoxon signedrank test to determine if there is a significant difference (at level α = 0.05) between the outside and inside chamber measurements.
Sample Data
The Chamber data set represents electrical measurements on 24 wiring boards. Measurements were taken both outside and inside a chamber, and the difference between these measurements (outside – inside) was also recorded.
These are the variables in the data set:
Name  Type  Description 
board  num  identifier for wiring board 
outside  num  electrical measurement taken outside the chamber 
inside  num  electrical measurement taken inside the chamber 
diff  num  the difference between the measurements (outside – inside) 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
Using the UNIVARIATE procedure in SAS, the analysis variable for this procedure is the variable diff, which is calculated as outside – inside. The Wilcoxon signedrank test gives a pvalue of 0.0106, which gives evidence of a significant difference between the outside and inside chamber measurements (at the α = 0.05 level).
2 Sample
Median Test
Source: http://support.sas.com/learn/statlib..._nonpara_3.htm
Problem: Carry out the twosample median test to determine if referrals received from physicians differs between types of hospice marketing visits.
 View the first table in the output to see the median score statistics.
 View the results of the twosample median test for the following hypotheses:
H_{o}: no difference in change in referrals after three
months between the two types of visitsagainst the twosided alternative
H_{a}: there is a difference in change in referrals after
three months between the two types of visits
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_3.htm#KB9
Hoop Ratings
Examine the correspondence between the team ratings of collegiate basketball polls.
Problem
The Basketball data set contains the preseason ratings of collegiate basketball teams for the 198586 season as given by ten media outlets. Use the Spearman correlation to give an assessment of the relationship between the team ratings by the UPI and AP polls.
Sample Data
The Basketball data set gives the preseason ratings of collegiate basketball teams for the 198586 season as given by ten media outlets.
These are the variables in the data set:
Name  Type  Description 
school  char  university team represents 
CSN  num  team rating by CSN 
Durham Sun  num  team rating by Durham Sun 
Durham Herald  num  team rating by Durham Herald 
Washington Post  num  team rating by Washington Post 
USA Today  num  team rating by USA Today 
Sports Magazine  num  team rating by Sports Magazine 
In Sport  num  team rating by In Sport 
UPI  num  team rating by UPI 
AP  num  team rating by AP 
Sports Illustrated  num  team rating by Sports Illustrated 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The value of the Spearman correlation coefficient is 0.95. This indicates that there is a very strong positive association between the ratings by the UPI and AP polls. So, we can say that high team ratings by the UPI poll are associated with high ratings by the AP poll, and low ratings are by the UPI are associated with low ratings by the AP.
Physician Referrals 1
Determine whether the type of hospice marketing visit affects physician referrals.
Problem
A study was done to investigate the effect of hospice marketing visits on the change in number of referrals received from doctors. There were two types of visits (one accompanied by a physician, and one by the hospice staff only), and the change in referrals after one month (change1) and after three months (change3) were recorded, along with other variables.
Use the median test to determine whether the change in referrals after one month differs for the two types of visits.
Sample Data
The Hospice data set contains data about referrals received from physicians after a visit by a hospice marketing nurse.
These are the variables in the data set:
Name  Type  Description 
ID  num  physician ID 
Practice  char  type of practice 
Date  char  date of visit 
Change3  num  change in number of referrals after 3 months 
Change1  num  change in number of referrals after one month 
Visit  char  type of visit 
Source of Data
This data is sample data from SAS Institute Inc.
Result
You’re testing the following hypotheses:
H_{o}: no difference in change in referrals after three months between the two types of visits
against the twosided alternative
H_{a}: there is a difference in change in referrals after
three months between the two types of visits
The median test statistic is M = 6.9375, with a standardized value of z = 0.4275. The pvalue for the twosided test is 0.6690, which fails to give sufficient evidence against the null hypothesis. So, we cannot conclude that there is a difference in the change in referrals after one month between the two types of visits.
Adverse Reactions 1
Test whether adverse reaction times differ significantly between two groups of patients.
Problem
The manufacturers of a medication were concerned about adverse reactions in patients that were treated with their drug. Data on adverse reactions was gathered and stored in a file. The duration of the adverse reaction was recorded as the dependent variable. Patients were either given a placebo or received the standard drug regimen.
Test whether there is a significant difference in adverse reaction times between the two groups using the nonparametric median test. Use a significance level of α = 0.10.
Sample Data
The Adverser data set contains information on patients and their adverse reactions to a drug treatment.
These are the variables in the data set:
Name  Type  Description 
PATIENT_ID  num  patient identification number 
TREATMENT_GROUP  char  treatment patient received (placebo or standard drug) 
TOTAL_DAILY_DOSE  num  daily dosage 
DAY_ON_DRUG  num  number of days patient was on treatment 
AGE  num  age 
SEX  char  sex 
WEIGHT  num  weight 
ADVERSE_REACTION  char  type of adverse reaction 
RACE  char  race 
ADR_SEVERITY  char  level of severity of adverse reaction 
RELATION_TO_DRUG  char  relation of adverse reaction to drug 
ADR_DURATION  char  duration of adverse reaction 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The results of the twosample median test generated by the NPAR1WAY procedure in SAS show a pvalue of 0.1132. This outcome is not significant at the α = 0.10 level (since pvalue > 0.10). So we can conclude that the data does not provide strong enough evidence to suggest that there is a significant difference in adverse reaction times between the two treatment groups.
Adverse Reactions 2
Test whether adverse reaction times differ significantly between two groups of patients.
Problem
The manufacturers of a medication were concerned about adverse reactions in patients that were treated with their drug. Data on adverse reactions was gathered and stored in a file. The duration of the adverse reaction was recorded as the dependent variable.
Patients were either given a placebo or received the standard drug regimen. Test whether there is a significant difference in adverse reaction times between the two groups using the nonparametric median test. Use a significance level of α = 0.10. ~nl~
Sample Data
The Adverser data set contains information on patients and their adverse reactions to a drug treatment.
These are the variables in the data set:
Name  Type  Description 
PATIENT_ID  num  patient identification number 
TREATMENT_GROUP  char  treatment patient received (placebo or standard drug) 
TOTAL_DAILY_DOSE  num  daily dosage 
DAY_ON_DRUG  num  number of days patient was on treatment 
AGE  num  age 
SEX  char  sex 
WEIGHT  num  weight 
ADVERSE_REACTION  char  type of adverse reaction 
RACE  char  race 
ADR_SEVERITY  char  level of severity of adverse reaction 
RELATION_TO_DRUG  char  relation of adverse reaction to drug 
ADR_DURATION  char  duration of adverse reaction 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The results of the two sample median test generated by the NPAR1WAY procedure in SAS shows a pvalue of 0.1132. This outcome is not significant at the α = 0.10 level (since pvalue > 0.10). So we can conclude that the data does not provide strong enough evidence to suggest that there is a significant difference in adverse reaction times between the two treatment groups.
Wilcoxon Rank Sum Test
Source: http://support.sas.com/learn/statlib..._nonpara_4.htm
Problem: Perform a Wilcoxon rank sum test to determine if there is a significant difference between types of hospice marketing visits for referrals received from physicians.
 View the first table in the output to see the Wilcoxon rank sums statistics.
 View the second table. This table gives the results of the Wilcoxon twosample test for the following hypotheses:
H_{o}: no difference in change in referrals after
one month between the two types of visits
versus the twosided alternative
H_{a}: there is a difference in change in referrals after
one month between the two types of visits
The rank sum statistic is S = 378.50. The table gives results for three different tests. You could choose any of the tests to draw your conclusion, but for our case suppose we wanted to make our inference based on the exact distribution of the rank sum statistic.
 Scroll down to the results of the Exact Test.
Since our alternative is twosided, we will rely on the twosided pvalue, which is equal to 0.6531. This does not give sufficient evidence to reject the null hypothesis, so we cannot conclude that there is a difference in the change in referrals after one month between the two types of visits.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_4.htm#KB9
Physician Referrals 2
Determine whether the type of hospice marketing visit affects physician referrals.
Problem
A study was done to investigate the effect of hospice marketing visits on the change in number of referrals received from doctors. For the two types of visits (one accompanied by a physician, and one by the hospice staff only), the change in referrals after one month (change1) and after three months (change3) were recorded, along with other variables.
Perform the Wilcoxon rank sum test to determine whether the change in referrals after one month differs for the two types of visits.
Sample Data
The Hospice data set contains data about referrals received from physicians after a visit by a hospice marketing nurse.
These are the variables in the data set:
Name  Type  Description 
ID  num  physician ID 
Practice  char  type of practice 
Date  char  date of visit 
Change3  num  change in number of referrals after 3 months 
Change1  num  change in number of referrals after one month 
Visit  char  type of visit 
Source of Data
This data is sample data from SAS Institute Inc.(database delimiter)
Result
You’re testing the following hypotheses:
H_{o}: no difference in change in referrals after
one month between the two types of visits
versus the twosided alternative
H_{a}: there is a difference in change in referrals after
one month between the two types of visits
In the results of the Wilcoxon twosample test for three different tests, the rank sum statistic is S = 378.50. You could choose any of the tests to draw your conclusion. However, in this situation, suppose you want to make your inference based on the exact distribution of the rank sum statistic.
Since your alternative is twosided, you will rely on the twosided pvalue, which is equal to 0.6531. This does not give sufficient evidence to reject the null hypothesis, so you cannot conclude that there is a difference in the change in referrals after one month between the two types of visits.
Teen Growth
Determine whether there is a significant difference in the heights of 15yearold males and females.
Problem
A study compiled the heights and weights of 39 teenagers, all age 15. Use the Wilcoxon rank sum test to determine if there is a significant difference (at level α = 0.05) in the mean heights of 15yearold males and females.
Sample Data
The Htwt15 data set contains the heights and weights of 39 15yearolds.
These are the variables in the data set:
Name  Type  Description 
gender  char  gender (male or female) 
height  num  height (in inches) 
weight  num  weight (in pounds) 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The NPAR1WAY procedure in SAS gives the results of two approximations to the Wilcoxon – the Normal and the t approximations (with pvalues 0.0002 and 0.0006, respectively) – along with the results of the exact test (pvalue = 0.000075). All three of these give strong evidence of a significant difference in the mean heights of 15yearold males and females (since all pvalues are less than α = 0.05).
KruskalWallis Test
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/eg_nonpara_5.htm
Problem: Carry out the KruskalWallis test to determine if there is a significant difference between types of hospice marketing visits for referrals received from physicians.
 View the first table in the output to see the Wilcoxon rank sums statistics.
 View the second table. This table gives the results of the Wilcoxon twosample test.
 Scroll down to the last table, which gives the results of the KruskalWallis test for the following hypotheses:
H_{o}: no difference in change in referrals after
three months between the two types of visits
versus the twosided alternative
H_{a}: there is a difference in change in referrals after
three months between the two types of visits
The KruskalWallis test statistic is K = 1.5991, with a onesided pvalue of 0. 2060. The pvalue for our twosided test is 2(0.2060) = 0.4120. This fails to give sufficient evidence against the null hypothesis, so we cannotconclude that there is a difference in the change in referrals after three months between the two types of visits.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_5.htm#KB9
Physician Referrals 3
Determine whether the type of hospice marketing visit affects physician referrals.
Problem
A study was done to investigate the effect of hospice marketing visits on the change in number of referrals received from doctors. There were two types of visits (one accompanied by a physician, and one by the hospice staff only), and the change in referrals after one month (change1) and after three months (change3) were recorded, along with other variables.
Use the KruskalWallis test to determine whether the change in referrals after three months differs for the two types of visits.
Sample Data
The Hospice data set contains data about referrals received from physicians after a visit by a hospice marketing nurse.
These are the variables in the data set:
Name  Type  Description 
ID  num  physician ID 
Practice  char  type of practice 
Date  char  date of visit 
Change3  num  change in number of referrals after 3 months 
Change1  num  change in number of referrals after one month 
Visit  char  type of visit 
Source of Data
This data is sample data from SAS Institute Inc.(database delimiter)
Result
You’re testing the following hypotheses:
H_{o}: no difference in change in referrals after
three months between the two types of visits
versus the twosided alternative
H_{a}: there is a difference in change in referrals after
three months between the two types of visits
The KruskalWallis test statistic is K = 1.5991, with a onesided pvalue of 0. 2060. The pvalue for our twosided test is 2(0.2060) = 0.4120. This fails to give sufficient evidence against the null hypothesis, so we cannot conclude that there is a difference in the change in referrals after three months between the two types of visits.
Adverse Reactions 3
Test whether adverse reaction times differ significantly based on the gender of patients.
Problem
The manufacturers of a medication were concerned about adverse reactions in patients that were treated with their drug. Data on adverse reactions was gathered and stored in a file. The duration of the adverse reaction was recorded as the dependent variable.
Patients were either given a placebo or received the standard drug regimen. Test whether there is a difference in adverse reaction times between males and females using the two sample KruskalWallis test. Use a significance level of α = 0.05. ~nl~
Sample Data
The Adverser data set contains information on patients and their adverse reactions to a drug treatment.
These are the variables in the data set:
Name  Type  Description 
PATIENT_ID  num  patient identification number 
TREATMENT_GROUP  char  treatment patient received (placebo or standard drug) 
TOTAL_DAILY_DOSE  num  daily dosage 
DAY_ON_DRUG  num  number of days patient was on treatment 
AGE  num  age 
SEX  char  sex 
WEIGHT  num  weight 
ADVERSE_REACTION  char  type of adverse reaction 
RACE  char  race 
ADR_SEVERITY  char  level of severity of adverse reaction 
RELATION_TO_DRUG  char  relation of adverse reaction to drug 
ADR_DURATION  char  duration of adverse reaction 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The results of the KruskalWallis test generated by the NPAR1WAY procedure in SAS shows a pvalue of 0.6356. This outcome is not significant at the α = 0.05 level. So we can conclude that the data does not provide evidence to suggest that there is a difference in adverse reaction times between males and females.
k Samples
KruskalWallis Test
Source: http://support.sas.com/learn/statlib..._nonpara_8.htm
Problem: Perform the Friedman test to determine if there is a difference in the effect on skin potential for four emotions induced by hypnosis.
The first eight tables in the output give the frequency tables for emotion by skin response for each subject.
Scroll down to the last table to see the results of the Friedman test.
In our setting, the Friedman test statistic is identical to the CochranMantelHaenszel, Row Mean Scores Differ statistic. The value of this test statistic is 6.45, with a corresponding pvalue of 0.0917. So at a significance level of α=0.10, we can say that the differences in skin potential for the four emotions are significant.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_6.htm#KB9
Synthetic Wood Veneer
Determine whether there is a difference in the durability of three brands of synthetic wood veneer.
Problem
An experiment was conducted to investigate the durability of three brands of synthetic wood veneer that is often used in office furniture and on kitchen countertops. Samples of each brand were subjected to a friction test to determine durability. The amount of veneer material that is worn away due to friction was measured. Brands that have small measurements are desireable.
Carry out the KruskalWallis test to see if the data indicate a significant difference in the distributions of wear measurements for the three brands of synthetic wood veneer.
Sample Data
The Veneer data set contains data about the wear recorded for three brands of synthetic wood veneer after they were subjected to a friction test.
These are the variables in the data set:
Name  Type  Description 
Wear  num  measurement of wear 
Brand  char  brand of wood veneer 
Source of Data
This data is sample data from SAS Institute Inc.
Result
You’re testing the following hypotheses:
H_{o}: the distributions of wear measurements
for the three brands are all equal
versus the twosided alternative
H_{a}: the distributions of wear measurements
for the three brands are not all equal
The KruskalWallis test statistic is K = 7.2558, with a pvalue of 0.0266. The outcome is significant at a level α = 0.05, so you have enough evidence to reject the null hypothesis at this level.
Thus, you can conclude that the distributions of wear measurements for the three brands of synthetic wood veneer are not all equal (that is, there is a significant difference in durability between the brands).
Adverse Reactions 4
Test whether adverse reaction times differ significantly based on the race of patients.
Problem
The manufacturers of a medication were concerned about adverse reactions in patients that were treated with their drug. Data on adverse reactions was gathered and stored in a file. The duration of the adverse reaction was recorded as the dependent variable.
Patients were either given a placebo or received the standard drug regimen. Test whether there is a difference in adverse reaction times based on the race of patients using the ksample KruskalWallis test. Use a significance level of α = 0.01. ~nl~
Sample Data
The Adverser data set contains information on patients and their adverse reactions to a drug treatment.
These are the variables in the data set:
Name  Type  Description 
PATIENT_ID  num  patient identification number 
TREATMENT_GROUP  char  treatment patient received (placebo or standard drug) 
TOTAL_DAILY_DOSE  num  daily dosage 
DAY_ON_DRUG  num  number of days patient was on treatment 
AGE  num  age 
SEX  char  sex 
WEIGHT  num  weight 
ADVERSE_REACTION  char  type of adverse reaction 
RACE  char  race 
ADR_SEVERITY  char  level of severity of adverse reaction 
RELATION_TO_DRUG  char  relation of adverse reaction to drug 
ADR_DURATION  char  duration of adverse reaction 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The NPAR1WAY procedure in SAS yields a pvalue < 0.001, which is significant at the α = 0.01 level. Therefore, we can conclude that there is enough evidence to support the claim that there is a difference in adverse reaction times based on races.
Nonparametric Analyses: Dependent Samples
2 Sample
Wilcoxon Signed Rank Test
Source: http://support.sas.com/learn/statlib..._nonpara_7.htm
Problem: Perform the Wilcoxon signed rank test to determine if there is a significant difference between pretest and posttest scores for students in a college course.
The value of the sign test statistic is S = 8.5, with a pvalue of 0.5278. This gives insufficient evidence of a significant difference between pretest and posttest scores for the students.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_7.htm#KB9
College Course Test Scores 3
Determine whether there is a difference between pretest and posttest scores.
Problem
An instructor at a community college is interested in examining a set of score changes between a pair of tests given in one of his college courses. He issued a pretest on the first day of class, and after a few weeks of lecture he issued a posttest on the same material. The instructor recorded the scores on both tests, as well as the difference in scores (posttestpretest) for each student.
Carry out the sign test to Determine whether there is evidence that the scores on the posttest were different than those on the pretest for the students.
Sample Data
The Score data set contains data about pretest and postest scores for students in a college course.
These are the variables in the data set:
Name  Type  Description 
Student  char  student 
PreTest  num  pretest score 
PostTest  num  posttest score 
ScoreChange  num  difference between posttest score and pretest score 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The value of the sign test statistic is M = 1, with a pvalue of 0.7744. This gives insufficient evidence of a significant difference between pretest and posttest scores for the students.
Wired 3
Determine whether electrical measurements taken outside and inside a chamber differ.
Problem
An experiment was conducted in which electrical measurements were taken on 24 wiring boards. Each board was measured first when soldering was completed, and again after three weeks in a chamber with a controlled environment of high temperature and humidity.
The ShapiroWilk W test for Normality for the data yielded a pvalue of 0.0076, indicating that the data are significantly nonNormal. Use the Wilcoxon signedrank test to determine if there is a significant difference (at level α = 0.05) between the outside and inside chamber measurements.~nl~
Sample Data
The Chamber data set represents electrical measurements on 24 wiring boards. Measurements were taken both outside and inside a chamber, and the difference between these measurements (outside – inside) was also recorded.
These are the variables in the data set:
Name  Type  Description 
board  num  identifier for wiring board 
outside  num  electrical measurement taken outside the chamber 
inside  num  electrical measurement taken inside the chamber 
diff  num  the difference between the measurements (outside – inside) 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The two dependent samples for this analysis were the paired outside and inside chamber measurements taken for each board. From the UNIVARIATE procedure in SAS, the Wilcoxon signedrank test gives a pvalue of 0.0106, which gives evidence of a significant difference between the outside and inside chamber measurements.
k Samples
Friedman Test
Source: http://support.sas.com/learn/statlib..._nonpara_8.htm
Problem: Perform the Friedman test to determine if there is a difference in the effect on skin potential for four emotions induced by hypnosis.
The first eight tables in the output give the frequency tables for emotion by skin response for each subject.
Scroll down to the last table to see the results of the Friedman test.
In our setting, the Friedman test statistic is identical to the CochranMantelHaenszel, Row Mean Scores Differ statistic. The value of this test statistic is 6.45, with a corresponding pvalue of 0.0917. So at a significance level of α=0.10, we can say that the differences in skin potential for the four emotions are significant.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_8.htm#KB9
Skin Potential for Emotions
Determine how hypnosis affects the skin potential for four different emotions.
Problem
A study was conducted to investigate whether hypnosis has the same effect on skin potential for four different emotions. Subjects were asked to display fear, joy, sadness and calmness under hypnosis, and the resulting skin potential (measured in millivolts) was recorded for each emotion.
Use the Friedman test to determine whether the data suggests that there is a difference in the effect on skin potential for the emotions.
Sample Data
The Hypnosis data set contains data about the skin response that was recorded for subjects who were asked to display four different emotions under hypnosis.
These are the variables in the data set:
Name  Type  Description 
Emotion  char  emotion (fear, joy, sadness, or calmness) 
Subject  num  subject identifier 
SkinResponse  num  skin response (in millivolts) 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The value of the Friedman test statistic is 6.45, with a corresponding pvalue of 0.0917. So at a significance level of α=0.10, we can say that the differences in skin potential for the four emotions are significant.
Adverse Reactions 5
Test whether the severity of adverse reactions differs based on the reaction's relation to the drug.
Problem
The manufacturers of a medication were concerned about adverse reactions in patients that were treated with their drug. Data on adverse reactions was gathered and stored in a file. The duration of the adverse reaction was recorded as the dependent variable. Patients were either given a placebo or received the standard drug regimen.
Test whether there is a difference in the severity of adverse reactions based on the relation of the reaction to the drug using the Friedman test. Use a significance level of α = 0.05.
Sample Data
The Adverser data set contains information on patients and their adverse reactions to a drug treatment.
These are the variables in the data set:
Name  Type  Description 
PATIENT_ID  num  patient identification number 
TREATMENT_GROUP  char  treatment patient received (placebo or standard drug) 
TOTAL_DAILY_DOSE  num  daily dosage 
DAY_ON_DRUG  num  number of days patient was on treatment 
AGE  num  age 
SEX  char  sex 
WEIGHT  num  weight 
ADVERSE_REACTION  char  type of adverse reaction 
RACE  char  race 
ADR_SEVERITY  char  level of severity of adverse reaction 
RELATION_TO_DRUG  char  relation of adverse reaction to drug 
ADR_DURATION  char  duration of adverse reaction 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The FREQ procedure in SAS yields a pvalue = 0.0061, which is significant at the α = 0.05 level. Therefore, we can conclude that there is enough evidence to support the claim that there is a difference in the severity of adverse reactions based on the relation of the reaction to the drug.
Spearman Correlation
Source: http://support.sas.com/learn/statlib..._nonpara_9.htm
Problem: Determine the Spearman correlation coefficient to measure the strength of the linear relationship between arts and economics.
View the first table in the output to see the variables involved in the analysis. View the second table to see the value of the Spearman correlation coefficient for the two variables.
The correlation value of 0.27926 suggests that there is a weak positive relationship between arts and economics.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/nonpara_9.htm#KB9
Arts and Economics in Cities
Measure the relationship between arts and economics in western cities.
Problem
In a study to investigate the prevalence of relationships between different socioeconomic factors, 52 western cities were rated by nine criteria: climate terrain, housing, health care environment, crime, transportation, education, arts, recreation, and economics. For housing and crime, the lower the rating score, the better. For the remaining seven criteria, the higher the score, the better.
Find the correlation of the arts and economics ratings for this sample data. Based on your value for the Spearman correlation coefficient would you describe the linear relationship between arts and economics as strong or weak, positive or negative?
Sample Data
The Westernrates data set contains data about ratings on nine criteria (climate and terrain, housing, health care and environment, crime, transportation, education, arts, recreation, and economics) for 52 western cities.
These are the variables in the data set:
Name  Type  Description 
City  char  city 
State  char  state 
ClimateTerrain  num  rating of climate and terrain 
Housing  num  rating of housing 
HealthCareEnvironment  num  rating of health care and environment 
Crime  num  rating of crime 
Transportation  num  rating of transportation 
Education  num  rating of education 
Arts  num  rating of the arts 
Recreation  num  rating of recreation 
Economics  num  rating of economics 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The correlation value of 0.27926 suggests that there is a weak positive relationship between arts and economics.
Hoop Ratings
Examine the correspondence between the team ratings of collegiate basketball polls.
Problem
The Basketball data set contains the preseason ratings of collegiate basketball teams for the 198586 season as given by ten media outlets. Use the Spearman correlation to give an assessment of the relationship between the team ratings by the UPI and AP polls.
Sample Data
The Basketball data set gives the preseason ratings of collegiate basketball teams for the 198586 season as given by ten media outlets.
These are the variables in the data set:
Name  Type  Description 
school  char  university team represents 
CSN  num  team rating by CSN 
Durham Sun  num  team rating by Durham Sun 
Durham Herald  num  team rating by Durham Herald 
Washington Post  num  team rating by Washington Post 
USA Today  num  team rating by USA Today 
Sports Magazine  num  team rating by Sports Magazine 
In Sport  num  team rating by In Sport 
UPI  num  team rating by UPI 
AP  num  team rating by AP 
Sports Illustrated  num  team rating by Sports Illustrated 
Source of Data
Sall, J., Creighton, L., & Lehman, A. (2006). JMP Start Statistics, Third Edition. Cary, NC: SAS Institute Inc.
Result
The value of the Spearman correlation coefficient (given by the CORR procedure in SAS) is 0.95. This indicates that there is a very strong positive association between the ratings by the UPI and AP polls. So, we can say that high team ratings by the UPI poll are associated with high ratings by the AP poll, and low ratings are by the UPI are associated with low ratings by the AP.
Survival Analysis
Test of Equality over Strata
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/eg_surv_1.htm
Problem: Perform a test of equality over strata to determine if receiving maintenance therapy is effective in lengthening the time in remission.
View the first part of the output to see the KaplanMeier survival estimates. Scroll down to see the rest of the table.
Scroll down again to view the summary statistics for MonitorTime for the control group.
Scroll down again to view survival estimates for the maintenance therapy group. Scroll down to see the rest of the table.
Scroll down again to view summary statistics for MonitorTime for the maintenance therapy group.
Scroll down again to the last table in the output, the results of the Test of Equality over Strata.
The likelihood ratio test should be used if the distribution of the event times is found to be exponential. The results from both the logrank and Wilcoxon tests suggest that the data does not indicate evidence that there is a difference in survival curves between patients that received the maintenance therapy and those who did not.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/surv_1.htm#KB8
Maintenance Therapy
Determine whether receiving maintenance therapy lengthens remission time.
Problem
A trial was conducted at Stanford University to investigate the effectiveness of maintenance therapy for acute myelogenous leukemia (aml). After being treated by chemotherapy until remission, patients were randomized into two groups: one that received maintenance therapy and a control group that did not. The patients were observed until they suffered a relapse, the event of interest. The event time variable is the length of time in remission—the time from entry into the study (being randomized into a group) until relapse.
Use the KaplanMeier method to estimate proportion of patients in remission for each group. Carry out a test of equality over strata to Determine whether there is a significant difference in remission time between the patients who received maintenance therapy and those who did not.
Sample Data
The Aml_survival data set contains data about the efficacy of maintenance therapy for acute myelogenous leukemia (aml).
These are the variables in the data set:
Name  Type  Description 
MonitorTime  num  length of time in remission (the time from entry into the study until relapse) 
Treatment  char  indicator of whether or not the patient received maintenance therapy 
Censored_  num  indicator of whether the patient suffered a relapse (0 = patient did not suffer a relapse, 
Source of Data
This data is sample data from SAS Institute Inc.
Result
The likelihood ratio test should be used if the distribution of the event times is found to be exponential. The results from both the logrank and Wilcoxon tests suggest that the data does not indicate evidence that there is a difference in remission time between patients that received the maintenance therapy and those who did not.
Comparing Survival Functions
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/eg_surv_2.htm
Problem: Compare the survival functions for two behavior types by examining the survival curves.
The tables in the output give the KaplanMeier survival estimates and the summary statistics for Personality for Type A and Type B personalities. If you like, you can view these tables in their entirety here.
Scroll down to the graph of the survival distribution functions.
It appears that the survival curve for the Type B behavior group lies above that of the Type A behavior group from time = 0 days to about time = 2850 days. So, we can say that the more relaxed, noncompetitive individuals had a more favorable survival experience for the first 5 years of the study.
Solve Exercises in Your Own Statistical Software
Source: http://support.sas.com/learn/statlibrary/statlib_eg4.2/surv_2.htm#KB8
Heart Disease
Compare the effects of two behavior types on the incidence of heart disease.
Problem
A prospective study was performed to investigate the effects of behavior type and smoking habits on heart disease. Participants were followed for nine years, and the time variable of interest was the interval (in days) from entry into the study until the appearance of coronary heart disease. Individuals were classified by two types of behavior on the basis of an interview: Type A is characterized by aggressiveness and competitiveness, and Type B is considered more relaxed and noncompetitive.
Use the plot of the survival curves to determine which behavior type yielded the more favorable “survival experiences” among the participants for the first 5 years (roughly 1825 days) of the study.
Sample Data
The Wcgs data set contains event time and censor variables for 614 participants, as well as measurements of two covariates of interest: smoking behavior at study entry and behavior type.
These are the variables in the data set:
Name  Type  Description 
Cigarettes  num  number of cigarettes smoked per day 
Personality  char  A (Type A) or B (Type B) 
Censor  num  censor (0 or 1) 
Time  num  number of days from entry into study until the appearance of coronary heart disease 
Source of Data
This data is sample data from SAS Institute Inc.
Result
It appears that the survival curve for the Type B behavior group lies above that of the Type A behavior group from time = 0 days to about time = 2850 days. So, we can say that the more relaxed, noncompetitive individuals had a more favorable survival experience for the first 5 years of the study.
Table: http://support.sas.com/learn/statlibrary/statlib_eg4.2/images/eg_surv_2_tables.htm
Comments