Table of contents
  1. Story
  2. Slides
    1. Slide 1 Effective Applications of the R Language
    2. Slide 2 Analytic Challenges for Enterprises
    3. Slide 3 R Can Help...But Has It's Own Challenges
    4. Slide 4 Why Embrace R?
    5. Slide 5 TIBCO Enterprise Runtime for R (TERR)
    6. Slide 6 TERR Performance
    7. Slide 7 TIBCO Spotfire Visual Analytics
    8. Slide 8 TIBCO Spotfire Predictive Analytics Ecosystem
    9. Slide 9 Example 1: Embedded TERR in Spotfire
    10. Slide 10 Power of Embedded Advanced Analytics
    11. Slide 11 Advanced Analytics Applications in Spotfire
    12. Slide 12 Example 2: TERR in TIBCO's Complex Event Processing
    13. Slide 13 Logistics Optimization
    14. Slide 14 Predictive Maintenance for Oil & Gas
    15. Slide 15 TERR Ecosystem
    16. Slide 16 TERR for Individual R Users
    17. Slide 17 TERR is R for the Enterprise
    18. Slide 18 Learn More and Try It Yourself
    19. Slide 19 TIBCO's Unique History with R/S
  3. Slides
    1. Slide 1 Extending the Reach of R to the Enterprise: TIBCO Enterprise Runtime for R
    2. Slide 2 Extending the Reach of R to the Enterprise
    3. Slide 3 Our (and My) Journey to TERR
    4. Slide 4 Why embrace R?
    5. Slide 5 What does “Embracing R” look like?
    6. Slide 6 Enterprise Challenges for R
    7. Slide 7 TIBCO Enterprise Runtime for R (TERR)
    8. Slide 8 Providing value for organizations who use R
    9. Slide 9 Providing Value for individuals who use R
    10. Slide 10 TERR Examples
    11. Slide 11 TERR vs. R Raw Performance
    12. Slide 12 TERR in Spotfire: Predictive Modeling
    13. Slide 13 Real time Fraud Detection
    14. Slide 14 Extending the Reach of R to the Enterprise
    15. Slide 15 Learn more and Try it yourself
  4. Tips and Tricks: A TIBCO Spotfire Blog
    1. Adding R Graphics to Spotfire (Case Study: Dendrograms)
    2. Interactively draw territories on a map with a TERR data function
    3. Accessing Third Party Data Sources Using RinR
    4. Add your own functions using R
  5. Spotfire Dashboard
  6. Spotfire Dashboard
  7. Research Notes
  8. How Much Did It Rain? II
    1. Competition Details
      1. Predict hourly rainfall using data from polarimetric radars
      2. Competition Description
      3. Acknowledgements
    2. Get the Data
      1. Data Files
      2. File descriptions
        1. Marshall–Palmer relation
        2. drop-size distribution
        3. radar reflectivity factor
        4. rainfall rate
      3. Data columns
      4. Referencing this data
    3. Competition Rules
      1. One account per participant
      2. No private sharing outside teams
      3. Public dissemination of entries
      4. Open licensing of winners
      5. Winning solutions must be posted or linked to in the forums.
      6. Team Mergers
      7. Team Limits
      8. Submission Limits
      9. Competition Timeline
      10. COMPETITION-SPECIFIC TERMS
      11. COMPETITION FRAMEWORK
      12. ELIGIBILITY
      13. SUBMISSIONS
      14. INTELLECTUAL PROPERTY
        1. DATA
        2. EXTERNAL DATA
        3. CODE SHARING
        4. OPEN-SOURCE CODE
      15. WINNING
        1. DETERMINING WINNERS
        2. RESOLVING TIES
        3. DECLINING PRIZES
      16. WINNERS' OBLIGATIONS
        1. DELIVERY & DOCUMENTATION
        2. PARTICIPANT INTELLECTUAL PROPERTY LICENSING
        3. PUBLIC COMPETITIONS: NON-EXCLUSIVE LICENSE
        4. RESEARCH COMPETITIONS: OPEN SOURCE LICENSE
        5. RECRUITING COMPETITION SUBMISSION LICENSE GRANT
        6. CHEATING
        7. RECEIVING PRIZES
      17. TEAMS
        1. FORMING A TEAM
        2. TEAM MERGERS
        3. TEAM PRIZES
      18. WARRANTIES AND OBLIGATIONS
        1. PARTICIPANT WARRANTIES AND OBLIGATIONS
        2. LIMITATION OF LIABILITY
        3. RESERVATION OF RIGHTS
      19. MISCELLANEOUS
        1. SEVERABILITY
        2. LAW
      20. Rules Acceptance
  9. Statistics Essentials For Dummies
    1. About the Author
    2. Introduction
      1. About This Book
      2. Conventions Used in This Book
      3. Foolish Assumptions
      4. Icons Used in This Book
      5. Where to Go from Here
      6. The 5th Wave
    3. Chapter 1: Statistics in a Nutshell
      1. Designing Studies
      2. Surveys
        1. Experiments
      3. Collecting Data
        1. Selecting a good sample
        2. Avoiding bias in your data
      4. Describing Data
        1. Descriptive statistics
        2. Charts and graphs
      5. Analyzing Data
      6. Making Conclusions
    4. Chapter 2: Descriptive Statistics
      1. Types of Data
      2. Counts and Percents
      3. Measures of Center
      4. Measures of Variability
      5. Percentiles
      6. Interpreting percentiles
        1. Table 2-1 U.S. Household Income for 2001
      7. The Five-Number Summary
    5. Chapter 3: Charts and Graphs
    6. Chapter 4: The Binomial Distribution
    7. Chapter 5: The Normal Distribution
    8. Chapter 6: Sampling Distributions and the Central Limit Theorem
    9. Chapter 7: Confidence Intervals
    10. Chapter 8: Hypothesis Tests
    11. Chapter 9: The t-distribution
    12. Chapter 10: Correlation and Regression
      1. Picturing the Relationship with a Scatterplot
      2. Making a scatterplot
      3. Interpreting a scatterplot
        1. Figure 10-1: Scatterplot of cricket chirps versus outdoor temperature.
      4. Measuring Relationships Using the Correlation
        1. Calculating the correlation
        2. Interpreting the correlation
        3. Figure 10-2: Scatterplots with various correlations.
        4. Properties of the correlation
      5. Finding the Regression Line
        1. Which is X and which is Y?
        2. Checking the conditions
        3. Understanding the equation
        4. Finding the slope
        5. Finding the y-intercept
        6. Interpreting the slope and y-intercept
          1. Interpreting the slope
          2. Interpreting the y-intercept
        7. Table 10-2 Big-Five Statistics for the Cricket Data
      6. Making Predictions
      7. Avoid Extrapolation!
      8. Correlation Doesn’t Necessarily Mean Cause-and-Effect
    13. Chapter 11: Two-Way Tables
    14. Chapter 12: A Checklist for Samples and Surveys
      1. The Target Population Is Well Defined
      2. The Sample Matches the Target Population
      3. The Sample Is Randomly Selected
      4. The Sample Size Is Large Enough
      5. Nonresponse Is Minimized
        1. The importance of following up
        2. Anonymity versus confidentiality
      6. The Survey Is of the Right Type
      7. Questions Are Well Worded
      8. The Timing Is Appropriate
      9. Personnel Are Well Trained
      10. Proper Conclusions Are Made
    15. Chapter 13: A Checklist for Judging Experiments
      1. Experiments versus Observational Studies
      2. Criteria for a Good Experiment
      3. Inspect the Sample Size
        1. Small samples — small conclusions
        2. Original versus final sample size
      4. Examine the Subjects
      5. Check for Random Assignments
      6. Gauge the Placebo Effect
      7. Identify Confounding Variables
      8. Assess Data Quality
      9. Check Out the Analysis
      10. Scrutinize the Conclusions
        1. Overstated results
        2. Ad-hoc explanations
        3. Generalizing beyond the scope
    16. Chapter 14: Ten Common Statistical Mistakes
      1. Misleading Graphs
        1. Pie charts
        2. Bar graphs
        3. Time charts
        4. Histograms
      2. Biased Data
      3. No Margin of Error
      4. Nonrandom Samples
      5. Missing Sample Sizes
      6. Misinterpreted Correlations
      7. Confounding Variables
      8. Botched Numbers
      9. Selectively Reporting Results
      10. The Almighty Anecdote
    17. Appendix: Tables for Reference
    18. Index
  10. Statistics II For Dummies
    1. Dedication
    2. About the Author
    3. Author’s Acknowledgments
    4. Introduction
      1. About This Book
      2. Conventions Used in This Book
      3. What You’re Not to Read
      4. Foolish Assumptions
      5. How This Book Is Organized
        1. Part I: Tackling Data Analysis and Model-Building Basics
        2. Part II: Using Different Types of Regression to Make Predictions
        3. Part III: Analyzing Variance with ANOVA
        4. Part IV: Building Strong Connections with Chi-Square Tests
        5. Part V: Nonparametric Statistics: Rebels without a Distribution
        6. Part VI: The Part of Tens
      6. Icons Used in This Book
      7. Where to Go from Here
    5. Part I: Tackling Data Analysis and Model-Building Basics
      1. Chapter 1: Beyond Number Crunching: The Art and Science of Data Analysis
        1. Data Analysis: Looking before You Crunch
          1. Nothing (not even a straight line) lasts forever
          2. Data snooping isn’t cool
          3. No (data) fishing allowed
        2. Getting the Big Picture: An Overview of Stats II
          1. Population parameter
          2. Sample statistic
          3. Confidence interval
          4. Hypothesis test
          5. Analysis of variance (ANOVA)
          6. Multiple comparisons
          7. Interaction effects
          8. Correlation
          9. Linear regression
          10. Chi-square tests
          11. Nonparametrics
      2. Chapter 2: Finding the Right Analysis for the Job
        1. Categorical versus Quantitative Variables
        2. Statistics for Categorical Variables
          1. Estimating a proportion
          2. Comparing proportions
          3. Looking for relationships between categorical variables
          4. Building models to make predictions
        3. Statistics for Quantitative Variables
          1. Making estimates
          2. Making comparisons
          3. Exploring relationships
          4. Predicting y using x
        4. Avoiding Bias
        5. Measuring Precision with Margin of Error
        6. Knowing Your Limitations
      3. Chapter 3: Reviewing Confi dence Intervals and Hypothesis Tests
        1. Estimating Parameters by Using Confidence Intervals
          1. Getting the basics: The general form of a confidence interval
          2. Finding the confidence interval for a population mean
          3. What changes the margin of error?
          4. Interpreting a confidence interval
        2. What’s the Hype about Hypothesis Tests?
          1. What Ho and Ha really represent
          2. Gathering your evidence into a test statistic
          3. Determining strength of evidence with a p-value
          4. False alarms and missed opportunities: Type I and II errors
          5. The power of a hypothesis test
    6. Part II: Using Different Types of Regression to Make Predictions
      1. Chapter 4: Getting in Line with Simple Linear Regression
        1. Exploring Relationships with Scatterplots and Correlations
          1. Using scatterplots to explore relationships
          2. Collating the information by using the correlation coefficient
        2. Building a Simple Linear Regression Model
          1. Finding the best-fi tting line to model your data
          2. The y-intercept of the regression line
          3. The slope of the regression line
          4. Making point estimates by using the regression line
        3. No Conclusion Left Behind: Tests and Confidence Intervals for Regression
          1. Scrutinizing the slope
          2. Inspecting the y-intercept
          3. Building confidence intervals for the average response
          4. Making the band with prediction intervals
        4. Checking the Model’s Fit (The Data, Not the Clothes!)
          1. Defining the conditions
          2. Finding and exploring the residuals
          3. Using r2 to measure model fit
          4. Scoping for outliers
        5. Knowing the Limitations of Your Regression Analysis
          1. Avoiding slipping into cause-and-effect mode
          2. Extrapolation: The ultimate no-no
          3. Sometimes you need more than one variable
      2. Chapter 5: Multiple Regression with Two X Variables
        1. Getting to Know the Multiple Regression Model
          1. Discovering the uses of multiple regression
          2. Looking at the general form of the multiple regression model
          3. Stepping through the analysis
        2. Looking at x’s and y’s
        3. Collecting the Data
        4. Pinpointing Possible Relationships
          1. Making scatterplots
          2. Correlations: Examining the bond
        5. Checking for Multicolinearity
        6. Finding the Best-Fitting Model for Two x Variables
          1. Getting the multiple regression coefficients
          2. Interpreting the coefficients
          3. Testing the coefficients
        7. Predicting y by Using the x Variables
        8. Checking the Fit of the Multiple Regression Model
          1. Noting the conditions
          2. Plotting a plan to check the conditions
          3. Checking the three conditions
      3. Chapter 6: How Can I Miss You If You Won’t Leave? Regression Model Selection
        1. Getting a Kick out of Estimating Punt Distance
          1. Brainstorming variables and collecting data
          2. Examining scatterplots and correlations
        2. Just Like Buying Shoes: The Model Looks Nice, But Does It Fit?
          1. Assessing the fi t of multiple regression models
          2. Model selection procedures
      4. Chapter 7: Getting Ahead of the Learning Curve with Nonlinear Regression
        1. Anticipating Nonlinear Regression
        2. Starting Out with Scatterplots
        3. Handling Curves in the Road with Polynomials
          1. Bringing back polynomials
          2. Searching for the best polynomial model
          3. Using a second-degree polynomial to pass the quiz
          4. Assessing the fit of a polynomial model
          5. Making predictions
        4. Going Up? Going Down? Go Exponential!
          1. Recollecting exponential models
          2. Searching for the best exponential model
          3. Spreading secrets at an exponential rate
      5. Chapter 8: Yes, No, Maybe So: Making Predictions by Using Logistic Regression
        1. Understanding a Logistic Regression Model
          1. How is logistic regression different from other regressions?
          2. Using an S-curve to estimate probabilities
          3. Interpreting the coeffi cients of the logistic regression model
          4. The logistic regression model in action
        2. Carrying Out a Logistic Regression Analysis
          1. Running the analysis in Minitab
          2. Finding the coefficients and making the model
          3. Estimating p
          4. Checking the fit of the model
          5. Fitting the Movie Model
    7. Part III: Analyzing Variance with ANOVA
      1. Chapter 9: Testing Lots of Means? Come On Over to ANOVA!
        1. Comparing Two Means with a t-Test
        2. Evaluating More Means with ANOVA
          1. Spitting seeds: A situation just waiting for ANOVA
          2. Walking through the steps of ANOVA
        3. Checking the Conditions
          1. Verifying independence
          2. Looking for what’s normal
          3. Taking note of spread
        4. Setting Up the Hypotheses
        5. Doing the F-Test
          1. Running ANOVA in Minitab
          2. Breaking down the variance into sums of squares
          3. Locating those mean sums of squares
          4. Figuring the F-statistic
          5. Making conclusions from ANOVA
          6. What’s next?
        6. Checking the Fit of the ANOVA Model
      2. Chapter 10: Sorting Out the Means with Multiple Comparisons
        1. Following Up after ANOVA
          1. Comparing cellphone minutes: An example
          2. Setting the stage for multiple comparison procedures
        2. Pinpointing Differing Means with Fisher and Tukey
          1. Fishing for differences with Fisher’s LSD
          2. Using Fisher’s new and improved LSD
          3. Separating the turkeys with Tukey’s test
        3. Examining the Output to Determine the Analysis
        4. So Many Other Procedures, So Little Time!
          1. Controlling for baloney with the Bonferroni adjustment
          2. Comparing combinations by using Scheffe’s method
          3. Finding out whodunit with Dunnett’s test
          4. Staying cool with Student Newman-Keuls
          5. Duncan’s multiple range test
          6. Going nonparametric with the Kruskal-Wallis test
      3. Chapter 11: Finding Your Way through Two-Way ANOVA
        1. Setting Up the Two-Way ANOVA Model
          1. Determining the treatments
          2. Stepping through the sums of squares
        2. Understanding Interaction Effects
          1. What is interaction, anyway?
          2. Interacting with interaction plots
        3. Testing the Terms in Two-Way ANOVA
        4. Running the Two-Way ANOVA Table
          1. Interpreting the results: Numbers and graphs
        5. Are Whites Whiter in Hot Water? Two-Way ANOVA Investigates
      4. Chapter 12: Regression and ANOVA: Surprise Relatives!
        1. Seeing Regression through the Eyes of Variation
          1. Spotting variability and fi nding an “x-planation”
          2. Getting results with regression
          3. Assessing the fi t of the regression model
        2. Regression and ANOVA: A Meeting of the Models
          1. Comparing sums of squares
          2. Dividing up the degrees of freedom
          3. Bringing regression to the ANOVA table
          4. Relating the F- and t-statistics: The fi nal frontier
    8. Part IV: Building Strong Connections with Chi-Square Tests
      1. Chapter 13: Forming Associations with Two-Way Tables
        1. Breaking Down a Two-Way Table
          1. Organizing data into a two-way table
          2. Filling in the cell counts
          3. Making marginal totals
        2. Breaking Down the Probabilities
          1. Marginal probabilities
          2. Joint probabilities
          3. Conditional probabilities
        3. Trying To Be Independent
          1. Checking for independence between two categories
          2. Checking for independence between two variables
        4. Demystifying Simpson’s Paradox
          1. Experiencing Simpson’s Paradox
          2. Figuring out why Simpson’s Paradox occurs
          3. Keeping one eye open for Simpson’s Paradox
      2. Chapter 14: Being Independent Enough for the Chi-Square Test
        1. The Chi-square Test for Independence
          1. Collecting and organizing the data
          2. Determining the hypotheses
          3. Figuring expected cell counts
          4. Checking the conditions for the test
          5. Calculating the Chi-square test statistic
          6. Finding your results on the Chi-square table
          7. Drawing your conclusions
          8. Putting the Chi-square to the test
        2. Comparing Two Tests for Comparing Two Proportions
          1. Getting reacquainted with the Z-test for two population proportions
          2. Equating Chi-square tests and Z-tests for a two-by-two table
      3. Chapter 15: Using Chi-Square Tests for Goodness-of-Fit (Your Data, Not Your Jeans)
        1. Finding the Goodness-of-Fit Statistic
          1. What’s observed versus what’s expected
          2. Calculating the goodness-of-fit statistic
        2. Interpreting the Goodness-of-Fit Statistic Using a Chi-Square
          1. Checking the conditions before you start
          2. The steps of the Chi-square goodness-of-fit test
    9. Part V: Nonparametric Statistics: Rebels without a Distribution
      1. Chapter 16: Going Nonparametric
        1. Arguing for Nonparametric Statistics
          1. No need to fret if conditions aren’t met
          2. The median’s in the spotlight for a change
          3. So, what’s the catch?
        2. Mastering the Basics of Nonparametric Statistics
          1. Sign
          2. Rank
          3. Signed rank
          4. Rank sum
      2. Chapter 17: All Signs Point to the Sign Test and Signed Rank Test
        1. Reading the Signs: The Sign Test
          1. Testing the median
          2. Estimating the median
          3. Testing matched pairs
        2. Going a Step Further with the Signed Rank Test
          1. A limitation of the sign test
          2. Stepping through the signed rank test
          3. Losing weight with signed ranks
      3. Chapter 18: Pulling Rank with the Rank Sum Test
        1. Conducting the Rank Sum Test
          1. Checking the conditions
          2. Stepping through the test
          3. Stepping up the sample size
        2. Performing a Rank Sum Test: Which Real Estate Agent Sells Homes Faster?
          1. Checking the conditions for this test
          2. Testing the hypotheses
      4. Chapter 19: Do the Kruskal-Wallis and Rank the Sums with the Wilcoxon
        1. Doing the Kruskal-Wallis Test to Compare More than Two Populations
          1. Checking the conditions
          2. Setting up the test
          3. Conducting the test step by step
        2. Pinpointing the Differences: The Wilcoxon Rank Sum Test
          1. Pairing off with pairwise comparisons
          2. Carrying out comparison tests to see who’s different
          3. Examining the medians to see how they’re different
      5. Chapter 20: Pointing Out Correlations with Spearman’s Rank
        1. Pickin’ On Pearson and His Precious Conditions
        2. Scoring with Spearman’s Rank Correlation
          1. Figuring Spearman’s rank correlation
          2. Watching Spearman at work: Relating aptitude to performance
    10. Part VI: The Part of Tens
      1. Chapter 21: Ten Common Errors in Statistical Conclusions
      2. Chapter 22: Ten Ways to Get Ahead by Knowing Statistics
      3. Chapter 23: Ten Cool Jobs That Use Statistics
    11. Appendix: Reference Tables
    12. Index
  11. Introduction to Random Forests for Beginners – free ebook
  12. Random Forests
    1. Salford Systems and Random Forests
    2. What are Random Forests?
      1. Prerequisites
      2. Preliminaries
      3. The Essentials
      4. More Essentials
      5. Predictions
      6. Weakness of Bagger
      7. Putting Randomness into Random Forests
      8. How Random Should a Random Forest Be?
      9. More on Random Splitting
      10. Controlling Degree of Randomness
      11. How Many Predictors in Each Node?
      12. Random Forests Predictions
      13. Out of Bag (OOB) Data
      14. Testing and Evaluating
      15. More Testing and Evaluation
      16. Testing vs.. Scoring
      17. Random Forests and Segmentation
      18. Proximity Matrix
      19. Characteristics of the Proximity Matrix
      20. Proximity Insights
      21. Proximity Visualization
      22. Missing Values
      23. Proximity and Missing Values
      24. Missing Imputation
      25. Variable Importance
      26. Variable Importance Details
      27. Variable Importance Issues
      28. Variable Importance: Final Observation
      29. Bootstrap Resampling
      30. The Technical Algorithm
    3. Suited for Wide Data
      1. Wide Data
      2. Random Forests and Wide Data
      3. Wide Shallow Data
    4. Strength and Weaknesses of Random Forests
      1. Random Forests: Few Controls to Learn & Set
      2. Easily Parallelizable
      3. Random Forests Weaknesses
      4. Therefore…
    5. Simple Example
      1. Boston Housing Data
      2. Run an RF model
      3. RF Controls
      4. Results Summary
      5. Confusion Matrix OOB
      6. Variable Importance
      7. Most Likely Vs Not Likely Parallel Coordinate Plots
      8. Parallel Coordinate Plots
      9. Outliers
      10. Proximity & Clusters
    6. Real World Example
      1. Analytics On a Grand Scale
      2. Challenge
      3. Some essential questions:
      4. The solution
      5. The result
      6. Customer Support
    7. Why Salford Systems?
      1. Unique to Salford
      2. Measure the ROI of Random Forests
  13. Connecting Alaska Landscapes Into the Future
    1. Cover Page
    2. Acknowledgments
    3. Executive Summary
    4. Introduction
      1. Scope and Purpose of the Report
      2. Terms and Definitions
      3. Project Background
      4. Models Used in the Connectivity Project
        1. MODELING CLIMATE CHANGE: SNAP CLIMATE MODELS
        2. MODELING CLIMATE CHANGE ENVELOPES: RANDOM FORESTS™
        3. MODELING CONNECTIVITY: MARXAN
      5. Important Factors and Models That Were NOT Incorporated
        1. SEA LEVEL RISE
        2. PERMAFROST CHANGE
        3. ALFRESCO
    5. Part I Modeling  Shifts in Ecosystem/Vegetation
      1. Landscape Classification
        1. Figure 1 Ecological groups, as defined by Nowacki et al. (2001). See Appendix C for full descriptions of each biome
        2. Figure 2 Alaska biomes and Canadian ecoregions
      2. Modeling the Effects of Climate Change on Biomes
        1. Figure 3 Conceptual illustration of how the Random Forests™ model was used to create climate envelopes for different modeling subjects
        2. Figure 4 Current biome types as predicted by SNAP climate data
        3. Figure 5 Projected potential biomes for 2030–2039
        4. Figure 6 Projected potential biomes for 2060–2069
        5. Figure 7 Projected potential biomes for 2090–2099
        6. Figure 8 Box plots comparing climate envelopes for biomes/ecoregions
    6. Part II Connectivity and Resilience and Conservation Potential
      1. Figure 9 Conservation lands in Alaska
      2. Methods
        1. CRITERION 1: BIOME REFUGIA
        2. Figure 10 Number of potential biome changes projected by 2099
        3. Figure 11 Biome refugia
        4. CRITERION 2: NDVI AS AN INDICATOR OF PRODUCTIVITY AND ENDEMISM
        5. Figure 12 Historical mean NDVI across the state of Alaska
        6. CRITERION 3:PROPORTION OF LAND AREA IN CONSERVATION STATUS
        7. MODELING STEPS
        8. Figure 13 Biome refugia on a 40 km hexagonal grid
        9. Figure 14 Marxan solution illustrating ranking by biome assuming 10% of total land area in conservation status
        10. Figure 15 Marxan solution statewide by biome assuming 25% of total land area in conservation status
      3. Marxan Modeling Results
      4. Discussion of Marxan Results
        1. Figure 16 Potential connections between Marxan solutions for lands within each biome category
    7. Part III Modeling Changes in Distribution of Indicator Species
      1. Caribou
        1. Figure 17 Alaska caribou herd ranges (2008)
        2. Figure 18 Projected caribou range for all herds combined
      2. Alaska Marmots
        1. Figure 19 Known Alaska marmot distribution and modeled current distribution
        2. Figure 20 Projected Alaska marmot distribution
      3. Trumpeter Swans
        1. Figure 21 Potential expansion of trumpeter swan habitat
      4. Reed Canary Grass
        1. Figure 22 Potential spread of reed canary grass, using climate and all-season roads as predictors
    8. Part IV Implications of Modeling for Conservation and Research and Management
        1. Figure 23 Case study area, Kenai Peninsula
      1. Supporting Evidence for our Models from the Field: Kenai Peninsula Case Study
        1. Figure 24 Landcover, existing biome, and predicted biome
        2. Figure 25 Modeled spread of reed canary grass
        3. Figure 26 Increase in likelihood of occurrence of trumpeter swans
        4. Figure 27 Little change is seen in potential caribou distribution
      2. Implications of Results for Conservation, Research, and Management
    9. Part V Lessons Learned and Next Steps
    10. Appendices and Technical Addenda
      1. Appendix A Project Participants
        1. P.I.s
        2. Alaska Department of Fish and  Game
        3. Alaska Natural Resource Center, National Wildlife Federation
        4. Audubon Alaska
        5. Defenders of Wildlife
        6. The Nature Conservancy
        7. The Wilderness Society
        8. U.S. Bureau of Land  Management
        9. U.S. Fish and Wildlife Service
        10. U.S. Forest Service
        11. U.S. Geological Survey
        12. U.S. National Park Service
        13. UAA Natural Heritage Program & Alaska GAP
      2. Appendix B Additional Reading and References
      3. Appendix C Alaskan Biome and Canadian Ecozone Descriptions
        1. Alpine
        2. Arctic
        3. Western Tundra
        4. Alaska Boreal
        5. Boreal Transition
        6. North Pacific Maritime
        7. Aleutian Islands
        8. Boreal Cordillera Ecozone
        9. Taiga Cordillera Ecozone
        10. Pacific Maritime Ecozone
        11. Montane Cordillera Ecozone
        12. Taiga Plains Ecozone
        13. Boreal Plains Ecozone
      4. Technical Addendum I Derivation of SNAP Climate Projections
      5. Technical Addendum II Identifying the Alpine Biome
        1. Figure II-1 Projected change in the Alpine zone over time
      6. Technical Addendum III LANDFIRE Mapping of Units 70 and 71 (ez7071)
        1. Table III-1 Grouping of the original LANDFIRE classes into 20 land cover classes for modeling
        2. Figure III-1 Extrapolating statewide, based on vegetation classifications for units 70 and 71
      7. Technical Addendum IV Defining Biomes and Modeling Biome Shifts in Random Forests™
        1. Reconciling Canadian and Alaskan data
        2. Defining climate envelopes in a continuous landscape
      8. Technical Addendum V Caribou Modeling by Herd or Season
        1. Table V-1 Seasonal ranges (updated 2008) used in modeling differences in northern and southern caribou herds in Alaska
        2. Figure V-1 Available range information for southern herds in 2008
        3. Figure V-2 Projected caribou range for all herds combined
        4. Figure V-3 Range size of all herds as a percentage of the state
      9. Technical Addendum VI Derivation of Marmot Model
        1. Figure VI-1 Known Alaska marmot occurrences overlaid with modeled Alpine zone model
        2. Figure VI-2 Terrain roughness
        3. Figures VI-3 Potential marmot habitat with terrain roughness as a factor
      10. Technical Addendum VII Derivation of Swan Model
        1. Figure VII-1 Trumpeter swan sightings, 2005
      11. Technical Addendum VIII Derivation of Reed Canary Grass Model
        1. Figure VIII-1 Grass modeling using a more extensive transportation network
    11. Notes

Data Science for Random Forests

Last modified
Table of contents
  1. Story
  2. Slides
    1. Slide 1 Effective Applications of the R Language
    2. Slide 2 Analytic Challenges for Enterprises
    3. Slide 3 R Can Help...But Has It's Own Challenges
    4. Slide 4 Why Embrace R?
    5. Slide 5 TIBCO Enterprise Runtime for R (TERR)
    6. Slide 6 TERR Performance
    7. Slide 7 TIBCO Spotfire Visual Analytics
    8. Slide 8 TIBCO Spotfire Predictive Analytics Ecosystem
    9. Slide 9 Example 1: Embedded TERR in Spotfire
    10. Slide 10 Power of Embedded Advanced Analytics
    11. Slide 11 Advanced Analytics Applications in Spotfire
    12. Slide 12 Example 2: TERR in TIBCO's Complex Event Processing
    13. Slide 13 Logistics Optimization
    14. Slide 14 Predictive Maintenance for Oil & Gas
    15. Slide 15 TERR Ecosystem
    16. Slide 16 TERR for Individual R Users
    17. Slide 17 TERR is R for the Enterprise
    18. Slide 18 Learn More and Try It Yourself
    19. Slide 19 TIBCO's Unique History with R/S
  3. Slides
    1. Slide 1 Extending the Reach of R to the Enterprise: TIBCO Enterprise Runtime for R
    2. Slide 2 Extending the Reach of R to the Enterprise
    3. Slide 3 Our (and My) Journey to TERR
    4. Slide 4 Why embrace R?
    5. Slide 5 What does “Embracing R” look like?
    6. Slide 6 Enterprise Challenges for R
    7. Slide 7 TIBCO Enterprise Runtime for R (TERR)
    8. Slide 8 Providing value for organizations who use R
    9. Slide 9 Providing Value for individuals who use R
    10. Slide 10 TERR Examples
    11. Slide 11 TERR vs. R Raw Performance
    12. Slide 12 TERR in Spotfire: Predictive Modeling
    13. Slide 13 Real time Fraud Detection
    14. Slide 14 Extending the Reach of R to the Enterprise
    15. Slide 15 Learn more and Try it yourself
  4. Tips and Tricks: A TIBCO Spotfire Blog
    1. Adding R Graphics to Spotfire (Case Study: Dendrograms)
    2. Interactively draw territories on a map with a TERR data function
    3. Accessing Third Party Data Sources Using RinR
    4. Add your own functions using R
  5. Spotfire Dashboard
  6. Spotfire Dashboard
  7. Research Notes
  8. How Much Did It Rain? II
    1. Competition Details
      1. Predict hourly rainfall using data from polarimetric radars
      2. Competition Description
      3. Acknowledgements
    2. Get the Data
      1. Data Files
      2. File descriptions
        1. Marshall–Palmer relation
        2. drop-size distribution
        3. radar reflectivity factor
        4. rainfall rate
      3. Data columns
      4. Referencing this data
    3. Competition Rules
      1. One account per participant
      2. No private sharing outside teams
      3. Public dissemination of entries
      4. Open licensing of winners
      5. Winning solutions must be posted or linked to in the forums.
      6. Team Mergers
      7. Team Limits
      8. Submission Limits
      9. Competition Timeline
      10. COMPETITION-SPECIFIC TERMS
      11. COMPETITION FRAMEWORK
      12. ELIGIBILITY
      13. SUBMISSIONS
      14. INTELLECTUAL PROPERTY
        1. DATA
        2. EXTERNAL DATA
        3. CODE SHARING
        4. OPEN-SOURCE CODE
      15. WINNING
        1. DETERMINING WINNERS
        2. RESOLVING TIES
        3. DECLINING PRIZES
      16. WINNERS' OBLIGATIONS
        1. DELIVERY & DOCUMENTATION
        2. PARTICIPANT INTELLECTUAL PROPERTY LICENSING
        3. PUBLIC COMPETITIONS: NON-EXCLUSIVE LICENSE
        4. RESEARCH COMPETITIONS: OPEN SOURCE LICENSE
        5. RECRUITING COMPETITION SUBMISSION LICENSE GRANT
        6. CHEATING
        7. RECEIVING PRIZES
      17. TEAMS
        1. FORMING A TEAM
        2. TEAM MERGERS
        3. TEAM PRIZES
      18. WARRANTIES AND OBLIGATIONS
        1. PARTICIPANT WARRANTIES AND OBLIGATIONS
        2. LIMITATION OF LIABILITY
        3. RESERVATION OF RIGHTS
      19. MISCELLANEOUS
        1. SEVERABILITY
        2. LAW
      20. Rules Acceptance
  9. Statistics Essentials For Dummies
    1. About the Author
    2. Introduction
      1. About This Book
      2. Conventions Used in This Book
      3. Foolish Assumptions
      4. Icons Used in This Book
      5. Where to Go from Here
      6. The 5th Wave
    3. Chapter 1: Statistics in a Nutshell
      1. Designing Studies
      2. Surveys
        1. Experiments
      3. Collecting Data
        1. Selecting a good sample
        2. Avoiding bias in your data
      4. Describing Data
        1. Descriptive statistics
        2. Charts and graphs
      5. Analyzing Data
      6. Making Conclusions
    4. Chapter 2: Descriptive Statistics
      1. Types of Data
      2. Counts and Percents
      3. Measures of Center
      4. Measures of Variability
      5. Percentiles
      6. Interpreting percentiles
        1. Table 2-1 U.S. Household Income for 2001
      7. The Five-Number Summary
    5. Chapter 3: Charts and Graphs
    6. Chapter 4: The Binomial Distribution
    7. Chapter 5: The Normal Distribution
    8. Chapter 6: Sampling Distributions and the Central Limit Theorem
    9. Chapter 7: Confidence Intervals
    10. Chapter 8: Hypothesis Tests
    11. Chapter 9: The t-distribution
    12. Chapter 10: Correlation and Regression
      1. Picturing the Relationship with a Scatterplot
      2. Making a scatterplot
      3. Interpreting a scatterplot
        1. Figure 10-1: Scatterplot of cricket chirps versus outdoor temperature.
      4. Measuring Relationships Using the Correlation
        1. Calculating the correlation
        2. Interpreting the correlation
        3. Figure 10-2: Scatterplots with various correlations.
        4. Properties of the correlation
      5. Finding the Regression Line
        1. Which is X and which is Y?
        2. Checking the conditions
        3. Understanding the equation
        4. Finding the slope
        5. Finding the y-intercept
        6. Interpreting the slope and y-intercept
          1. Interpreting the slope
          2. Interpreting the y-intercept
        7. Table 10-2 Big-Five Statistics for the Cricket Data
      6. Making Predictions
      7. Avoid Extrapolation!
      8. Correlation Doesn’t Necessarily Mean Cause-and-Effect
    13. Chapter 11: Two-Way Tables
    14. Chapter 12: A Checklist for Samples and Surveys
      1. The Target Population Is Well Defined
      2. The Sample Matches the Target Population
      3. The Sample Is Randomly Selected
      4. The Sample Size Is Large Enough
      5. Nonresponse Is Minimized
        1. The importance of following up
        2. Anonymity versus confidentiality
      6. The Survey Is of the Right Type
      7. Questions Are Well Worded
      8. The Timing Is Appropriate
      9. Personnel Are Well Trained
      10. Proper Conclusions Are Made
    15. Chapter 13: A Checklist for Judging Experiments
      1. Experiments versus Observational Studies
      2. Criteria for a Good Experiment
      3. Inspect the Sample Size
        1. Small samples — small conclusions
        2. Original versus final sample size
      4. Examine the Subjects
      5. Check for Random Assignments
      6. Gauge the Placebo Effect
      7. Identify Confounding Variables
      8. Assess Data Quality
      9. Check Out the Analysis
      10. Scrutinize the Conclusions
        1. Overstated results
        2. Ad-hoc explanations
        3. Generalizing beyond the scope
    16. Chapter 14: Ten Common Statistical Mistakes
      1. Misleading Graphs
        1. Pie charts
        2. Bar graphs
        3. Time charts
        4. Histograms
      2. Biased Data
      3. No Margin of Error
      4. Nonrandom Samples
      5. Missing Sample Sizes
      6. Misinterpreted Correlations
      7. Confounding Variables
      8. Botched Numbers
      9. Selectively Reporting Results
      10. The Almighty Anecdote
    17. Appendix: Tables for Reference
    18. Index
  10. Statistics II For Dummies
    1. Dedication
    2. About the Author
    3. Author’s Acknowledgments
    4. Introduction
      1. About This Book
      2. Conventions Used in This Book
      3. What You’re Not to Read
      4. Foolish Assumptions
      5. How This Book Is Organized
        1. Part I: Tackling Data Analysis and Model-Building Basics
        2. Part II: Using Different Types of Regression to Make Predictions
        3. Part III: Analyzing Variance with ANOVA
        4. Part IV: Building Strong Connections with Chi-Square Tests
        5. Part V: Nonparametric Statistics: Rebels without a Distribution
        6. Part VI: The Part of Tens
      6. Icons Used in This Book
      7. Where to Go from Here
    5. Part I: Tackling Data Analysis and Model-Building Basics
      1. Chapter 1: Beyond Number Crunching: The Art and Science of Data Analysis
        1. Data Analysis: Looking before You Crunch
          1. Nothing (not even a straight line) lasts forever
          2. Data snooping isn’t cool
          3. No (data) fishing allowed
        2. Getting the Big Picture: An Overview of Stats II
          1. Population parameter
          2. Sample statistic
          3. Confidence interval
          4. Hypothesis test
          5. Analysis of variance (ANOVA)
          6. Multiple comparisons
          7. Interaction effects
          8. Correlation
          9. Linear regression
          10. Chi-square tests
          11. Nonparametrics
      2. Chapter 2: Finding the Right Analysis for the Job
        1. Categorical versus Quantitative Variables
        2. Statistics for Categorical Variables
          1. Estimating a proportion
          2. Comparing proportions
          3. Looking for relationships between categorical variables
          4. Building models to make predictions
        3. Statistics for Quantitative Variables
          1. Making estimates
          2. Making comparisons
          3. Exploring relationships
          4. Predicting y using x
        4. Avoiding Bias
        5. Measuring Precision with Margin of Error
        6. Knowing Your Limitations
      3. Chapter 3: Reviewing Confi dence Intervals and Hypothesis Tests
        1. Estimating Parameters by Using Confidence Intervals
          1. Getting the basics: The general form of a confidence interval
          2. Finding the confidence interval for a population mean
          3. What changes the margin of error?
          4. Interpreting a confidence interval
        2. What’s the Hype about Hypothesis Tests?
          1. What Ho and Ha really represent
          2. Gathering your evidence into a test statistic
          3. Determining strength of evidence with a p-value
          4. False alarms and missed opportunities: Type I and II errors
          5. The power of a hypothesis test
    6. Part II: Using Different Types of Regression to Make Predictions
      1. Chapter 4: Getting in Line with Simple Linear Regression
        1. Exploring Relationships with Scatterplots and Correlations
          1. Using scatterplots to explore relationships
          2. Collating the information by using the correlation coefficient
        2. Building a Simple Linear Regression Model
          1. Finding the best-fi tting line to model your data
          2. The y-intercept of the regression line
          3. The slope of the regression line
          4. Making point estimates by using the regression line
        3. No Conclusion Left Behind: Tests and Confidence Intervals for Regression
          1. Scrutinizing the slope
          2. Inspecting the y-intercept
          3. Building confidence intervals for the average response
          4. Making the band with prediction intervals
        4. Checking the Model’s Fit (The Data, Not the Clothes!)
          1. Defining the conditions
          2. Finding and exploring the residuals
          3. Using r2 to measure model fit
          4. Scoping for outliers
        5. Knowing the Limitations of Your Regression Analysis
          1. Avoiding slipping into cause-and-effect mode
          2. Extrapolation: The ultimate no-no
          3. Sometimes you need more than one variable
      2. Chapter 5: Multiple Regression with Two X Variables
        1. Getting to Know the Multiple Regression Model
          1. Discovering the uses of multiple regression
          2. Looking at the general form of the multiple regression model
          3. Stepping through the analysis
        2. Looking at x’s and y’s
        3. Collecting the Data
        4. Pinpointing Possible Relationships
          1. Making scatterplots
          2. Correlations: Examining the bond
        5. Checking for Multicolinearity
        6. Finding the Best-Fitting Model for Two x Variables
          1. Getting the multiple regression coefficients
          2. Interpreting the coefficients
          3. Testing the coefficients
        7. Predicting y by Using the x Variables
        8. Checking the Fit of the Multiple Regression Model
          1. Noting the conditions
          2. Plotting a plan to check the conditions
          3. Checking the three conditions
      3. Chapter 6: How Can I Miss You If You Won’t Leave? Regression Model Selection
        1. Getting a Kick out of Estimating Punt Distance
          1. Brainstorming variables and collecting data
          2. Examining scatterplots and correlations
        2. Just Like Buying Shoes: The Model Looks Nice, But Does It Fit?
          1. Assessing the fi t of multiple regression models
          2. Model selection procedures
      4. Chapter 7: Getting Ahead of the Learning Curve with Nonlinear Regression
        1. Anticipating Nonlinear Regression
        2. Starting Out with Scatterplots
        3. Handling Curves in the Road with Polynomials
          1. Bringing back polynomials
          2. Searching for the best polynomial model
          3. Using a second-degree polynomial to pass the quiz
          4. Assessing the fit of a polynomial model
          5. Making predictions
        4. Going Up? Going Down? Go Exponential!
          1. Recollecting exponential models
          2. Searching for the best exponential model
          3. Spreading secrets at an exponential rate
      5. Chapter 8: Yes, No, Maybe So: Making Predictions by Using Logistic Regression
        1. Understanding a Logistic Regression Model
          1. How is logistic regression different from other regressions?
          2. Using an S-curve to estimate probabilities
          3. Interpreting the coeffi cients of the logistic regression model
          4. The logistic regression model in action
        2. Carrying Out a Logistic Regression Analysis
          1. Running the analysis in Minitab
          2. Finding the coefficients and making the model
          3. Estimating p
          4. Checking the fit of the model
          5. Fitting the Movie Model
    7. Part III: Analyzing Variance with ANOVA
      1. Chapter 9: Testing Lots of Means? Come On Over to ANOVA!
        1. Comparing Two Means with a t-Test
        2. Evaluating More Means with ANOVA
          1. Spitting seeds: A situation just waiting for ANOVA
          2. Walking through the steps of ANOVA
        3. Checking the Conditions
          1. Verifying independence
          2. Looking for what’s normal
          3. Taking note of spread
        4. Setting Up the Hypotheses
        5. Doing the F-Test
          1. Running ANOVA in Minitab
          2. Breaking down the variance into sums of squares
          3. Locating those mean sums of squares
          4. Figuring the F-statistic
          5. Making conclusions from ANOVA
          6. What’s next?
        6. Checking the Fit of the ANOVA Model
      2. Chapter 10: Sorting Out the Means with Multiple Comparisons
        1. Following Up after ANOVA
          1. Comparing cellphone minutes: An example
          2. Setting the stage for multiple comparison procedures
        2. Pinpointing Differing Means with Fisher and Tukey
          1. Fishing for differences with Fisher’s LSD
          2. Using Fisher’s new and improved LSD
          3. Separating the turkeys with Tukey’s test
        3. Examining the Output to Determine the Analysis
        4. So Many Other Procedures, So Little Time!
          1. Controlling for baloney with the Bonferroni adjustment
          2. Comparing combinations by using Scheffe’s method
          3. Finding out whodunit with Dunnett’s test
          4. Staying cool with Student Newman-Keuls
          5. Duncan’s multiple range test
          6. Going nonparametric with the Kruskal-Wallis test
      3. Chapter 11: Finding Your Way through Two-Way ANOVA
        1. Setting Up the Two-Way ANOVA Model
          1. Determining the treatments
          2. Stepping through the sums of squares
        2. Understanding Interaction Effects
          1. What is interaction, anyway?
          2. Interacting with interaction plots
        3. Testing the Terms in Two-Way ANOVA
        4. Running the Two-Way ANOVA Table
          1. Interpreting the results: Numbers and graphs
        5. Are Whites Whiter in Hot Water? Two-Way ANOVA Investigates
      4. Chapter 12: Regression and ANOVA: Surprise Relatives!
        1. Seeing Regression through the Eyes of Variation
          1. Spotting variability and fi nding an “x-planation”
          2. Getting results with regression
          3. Assessing the fi t of the regression model
        2. Regression and ANOVA: A Meeting of the Models
          1. Comparing sums of squares
          2. Dividing up the degrees of freedom
          3. Bringing regression to the ANOVA table
          4. Relating the F- and t-statistics: The fi nal frontier
    8. Part IV: Building Strong Connections with Chi-Square Tests
      1. Chapter 13: Forming Associations with Two-Way Tables
        1. Breaking Down a Two-Way Table
          1. Organizing data into a two-way table
          2. Filling in the cell counts
          3. Making marginal totals
        2. Breaking Down the Probabilities
          1. Marginal probabilities
          2. Joint probabilities
          3. Conditional probabilities
        3. Trying To Be Independent
          1. Checking for independence between two categories
          2. Checking for independence between two variables
        4. Demystifying Simpson’s Paradox
          1. Experiencing Simpson’s Paradox
          2. Figuring out why Simpson’s Paradox occurs
          3. Keeping one eye open for Simpson’s Paradox
      2. Chapter 14: Being Independent Enough for the Chi-Square Test
        1. The Chi-square Test for Independence
          1. Collecting and organizing the data
          2. Determining the hypotheses
          3. Figuring expected cell counts
          4. Checking the conditions for the test
          5. Calculating the Chi-square test statistic
          6. Finding your results on the Chi-square table
          7. Drawing your conclusions
          8. Putting the Chi-square to the test
        2. Comparing Two Tests for Comparing Two Proportions
          1. Getting reacquainted with the Z-test for two population proportions
          2. Equating Chi-square tests and Z-tests for a two-by-two table
      3. Chapter 15: Using Chi-Square Tests for Goodness-of-Fit (Your Data, Not Your Jeans)
        1. Finding the Goodness-of-Fit Statistic
          1. What’s observed versus what’s expected
          2. Calculating the goodness-of-fit statistic
        2. Interpreting the Goodness-of-Fit Statistic Using a Chi-Square
          1. Checking the conditions before you start
          2. The steps of the Chi-square goodness-of-fit test
    9. Part V: Nonparametric Statistics: Rebels without a Distribution
      1. Chapter 16: Going Nonparametric
        1. Arguing for Nonparametric Statistics
          1. No need to fret if conditions aren’t met
          2. The median’s in the spotlight for a change
          3. So, what’s the catch?
        2. Mastering the Basics of Nonparametric Statistics
          1. Sign
          2. Rank
          3. Signed rank
          4. Rank sum
      2. Chapter 17: All Signs Point to the Sign Test and Signed Rank Test
        1. Reading the Signs: The Sign Test
          1. Testing the median
          2. Estimating the median
          3. Testing matched pairs
        2. Going a Step Further with the Signed Rank Test
          1. A limitation of the sign test
          2. Stepping through the signed rank test
          3. Losing weight with signed ranks
      3. Chapter 18: Pulling Rank with the Rank Sum Test
        1. Conducting the Rank Sum Test
          1. Checking the conditions
          2. Stepping through the test
          3. Stepping up the sample size
        2. Performing a Rank Sum Test: Which Real Estate Agent Sells Homes Faster?
          1. Checking the conditions for this test
          2. Testing the hypotheses
      4. Chapter 19: Do the Kruskal-Wallis and Rank the Sums with the Wilcoxon
        1. Doing the Kruskal-Wallis Test to Compare More than Two Populations
          1. Checking the conditions
          2. Setting up the test
          3. Conducting the test step by step
        2. Pinpointing the Differences: The Wilcoxon Rank Sum Test
          1. Pairing off with pairwise comparisons
          2. Carrying out comparison tests to see who’s different
          3. Examining the medians to see how they’re different
      5. Chapter 20: Pointing Out Correlations with Spearman’s Rank
        1. Pickin’ On Pearson and His Precious Conditions
        2. Scoring with Spearman’s Rank Correlation
          1. Figuring Spearman’s rank correlation
          2. Watching Spearman at work: Relating aptitude to performance
    10. Part VI: The Part of Tens
      1. Chapter 21: Ten Common Errors in Statistical Conclusions
      2. Chapter 22: Ten Ways to Get Ahead by Knowing Statistics
      3. Chapter 23: Ten Cool Jobs That Use Statistics
    11. Appendix: Reference Tables
    12. Index
  11. Introduction to Random Forests for Beginners – free ebook
  12. Random Forests
    1. Salford Systems and Random Forests
    2. What are Random Forests?
      1. Prerequisites
      2. Preliminaries
      3. The Essentials
      4. More Essentials
      5. Predictions
      6. Weakness of Bagger
      7. Putting Randomness into Random Forests
      8. How Random Should a Random Forest Be?
      9. More on Random Splitting
      10. Controlling Degree of Randomness
      11. How Many Predictors in Each Node?
      12. Random Forests Predictions
      13. Out of Bag (OOB) Data
      14. Testing and Evaluating
      15. More Testing and Evaluation
      16. Testing vs.. Scoring
      17. Random Forests and Segmentation
      18. Proximity Matrix
      19. Characteristics of the Proximity Matrix
      20. Proximity Insights
      21. Proximity Visualization
      22. Missing Values
      23. Proximity and Missing Values
      24. Missing Imputation
      25. Variable Importance
      26. Variable Importance Details
      27. Variable Importance Issues
      28. Variable Importance: Final Observation
      29. Bootstrap Resampling
      30. The Technical Algorithm
    3. Suited for Wide Data
      1. Wide Data
      2. Random Forests and Wide Data
      3. Wide Shallow Data
    4. Strength and Weaknesses of Random Forests
      1. Random Forests: Few Controls to Learn & Set
      2. Easily Parallelizable
      3. Random Forests Weaknesses
      4. Therefore…
    5. Simple Example
      1. Boston Housing Data
      2. Run an RF model
      3. RF Controls
      4. Results Summary
      5. Confusion Matrix OOB
      6. Variable Importance
      7. Most Likely Vs Not Likely Parallel Coordinate Plots
      8. Parallel Coordinate Plots
      9. Outliers
      10. Proximity & Clusters
    6. Real World Example
      1. Analytics On a Grand Scale
      2. Challenge
      3. Some essential questions:
      4. The solution
      5. The result
      6. Customer Support
    7. Why Salford Systems?
      1. Unique to Salford
      2. Measure the ROI of Random Forests
  13. Connecting Alaska Landscapes Into the Future
    1. Cover Page
    2. Acknowledgments
    3. Executive Summary
    4. Introduction
      1. Scope and Purpose of the Report
      2. Terms and Definitions
      3. Project Background
      4. Models Used in the Connectivity Project
        1. MODELING CLIMATE CHANGE: SNAP CLIMATE MODELS
        2. MODELING CLIMATE CHANGE ENVELOPES: RANDOM FORESTS™
        3. MODELING CONNECTIVITY: MARXAN
      5. Important Factors and Models That Were NOT Incorporated
        1. SEA LEVEL RISE
        2. PERMAFROST CHANGE
        3. ALFRESCO
    5. Part I Modeling  Shifts in Ecosystem/Vegetation
      1. Landscape Classification
        1. Figure 1 Ecological groups, as defined by Nowacki et al. (2001). See Appendix C for full descriptions of each biome
        2. Figure 2 Alaska biomes and Canadian ecoregions
      2. Modeling the Effects of Climate Change on Biomes
        1. Figure 3 Conceptual illustration of how the Random Forests™ model was used to create climate envelopes for different modeling subjects
        2. Figure 4 Current biome types as predicted by SNAP climate data
        3. Figure 5 Projected potential biomes for 2030–2039
        4. Figure 6 Projected potential biomes for 2060–2069
        5. Figure 7 Projected potential biomes for 2090–2099
        6. Figure 8 Box plots comparing climate envelopes for biomes/ecoregions
    6. Part II Connectivity and Resilience and Conservation Potential
      1. Figure 9 Conservation lands in Alaska
      2. Methods
        1. CRITERION 1: BIOME REFUGIA
        2. Figure 10 Number of potential biome changes projected by 2099
        3. Figure 11 Biome refugia
        4. CRITERION 2: NDVI AS AN INDICATOR OF PRODUCTIVITY AND ENDEMISM
        5. Figure 12 Historical mean NDVI across the state of Alaska
        6. CRITERION 3:PROPORTION OF LAND AREA IN CONSERVATION STATUS
        7. MODELING STEPS
        8. Figure 13 Biome refugia on a 40 km hexagonal grid
        9. Figure 14 Marxan solution illustrating ranking by biome assuming 10% of total land area in conservation status
        10. Figure 15 Marxan solution statewide by biome assuming 25% of total land area in conservation status
      3. Marxan Modeling Results
      4. Discussion of Marxan Results
        1. Figure 16 Potential connections between Marxan solutions for lands within each biome category
    7. Part III Modeling Changes in Distribution of Indicator Species
      1. Caribou
        1. Figure 17 Alaska caribou herd ranges (2008)
        2. Figure 18 Projected caribou range for all herds combined
      2. Alaska Marmots
        1. Figure 19 Known Alaska marmot distribution and modeled current distribution
        2. Figure 20 Projected Alaska marmot distribution
      3. Trumpeter Swans
        1. Figure 21 Potential expansion of trumpeter swan habitat
      4. Reed Canary Grass
        1. Figure 22 Potential spread of reed canary grass, using climate and all-season roads as predictors
    8. Part IV Implications of Modeling for Conservation and Research and Management
        1. Figure 23 Case study area, Kenai Peninsula
      1. Supporting Evidence for our Models from the Field: Kenai Peninsula Case Study
        1. Figure 24 Landcover, existing biome, and predicted biome
        2. Figure 25 Modeled spread of reed canary grass
        3. Figure 26 Increase in likelihood of occurrence of trumpeter swans
        4. Figure 27 Little change is seen in potential caribou distribution
      2. Implications of Results for Conservation, Research, and Management
    9. Part V Lessons Learned and Next Steps
    10. Appendices and Technical Addenda
      1. Appendix A Project Participants
        1. P.I.s
        2. Alaska Department of Fish and  Game
        3. Alaska Natural Resource Center, National Wildlife Federation
        4. Audubon Alaska
        5. Defenders of Wildlife
        6. The Nature Conservancy
        7. The Wilderness Society
        8. U.S. Bureau of Land  Management
        9. U.S. Fish and Wildlife Service
        10. U.S. Forest Service
        11. U.S. Geological Survey
        12. U.S. National Park Service
        13. UAA Natural Heritage Program & Alaska GAP
      2. Appendix B Additional Reading and References
      3. Appendix C Alaskan Biome and Canadian Ecozone Descriptions
        1. Alpine
        2. Arctic
        3. Western Tundra
        4. Alaska Boreal
        5. Boreal Transition
        6. North Pacific Maritime
        7. Aleutian Islands
        8. Boreal Cordillera Ecozone
        9. Taiga Cordillera Ecozone
        10. Pacific Maritime Ecozone
        11. Montane Cordillera Ecozone
        12. Taiga Plains Ecozone
        13. Boreal Plains Ecozone
      4. Technical Addendum I Derivation of SNAP Climate Projections
      5. Technical Addendum II Identifying the Alpine Biome
        1. Figure II-1 Projected change in the Alpine zone over time
      6. Technical Addendum III LANDFIRE Mapping of Units 70 and 71 (ez7071)
        1. Table III-1 Grouping of the original LANDFIRE classes into 20 land cover classes for modeling
        2. Figure III-1 Extrapolating statewide, based on vegetation classifications for units 70 and 71
      7. Technical Addendum IV Defining Biomes and Modeling Biome Shifts in Random Forests™
        1. Reconciling Canadian and Alaskan data
        2. Defining climate envelopes in a continuous landscape
      8. Technical Addendum V Caribou Modeling by Herd or Season
        1. Table V-1 Seasonal ranges (updated 2008) used in modeling differences in northern and southern caribou herds in Alaska
        2. Figure V-1 Available range information for southern herds in 2008
        3. Figure V-2 Projected caribou range for all herds combined
        4. Figure V-3 Range size of all herds as a percentage of the state
      9. Technical Addendum VI Derivation of Marmot Model
        1. Figure VI-1 Known Alaska marmot occurrences overlaid with modeled Alpine zone model
        2. Figure VI-2 Terrain roughness
        3. Figures VI-3 Potential marmot habitat with terrain roughness as a factor
      10. Technical Addendum VII Derivation of Swan Model
        1. Figure VII-1 Trumpeter swan sightings, 2005
      11. Technical Addendum VIII Derivation of Reed Canary Grass Model
        1. Figure VIII-1 Grass modeling using a more extensive transportation network
    11. Notes

  1. Story
  2. Slides
    1. Slide 1 Effective Applications of the R Language
    2. Slide 2 Analytic Challenges for Enterprises
    3. Slide 3 R Can Help...But Has It's Own Challenges
    4. Slide 4 Why Embrace R?
    5. Slide 5 TIBCO Enterprise Runtime for R (TERR)
    6. Slide 6 TERR Performance
    7. Slide 7 TIBCO Spotfire Visual Analytics
    8. Slide 8 TIBCO Spotfire Predictive Analytics Ecosystem
    9. Slide 9 Example 1: Embedded TERR in Spotfire
    10. Slide 10 Power of Embedded Advanced Analytics
    11. Slide 11 Advanced Analytics Applications in Spotfire
    12. Slide 12 Example 2: TERR in TIBCO's Complex Event Processing
    13. Slide 13 Logistics Optimization
    14. Slide 14 Predictive Maintenance for Oil & Gas
    15. Slide 15 TERR Ecosystem
    16. Slide 16 TERR for Individual R Users
    17. Slide 17 TERR is R for the Enterprise
    18. Slide 18 Learn More and Try It Yourself
    19. Slide 19 TIBCO's Unique History with R/S
  3. Slides
    1. Slide 1 Extending the Reach of R to the Enterprise: TIBCO Enterprise Runtime for R
    2. Slide 2 Extending the Reach of R to the Enterprise
    3. Slide 3 Our (and My) Journey to TERR
    4. Slide 4 Why embrace R?
    5. Slide 5 What does “Embracing R” look like?
    6. Slide 6 Enterprise Challenges for R
    7. Slide 7 TIBCO Enterprise Runtime for R (TERR)
    8. Slide 8 Providing value for organizations who use R
    9. Slide 9 Providing Value for individuals who use R
    10. Slide 10 TERR Examples
    11. Slide 11 TERR vs. R Raw Performance
    12. Slide 12 TERR in Spotfire: Predictive Modeling
    13. Slide 13 Real time Fraud Detection
    14. Slide 14 Extending the Reach of R to the Enterprise
    15. Slide 15 Learn more and Try it yourself
  4. Tips and Tricks: A TIBCO Spotfire Blog
    1. Adding R Graphics to Spotfire (Case Study: Dendrograms)
    2. Interactively draw territories on a map with a TERR data function
    3. Accessing Third Party Data Sources Using RinR
    4. Add your own functions using R
  5. Spotfire Dashboard
  6. Spotfire Dashboard
  7. Research Notes
  8. How Much Did It Rain? II
    1. Competition Details
      1. Predict hourly rainfall using data from polarimetric radars
      2. Competition Description
      3. Acknowledgements
    2. Get the Data
      1. Data Files
      2. File descriptions
        1. Marshall–Palmer relation
        2. drop-size distribution
        3. radar reflectivity factor
        4. rainfall rate
      3. Data columns
      4. Referencing this data
    3. Competition Rules
      1. One account per participant
      2. No private sharing outside teams
      3. Public dissemination of entries
      4. Open licensing of winners
      5. Winning solutions must be posted or linked to in the forums.
      6. Team Mergers
      7. Team Limits
      8. Submission Limits
      9. Competition Timeline
      10. COMPETITION-SPECIFIC TERMS
      11. COMPETITION FRAMEWORK
      12. ELIGIBILITY
      13. SUBMISSIONS
      14. INTELLECTUAL PROPERTY
        1. DATA
        2. EXTERNAL DATA
        3. CODE SHARING
        4. OPEN-SOURCE CODE
      15. WINNING
        1. DETERMINING WINNERS
        2. RESOLVING TIES
        3. DECLINING PRIZES
      16. WINNERS' OBLIGATIONS
        1. DELIVERY & DOCUMENTATION
        2. PARTICIPANT INTELLECTUAL PROPERTY LICENSING
        3. PUBLIC COMPETITIONS: NON-EXCLUSIVE LICENSE
        4. RESEARCH COMPETITIONS: OPEN SOURCE LICENSE
        5. RECRUITING COMPETITION SUBMISSION LICENSE GRANT
        6. CHEATING
        7. RECEIVING PRIZES
      17. TEAMS
        1. FORMING A TEAM
        2. TEAM MERGERS
        3. TEAM PRIZES
      18. WARRANTIES AND OBLIGATIONS
        1. PARTICIPANT WARRANTIES AND OBLIGATIONS
        2. LIMITATION OF LIABILITY
        3. RESERVATION OF RIGHTS
      19. MISCELLANEOUS
        1. SEVERABILITY
        2. LAW
      20. Rules Acceptance
  9. Statistics Essentials For Dummies
    1. About the Author
    2. Introduction
      1. About This Book
      2. Conventions Used in This Book
      3. Foolish Assumptions
      4. Icons Used in This Book
      5. Where to Go from Here
      6. The 5th Wave
    3. Chapter 1: Statistics in a Nutshell
      1. Designing Studies
      2. Surveys
        1. Experiments
      3. Collecting Data
        1. Selecting a good sample
        2. Avoiding bias in your data
      4. Describing Data
        1. Descriptive statistics
        2. Charts and graphs
      5. Analyzing Data
      6. Making Conclusions
    4. Chapter 2: Descriptive Statistics
      1. Types of Data
      2. Counts and Percents
      3. Measures of Center
      4. Measures of Variability
      5. Percentiles
      6. Interpreting percentiles
        1. Table 2-1 U.S. Household Income for 2001
      7. The Five-Number Summary
    5. Chapter 3: Charts and Graphs
    6. Chapter 4: The Binomial Distribution
    7. Chapter 5: The Normal Distribution
    8. Chapter 6: Sampling Distributions and the Central Limit Theorem
    9. Chapter 7: Confidence Intervals
    10. Chapter 8: Hypothesis Tests
    11. Chapter 9: The t-distribution
    12. Chapter 10: Correlation and Regression
      1. Picturing the Relationship with a Scatterplot
      2. Making a scatterplot
      3. Interpreting a scatterplot
        1. Figure 10-1: Scatterplot of cricket chirps versus outdoor temperature.
      4. Measuring Relationships Using the Correlation
        1. Calculating the correlation
        2. Interpreting the correlation
        3. Figure 10-2: Scatterplots with various correlations.
        4. Properties of the correlation
      5. Finding the Regression Line
        1. Which is X and which is Y?
        2. Checking the conditions
        3. Understanding the equation
        4. Finding the slope
        5. Finding the y-intercept
        6. Interpreting the slope and y-intercept
          1. Interpreting the slope
          2. Interpreting the y-intercept
        7. Table 10-2 Big-Five Statistics for the Cricket Data
      6. Making Predictions
      7. Avoid Extrapolation!
      8. Correlation Doesn’t Necessarily Mean Cause-and-Effect
    13. Chapter 11: Two-Way Tables
    14. Chapter 12: A Checklist for Samples and Surveys
      1. The Target Population Is Well Defined
      2. The Sample Matches the Target Population
      3. The Sample Is Randomly Selected
      4. The Sample Size Is Large Enough
      5. Nonresponse Is Minimized
        1. The importance of following up
        2. Anonymity versus confidentiality
      6. The Survey Is of the Right Type
      7. Questions Are Well Worded
      8. The Timing Is Appropriate
      9. Personnel Are Well Trained
      10. Proper Conclusions Are Made
    15. Chapter 13: A Checklist for Judging Experiments
      1. Experiments versus Observational Studies
      2. Criteria for a Good Experiment
      3. Inspect the Sample Size
        1. Small samples — small conclusions
        2. Original versus final sample size
      4. Examine the Subjects
      5. Check for Random Assignments
      6. Gauge the Placebo Effect
      7. Identify Confounding Variables
      8. Assess Data Quality
      9. Check Out the Analysis
      10. Scrutinize the Conclusions
        1. Overstated results
        2. Ad-hoc explanations
        3. Generalizing beyond the scope
    16. Chapter 14: Ten Common Statistical Mistakes
      1. Misleading Graphs
        1. Pie charts
        2. Bar graphs
        3. Time charts
        4. Histograms
      2. Biased Data
      3. No Margin of Error
      4. Nonrandom Samples
      5. Missing Sample Sizes
      6. Misinterpreted Correlations
      7. Confounding Variables
      8. Botched Numbers
      9. Selectively Reporting Results
      10. The Almighty Anecdote
    17. Appendix: Tables for Reference
    18. Index
  10. Statistics II For Dummies
    1. Dedication
    2. About the Author
    3. Author’s Acknowledgments
    4. Introduction
      1. About This Book
      2. Conventions Used in This Book
      3. What You’re Not to Read
      4. Foolish Assumptions
      5. How This Book Is Organized
        1. Part I: Tackling Data Analysis and Model-Building Basics
        2. Part II: Using Different Types of Regression to Make Predictions
        3. Part III: Analyzing Variance with ANOVA
        4. Part IV: Building Strong Connections with Chi-Square Tests
        5. Part V: Nonparametric Statistics: Rebels without a Distribution
        6. Part VI: The Part of Tens
      6. Icons Used in This Book
      7. Where to Go from Here
    5. Part I: Tackling Data Analysis and Model-Building Basics
      1. Chapter 1: Beyond Number Crunching: The Art and Science of Data Analysis
        1. Data Analysis: Looking before You Crunch
          1. Nothing (not even a straight line) lasts forever
          2. Data snooping isn’t cool
          3. No (data) fishing allowed
        2. Getting the Big Picture: An Overview of Stats II
          1. Population parameter
          2. Sample statistic
          3. Confidence interval
          4. Hypothesis test
          5. Analysis of variance (ANOVA)
          6. Multiple comparisons
          7. Interaction effects
          8. Correlation
          9. Linear regression
          10. Chi-square tests
          11. Nonparametrics
      2. Chapter 2: Finding the Right Analysis for the Job
        1. Categorical versus Quantitative Variables
        2. Statistics for Categorical Variables
          1. Estimating a proportion
          2. Comparing proportions
          3. Looking for relationships between categorical variables
          4. Building models to make predictions
        3. Statistics for Quantitative Variables
          1. Making estimates
          2. Making comparisons
          3. Exploring relationships
          4. Predicting y using x
        4. Avoiding Bias
        5. Measuring Precision with Margin of Error
        6. Knowing Your Limitations
      3. Chapter 3: Reviewing Confi dence Intervals and Hypothesis Tests
        1. Estimating Parameters by Using Confidence Intervals
          1. Getting the basics: The general form of a confidence interval
          2. Finding the confidence interval for a population mean
          3. What changes the margin of error?
          4. Interpreting a confidence interval
        2. What’s the Hype about Hypothesis Tests?
          1. What Ho and Ha really represent
          2. Gathering your evidence into a test statistic
          3. Determining strength of evidence with a p-value
          4. False alarms and missed opportunities: Type I and II errors
          5. The power of a hypothesis test
    6. Part II: Using Different Types of Regression to Make Predictions
      1. Chapter 4: Getting in Line with Simple Linear Regression
        1. Exploring Relationships with Scatterplots and Correlations
          1. Using scatterplots to explore relationships
          2. Collating the information by using the correlation coefficient
        2. Building a Simple Linear Regression Model
          1. Finding the best-fi tting line to model your data
          2. The y-intercept of the regression line
          3. The slope of the regression line
          4. Making point estimates by using the regression line
        3. No Conclusion Left Behind: Tests and Confidence Intervals for Regression
          1. Scrutinizing the slope
          2. Inspecting the y-intercept
          3. Building confidence intervals for the average response
          4. Making the band with prediction intervals
        4. Checking the Model’s Fit (The Data, Not the Clothes!)
          1. Defining the conditions
          2. Finding and exploring the residuals
          3. Using r2 to measure model fit
          4. Scoping for outliers
        5. Knowing the Limitations of Your Regression Analysis
          1. Avoiding slipping into cause-and-effect mode
          2. Extrapolation: The ultimate no-no
          3. Sometimes you need more than one variable
      2. Chapter 5: Multiple Regression with Two X Variables
        1. Getting to Know the Multiple Regression Model
          1. Discovering the uses of multiple regression
          2. Looking at the general form of the multiple regression model
          3. Stepping through the analysis
        2. Looking at x’s and y’s
        3. Collecting the Data
        4. Pinpointing Possible Relationships
          1. Making scatterplots
          2. Correlations: Examining the bond
        5. Checking for Multicolinearity
        6. Finding the Best-Fitting Model for Two x Variables
          1. Getting the multiple regression coefficients
          2. Interpreting the coefficients
          3. Testing the coefficients
        7. Predicting y by Using the x Variables
        8. Checking the Fit of the Multiple Regression Model
          1. Noting the conditions
          2. Plotting a plan to check the conditions
          3. Checking the three conditions
      3. Chapter 6: How Can I Miss You If You Won’t Leave? Regression Model Selection
        1. Getting a Kick out of Estimating Punt Distance
          1. Brainstorming variables and collecting data
          2. Examining scatterplots and correlations
        2. Just Like Buying Shoes: The Model Looks Nice, But Does It Fit?
          1. Assessing the fi t of multiple regression models
          2. Model selection procedures
      4. Chapter 7: Getting Ahead of the Learning Curve with Nonlinear Regression
        1. Anticipating Nonlinear Regression
        2. Starting Out with Scatterplots
        3. Handling Curves in the Road with Polynomials
          1. Bringing back polynomials
          2. Searching for the best polynomial model
          3. Using a second-degree polynomial to pass the quiz
          4. Assessing the fit of a polynomial model
          5. Making predictions
        4. Going Up? Going Down? Go Exponential!
          1. Recollecting exponential models
          2. Searching for the best exponential model
          3. Spreading secrets at an exponential rate
      5. Chapter 8: Yes, No, Maybe So: Making Predictions by Using Logistic Regression
        1. Understanding a Logistic Regression Model
          1. How is logistic regression different from other regressions?
          2. Using an S-curve to estimate probabilities
          3. Interpreting the coeffi cients of the logistic regression model
          4. The logistic regression model in action
        2. Carrying Out a Logistic Regression Analysis
          1. Running the analysis in Minitab
          2. Finding the coefficients and making the model
          3. Estimating p
          4. Checking the fit of the model
          5. Fitting the Movie Model
    7. Part III: Analyzing Variance with ANOVA
      1. Chapter 9: Testing Lots of Means? Come On Over to ANOVA!
        1. Comparing Two Means with a t-Test
        2. Evaluating More Means with ANOVA
          1. Spitting seeds: A situation just waiting for ANOVA
          2. Walking through the steps of ANOVA
        3. Checking the Conditions
          1. Verifying independence
          2. Looking for what’s normal
          3. Taking note of spread
        4. Setting Up the Hypotheses
        5. Doing the F-Test
          1. Running ANOVA in Minitab
          2. Breaking down the variance into sums of squares
          3. Locating those mean sums of squares
          4. Figuring the F-statistic
          5. Making conclusions from ANOVA
          6. What’s next?
        6. Checking the Fit of the ANOVA Model
      2. Chapter 10: Sorting Out the Means with Multiple Comparisons
        1. Following Up after ANOVA
          1. Comparing cellphone minutes: An example
          2. Setting the stage for multiple comparison procedures
        2. Pinpointing Differing Means with Fisher and Tukey
          1. Fishing for differences with Fisher’s LSD
          2. Using Fisher’s new and improved LSD
          3. Separating the turkeys with Tukey’s test
        3. Examining the Output to Determine the Analysis
        4. So Many Other Procedures, So Little Time!
          1. Controlling for baloney with the Bonferroni adjustment
          2. Comparing combinations by using Scheffe’s method
          3. Finding out whodunit with Dunnett’s test
          4. Staying cool with Student Newman-Keuls
          5. Duncan’s multiple range test
          6. Going nonparametric with the Kruskal-Wallis test
      3. Chapter 11: Finding Your Way through Two-Way ANOVA
        1. Setting Up the Two-Way ANOVA Model
          1. Determining the treatments
          2. Stepping through the sums of squares
        2. Understanding Interaction Effects
          1. What is interaction, anyway?
          2. Interacting with interaction plots
        3. Testing the Terms in Two-Way ANOVA
        4. Running the Two-Way ANOVA Table
          1. Interpreting the results: Numbers and graphs
        5. Are Whites Whiter in Hot Water? Two-Way ANOVA Investigates
      4. Chapter 12: Regression and ANOVA: Surprise Relatives!
        1. Seeing Regression through the Eyes of Variation
          1. Spotting variability and fi nding an “x-planation”
          2. Getting results with regression
          3. Assessing the fi t of the regression model
        2. Regression and ANOVA: A Meeting of the Models
          1. Comparing sums of squares
          2. Dividing up the degrees of freedom
          3. Bringing regression to the ANOVA table
          4. Relating the F- and t-statistics: The fi nal frontier
    8. Part IV: Building Strong Connections with Chi-Square Tests
      1. Chapter 13: Forming Associations with Two-Way Tables
        1. Breaking Down a Two-Way Table
          1. Organizing data into a two-way table
          2. Filling in the cell counts
          3. Making marginal totals
        2. Breaking Down the Probabilities
          1. Marginal probabilities
          2. Joint probabilities
          3. Conditional probabilities
        3. Trying To Be Independent
          1. Checking for independence between two categories
          2. Checking for independence between two variables
        4. Demystifying Simpson’s Paradox
          1. Experiencing Simpson’s Paradox
          2. Figuring out why Simpson’s Paradox occurs
          3. Keeping one eye open for Simpson’s Paradox
      2. Chapter 14: Being Independent Enough for the Chi-Square Test
        1. The Chi-square Test for Independence
          1. Collecting and organizing the data
          2. Determining the hypotheses
          3. Figuring expected cell counts
          4. Checking the conditions for the test
          5. Calculating the Chi-square test statistic
          6. Finding your results on the Chi-square table
          7. Drawing your conclusions
          8. Putting the Chi-square to the test
        2. Comparing Two Tests for Comparing Two Proportions
          1. Getting reacquainted with the Z-test for two population proportions
          2. Equating Chi-square tests and Z-tests for a two-by-two table
      3. Chapter 15: Using Chi-Square Tests for Goodness-of-Fit (Your Data, Not Your Jeans)
        1. Finding the Goodness-of-Fit Statistic
          1. What’s observed versus what’s expected
          2. Calculating the goodness-of-fit statistic
        2. Interpreting the Goodness-of-Fit Statistic Using a Chi-Square
          1. Checking the conditions before you start
          2. The steps of the Chi-square goodness-of-fit test
    9. Part V: Nonparametric Statistics: Rebels without a Distribution
      1. Chapter 16: Going Nonparametric
        1. Arguing for Nonparametric Statistics
          1. No need to fret if conditions aren’t met
          2. The median’s in the spotlight for a change
          3. So, what’s the catch?
        2. Mastering the Basics of Nonparametric Statistics
          1. Sign
          2. Rank
          3. Signed rank
          4. Rank sum
      2. Chapter 17: All Signs Point to the Sign Test and Signed Rank Test
        1. Reading the Signs: The Sign Test
          1. Testing the median
          2. Estimating the median
          3. Testing matched pairs
        2. Going a Step Further with the Signed Rank Test
          1. A limitation of the sign test
          2. Stepping through the signed rank test
          3. Losing weight with signed ranks
      3. Chapter 18: Pulling Rank with the Rank Sum Test
        1. Conducting the Rank Sum Test
          1. Checking the conditions
          2. Stepping through the test
          3. Stepping up the sample size
        2. Performing a Rank Sum Test: Which Real Estate Agent Sells Homes Faster?
          1. Checking the conditions for this test
          2. Testing the hypotheses
      4. Chapter 19: Do the Kruskal-Wallis and Rank the Sums with the Wilcoxon
        1. Doing the Kruskal-Wallis Test to Compare More than Two Populations
          1. Checking the conditions
          2. Setting up the test
          3. Conducting the test step by step
        2. Pinpointing the Differences: The Wilcoxon Rank Sum Test
          1. Pairing off with pairwise comparisons
          2. Carrying out comparison tests to see who’s different
          3. Examining the medians to see how they’re different
      5. Chapter 20: Pointing Out Correlations with Spearman’s Rank
        1. Pickin’ On Pearson and His Precious Conditions
        2. Scoring with Spearman’s Rank Correlation
          1. Figuring Spearman’s rank correlation
          2. Watching Spearman at work: Relating aptitude to performance
    10. Part VI: The Part of Tens
      1. Chapter 21: Ten Common Errors in Statistical Conclusions
      2. Chapter 22: Ten Ways to Get Ahead by Knowing Statistics
      3. Chapter 23: Ten Cool Jobs That Use Statistics
    11. Appendix: Reference Tables
    12. Index
  11. Introduction to Random Forests for Beginners – free ebook
  12. Random Forests
    1. Salford Systems and Random Forests
    2. What are Random Forests?
      1. Prerequisites
      2. Preliminaries
      3. The Essentials
      4. More Essentials
      5. Predictions
      6. Weakness of Bagger
      7. Putting Randomness into Random Forests
      8. How Random Should a Random Forest Be?
      9. More on Random Splitting
      10. Controlling Degree of Randomness
      11. How Many Predictors in Each Node?
      12. Random Forests Predictions
      13. Out of Bag (OOB) Data
      14. Testing and Evaluating
      15. More Testing and Evaluation
      16. Testing vs.. Scoring
      17. Random Forests and Segmentation
      18. Proximity Matrix
      19. Characteristics of the Proximity Matrix
      20. Proximity Insights
      21. Proximity Visualization
      22. Missing Values
      23. Proximity and Missing Values
      24. Missing Imputation
      25. Variable Importance
      26. Variable Importance Details
      27. Variable Importance Issues
      28. Variable Importance: Final Observation
      29. Bootstrap Resampling
      30. The Technical Algorithm
    3. Suited for Wide Data
      1. Wide Data
      2. Random Forests and Wide Data
      3. Wide Shallow Data
    4. Strength and Weaknesses of Random Forests
      1. Random Forests: Few Controls to Learn & Set
      2. Easily Parallelizable
      3. Random Forests Weaknesses
      4. Therefore…
    5. Simple Example
      1. Boston Housing Data
      2. Run an RF model
      3. RF Controls
      4. Results Summary
      5. Confusion Matrix OOB
      6. Variable Importance
      7. Most Likely Vs Not Likely Parallel Coordinate Plots
      8. Parallel Coordinate Plots
      9. Outliers
      10. Proximity & Clusters
    6. Real World Example
      1. Analytics On a Grand Scale
      2. Challenge
      3. Some essential questions:
      4. The solution
      5. The result
      6. Customer Support
    7. Why Salford Systems?
      1. Unique to Salford
      2. Measure the ROI of Random Forests
  13. Connecting Alaska Landscapes Into the Future
    1. Cover Page
    2. Acknowledgments
    3. Executive Summary
    4. Introduction
      1. Scope and Purpose of the Report
      2. Terms and Definitions
      3. Project Background
      4. Models Used in the Connectivity Project
        1. MODELING CLIMATE CHANGE: SNAP CLIMATE MODELS
        2. MODELING CLIMATE CHANGE ENVELOPES: RANDOM FORESTS™
        3. MODELING CONNECTIVITY: MARXAN
      5. Important Factors and Models That Were NOT Incorporated
        1. SEA LEVEL RISE
        2. PERMAFROST CHANGE
        3. ALFRESCO
    5. Part I Modeling  Shifts in Ecosystem/Vegetation
      1. Landscape Classification
        1. Figure 1 Ecological groups, as defined by Nowacki et al. (2001). See Appendix C for full descriptions of each biome
        2. Figure 2 Alaska biomes and Canadian ecoregions
      2. Modeling the Effects of Climate Change on Biomes
        1. Figure 3 Conceptual illustration of how the Random Forests™ model was used to create climate envelopes for different modeling subjects
        2. Figure 4 Current biome types as predicted by SNAP climate data
        3. Figure 5 Projected potential biomes for 2030–2039
        4. Figure 6 Projected potential biomes for 2060–2069
        5. Figure 7 Projected potential biomes for 2090–2099
        6. Figure 8 Box plots comparing climate envelopes for biomes/ecoregions
    6. Part II Connectivity and Resilience and Conservation Potential
      1. Figure 9 Conservation lands in Alaska
      2. Methods
        1. CRITERION 1: BIOME REFUGIA
        2. Figure 10 Number of potential biome changes projected by 2099
        3. Figure 11 Biome refugia
        4. CRITERION 2: NDVI AS AN INDICATOR OF PRODUCTIVITY AND ENDEMISM
        5. Figure 12 Historical mean NDVI across the state of Alaska
        6. CRITERION 3:PROPORTION OF LAND AREA IN CONSERVATION STATUS
        7. MODELING STEPS
        8. Figure 13 Biome refugia on a 40 km hexagonal grid
        9. Figure 14 Marxan solution illustrating ranking by biome assuming 10% of total land area in conservation status
        10. Figure 15 Marxan solution statewide by biome assuming 25% of total land area in conservation status
      3. Marxan Modeling Results
      4. Discussion of Marxan Results
        1. Figure 16 Potential connections between Marxan solutions for lands within each biome category
    7. Part III Modeling Changes in Distribution of Indicator Species
      1. Caribou
        1. Figure 17 Alaska caribou herd ranges (2008)
        2. Figure 18 Projected caribou range for all herds combined
      2. Alaska Marmots
        1. Figure 19 Known Alaska marmot distribution and modeled current distribution
        2. Figure 20 Projected Alaska marmot distribution
      3. Trumpeter Swans
        1. Figure 21 Potential expansion of trumpeter swan habitat
      4. Reed Canary Grass
        1. Figure 22 Potential spread of reed canary grass, using climate and all-season roads as predictors
    8. Part IV Implications of Modeling for Conservation and Research and Management
        1. Figure 23 Case study area, Kenai Peninsula
      1. Supporting Evidence for our Models from the Field: Kenai Peninsula Case Study
        1. Figure 24 Landcover, existing biome, and predicted biome
        2. Figure 25 Modeled spread of reed canary grass
        3. Figure 26 Increase in likelihood of occurrence of trumpeter swans
        4. Figure 27 Little change is seen in potential caribou distribution
      2. Implications of Results for Conservation, Research, and Management
    9. Part V Lessons Learned and Next Steps
    10. Appendices and Technical Addenda
      1. Appendix A Project Participants
        1. P.I.s
        2. Alaska Department of Fish and  Game
        3. Alaska Natural Resource Center, National Wildlife Federation
        4. Audubon Alaska
        5. Defenders of Wildlife
        6. The Nature Conservancy
        7. The Wilderness Society
        8. U.S. Bureau of Land  Management
        9. U.S. Fish and Wildlife Service
        10. U.S. Forest Service
        11. U.S. Geological Survey
        12. U.S. National Park Service
        13. UAA Natural Heritage Program & Alaska GAP
      2. Appendix B Additional Reading and References
      3. Appendix C Alaskan Biome and Canadian Ecozone Descriptions
        1. Alpine
        2. Arctic
        3. Western Tundra
        4. Alaska Boreal
        5. Boreal Transition
        6. North Pacific Maritime
        7. Aleutian Islands
        8. Boreal Cordillera Ecozone
        9. Taiga Cordillera Ecozone
        10. Pacific Maritime Ecozone
        11. Montane Cordillera Ecozone
        12. Taiga Plains Ecozone
        13. Boreal Plains Ecozone
      4. Technical Addendum I Derivation of SNAP Climate Projections
      5. Technical Addendum II Identifying the Alpine Biome
        1. Figure II-1 Projected change in the Alpine zone over time
      6. Technical Addendum III LANDFIRE Mapping of Units 70 and 71 (ez7071)
        1. Table III-1 Grouping of the original LANDFIRE classes into 20 land cover classes for modeling
        2. Figure III-1 Extrapolating statewide, based on vegetation classifications for units 70 and 71
      7. Technical Addendum IV Defining Biomes and Modeling Biome Shifts in Random Forests™
        1. Reconciling Canadian and Alaskan data
        2. Defining climate envelopes in a continuous landscape
      8. Technical Addendum V Caribou Modeling by Herd or Season
        1. Table V-1 Seasonal ranges (updated 2008) used in modeling differences in northern and southern caribou herds in Alaska
        2. Figure V-1 Available range information for southern herds in 2008
        3. Figure V-2 Projected caribou range for all herds combined
        4. Figure V-3 Range size of all herds as a percentage of the state
      9. Technical Addendum VI Derivation of Marmot Model
        1. Figure VI-1 Known Alaska marmot occurrences overlaid with modeled Alpine zone model
        2. Figure VI-2 Terrain roughness
        3. Figures VI-3 Potential marmot habitat with terrain roughness as a factor
      10. Technical Addendum VII Derivation of Swan Model
        1. Figure VII-1 Trumpeter swan sightings, 2005
      11. Technical Addendum VIII Derivation of Reed Canary Grass Model
        1. Figure VIII-1 Grass modeling using a more extensive transportation network
    11. Notes

Story

Data Science for Random Forests

Start with Video: Learning Path: Data Science with R then see Kaggle Competition: How Much Did It Rain? II using Spotfire instead of R! See Slides and Sample Solution and Data Dictionary Spreadsheet

I got a request from a new member of the Federal Big Data Working Group Meetup to "look over my shoulder" when I did my data science to help them enter a Kaggle Competition:

As a new Data Scientist, I am really struggling to keep up with all of the technology changes and how to actually apply what I've been able to learn to real world problems.

The next time that you are working on a project, would it be possible for me to look over your shoulder, watch what you do or could we talk through some Kaggle competitions, talk about what they are looking for and how to attack the problem?

Thanks for hosting the Meetup, it is really informative.

I replied:

Hello and thanks for contacting me.

Please look at: http://semanticommunity.info/Data_Science/Data_Science_for_Statistics.com

Which is a tutorial I did recently on a Kaggle Training Data Set (Titanic) for just the purpose you are asking for. See Data Science for Statistics.com.

Then pick a Kaggle Competition you are interested in and send me the link and I will see if I can do a tutorial for it and then show it step-by-step in one of our Meetups and make a recording of it.

He then replied:

I am working through your tutorial, but I've run into a snag.

I am having to use the "Cloud" version because I use a Mac.  I do not see anywhere via the Web interface to create the "Data Relationships."  Do you know if this can be done via the Web version?

As an aside, can this analysis also be done in R?

I replied:

I do not think it can be done with the Web version.

Yes, but with lots of R coding. Kaggle provides the R code as I recall in the Titanic example.

He replied:

As info, I hand-jammed one of the R scripts on the website to see what it does, I just picked the first one I came to and it was a Random Forest Model.  (https://www.kaggle.com/benhamner/tit...nchmark-r/code)

I got it to run, and it showed that "Sex" was the most important factor (almost 2x as much) as determining survivability than the other factors.

That said, I'm not really sure what I did.  I'm reading up right now on what a Random Forest is and how it was used in this context to spit out an accurate result.

I replied:

You are welcome. The Spotfire documentation or Wikipedia has good explanations of Statistical Methods.

The Spotfire Data Relationships is an good example of the power Spotfire without coding because the R code for that would be very complicated and big.

He replied:

Ok, I look at the Spotfire documentation.  My impression of Wikipedia is that people get on there and try to "out-do" each other to the point that the information is so far above my head, I just end up googling "Statistics for Dummies".

I replied:

Better idea:)

This prompted me to look at:

Statistics Essentials For Dummies

Introduction to Random Forests for Beginners – free ebook

and to repurpose the free ebook below: Random Forests

by copying the text to a Notepad file and copying each header and body section into MindTouch in Edit Mode.

The chapter in the book: Suited for Wide Data reminded me of why Spotfire was originally developed, to handle very wide data by providing for "Details on Demand" in the interface and statistical methods like Data Relationships to handle many variables in an efficient way.

We have had excellent presentations in our Federal Big Data Working Group Meetups on Random Forest (GIVE SPECIFICS).

Just recently, I analyzed the National Soil Survey (SSURGO) Geographic AccessShape and Gridded Soil Survey Geographic (gSSURGO) Databases, which have both many columns and rows using Spotifire Data Relationships. See Spotfire Dashboard.

I did this as part of my online course on Big Data Science for Precision Farming BusinessWeek 5 Evaluation.

Your excellent questions have gotten me started on Data Science for Random Forests leading to a future Meetup.

I have skim read the two books: Statistics Essentials For Dummies (in which I do not find Random Forests mentioned) and Introduction to Random Forests for Beginners – free ebook (which is all about Random Forests and the Salford Systems tool for doing that).

To do data science, I need the data set used by Dr. Falk Huettmann, and so I started a Google search and found:

http://www.salford-systems.com/news/salford-systems-helps-forecast-alaska-s-ecosystem-in-the-22nd-century (no links to anything)

http://www.kdnuggets.com/2011/04/salford-forecasts-alaska-ecosystem-22nd-century.html  (with one link)

http://alaska.usgs.gov/science/biology/ecomonitoring/pdfs/ConnectivityProject_UpdateMay7_09.pdf (but no links to data)

https://www.snap.uaf.edu/attachments/SNAP-connectivity-2010-complete.pdf (Final Report)

A Google Chrome Find for Random Forests: 35 hits with two in the Table of Contents

Introduction: Modeling climate change envelopes: Random Forests™ (page 9)

Technical Addendum IV Defining Biomes and Modeling Biome Shifts in Random Forests™ (pages 84-85)

Which says: We downloaded the data in GIS format from the following websites: http://sis.agr.gc.ca/cansis/nsdb/ecostrat/gis_data.html

(The National Ecological Framework for Canada GIS data sets, providing ecozones and ecoregions as shapefiles) and

http://climate.weatheroffice.ec.gc.ca/Welcome_e.html

(Canada’s National Climate Data and Information Archive, providing historical GIS data for cities and ecoregions based on 1971–2000 normals).

So we need to see if this data is still available and we can reproduce Dr. Falk Huettmann results, and produce more useful results, etc.

So this is a beginning to answer the four data science questions:

  • How was the data collected?
  • Where is the data stored?
  • What are the data results? and
  • Why should we believe the data results?

As one of four Steps in the Data Mining - Data Science – Data Publication Process

Data Mining Process:

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation
  • Deployment

Data Science Process:

  • Data Preparation
  • Data Ecosystem
  • Data Story

Data Science Data Publication:

  • Knowledge Base
  • Spreadsheet Index
  • Web & PDF Tables to Spreadsheet
  • Data Browser
  • Dynamically Linked Adjacent Visualizations

That you asked me to teach you.

Dr. Falk Huettmann is very confident in the RandomForest software and results for Alaska, however, the inventor, Professor Leo Breiman, says his philosophy is:

  • RF is an example of a tool that is useful in doing analyses of scientific data.
  • But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem.
  • Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.

​We need to do an audit and see who is closer to the truth.

MORE TO FOLLOW.

So can Spotifre do "Random Forests by using Data Relationships, etc.? Is the Alaskan data still available and can we use it? Can we involve other Meetup members in this and do a Meetup? Can I successfuly convert the PDF file to Word and import it into MindTouch?

Slides

Slides

Slide 1 Effective Applications of the R Language

LouBajuk-Yorgan09142015Slide1.PNG

Slide 2 Analytic Challenges for Enterprises

LouBajuk-Yorgan09142015Slide2.PNG

Slide 3 R Can Help...But Has It's Own Challenges

LouBajuk-Yorgan09142015Slide3.PNG

Slide 4 Why Embrace R?

LouBajuk-Yorgan01142013Slide4.PNG

Slide 5 TIBCO Enterprise Runtime for R (TERR)

LouBajuk-Yorgan09142015Slide5.PNG

Slide 6 TERR Performance

LouBajuk-Yorgan09142015Slide6.PNG

Slide 7 TIBCO Spotfire Visual Analytics

LouBajuk-Yorgan09142015Slide7.PNG

Slide 8 TIBCO Spotfire Predictive Analytics Ecosystem

LouBajuk-Yorgan09142015Slide8.PNG

Slide 9 Example 1: Embedded TERR in Spotfire

LouBajuk-Yorgan09142015Slide9.PNG

Slide 10 Power of Embedded Advanced Analytics

LouBajuk-Yorgan09142015Slide10.PNG

Slide 11 Advanced Analytics Applications in Spotfire

LouBajuk-Yorgan09142015Slide11.PNG

Slide 12 Example 2: TERR in TIBCO's Complex Event Processing

LouBajuk-Yorgan09142015Slide12.PNG

Slide 13 Logistics Optimization

LouBajuk-Yorgan09142015Slide13.PNG

Slide 14 Predictive Maintenance for Oil & Gas

LouBajuk-Yorgan09142015Slide14.PNG

Slide 15 TERR Ecosystem

LouBajuk-Yorgan09142015Slide15.PNG

Slide 16 TERR for Individual R Users

LouBajuk-Yorgan09142015Slide16.PNG

Slide 17 TERR is R for the Enterprise

LouBajuk-Yorgan09142015Slide17.PNG

Slide 19 TIBCO's Unique History with R/S

LouBajuk-Yorgan09142015Slide19.PNG

Slides

Slides

Slide 1 Extending the Reach of R to the Enterprise: TIBCO Enterprise Runtime for R

LouBajuk-Yorgan01142013Slide1.PNG

Slide 2 Extending the Reach of R to the Enterprise

LouBajuk-Yorgan01142013Slide2.PNG

Slide 3 Our (and My) Journey to TERR

LouBajuk-Yorgan01142013Slide3.PNG

Slide 4 Why embrace R?

But large enterprises have often been challenged to realize that value
“More on that in a minute”

LouBajuk-Yorgan01142013Slide4.PNG

Slide 5 What does “Embracing R” look like?

“But we needed to do more. This wasn’t enough for Enterprise customers)

LouBajuk-Yorgan01142013Slide5.PNG

Slide 6 Enterprise Challenges for R

What we heard from our customers

“We didn’t build R for you.”

LouBajuk-Yorgan01142013Slide6.PNG

Slide 7 TIBCO Enterprise Runtime for R (TERR)

Something only TIBCO could do
Based on Enterprise, Analytic, and S+ expertise
Twice or more the speed of R, much more efficient & robust use of memory

Clean room implementation

In terms of limitations:
We are 100% R compatible (in terms of full support for the R language), but the first release won’t have full R coverage (in terms of support for all the core R functions). We focused on implementing our own versions of the most commonly used functions and statistical methods, and support for the most commonly used R packages (so many R packages from the CRAN repository can be run in TERR without any modification).
The specific things we haven’t done yet include:
Support for R-based graphics (since our first release focuses on Spotfire integration, we use Spotfire for visualization; also, based on customer input, these seemed less important for production runs of R language scripts)
Support for some of the less-commonly-used statistical and utility functions in core R

LouBajuk-Yorgan01142013Slide7.PNG

Slide 8 Providing value for organizations who use R

LouBajuk-Yorgan01142013Slide8.PNG

Slide 9 Providing Value for individuals who use R

LouBajuk-Yorgan01142013Slide9.PNG

Slide 10 TERR Examples

LouBajuk-Yorgan01142013Slide10.PNG

Slide 11 TERR vs. R Raw Performance

Meaningful benchmarks are hard

LouBajuk-Yorgan01142013Slide11.PNG

Slide 12 TERR in Spotfire: Predictive Modeling

LouBajuk-Yorgan01142013Slide12.PNG

Slide 13 Real time Fraud Detection

LouBajuk-Yorgan01142013Slide13.PNG

Slide 14 Extending the Reach of R to the Enterprise

LouBajuk-Yorgan01142013Slide14.PNG

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Error: Embedded data could not be displayed. Use Google Chrome

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Error: Embedded data could not be displayed. Use Google Chrome

Research Notes

Your excellent questions have gotten me started on Data Science for Random Forests leading to a future Meetup.

I have skim read the two books: Statistics Essentials For Dummies (in which I do not find Random Forests mentioned) and Introduction to Random Forests for Beginners – free ebook (which is all about Random Forests and the Salford Systems tool for doing that).

To do data science, I need the data set used by Dr. Falk Huettmann, and so I started a Google search and found:

http://www.salford-systems.com/news/salford-systems-helps-forecast-alaska-s-ecosystem-in-the-22nd-century (no links to anything)

http://www.kdnuggets.com/2011/04/salford-forecasts-alaska-ecosystem-22nd-century.html  (with one link)

http://alaska.usgs.gov/science/biology/ecomonitoring/pdfs/ConnectivityProject_UpdateMay7_09.pdf (but no links to data)

https://www.snap.uaf.edu/attachments/SNAP-connectivity-2010-complete.pdf (Final Report)

A Google Chrome Find for Random Forests: 35 hits with two in the Table of Contents

Introduction: Modeling climate change envelopes: Random Forests™ (page 9)

Technical Addendum IV Defining Biomes and Modeling Biome Shifts in Random Forests™ (pages 84-85)

Which says: We downloaded the data in GIS format from the following websites: http://sis.agr.gc.ca/cansis/nsdb/ecostrat/gis_data.html

(The National Ecological Framework for Canada GIS data sets, providing ecozones and ecoregions as shapefiles) and

http://climate.weatheroffice.ec.gc.ca/Welcome_e.html

(Canada’s National Climate Data and Information Archive, providing historical GIS data for cities and ecoregions based on 1971–2000 normals).

So we need to see if this data is still available and we can reproduce Dr. Falk Huettmann results, and produce more useful results, etc.

I have provided the following Google Search Result: Spotfire Random Forest

http://www.edii.uclm.es/~useR-2013/slides/118.pdf

http://www.londonr.org/Presentations/TERR%20Extend%20Reach%20of%20R%20-%20Louis%20Bajuk-Yorgan.pptx

R struggles with Big Data

R was not built for enterprise usage and integration

Not seeking to displace R from statistician’s desktops

– Enterprise platform for the deployment and integration of your work—without having to rewrite it!

Power of predictive analytics in Spotfire

  • For non-R programmer
  • Leverage the interactive visualizations of Spotfire
  • Powered by embedded TERR engine

Example integration with TERR

  • Deploy R models on TERR engine for real-time scoring in response to complex events
  • Random forest model, scoring online transactions for fraud

https://docs.tibco.com/pub/spotfire/general/Whats_New_in_Spotfire_6.5.pdf

TIBCO® Enterprise Runtime for R (TERR)

Expanded R coverage and R package compatibility

TERR 2.5 includes expanded support for R packages and core R functionality:  

  • Splines
  • Hierarchical and k-means clustering
  • Improved compatibility with many important packages, including MASS, randomForest, boot, rpart, e1071, mgcv, earth and bioconductor

http://127.0.0.1:23204/doc/html/index.html

TIBCO® Enterprise Runtime for R

https://docs.tibco.com/products/tibco-enterprise-runtime-for-r-3-2-0

TIBCO® Enterprise Runtime for R 3.2.0

Find out more about the data mining and predictive analytics tools Salford Systems has to offer with free, online videos.

Watch videos on:
•    Introduction to Data Mining
•    CART (Classification and Regression Trees)
•    MARS (Multivariate Adaptive Regression Splines)
•    TreeNet (Stochastic Gradient Boosting)
Introductory data mining videos

MORE TO FOLLOW

How Much Did It Rain? II

Source: https://www.kaggle.com/c/how-much-did-it-rain-ii

Competition Details

https://www.kaggle.com/c/how-much-did-it-rain-ii

Predict hourly rainfall using data from polarimetric radars

After incorporating feedback from the Kaggle community, as well as scientific and educational partners, the Artificial Intelligence Committee of the American Meteorological Society is excited to be running a second iteration of the How Much Did It Rain? competition.

How Much Did It Rain? II is focused on solving the same core rain measurement prediction problem, but approaches it with a new and improved dataset and evaluation metric. This competition will go even further towards building a useful educational tool for universities, as well as making a meaningful contribution to continued meteorological research.

Competition Description

Rainfall is highly variable across space and time, making it notoriously tricky to measure. Rain gauges can be an effective measurement tool for a specific location, but it is impossible to have them everywhere. In order to have widespread coverage, data from weather radars is used to estimate rainfall nationwide. Unfortunately, these predictions never exactly match the measurements taken using rain gauges.

Recently, in an effort to improve their rainfall predictors, the U.S. National Weather Service upgraded their radar network to be polarimetric. These polarimetric radars are able to provide higher quality data than conventional Doppler radars because they transmit radio wave pulses with both horizontal and vertical orientations. 

Polarimetric radar. Image courtesy NOAA

Dual pulses make it easier to infer the size and type of precipitation because rain drops become flatter as they increase in size, whereas ice crystals tend to be elongated vertically.

In this competition, you are given snapshots of polarimetric radar values and asked to predict the hourly rain gauge total. A word of caution: many of the gauge values in the training dataset are implausible (gauges may get clogged, for example). More details are on the data page.

Acknowledgements

This competition is sponsored by the Artificial Intelligence Committee of the American Meteorological Society. Climate Corporation is providing the prize pool.

Started: 9:53 pm, Thursday 17 September 2015 UTC 
Ends: 11:59 pm, Monday 7 December 2015 UTC (81 total days) 
Points: this competition awards standard ranking points 
Tiers: this competition counts towards tiers

Get the Data

Source: https://www.kaggle.com/c/how-much-did-it-rain-ii/data

Data Files

Source: https://www.kaggle.com/c/how-much-did-it-rain-ii/data

File Name Available Formats
train .zip (240.42 mb)
test .zip (134.52 mb)
sample_solution.csv .zip (4.40 mb)
sample_dask .py (1.96 kb)

The training data consists of NEXRAD and MADIS data collected on 20 days between Apr and Aug 2014 over midwestern corn-growing states. Time and location information have been censored, and the data have been shuffled so that they are not ordered by time or place. The test data consists of data from the same radars and gauges over the remaining days in that month. Please see this page to understand more about polarimetric radar measurements.

File descriptions

  • train.zip - the training set.  This consists of radar observations at gauges in the Midwestern US over 20 days each month during the corn growing season. You are also provided the gauge observation at the end of each hour.
  • test.zip - the test set.  This consists of radar observations at gauges in the Midwestern US over the remaining 10/11 days each month of the same year(s) as the training set.  You are required to predict the gauge observation at the end of each hour.
  • sample_solution.zip - a sample submission file in the correct format
  • sample_dask.py - Example program in Python that will produce the sample submission file.  This program applies the Marshall-Palmerrelationship to the radar observations to predict the gauge observation.
Marshall–Palmer relation

Source: http://glossary.ametsoc.org/wiki/Mar...almer_relation

The ZR relationship developed by J. S. Marshall and W. M. Palmer (1948) consistent with an exponential drop-size distribution.

 
The relationship is Z = 200R1.6, where Z (mm6 m-3) is the reflectivity factor and R (mm h-1) is the rainfall rate. The relationship is sometimes generalized to the form Z = a'Rb, where a and b are adjustable parameters.
 
Marshall, J. S., and W. McK. Palmer 1948. The distribution of raindrops with size. J. Meteor.. 5. 165–166.
drop-size distribution

Source: http://glossary.ametsoc.org/wiki/Dro...e_distribution

The frequency distribution of drop sizes (diameters, volumes) that is characteristic of a given cloud or of a given fall of rain.

 
Most natural clouds have unimodal (single maximum) distributions, but occasionally bimodal distributions are observed. In convective clouds, the drop-size distribution is found to change with time and to vary systematically with height, the modal size increasing and the number decreasing with height. For many purposes a useful single parameterrepresenting a given distribution is the volume median diameter, that is, that diameter for which the total volume of all drops having greater diameters is just equal to the total volume of all drops having smaller diameters. The drop- size distribution is one of the primary factors involved in determining the radar reflectivity of any fall of precipitation, or of a cloud mass.
radar reflectivity factor

Source: http://glossary.ametsoc.org/wiki/Rad...ctivity_factor

A quantity determined by the drop-size distribution of precipitation, which is proportional to the radar reflectivity if the precipitation particles are spheres small compared with the radar wavelength.

 
Given the drop-size distribution of a sample of rain, the radar reflectivity factor may be computed by summing the sixth-powers of the diameters of all the drops contained in a unit volume of space. Or, regarding the drop-size distribution N(D) as a continuous function of drop size, the reflectivity factor Z may be written as
ams2001glos-Re9
For ice-phase precipitation, N(D) is the distribution of melted diameters. Conventional units of Z are mm6 m-3 and it is sometimes measured on a logarithmic scale in units of dBZ. The equivalent reflectivity factor Ze may be estimated from measurements of the radar reflectivity η of precipitation and is defined by
ams2001glos-Re10

where λ is the radar wavelength and 0.93 is the dielectric factor for water. Either the reflectivity factor or the equivalent reflectivity factor is frequently used to estimate rainfall rate using relationships of the form Z = aRb, where a and b are empirical constants and R is the rainfall rate. For R in millimeters per hour and Z or Ze in mm6 m-3, values of arange from 200 to 600 and those of b range from 1.5 to 2. The particular combination of a = 200 and b = 1.6 defines the Marshall–Palmer relation
See radar equation.

rainfall rate

Source: http://glossary.ametsoc.org/wiki/Rainfall_rate

A measure of the intensity of rainfall by calculating the amount of rain that would fall over a given interval of time if the rainfall intensity were constant over that time period.

 
The rate is typically expressed in terms of length (depth) per unit time, for example, millimeters per hour, or inches per hour.

Data columns

To understand the data, you have to realize that there are multiple radar observations over the course of an hour, and only one gauge observation (the 'Expected'). That is why there are multiple rows with the same 'Id'.

The columns in the datasets are:

  • Id: A unique number for the set of observations over an hour at a gauge.
  • minutes_past: For each set of radar observations, the minutes past the top of the hour that the radar observations were carried out.  Radar observations are snapshots at that point in time.
  • radardist_km: Distance of gauge from the radar whose observations are being reported.
  • Ref: Radar reflectivity in km
  • Ref_5x5_10th: 10th percentile of reflectivity values in 5x5 neighborhood around the gauge.
  • Ref_5x5_50th: 50th percentile
  • Ref_5x5_90th: 90th percentile
  • RefComposite: Maximum reflectivity in the vertical column above gauge. In dBZ.
  • RefComposite_5x5_10th
  • RefComposite_5x5_50th
  • RefComposite_5x5_90th
  • RhoHV: Correlation coefficient (unitless)
  • RhoHV_5x5_10th
  • RhoHV_5x5_50th
  • RhoHV_5x5_90th
  • Zdr: Differential reflectivity in dB
  • Zdr_5x5_10th
  • Zdr_5x5_50th
  • Zdr_5x5_90th
  • Kdp:  Specific differential phase (deg/km)
  • Kdp_5x5_10th
  • Kdp_5x5_50th
  • Kdp_5x5_90th
  • Expected: Actual gauge observation in mm at the end of the hour.

Referencing this data

To reference this dataset in scientific publications, please use the following citation:

Lakshmanan, V, A. Kleeman, J. Boshard, R. Minkowsky, A. Pasch, 2015. The AMS-AI 2015-2016 Contest: Probabilistic estimate of hourly rainfall from radar. 13th Conference on Artificial Intelligence, American Meteorological Society, Phoenix, AZ

Competition Rules

Source: https://www.kaggle.com/c/how-much-di...rain-ii/submit

One account per participant

You cannot sign up to Kaggle from multiple accounts and therefore you cannot submit from multiple accounts.

No private sharing outside teams

Privately sharing code or data outside of teams is not permitted. It's okay to share code if made available to all participants on the forums.

Public dissemination of entries

Kaggle and the competition host have the right to publicly disseminate any entries or models.

Open licensing of winners

Winning solutions need to be made available under a popular OSI-approved license in order to be eligible for recognition and prize money.

Winning solutions must be posted or linked to in the forums.

Prizes will be awarded after the winners have posted their solutions to the competition forum. Winners must post or link to their solutions within fourteen (14) days of being notified of their winning status.

Team Mergers

Team mergers are allowed and can be performed by the team leader. In order to merge, the combined team must have a total submission count less than or equal to the maximum allowed as of the merge date. The maximum allowed is the number of submissions per day multiplied by the number of days the competition has been running.

Team Limits

There is no maximum team size.

Submission Limits

You may submit a maximum of 2 entries per day.

You may select up to 2 final submissions for judging.

Competition Timeline

Start Date: 9/17/2015 9:53:46 PM UTC
Merger Deadline: 11/30/2015 11:59:00 PM UTC
First Submission Deadline: 11/30/2015 11:59:00 PM UTC
End Date: 12/7/2015 11:59:00 PM UTC
  • Winners are encouraged (not required) to present at the AMS Annual Meeting in New Orleans in 2016 and asked to write a 2 page paper describing their methods.
  • Employees and immediate family members of the Climate Corporation are not eligible to participate or win prizes.
  • Use of external data is not permitted.

COMPETITION-SPECIFIC TERMS

COMPETITION NAME (the 'Competition'):  How much did it rain II

COMPETITION SPONSOR: American Meteorological Society

COMPETITION WEBSITE: https://www.kaggle.com/c/how-much-did-it-rain-ii

PRIZES: $500

WINNER LICENSE TYPE: Open Source License

COMPETITION FRAMEWORK

These are the complete, official rules for the Competition (the 'Competition Rules') and incorporate by reference the contents of the Competition Website listed above.

By downloading a dataset linked from the Competition Website, submitting an entry to this Competition, or joining a Team in this Competition, you are agreeing to be bound by these Competition Rules which constitute a binding agreement between you and the Competition Sponsor.

The Competition is sponsored by the Competition Sponsor listed above and hosted on the Sponsor's behalf by Kaggle Inc ('Kaggle').

The Competition will run from Start Date to End Date as listed on the Competition Website. Note that Competition deadlines are subject to change and additional hurdle deadlines may be introduced during the Competition. Any additional or altered deadlines not already described in these Competition Rules will be publicized to the Participants via the Competition Website. It is the Participant's responsibility in this Competition to check the website regularly throughout the Competition to stay informed of any new or updated deadlines. Neither Kaggle nor Competition Sponsor are responsible for any Participant's failure to do so.

Each registered individual or Team is referred to as a Participant. You may only participate using a single, unique Kaggle account registered at http://www.kaggle.com. Participating using more than one Kaggle account per individual Participant is a breach of these Competition Rules and Competition Sponsor reserves the right to disqualify any Participant (or Team including Participant) who is found to breach these Competition Rules.

ELIGIBILITY

The Competition is open to all individuals over the age of 18 at the time of entry and to all validly formed legal entities that have not declared or been declared in bankruptcy.

Officers, directors, employees and advisory board members (and their immediate families and members of the same household) of the Competition Sponsor, Kaggle and their respective affiliates, subsidiaries, contractors (with the express exception of Kaggle's authorized Kaggle Community Evangelists), agents, judges and advertising and promotion agencies are not eligible to participate in the Competition.

You are not eligible to receive any Prize in the Competition if you are a resident of a country designated by the United States Treasury’s Office of Foreign Assets Control (see <http://www.treasury.gov/resource-cen...s/default.aspx> for additional information).

SUBMISSIONS

'Submission' means the material submitted by you in the manner and format specified on the Website via the Submission form on the Website. You (or if you are part of a Team, your Team) may submit up to the maximum number of Submissions per day as specified above. All Submissions must be uploaded to the Website in the manner and format specified on the Website. Submissions must be received prior to the Competition deadline and adhere to the guidelines for Submissions specified on the Website.

Submissions may not use or incorporate information from hand labeling or human prediction of the validation dataset or test data records.

If the Competition is a multi-stage Competition with temporally separate training data and/or leaderboard data, one or more valid Submissions must be made and selected during each stage of the Competition in the manner described on the Competition Website.

INTELLECTUAL PROPERTY

DATA

'Data' means the Data or Datasets linked from the Competition Website for the purpose of use by Participants in the Competition. For the avoidance of doubt, Data is deemed for the purpose of these Competition Rules to include any prototype or executable code provided to Participants by Kaggle or Competition Sponsor via the Website. Participants must use the Data only as permitted by these Competition Rules and any associated data use rules specified on the Competition Website.

Unless otherwise permitted by the terms of the Competition Website, Participants must use the Data solely for the purpose and duration of the Competition, including but not limited to reading and learning from the Data, analyzing the Data, modifying the Data and generally preparing your Submission and any underlying models and participating in forum discussions on the Website. Participants agree to use suitable measures to prevent persons who have not formally agreed to these Competition Rules from gaining access to the Data and agree not to transmit, duplicate, publish, redistribute or otherwise provide or make available the Data to any party not participating in the Competition. Participants agree to notify Kaggle immediately upon learning of any possible unauthorized transmission or unauthorized access of the Data and agree to work with Kaggle to rectify any unauthorized transmission. Participants agree that participation in the Competition shall not be construed as having or being granted a license (expressly, by implication, estoppel, or otherwise) under, or any right of ownership in, any of the Data.

EXTERNAL DATA

Unless otherwise expressly stated on the Competition Website, Participants must not use data other than the Data to develop and test their models and Submissions. Competition Sponsor reserves the right in its sole discretion to disqualify any Participant who Competition Sponsor discovers has undertaken or attempted to undertake the use of data other than the Data, or who uses the Data other than as permitted according to the Competition Website and in these Competition Rules, in the course of the Competition.

CODE SHARING

Participants are prohibited from privately sharing source or executable code developed in connection with or based upon the Data, and any such sharing is a breach of these Competition Rules and may result in disqualification.

Participants are permitted to publicly share source or executable code developed in connection with or based upon the Data, or otherwise relevant to the Competition, provided that such sharing does not violate the intellectual property rights of any third party. By so sharing, the sharing Participant is thereby deemed to have licensed the shared code under the MIT License (an open source software license commonly described at <http://opensource.org/licenses/MIT>).

OPEN-SOURCE CODE

A Submission will be ineligible to win a prize if it was developed using code containing or depending on software licensed under an open source license:

other than an Open Source Initiative-approved license (see <http://opensource.org/>); or
* an open source license that prohibits commercial use.

WINNING

DETERMINING WINNERS

This Competition is a challenge of skill and the final results are determined solely by leaderboard ranking on the private leaderboard (subject to compliance with these Competition Rules). Participants' scores and ranks on the Competition Website at any given stage of the Competition will be based on the evaluation metric described on the Competition Website, as determined by applying the predictions in the Submission to the ground truth of a validation dataset whose instances were a fixed set sampled from the Data.

The evaluation metric used for scoring and ranking Submissions will be displayed on the Competition Website.

Prize awards are subject to verification of eligibility and compliance with these Competition Rules. All decisions of the Competition Sponsor and judges will be final and binding on all matters relating to this Competition. Competition Sponsor reserves the right to examine the Submission and any associated code or documentation for compliance with these Competition Rules. In the event that the Submission demonstrates a breach of these Competition Rules, Competition Sponsor may at its discretion take either of the following actions:

* disqualify your Submission(s); or 
* require that you remediate within one week all issues identified in your Submission(s) (including, without limitation, the resolution of license conflicts, the fulfillment of all obligations required by software licenses, and the removal of any software that violates the software restrictions).

RESOLVING TIES

A tie between two or more valid and identically ranked submissions will be resolved in favour of the tied submission that was submitted first.

DECLINING PRIZES

A Participant may decline to be nominated as a Winner by notifying Kaggle directly within one week following the Competition deadline, in which case the declining Participant forgoes any prize or other features associated with winning the Competition. Kaggle reserves the right to disqualify a Participant who so declines at Kaggle's sole discretion if Kaggle deems disqualification appropriate.

WINNERS' OBLIGATIONS

DELIVERY & DOCUMENTATION

As a condition of receipt of the Prize, the Prize winner must deliver the final model’s software code as used to generate the winning Submission and associated documentation (consistent with the winning model documentation template available on the kaggle wiki at https://www.kaggle.com/wiki/WinningM...tationTemplate) to the Competition Sponsor. The delivered software code must be capable of generating the winning Submission and contain a description of resources required to build and/or run the executable code successfully.

If a potential winning Participant or Team member is a U.S. citizen, potential winner must also sign and return an IRS W-9 form, or if a foreign resident, an IRS W-8BEN form, within the stated time in order to claim the prize.

Potential winners will be disqualified and the prize may be awarded to an alternate winner (the next-ranked qualified Participant on the leaderboard) if

* the required documentation is not returned within 14 days after receipt of any request to provide documentation;
* prize notification letter/email or prize is returned as undeliverable; or
* potential winner or winning Team member is disqualified for any reason.

By accepting any Prize, each Participant receiving a Prize thereby agrees to use of his/her name, address, likeness and/or Prize information by Kaggle and Competition Sponsor for promotional purposes in any medium without additional compensation.

PARTICIPANT INTELLECTUAL PROPERTY LICENSING

As a further condition of receipt of a Prize, each winning Participant thereby licenses their winning Submission and the source code used to generate the Submission according to the Winner License Type specified above (note: if no Winner License Type is specified above, the Winner License Type is deemed to be Non-Exclusive License).

PUBLIC COMPETITIONS: NON-EXCLUSIVE LICENSE

If the Winner License Type for the Competition (see Winner License Type above) is a Non-Exclusive License then each Winner by accepting a Prize thereby:

* grants to Competition Sponsor and its designees a worldwide, non-exclusive, sub-licensable, transferable, fully paid-up, royalty-free, perpetual, irrevocable right to use, not use, reproduce, distribute, create derivative works of, publicly perform, publicly display, digitally perform, make, have made, sell, offer for sale and import their winning Submission and the source code used to generate the Submission, in any media now known or hereafter developed, for any purpose whatsoever, commercial or otherwise, without further approval by or payment to Participant; and
* represents that he/she/it has the unrestricted right to grant that license.

RESEARCH COMPETITIONS: OPEN SOURCE LICENSE

If the Winner License Type for the Competition (see Winner License Type above) is Open Source License, then each Winner by accepting a Prize thereby:

* licenses their winning Submission and the source code used to generate the Submission under the MIT License (an open source software license commonly described at <http://opensource.org/licenses/MIT>), unless another open source license is chosen explicitly from the list of those approved by the Open Source Initiative at <http://opensource.org/licenses/>; and
* represents that he/she/it has the unrestricted right to grant that license.

RECRUITING COMPETITION SUBMISSION LICENSE GRANT

If the Competition is classed above as a Recruiting Competition, Participant agrees to grant and hereby grants to Competition Sponsor a perpetual, irrevocable, worldwide, royalty-free, transferable, and sublicenseable right to use, review, reproduce, and internally distribute the Submission (including any and all submitted source code) in connection with evaluating Participant's suitability for employment, provided that the Submission licensed as set forth above will not be used for Competition Host’s commercial purposes except as may otherwise be set forth in this Agreement.

CHEATING

Participating using more than one Kaggle account is deemed cheating and, if discovered, will result in disqualification from the Competition and any other affected Competitions and may result in banning or deactivation of affected Kaggle accounts.

RECEIVING PRIZES

After verification of eligibility, each Prize winner will receive the prize in the form of a check or wire transfer made out to the Prize winner (if an individual, or to the individual Team members if a Team). Allow 30 days from final confirmation for Prize delivery. Any winners who are U.S. citizens will receive an IRS 1099 form in the amount of their prize at the appropriate time. Prize winners are responsible for any taxes, fees or other liability resulting from their receipt of a Prize.

TEAMS

FORMING A TEAM

Multiple individuals or entities may collaborate as a team (“Team”). You may not participate on more than one Team. Each Team member must be a single individual operating a separate Kaggle account. You must register individually for the Competition before joining a Team. You must confirm your Team membership to make it official by responding to the Team notification message which will be sent to your Account.

Team membership may not exceed the Maximum Team Size.

TEAM MERGERS

Teams may be permitted to merge at Kaggle’s discretion, so long as the merged Team meets all requirements in these Competition Rules. Kaggle will review any merger request and will either approve or reject the request within three business days. Merger requests may be rejected if the combined number of Entries made by the merging Teams exceeds the number of Entries permissible at the date of the merger request. Team merger requests will not be permitted within seven days of any deadline listed on the Competition Website.

TEAM PRIZES

If a Team wins a monetary Prize, Competition Sponsor will allocate the Prize money in even shares between the Team members unless the Team unanimously contacts Kaggle via the Competition Website within three business days following the Submission deadline to request an alternative prize distribution.

WARRANTIES AND OBLIGATIONS

PARTICIPANT WARRANTIES AND OBLIGATIONS

By registering, you agree that (a) your Account is complete, correct and accurate and (b) your registration may be rejected or terminated and all Entries submitted by you and/or your Team may be disqualified if any of the information in your Account is (or Competition Sponsor has reasonable grounds to believe it is) incomplete, incorrect or inaccurate. You are solely responsible for your Account. All registration information is deemed collected in the United States.

Participation is subject to all federal, state and local laws and regulations. Void where prohibited or restricted by law. You are responsible for checking applicable laws and regulations in your jurisdiction before participating in the Competition to make sure that your participation is legal. You are responsible for all taxes and reporting related to any award that you may receive as part of the Competition. You are responsible for abiding by your employer's policies regarding participation in the Competition. Competition Sponsor disclaims any and all liability or responsibility for disputes arising between you and your employer related to this Competition.

Each Participant is solely responsible for all equipment, including but not necessarily limited to a computer and internet connection necessary to access the Website and to develop and upload any Submission, and any telephone, data, hosting or other service fees associated with such access, as well as all costs incurred by or behalf of the Entrant in participating in the Competition.

By entering a Submission, you represent and warrant that all information you enter on the Website is true and complete to the best of your knowledge, that you have the right and authority to make the Submission (including any underlying code and model) on your own behalf or on behalf of the persons and entities that you specify within the Submission, and that your Submission:

* is your own original work, or is used by permission, in which case full and proper credit and identify is given and the third party contributions are clearly identified within your Submission;
* does not contain confidential information or trade secrets and is not the subject of a registered patent or pending patent application;
* does not violate or infringe upon the patent rights, industrial design rights, copyrights, trademarks, rights of privacy, publicity or other intellectual property or other rights of any person or entity;
* does not contain malicious code, such as viruses, timebombs, cancelbots, worms, Trojan horses or other potentially harmful programs or other material or information; 
* does not and will not violate any applicable law, statute, ordinance, rule or regulation;
* does not trigger any reporting or royalty obligation to any third party; and
* was not previously published and has not won any other prize/award. 
A breach of any of these warranties will result in the corresponding Submission being invalid.

LIMITATION OF LIABILITY

By participating in the Competition, each Participant agrees to release, indemnify and hold harmless Competition Sponsor, Kaggle, their respective affiliates, subsidiaries, advertising and promotions agencies, as applicable, and each of their respective agents, representatives, officers, directors, shareholders, and employees from and against any injuries, losses, damages, claims, actions and any liability of any kind resulting from or arising out of your participation in or association with the Competition. Competition Sponsor is not responsible for any miscommunications such as technical failures related to computer, telephone, cable, and unavailable network or server connections, related technical failures, or other failures related to hardware, software or virus, or incomplete, late or misdirected Submissions. Competition Sponsor reserves the right to cancel, modify or suspend the Competition should any computer virus, bug or other technical difficulty or other causes beyond the control of Competition Sponsor corrupt the administration, security or proper play of the Competition, and to determine winners from among Submission not affected by the corruption, if any, in its sole discretion.

Neither Kaggle nor Competition Sponsor are responsible for (a) late, lost, stolen, damaged, garbled, incomplete, incorrect or misdirected Entries or other communications, (b) errors, omissions, interruptions, deletions, defects, or delays in operations or transmission of information, in each case whether arising by way of technical or other failures or malfunctions of computer hardware, software, communications devices, or transmission lines, or (c) data corruption, theft, destruction, unauthorized access to or alteration of Submission materials, loss or otherwise. Neither Kaggle nor Competition Sponsor are responsible for electronic communications or emails which are undeliverable as a result of any form of active or passive filtering of any kind, or insufficient space in any email account to receive email messages. Competition Sponsor disclaims any liability for damage to any computer system resulting from participation in, or accessing or downloading information in connection with, the Competition.

RESERVATION OF RIGHTS

Competition Sponsor reserves the right to modify the Start Date, End Date, and the right to modify, add or remove additional Competition deadlines for any reason. Any changes will be communicated through the Competition Website. Competition Sponsor also reserves the right to modify, remove or add Data to the Website upon notice via the Competition Website. NEITHER KAGGLE NOR COMPETITION SPONSOR ARE RESPONSIBLE FOR ANY FAILURE OF A PARTICIPANT TO RECEIVE DATA CHANGES.

MISCELLANEOUS

SEVERABILITY

The invalidity or unenforceability of any provision of these Competition Rules shall not affect the validity or enforceability of any other provision. In the event that any provision is determined to be invalid or otherwise unenforceable or illegal, these Competition Rules shall otherwise remain in effect and be construed in accordance with their terms as if the invalid or illegal provision was not contained herein.

LAW

You agree that these terms and the relationship between you and Competition Sponsor shall be governed by the laws of the State of California and the United States of America.

Rules Acceptance

 

By clicking on the "I understand and accept" button below, you are indicating that you agree to be bound to the above rules.

Statistics Essentials For Dummies

http://www.math.uni.wroc.pl/~dyba/ma...ls/dummies.pdf PDF

About the Author

Deborah Rumsey is a Statistics Education Specialist and Auxiliary Professor at The Ohio State University. Dr. Rumsey is a Fellow of the American Statistical Association and has won a Presidential Teaching Award from Kansas State University. She has served on the American Statistical Association’s Statistics Education Executive Committee and the Advisory Committee on Teacher Enhancement, and is the editor of the Teaching Bits section of the Journal of Statistics Education. She is the author of the books Statistics For Dummies, Statistics II For Dummies, Probability For Dummies, and Statistics Workbook For Dummies. Her passions, besides teaching, include her family, fishing, bird watching, getting “seat time” on her Kubota tractor, and cheering the Ohio State Buckeyes to another national championship.

Introduction

This book is designed to give you the essential, nitty-gritty information typically covered in a first semester statistics course. It’s bottom-line information for you to use as a refresher, a resource, a quick reference, and/or a study guide. It helps you decipher and make important decisions about statistical polls, experiments, reports and headlines with confidence, being ever aware of the ways people can mislead you with statistics, and how to handle it.

Topics I work you through include graphs and charts, descriptive statistics, the binomial, normal, and t-distributions, two-way tables, simple linear regression, confidence intervals, hypothesis tests, surveys, experiments, and of course the most frustrating yet critical of all statistical topics: sampling distributions and the Central Limit Theorem.

About This Book

This book departs from traditional statistics texts and reference/supplement books and study guides in these ways:

  • Clear and concise step-by-step procedures that intuitively explain how to work through statistics problems and remember the process.
  • Focused, intuitive explanations empower you to know you’re doing things right and whether others do it wrong.
  • Nonlinear approach so you can quickly zoom in on that concept or technique you need, without having to read other material first.
  • Easy-to-follow examples reinforce your understanding and help you immediately see how to apply the concepts in practical settings.
  • Understandable language helps you remember and put into practice essential statistical concepts and techniques.

Conventions Used in This Book

I refer to statistics in two different ways: as numerical results (such as means and medians); or as a field of study (for example, “Statistics is all about data.”).

The second convention refers to the word data. I’m going to go with the plural version of the word data in this book. For example “data are collected during the experiment” — not “data is collected during the experiment.”

Foolish Assumptions

I assume you’ve had some (not necessarily a lot of) previous experience with statistics somewhere in your past. For example, you can recognize some of the basic statistics such as the mean, median, standard deviation, and perhaps correlation; you can handle some graphs; and you can remember having seen the normal distribution. If it’s been a while and you are a bit rusty, that’s okay; this book is just the thing to jog your memory.

If you have very limited or no prior experience with statistics, allow me to suggest my full-version book, Statistics for Dummies, to build up your foundational knowledge base. But if you are someone who has not seen these ideas before and either doesn’t have time for the full version, or you like to plunge into details right away, this book can work for you.

I assume you’ve had a basic algebra background and can do some of the basic mathematical operations and understand some of the basic notation used in algebra like x, y, summation signs, taking the square root, squaring a number, and so on. (If you’d like some backup on the algebra part, I suggest you consider Algebra I For Dummies and Algebra II For Dummies (Wiley)).

Icons Used in This Book

Here are the road signs you’ll encounter on your journey through this book:

  • Tips refer to helpful hints or shortcuts you can use to save time.
  • Read these to get the inside track on why a certain concept is important, what its impact will be on the results, and highlights to keep on your radar.
  • These alert you to common errors that can cause problems, so you can steer around them.
  • These point out things in the text that you should, if possible, stash away somewhere in your brain for future use.

Where to Go from Here

This book is written in a nonlinear way, so you can start anywhere and still be able to understand what’s happening. However, I can make some recommendations for those who are interested in knowing where to start.

For a quick overview of the topics to refresh your memory, check out Chapter 1. For basic number crunching and graphs, see Chapters 2 and 3. If you’re most interested in common distributions, see Chapters 4 (binomial); 5 (normal); and 9 (t-distribution). Confidence intervals and hypothesis testing are found in Chapters 7 and 8. Correlation and regression are found in Ch 10, and two-way tables and independence are tackled in Ch 11. If you are interested in evaluating and making sense of the results of medical studies, polls, surveys, and experiments, you’ll find all the info in Chapters 12 and 13. Common mistakes to avoid or watch for are seen in Chapter 14.

The 5th Wave

Chapter 1: Statistics in a Nutshell

In This Chapter

  • Getting the big picture of the field of statistics
  • Overviewing the steps of the scientific method
  • Seeing the role of statistics at each step

The most common description of statistics is that it’s the process of analyzing data — number crunching, in a sense. But statistics is not just about analyzing the data. It’s about the whole process of using the scientific method to answer questions and make decisions. That process involves designing studies, collecting good data, describing the data with numbers and graphs, analyzing the data, and then making conclusions. In this chapter I review each of these steps and show where statistics plays the all-important role.

Designing Studies

Once a research question is defined, the next step is designing a study in order to answer that question. This amounts to figuring out what process you’ll use to get the data you need. In this section I overview the two major types of studies: observational studies and experiments.

Surveys

An observational study is one in which data are collected on individuals in a way that doesn’t affect them. The most common observational study is the survey. Surveys are questionnaires that are presented to individuals who have been selected from a population of interest. Surveys take on many different forms: paper surveys sent through the mail; Web sites; call-in polls conducted by TV networks; and phone surveys. If conducted properly, surveys can be very useful tools for getting information. However, if not conducted properly, surveys can result in bogus information. Some problems include improper wording of questions, which can be misleading, people who were selected to participate but do not respond, or an entire group in the population who had no chance of even being selected. These potential problems mean a survey has to be well thought-out before it’s given.

A downside of surveys is that they can only report relationships between variables that are found; they cannot claim cause and effect. For example, if in a survey researchers notice that the people who drink more than one Diet Coke per day tend to sleep fewer hours each night than those who drink at most one per day, they cannot conclude that Diet Coke is causing the lack of sleep. Other variables might explain the relationship, such as number of hours worked per week. See all the information about surveys, their design, and potential problems in Chapter 12.

Experiments

An experiment imposes one or more treatments on the participants in such a way that clear comparisons can be made. Once the treatments are applied, the response is recorded. For example, to study the effect of drug dosage on blood pressure, one group might take 10 mg of the drug, and another group might take 20 mg. Typically, a control group is also involved, where subjects each receive a fake treatment (a sugar pill, for example).

Experiments take place in a controlled setting, and are designed to minimize biases that might occur. Some potential problems include: researchers knowing who got what treatment; a certain condition or characteristic wasn’t accounted for that can affect the results (such as weight of the subject when studying drug dosage); or lack of a control group. But when designed correctly, if a difference in the responses is found when the groups are compared, the researchers can conclude a cause and effect relationship. See coverage of experiments in Chapter 13.

It is perhaps most important to note that no matter what the study, it has to be designed so that the original questions can be answered in a credible way.

Collecting Data

Once a study has been designed, be it a survey or an experiment, the subjects are chosen and the data are ready to be collected. This phase of the process is also critical to producing good data.

Selecting a good sample

First, a few words about selecting individuals to participate in a study (much, much more is said about this topic in Chapter 12). In statistics, we have a saying: “Garbage in equals garbage out.” If you select your subjects in a way that is biased — that is, favoring certain individuals or groups of individuals — then your results will also be biased.

Suppose Bob wants to know the opinions of people in your city regarding a proposed casino. Bob goes to the mall with his clipboard and asks people who walk by to give their opinions. What’s wrong with that? Well, Bob is only going to get the opinions of a) people who shop at that mall; b) on that particular day; c) at that particular time; d) and who take the time to respond. That’s too restrictive — those folks don’t represent a cross-section of the city. Similarly, Bob could put up a Web site survey and ask people to use it to vote. However, only those who know about the site, have Internet access, and want to respond will give him data. Typically, only those with strong opinions will go to such trouble. So, again, these individuals don’t represent all the folks in the city.

In order to minimize bias, you need to select your sample of individuals randomly — that is, using some type of “draw names out of a hat” process. Scientists use a variety of methods to select individuals at random (more in Chapter 12), but getting a random sample is well worth the extra time and effort to get results that are legitimate.

Avoiding bias in your data

Say you’re conducting a phone survey on job satisfaction of Americans. If you call them at home during the day between 9 a.m. and 5 p.m., you’ll miss out on all those who work during the day; it could be that day workers are more satisfied than night workers, for example. Some surveys are too long — what if someone stops answering questions halfway through? Or what if they give you misinformation and tell you they make $100,000 a year instead of $45,000? What if they give you an answer that isn’t on your list of possible answers? A host of problems can occur when collecting survey data; Chapter 12 gives you tips on avoiding and spotting them.

Experiments are sometimes even more challenging when it comes to collecting data. Suppose you want to test blood pressure; what if the instrument you are using breaks during the experiment? What if someone quits the experiment halfway through? What if something happens during the experiment to distract the subjects or the researchers? Or they can’t find a vein when they have to do a blood test exactly one hour after a dose of a drug is given? These are just some of the problems in data collection that can arise with experiments; Chapter 13 helps you find and minimize them.

Describing Data

Once data are collected, the next step is to summarize it all to get a handle on the big picture. Statisticians describe data in two major ways: with pictures (that is, charts and graphs) and with numbers, called descriptive statistics.

Descriptive statistics

Data are also summarized (most often in conjunction with charts and/or graphs) by using what statisticians call descriptive statistics. Descriptive statistics are numbers that describe a data set in terms of its important features.

If the data are categorical (where individuals are placed into groups, such as gender or political affiliation) they are typically summarized using the number of individuals in each group (called the frequency) or the percentage of individuals in each group (the relative frequency).

Numerical data represent measurements or counts, where the actual numbers have meaning (such as height and weight). With numerical data, more features can be summarized besides the number or percentage in each group. Some of these features include measures of center (in other words, where is the “middle” of the data?); measures of spread (how diverse or how concentrated are the data around the center?); and, if appropriate, numbers that measure the relationship between two variables (such as height and weight).

Some descriptive statistics are better than others, and some are more appropriate than others in certain situations. For example, if you use codes of 1 and 2 for males and females, respectively, when you go to analyze that data, you wouldn’t want to find the average of those numbers — an “average gender” makes no sense. Similarly, using percentages to describe the amount of time until a battery wears out is not appropriate. A host of basic descriptive statistics are presented, compared, and calculated in Chapter 2.

Charts and graphs

Data are summarized in a visual way using charts and/or graphs. Some of the basic graphs used include pie charts and bar charts, which break down variables such as gender and which applications are used on teens’ cell phones. A bar graph, for example, may display opinions on an issue using 5 bars labeled in order from “Strongly Disagree” up through “Strongly Agree.”

But not all data fit under this umbrella. Some data are numerical, such as height, weight, time, or amount. Data representing counts or measurements need a different type of graph that either keeps track of the numbers themselves or groups them into numerical groupings. One major type of graph that is used to graph numerical data is a histogram. In Chapter 3 you delve into pie charts, bar graphs, histograms and other visual summaries of data.

Analyzing Data

After the data have been collected and described using pictures and numbers, then comes the fun part: navigating through that black box called the statistical analysis. If the study has been designed properly, the original questions can be answered using the appropriate analysis, the operative word here being appropriate. Many types of analyses exist; choosing the wrong one will lead to wrong results.

In this book I cover the major types of statistical analyses encountered in introductory statistics. Scenarios involving a fixed number of independent trials where each trial results in either success or failure use the binomial distribution, described in Chapter 4. In the case where the data follow a bell-shaped curve, the normal distribution is used to model the data, covered in Chapter 5.

Chapter 7 deals with confidence intervals, used when you want to make estimates involving one or two population means or proportions using a sample of data. Chapter 8 focuses on testing someone’s claim about one or two population means or proportions — these analyses are called hypothesis tests. If your data set is small and follows a bellshape, the t-distribution might be in order; see Chapter 9.

Chapter 10 examines relationships between two numerical variables (such as height and weight) using correlation and simple linear regression. Chapter 11 studies relationships between two categorical variables (where the data place individuals into groups, such as gender and political affiliation). You can find a fuller treatment of these topics in Statistics For Dummies (Wiley), and analyses that are more complex than that are discussed in the book Statistics II For Dummies, also published by Wiley.

Making Conclusions

Researchers perform analysis with computers, using formulas. But neither a computer nor a formula knows whether it’s being used properly, and they don’t warn you when your results are incorrect. At the end of the day, computers and formulas can’t tell you what the results mean. It’s up to you.

One of the most common mistakes made in conclusions is to overstate the results, or to generalize the results to a larger group than was actually represented by the study. For example, a professor wants to know which Super Bowl commercials viewers liked best. She gathers 100 students from her class on Super Bowl Sunday and asks them to rate each commercial as it is shown. A top 5 list is formed, and she concludes that Super Bowl viewers liked those 5 commercials the best. But she really only knows which ones her students liked best — she didn’t study any other groups, so she can’t draw conclusions about all viewers.

Statistics is about much more than numbers. It’s important to understand how to make appropriate conclusions from studying data, and that’s something I discuss throughout the book.

Chapter 2: Descriptive Statistics

In This Chapter

  • Statistics to measure center
  • Standard deviation, variance, and other measures of spread
  • Measures of relative standing

Descriptive statistics are numbers that summarize some characteristic about a set of data. They provide you with easy-to-understand information that helps answer questions. They also help researchers get a rough idea about what’s happening in their experiments so later they can do more formal and targeted analyses. Descriptive statistics make a point clearly and concisely.

In this chapter you see the essentials of calculating and evaluating common descriptive statistics for measuring center and variability in a data set, as well as statistics to measure the relative standing of a particular value within a data set.

Types of Data

Data come in a wide range of formats. For example, a survey might ask questions about gender, race, or political affiliation, while other questions might be about age, income, or the distance you drive to work each day. Different types of questions result in different types of data to be collected and analyzed. The type of data you have determines the type of descriptive statistics that can be found and interpreted.

There are two main types of data: categorical (or qualitative) data and numerical (or quantitative data). Categorical data record qualities or characteristics about the individual, such as eye color, gender, political party, or opinion on some issue (using categories such as agree, disagree, or no opinion). Numerical data record measurements or counts regarding each individual, which may include weight, age, height, or time to take an exam; counts may include number of pets, or the number of red lights you hit on your way to work. The important difference between the two is that with categorical data, any numbers involved do not have real numerical meaning (for example, using 1 for male and 2 for female), while all numerical data represents actual numbers for which math operations make sense.

A third type of data, ordinal data, falls in between, where data appear in categories, but the categories have a meaningful order, such as ratings from 1 to 5, or class ranks of freshman through senior. Ordinal data can be analyzed like categorical data, and the basic numerical data techniques also apply when categories are represented by numbers that have meaning.

Counts and Percents

Categorical data place individuals into groups. For example, male/female, own your home/don’t own, or Democrat/ Republican/Independent/Other. Categorical data often come from survey data, but they can also be collected in experiments. For example, in a test of a new medical treatment, researchers may use three categories to assess the outcome: Did the patient get better, worse, or stay the same?

Categorical data are typically summarized by reporting either the number of individuals falling into each category, or the percentage of individuals falling into each category. For example, pollsters may report the percentage of Republicans, Democrats, Independents, and others who took part in a survey. To calculate the percentage of individuals in a certain category, find the number of individuals in that category, divide by the total number of people in the study, and then multiply by 100%. For example, if a survey of 2,000 teenagers included 1,200 females and 800 males, the resulting percentages would be (1,200 ÷ 2,000) * 100% = 60% female and (800 ÷ 2,000) * 100% = 40% male.

You can further break down categorical data by creating crosstabs. Crosstabs (also called two-way tables) are tables with rows and columns. They summarize the information from two categorical variables at once, such as gender and political party, so you can see (or easily calculate) the percentage of individuals in each combination of categories. For example, if you had data about the gender and political party of your respondents, you would be able to look at the percentage of Republican females, Democratic males, and so on. In this example, the total number of possible combinations in your table would be the total number of gender categories times the total number of party affiliation categories. The U.S. government calculates and summarizes loads of categorical data using crosstabs. (see Chapter 11 for more on two-way tables.)

If you’re given the number of individuals in each category, you can always calculate your own percents. But if you’re only given percentages without the total number in the group, you can never retrieve the original number of individuals in each group. For example, you might hear that 80% of people surveyed prefer Cheesy cheese crackers over Crummy cheese crackers. But how many were surveyed? It could be only 10 people, for all you know, because 8 out of 10 is 80%, just as 800 out of 1,000 is 80%. These two fractions (8 out of 10 and 800 out of 1,000) have different meanings for statisticians, because the first is based on very little data, and the second is based on a lot of data. (See Chapter 7 for more information on data accuracy and margin of error.)

Measures of Center

The most common way to summarize a numerical data set is to describe where the center is. One way of thinking about what the center of a data set means is to ask, “What’s a typical value?” Or, “Where is the middle of the data?” The center of a data set can be measured in different ways, and the method chosen can greatly influence the conclusions people make about the data. In this section I present the two most common measures of center: the mean (or average) and the median.

The mean (or average) of a data set is simply the average of all the numbers. Its formula is Xmean = Sum Xi/n. Here is what you need to do to find the mean of a data set, xmean:

1. Add up all the numbers in the data set.

2. Divide by the number of numbers in the data set, n.

When it comes to measures of center, the average doesn’t always tell the whole story and may be a bit misleading. Take NBA salaries. Every year, a few top-notch players (like Shaq) make much more money than anybody else. These are called outliers (numbers in the data set that are extremely high or low compared to the rest). Because of the way the average is calculated, high outliers drive the average upward (as Shaq’s salary did in the preceding example). Similarly, outliers that are extremely low tend to drive the average downward.

What can you report, other than the average, to show what the salary of a “typical” NBA player would be? Another statistic used to measure the center of a data set is the median. The median of a data set is the place that divides the data in half, once the data are ordered from smallest to largest. It is denoted by M or xmean. To find the median of a data set:

1. Order the numbers from smallest to largest.

2. If the data set contains an odd number of numbers, the one exactly in the middle is the median.

3. If the data set contains an even number of numbers, take the two numbers that appear exactly in the middle and average them to find the median.

For example, take the data set 4, 2, 3, 1. First, order the numbers to get 1, 2, 3, 4. Then note this data has an even number of numbers, so go to Step 3. Take the two numbers in the middle — 2 and 3 — and find their average: 2.5.

Note that if the data set is odd, the median will be one of the numbers in the data set itself. However, if the data set is even, it may be one of the numbers (the data set 1, 2, 2, 3 has median 2); or it may not be, as the data set 4, 2, 3, 1 (whose median is 2.5) shows.

Which measure of center should you use, the mean or the median? It depends on the situation, but reporting both is never a bad idea. Suppose you’re part of an NBA team trying to negotiate salaries. If you represent the owners, you want to show how much everyone is making and how much you’re spending, so you want to take into account those superstar players and report the average. But if you’re on the side of the players, you want to report the median, because that’s more representative of what the players in the middle are making. Fifty percent of the players make a salary above the median, and 50% make a salary below the median.

When the mean and median are not close to each other in terms of their value, it’s a good idea to report both and let the reader interpret the results from there. Also, as a general rule, be sure to ask for the median if you are only given the mean.

Measures of Variability

Variability is what the field of statistics is all about. Results vary from individual to individual, from group to group, from city to city, from moment to moment. Variation always exists in a data set, regardless of which characteristic you’re measuring, because not every individual will have the same exact value for every characteristic you measure. Without a measure of variability you can’t compare two data sets effectively. What if in both sets two sets of data have about the same average and the same median? Does that mean that the data are all the same? Not at all. For example, the data sets 199, 200, 201, and 0, 200, 400 both have the same average, which is 200, and the same median, which is also 200. Yet they have very different amounts of variability. The first data set has a very small amount of variability compared to the second.

By far the most commonly used measure of variability is the standard deviation. The standard deviation of a data set, denoted by s, represents the typical distance from any point in the data set to the center. It’s roughly the average distance from the center, and in this case, the center is the average. Most often, you don’t hear a standard deviation given just by itself; if it’s reported (and it’s not reported nearly enough) it’s usually in the fine print, in parentheses, like “(s = 2.68).”

The formula for the standard deviation of a data set is s = SRT (Sum (x-xmean)2/(x-1) . To calculate s, do the following steps:

1. Find the average of the data set, xmean. To find the average, add up all the numbers and divide by the number of numbers in the data set, n.

2. For each number, subtract the average from it.

3. Square each of the differences.

4. Add up all the results from Step 3.

5. Divide the sum of squares (Step 4) by the number of numbers in the data set, minus one (n – 1).

If you do Steps 1 through 5 only, you have found another measure of variability, called the variance.

6. Take the square root of the variance. This is the standard deviation.

Suppose you have four numbers: 1, 3, 5, and 7. The mean is 16 ÷ 4 = 4. Subtracting the mean from each number, you get (1 – 4) = –3, (3 – 4) = –1, (5 – 4) = +1, and (7 – 4) = +3. Squaring the results you get 9, 1, 1, and 9, which sum to 20. Divide 20 by 4 – 1 = 3 to get 6.67. The standard deviation is the square root of 6.67, which is 2.58.

Here are some properties that can help you when interpreting a standard deviation:

  • The standard deviation can never be a negative number.
  • The smallest possible value for the standard deviation is 0 (when every number in the data set is exactly the same).
  • Standard deviation is affected by outliers, as it’s based on distance from the mean, which is affected by outliers.
  • The standard deviation has the same units as the original data, while variance is in square units.

Percentiles

The most common way to report relative standing of a number within a data set is by using percentiles. A percentile is the percentage of individuals in the data set who are below where your particular number is located. If your exam score is at the 90th percentile, for example, that means 90% of the people taking the exam with you scored lower than you did (it also means that 10 percent scored higher than you did.)

Finding a percentile

To calculate the kth percentile (where k is any number between one and one hundred), do the following steps:

1. Order all the numbers in the data set from smallest to largest.

2. Multiply k percent times the total number of numbers, n.

3a. If your result from Step 2 is a whole number, go to Step 4. If the result from Step 2 is not a whole number, round it up to the nearest whole number and go to Step 3b.

3b. Count the numbers in your data set from left to right (from the smallest to the largest number) until you reach the value from Step 3a. This corresponding number in your data set is the kth percentile.

4. Count the numbers in your data set from left to right until you reach that whole number. The kth percentile is the average of that corresponding number in your data set and the next number in your data set.

For example, suppose you have 25 test scores, in order from lowest to highest: 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99. To find the 90th percentile for these (ordered) scores start by multiplying 90% times the total number of scores, which gives 90% × 25 = 0.90 × 25 = 22.5 (Step 2). This is not a whole number; Step 3a says round up to the nearest whole number — 23 — then go to step 3b. Counting from left to right (from the smallest to the largest number in the data set), you go until you find the 23rd number in the data set. That number is 98, and it’s the 90th percentile for this data set.

If you want to find the 20th percentile, take 0.20 ∗ 25 = 5; this is a whole number so proceed to Step 4, which tells us the 20th percentile is the average of the 5th and 6th numbers in the ordered data set (62 and 66). The 20th percentile then comes to (62 + 66)/2 = 64.

The median is the 50th percentile, the point in the data where 50% of the data fall below that point and 50% fall above it. The median for the test scores example is the 13th number, 77.

Interpreting percentiles

The U.S. government often reports percentiles among its data summaries. For example, the U.S. Census Bureau reported the median household income for 2001 was $42,228. The Bureau also reported various percentiles for household income, including the 10th, 20th, 50th, 80th, 90th, and 95th. Table 2-1 shows the values of each of these percentiles.

Table 2-1 U.S. Household Income for 2001

Percentile

2001 Household Income

10th

$10,913
20th $17,970
50th $42,228
80th $83,500
90th $116,105
95th

$150,499

 

Looking at these percentiles, you can see that the bottom half of the incomes are closer together than are the top half. The difference between the 50th percentile and the 20th percentile is about $24,000, whereas the spread between the 50th percentile and the 80th percentile is more like $41,000. And the difference between the 10th and 50th percentiles is only about $31,000, whereas the difference between the 90th and the 50th percentiles is a whopping $74,000.

A percentile is not a percent; a percentile is a number that is a certain percentage of the way through the data set, when the data set is ordered. Suppose your score on the GRE was reported to be the 80th percentile. This doesn’t mean you scored 80% of the questions correctly. It means that 80% of the students’ scores were lower than yours, and 20% of the students’ scores were higher than yours.

The Five-Number Summary

The five-number summary is a set of five descriptive statistics that divide the data set into four equal sections. The five numbers in a five number summary are:

1. The minimum (smallest) number in the data set.

2. The 25th percentile, aka the first quartile, or Q1.

3. The median (or 50th percentile).

4. The 75th percentile, aka the third quartile, or Q3.

5. The maximum (largest) number in the data set.

For example, we can find the five-number summary of the 25 (ordered) exam scores 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99. The minimum is 43, the maximum is 99, and the median is the number directly in the middle, 77.

To find Q1 and Q3, you use the steps shown in the section, “Finding a percentile,” where n = 25. Step 1 is done since the data are ordered. For Step 2, since Q1 is the 25th percentile, multiply 0.25 ∗ 25 = 6.25. This is not a whole number, so Step 3a says round it up to 7 and proceed to Step 3b. Count from left to right in the data set until you reach the 7th number, 68; this is Q1. For Q3 (the 75th percentile) multiply 0.75 ∗ 25 = 18.75; round up to 19, and the 19th number on the list is 89, or Q3. Putting it all together, the five-number summary for the test scores data is 43, 68, 77, 89, and 99.

The purpose of the five-number summary is to give descriptive statistics for center, variability, and relative standing all in one shot. The measure of center in the five-number summary is the median, and the first quartile, median, and third quartiles are measures of relative standing. To obtain a measure of variability based on the five-number summary, you can find what’s called the Interquartile Range (or IQR). The IQR equals Q3 – Q1 and reflects the distance taken up by the innermost 50% of the data. If the IQR is small, you know there is much data close to the median. If the IQR is large, you know the data are more spread out from the median. The IQR for the test scores data set is 89 – 68 = 21, which is quite large seeing as how test scores only go from 0 to 100.

Chapter 3: Charts and Graphs

Chapter 4: The Binomial Distribution

Chapter 5: The Normal Distribution

Chapter 6: Sampling Distributions and the Central Limit Theorem

Chapter 7: Confidence Intervals

Chapter 8: Hypothesis Tests

Chapter 9: The t-distribution

Chapter 10: Correlation and Regression

In This Chapter

  • Exploring statistical relationships between numerical variables
  • Distinguishing between association, correlation, and causation
  • Making predictions based on known relationships

In this chapter you analyze two numerical variables, X and Y, to look for patterns, find the correlation, and make pre- dictions about Y from X, if appropriate, using simple linear regression.

Picturing the Relationship with a Scatterplot

A fair amount of research supports the claim that the fre- quency of cricket chirps is related to temperature. And this relationship is actually used at times to predict the tem- perature using the number of times the crickets chirp per 15 seconds. To illustrate, I’ve taken a subset of some of the data that’s been collected on this; you can see it in Table 10-1.

Table 10-1 Cricket Chirps and Temperature Data (Excerpt)

Number of Chirps (in 15 Seconds) Temperature (Fahrenheit)
18 57
20 60
21 64
23 65
27 68
30 71
34 74
39 77

 

Notice that each observation is composed of two variables that are tied together, in this case the number of times the cricket chirped in 15 seconds (the X-variable), and the temperature at the time the data was collected (the Y-variable). Statisticians call this type of two-dimensional data bivariate data. Each observation contains one pair of data collected simultaneously.

Making a scatterplot

Bivariate data are typically organized in a graph that statisticians call a scatterplot. A scatterplot has two dimensions, a horizontal dimension (called the x-axis) and a vertical dimen- sion (called the y-axis). Both axes are numerical — each con- tains a number line.

The x-coordinate of bivariate data corresponds to the first piece of data in the pair; the y-coordinate corresponds to the second piece of data in the pair. If you intersect the two coor- dinates, you can graph the pair of data on a scatterplot. Figure 10-1 shows a scatterplot of the data from Table 10-1.

Interpreting a scatterplot

You interpret a scatterplot by looking for trends in the data as you go from left to right:

  • If the data show an uphill pattern as you move from left to right, this indicates a positive relationship between X and Y. As the x-values increase (move right), the y-values increase (move up) a certain amount.
  • If the data show a downhill pattern as you move from left to right, this indicates a negative relationship between X and Y. That means as the x-values increase (move right) the y-values decrease (move down) by a certain amount.
  • If the data don’t resemble any kind of pattern (even a vague one), then no relationship exists between X and Y.

This chapter focuses on linear relationships. A linear relationship between X and Y exists when the pattern of x- and y-values resembles a line, either uphill (with positive slope) or downhill (with negative slope).

Looking at Figure 10-1, there does appear to be a positive linear relationship between number of cricket chirps and the temperature. That is, as the cricket chirps increase, you can predict that the temperature is higher as well.

Figure 10-1: Scatterplot of cricket chirps versus outdoor temperature.

Measuring Relationships Using the Correlation

After the bivariate data have been organized, the next step is to do some statistics that can quantify or measure the extent and nature of the relationship.

Calculating the correlation

The pattern and direction of the relationship between X and Y can be seen from the scatterplot. The strength of the rela- tionship between two numerical variables depends on how closely the data resemble a certain pattern. Although many different types of patterns can exist between two variables, this chapter examines linear patterns only.

Statisticians use the correlation coefficient to measure the strength and direction of the linear relationship between two numerical variables X and Y. The correlation coefficient for a sample of data is denoted by r.

Although the street definition of correlation applies to any two items that are related (such as gender and political affiliation), statisticians only use this term in the context of two numerical variables. The formal term for correlation is the correlation coefficient. Many different correlation measures have been created; the one in our case is the Pearson correlation coef- ficient (I’ll just call it the correlation).

The formula for the correlation (r) is

EQUATION

where n is the number of pairs of data; and are the sample means; and sx and sy are the sample standard deviations of the x- and y- values, respectively.

To calculate the correlation r from a data set:

1. Find the mean of all the x-values ( ) and the mean of all the y-values ( ).

See Chapter 2 for information on the mean.

2. Find the standard deviation of all the x-values (call it sx) and the standard deviation of all the y-values (call it sy).

See Chapter 2 for information on standard deviation.

3. For each (x, y) pair in the data set, take x minus and y minus , and multiply them together.

4. Add up all the results from Step 3.

5. Divide the sum by sx ∗ sy.

6. Divide the result by n – 1, where n is the number of (x, y) pairs.

This gives you the correlation r.

For example, suppose you have the data set (3, 2), (3, 3), and (6, 4). Following the preceding steps, you can calculate the correlation coefficient r via the following steps. (Note that for this data the x-values are 3, 3, 6, and the y-values are 2, 3, 4.)

1. is 12/3 = 4, and is 9/3 = 3.

2. The standard deviations are calculated to be sx = 1.73 and sy = 1.00.

3. The differences found in Step 3 multiplied together are: (3 – 4)(2 – 3) = (–1)(–1) = 1; (3 – 4)(3 – 3) = (–1)(0) = 0; (6 – 4)(4 – 3) = (+2)(+1) = +2.

4. Adding the Step 3 results, you get 1 + 0 + 2 = 3.

5. Dividing by sx ∗ sy gives you 3/(1.73 ∗ 1.00) = 3/1.73 = 1.73.

6. Now divide the Step 5 result by 3 – 1 (which is 2) and you get the correlation r = 0.87.

Interpreting the correlation

The correlation r is always between +1 and –1. Here is how you interpret various values of r. A correlation that is

Exactly –1 indicates a perfect downhill linear relationship.

Close to –1 indicates a strong downhill linear relationship.

Close to 0 means no linear relationship exists.

Close to +1 indicates a strong uphill linear relationship.

Exactly +1 indicates a perfect uphill linear relationship.

How “close” do you have to get to –1 or +1 to indicate a strong linear relationship? Most statisticians like to see correlations above +0.60 (or below –0.60) before getting too excited about them. Don’t expect a correlation to always be +0.99 or –0.99; real data aren’t perfect.

Figure 10-2 shows examples of what various correlations look like in terms of the strength and direction of the relationship.

Figure 10-2: Scatterplots with various correlations.

For my subset of the cricket chirps versus temperature data, I calculated a correlation of 0.98, which is almost unheard of in the real world (these crickets are good!).

Properties of the correlation

Here are two important properties of correlation:

The correlation is a unitless measure. This means that if you change the units of X or Y, the correlation doesn’t change. For example, changing the temperature (Y) from Fahrenheit to Celsius won’t affect the correlation between the frequency of chirps and the outside temperature.

The variables X and Y can be switched in the data set, and the correlation doesn’t change. For example, if height and weight have a correlation of 0.53, weight and height have the same correlation.

Finding the Regression Line

After you’ve found a linear pattern in the scatterplot, and the correlation between the two numerical variables is moderate to strong, you can create an equation that allows you to pre- dict one variable using the other. This equation is called the simple linear regression line.

Which is X and which is Y?

Before moving forward with your regression analysis, you have to identify which of your two variables is X and which is Y. When doing correlations, the choice of which variable is X and which is Y doesn’t matter, as long as you’re consistent for all the data; but when fitting lines and making predictions, the choice of X and Y makes a difference. In general, X is the variable that is the predictor. Statisticians call the X-variable (here cricket chirps) the explanatory variable, because if X changes, the slope tells you (or explains) how much Y is expected to change. The Y-variable (here temperature) is called the response variable because if X changes, the response (according the equation of the line) is a change in Y. Hence Y can be predicted by X if a strong relationship exists.

Note: In this example, I want to predict the temperature based on listening to crickets. Obviously, the real cause-and-effect is the opposite: As temperature rises, crickets chirp more.

Checking the conditions

In the case of two numerical variables, it’s possible to come up with a line that you can use to predict Y fromX, if (and only if) the following two conditions we examined in the previous sec- tions aremet: 1) The scatterplotmust find a linear pattern; and 2) The correlation, r, ismoderate to strong (typically beyond ±0.60).

It’s not always the case that folks actually check these condi- tions. I’ve seen cases where researchers go ahead and make predictions when a correlation was as low as 0.20, or where the data follow a curve instead of a line when you make the scatterplot! That doesn’t make any sense.

But suppose the correlation is high; do we need to look at the scatterplot? Yes. There are situations where the data have a somewhat curved shape, yet the correlation is still strong.

Understanding the equation

For the crickets and temperature data, you see the scatterplot in Figure 10-1 shows a linear pattern. The correlation between cricket chirps and temperature was found to be very strong (r = 0.98). You now can find one line that best fits the data (in terms of the having the smallest average distance to all the points.). Statisticians call this technique for finding the best- fitting line a simple linear regression analysis.

Do you have to try lots of different lines to see which one fits best? Fortunately, this is not the case (although eyeballing a line on the scatterplot does help you think about what you’d expect the answer to be). The best-fitting line has a distinct slope and y-intercept that can be calculated using formulas (and, I may add, these formulas aren’t too hard to calculate).

The formula for the best-fitting line (or regression line) is y = mx + b, where m is the slope of the line and b is the y-intercept. (This is the same equation from algebra.) The slope of a line is the change in Y over the change in X. For example, a slope of 10/3 means as the x-value increases (moves right) by 3 units, the y-value moves up by 10 units on average.

The y-intercept is that place on the y-axis where the line crosses. For example, in the equation y = 2x – 6, the line crosses the y-axis at the point –6. The coordinates of this point are (0,–6); when a line crosses the y-axis, the x-value is always 0. To come up with the best-fitting line, you need to find values for m and b that fit the pattern of data the absolute best. The following sections find these values.

Finding the slope

The formula for the slope, m, of the best-fitting line is m = ,EQUATION

where r is the correlation between X and Y, and sx and sy are the standard deviations of the x-values and the y-values . To calculate the slope, m, of the best-fitting line:

1. Divide sy by sx.

2. Multiply the result in Step 1 by r.

The correlation and the slope of the best-fitting line are not the same. The formula for slope takes the correlation (a unit- less measurement) and attaches units to it. Think of sy / sx as the change in Y over the change in X, in units of X and Y; for example, change in temperature (degrees Fahrenheit) per increase of one cricket chirp (in 15 seconds).

Finding the y-intercept

The formula for the y-intercept, b, of the best-fitting line is b = – m , where and are the means of the x-values and the y-values, respectively, and m is the slope (the formula for which is given in the preceding section). To calculate the y-intercept, b, of the best-fitting line:

1. Find the slope, m, of the best-fitting line using the steps listed in the preceding section.

2. Multiply by .

3. Subtract your result from .

To save a great deal of time calculating the best-fitting line, keep in mind that five well-known summary statistics are all you need to do all the necessary calculations. I call them the “big-five statistics” (not to be confused with the five-number summary from Chapter 2):

1. The mean of the x-values (denoted )

2. The mean of the y-values (denoted )

3. The standard deviation of the x-values (denoted sx)

4. The standard deviation of the y-values (denoted sy)

5. The correlation between X and Y (denoted r)

(This chapter and Chapter 2 contain formulas and step-by-step instructions for these statistics.)

Interpreting the slope and y-intercept

Even more important than being able to calculate the slope and y-intercept to form the best-f tting regress on line is the ability to interpret their values.

Interpreting the slope

The slope is interpreted in algebra as “rise over run.” If the slope for example is 2, you can write this as 2/1 and say as X increases by 1, Y increases by 2, and that’s how you move along from point to point on the line. In a regression context, the slope is the heart and soul of the equation because it tells you how much you can expect Y to change as X increases.

In general, the units for slope are the units of the Y-variable per units of the X-variable. It’s a ratio of change in Y per change in X. Suppose in studying the effect of dosage level in milligrams (mg) on blood pressure, a researcher finds that the slope of the regression line is –2.5. You can write this as –2.5/1 and say blood pressure is expected to decrease by 2.5 points on average per 1 mg increase in drug dosage.

Always remember to use proper units when interpreting slope.

If using a 1 in the denominator of slope is not super-meaningful, you can multiply the top and bottom by any number (as long as it’s the same number) and interpret it that way instead. In the blood pressure example, instead of writing slope as –2.5/1 and interpreting it as a decrease of 2.5 points per 1 mg increase of the drug, we can multiply the top and bottom by ten to get –25/10 and say an increase in dosage of 10 mg results in a 25-point decrease in blood pressure.

Interpreting the y-intercept

The y-intercept is the place where the regression line y = mx + b crosses the y-axis and is denoted by b (see earlier section “The y-intercept of the regression line”). Sometimes the y-intercept can be interpreted in a meaningful way, and sometimes not. This differs from slope, which is always interpretable. In fact, between the two elements of slope and intercept, the slope is the star of the show, with the y-intercept serving as the less famous but still noticeable sidekick.

There are times when the y-intercept makes no sense. For example, suppose you use rain to predict bushels per acre of corn; if the regression line crosses the y-axis somewhere below zero (and it most likely will), the y-intercept will make no sense. You can’t have negative corn production.

Another situation when it’s not okay to interpret the y-inter- cept is if there is no data near the point where x = 0. For exam- ple, suppose you want to use students’ scores on Midterm 1 to predict their scores on Midterm 2. The y-intercept repre- sents a prediction for Midterm 2 when the score on Midterm 1 is zero. You don’t expect scores on a midterm to be at or near zero unless someone did not take the exam, in which case their score would not be included in the first place.

Many times, however, the y-intercept is of interest to you, it has meaning, and you have data collected in that area (where x = 0). For example, if you’re predicting coffee sales at Green Bay Packer games using temperature, some games have tem- peratures at or even below zero, so predicting coffee sales at these temperatures makes sense. (As you might guess, they sell more and more coffee as the temperature dips.)

The best-fitting line for the crickets

The “big-five” statistics from the subset of cricket data are shown in Table 10-2.

Table 10-2 Big-Five Statistics for the Cricket Data
Variable Mean Standard Deviation Correlation
# Chirps (x) XMean = 26.5 sx = 7.4 r = +0.98
Temp (y) XMean = 67

sy = 6.8

 

 

The slope, m, for the best-fitting line for the subset of cricket

chirp versus temperature data is m = = = 0.90 EQUATION

So, as the number of chirps increases by 1 chirp per 15 sec- onds, the temperature is expected to increase by 0.90 degrees Fahrenheit on average. To get a more practical interpretation, you can multiply the top and bottom of the slope by 10 to get 9.0/10 and say that as chirps increase by 10 (per 15 seconds), temperature increases 9 degrees Fahrenheit.

Now, to find the y-intercept, b, you take – m ∗ , or 67 – (0.90)∗ (26.5) = 43.15. So the best-fitting line for predicting temperature from cricket chirps based on the data is y = 0.90x + 43.15, or temperature (in degrees Fahrenheit) = 0.90 ∗ (number of chirps in 15 seconds) + 43.15. The y-intercept would try to predict temperature when there is no chirping going on at all. However, no data was collected at or near this point, so we can’t make predictions for temperature in this area. You can’t predict temperature using crickets if the crickets are silent.

Making Predictions

After you have a strong linear relationship, and you find the equation of the best-fitting line y = mx + b, you use that line to predict y for a given x-value. This amounts to plugging the x-value into the equation and solving for y. For example, if your equation is y = 2x + 1, and you want to predict y for x = 1, then plug 1 into the equation for x to get y = 2(1) + 1 = 3.

Remember that you choose the values of X (the explanatory vari- able) that you plug in; what you predict is Y, the response vari- able, which totally depends on X. By doing this, you are using one variable that you can easily collect data on, to predict a Y vari- able that is difficult or not possible tomeasure; this works well as long as X and Y are correlated. That’s the big idea of regression.

From the previous section, the best-fitting line for the crickets is y = 0.90x + 43.15. Say you’re camping, listening to crick- ets, and you remember that you can predict temperature by counting chirps. You count 35 chirps in 15 seconds. You put in 35 for x and find y = 0.90(35) + 43.15 = 74.65 degrees F. (Yeah, you memorized the formula just in case you needed it.) So, because crickets chirped 35 times in 15 seconds, you figure the temperature is probably about 75 degrees Fahrenheit.

Avoid Extrapolation!

Just because you have a model doesn’t mean you can plug in any value for X and do a good job of predicting Y. For exam- ple, in the chirping data, there is no data collected for less than 18 chirps or more than 39 chirps per 15 seconds (refer back to Table 10-1). If you try to make predictions outside this range you’re going into uncharted territory; the farther out- side this range you go with your x-values, the more dubious your predictions for y will get. Who’s to say the line still works outside of the area where data were collected? Do you think crickets will chirp faster and faster without limit? At some point they would either pass out or burn up!

Making predictions using x-values that fall outside the range of your data is a no-no. Statisticians call this extrapolation; watch for researchers who try to make claims beyond the range of their data.

Correlation Doesn’t Necessarily Mean Cause-and-Effect

Scatterplots and correlations identify and quantify relation- ships between two variables. However, if a scatterplot shows a definite pattern, and the data are found to have a strong cor- relation, that doesn’t necessarily mean that a cause-and-effect relationship exists between the two variables. A cause-and-effect relationship is one where a change in X causes a change in Y. (In other words, the change in Y is not only associated with a change in X, it is directly caused by X.)

For example, suppose a well-controlled medical experiment is conducted to determine the effects of dosage of a certain drug on blood pressure. (See a total breakdown of experiments in Chapter 13.) The researchers look at their scatterplot and see a definite downhill linear pattern; they calculate the correlation and it’s strong. They conclude that increasing the dosage of this drug causes a decrease in blood pressure. This cause-and- effect conclusion is okay because they controlled for other vari- ables that could affect blood pressure in their experiment, such as other drugs taken, age, general health, and so on.

However, if you made a scatterplot and examined the correla- tion between ice cream consumption versus murder rates, you would also see a strong linear relationship (this time uphill.) Yet no one would claim that more ice cream consumption causes more murders to occur.

What’s going on here? In the drug example, the data were collected through a well-controlled medical experiment, which minimizes the influence of other factors that might affect blood pressure changes. In the second example, the data were just based on observation, and no other factors were exam- ined. It turns out that this strong relationship exists because increases in murder rates and ice cream sales are both related to increases in temperature. (Temperature in this case is called a confounding variable; it affects both X and Y but was not included in the study — see Chapter 13.)

Whether two variables are found to be causally associated depends on how the study was conducted. Only a well- designed experiment (see Chapter 13) or a large collection of several different observational studies can show enough evi- dence for cause-and-effect.

Yet, this condition is often ignored as the media gives us headlines such as “Doctors can lower malpractice lawsuits by spending more time with patients.” In reality, it was found that doctors who have fewer lawsuits are the type of doctor who spends a lot of time with patients. But that doesn’t mean taking a bad doctor and having him spend more time with his patients will reduce his malpractice suits; in fact, spending more time with him might create even more problems.

And we can’t say that crickets chirping faster will cause the temperature to increase, of course, but we do know we can count cricket chirps and do a pretty good job predicting temperature nonetheless, through simple linear regression.

Chapter 11: Two-Way Tables

Chapter 12: A Checklist for Samples and Surveys

In This Chapter

  • Defining and getting good samples from the target population
  • Crafting and administering good surveys
  • Making appropriate conclusions

Surveys are all around you — I guarantee that at some point in your life, you’ll be asked to complete a survey. You’re also likely to be inundated with the results of surveys, and before you consume their information, you need to evaluate whether they were properly designed. In this chapter, I present a checklist you can use to evaluate or plan a survey.

The survey process can be broken down into a series of ten elements that should be checked:

1. Target population is well defined.

2. Sample matches the target population.

3. Sample is randomly selected.

4. Sample size is large enough.

5. Nonresponse is minimized.

6. Type of survey is appropriate.

7. Questions are well worded.

8. Survey is properly timed.

9. Personnel are well trained.

10. Proper conclusions are made.

This list helps you carry out your own survey or critique someone else’s survey. In the following sections I address each item and discuss its role in getting a good survey done.

The Target Population Is Well Defined

The target population is the entire group of individuals that you’re interested in studying. For example, suppose you want to know what the people in Great Britain think of reality TV. The target population is all the residents of Great Britain.

Many researchers don’t do a good job of defining their target populations clearly. For example, if the American Egg Board wants to say “Eggs are good for you!” it needs to specify who the “you” is. For example, is the Egg Board prepared to say that eggs are good for people who have high cholesterol? What if one of the studies the group cites is based only on young people who are healthy and eating low-fat diets — is that who they mean by “you”?

If the target population isn’t well defined, the survey results are likely to be biased. The sample that’s actually studied may contain people outside the intended population, or the survey may exclude people who should have been included.

The Sample Matches the Target Population

When you’re conducting a survey, you typically can’t ask every single member of the target population to provide the information you’re looking for. The best you can do is select a good sample (a subset of individuals from the population) and get the information from them. A good sample represents the target population. The sample doesn’t systematically favor certain groups within the target population, and it doesn’t systematically exclude certain people, either.

The best scenario for selecting a representative sample is to obtain a sampling frame — a list of all the members of the target population — and draw randomly from that. If such a list isn’t possible, you need some mechanism that gives everyone in the population an equal opportunity to be chosen to participate in the survey. For example, if a house-to-house survey of a city is needed, an updated map including all houses in that city should be used as the sampling frame.

The Sample Is Randomly Selected

An important feature of a good study is that the sample is randomly selected from the target population. Randomly means that every member of the target population has an equal chance of being included in the sample. In other words, the process you use for selecting your sample can’t be biased.

The biggest problem to watch for is convenience samples. A convenience sample is a sample selected in a way that’s easiest on the researcher — for example call-in polls, man-on-thestreet surveys, or Internet surveys. Convenience samples are totally nonrandom, and their results are not credible.

For surveys involving people, reputable polling organizations such as the Gallup Organization use a random digit dialing procedure to telephone the members of their sample. This excludes people without phones, of course, so this kind of survey does have a bit of bias. In this case, though, most people do have phones (over 95%, according to the Gallup Organization), so the bias against people who don’t have phones is not a big problem.

The Sample Size Is Large Enough

You’ve heard the saying, “Less is more”? With surveys, the saying is, “Less good information is better than more bad information, but more good information is better.”

If you have a large sample size, and the sample is representative of the target population (meaning randomly selected), you can count on that information to be pretty accurate. Exactly how accurate depends on the sample size, but in general a bigger sample leads to more accurate information (assuming the data is well collected).

A quick and dirty formula to calculate the accuracy of a survey is to divide by the square root of the sample size. For example, a survey of 1,000 (randomly selected) people is accurate to within + or - 1/SRT 1000, which is 0.032 or 3.2%. This percentage is called the margin of error. (Note that this formula is just a rough estimate. A better estimate can be found using the formulas from Chapter 7.)

Beware of surveys that have a large sample size but it’s not randomly selected; internet surveys are the biggest culprit. A company can say that 50,000 people logged on to its Web site to answer a survey, but that information is biased, because it represents opinions of those who had access to the Internet, went to the Web site, and chose to complete the survey.

Nonresponse Is Minimized

After the sample size has been chosen and the sample of individuals has been randomly selected from the target population, you have to get the information you need from the people in the sample. If you’ve ever thrown away a survey or refused to answer a few questions over the phone, you know that getting people to participate in a survey isn’t easy.

The importance of following up

If a researcher wants to minimizes bias, the best way to handle nonresponse is to “hound” the people in the sample: Follow up one, two, or even three times, offering dollar bills, coupons, self-addressed stamped return envelopes, chances to win prizes, and so on. Note that offering more than a small token of incentive and appreciation for participating can create bias as well, because then people who really need the money are more likely to respond than those who don’t.

Consider what motivates you to fill out a survey. If the incentive provided by the researcher doesn’t get you, maybe the subject matter piques your interest. Unfortunately, this is where bias comes in. If only those folks who feel very strongly respond to a survey, only their opinions will count; because the other people who don’t really care about the issue don’t respond, each “I don’t care” vote doesn’t count. And when people do care but don’t take the time to complete the survey, those votes don’t count, either.

The response rate of a survey is a percentage found by taking the number of respondents divided by the total sample size and multiplying by 100%. The ideal response rate according to statisticians is anything over 70%. However, most response rates fall well short of that, unless the survey is done by a very reputable organization, such as Gallup.

Look for the response rate when examining survey results. If the response rate is too low (much less than 70%) the results may be biased and should be ignored. Selecting a smaller initial sample and following up aggressively is better than selecting a bigger sample that ends up with a low response rate. Plan several follow-up calls/mailings to reduce bias. It also helps increase the response rate to let people know up front whether their results will be shared or not.

Anonymity versus confidentiality

If you were to conduct a survey to determine the extent of personal email usage at work, the response rate would probably be low because many people are reluctant to disclose their use of personal email in the workplace. You could encourage people to respond by letting them know that their privacy would be protected during and after the survey.

When you report the results of a survey, you generally don’t tie the information collected to the names of the respondents, because doing so would violate the privacy of the respondents. You’ve probably heard the terms anonymous and confidential before, but you may not realize that they have totally different meanings in terms of privacy issues. Keeping results confidential means that I could tie your information to your name in my report, but I promise that I won’t do that. Keeping results anonymous means that I have no way of tying your information to your name in my report, even if I wanted to.

If you’re asked to participate in a survey, be sure you’re clear about what the researchers plan to do with your responses and whether or not your name can be tied to the survey. (Good surveys always make this issue very clear for you.) Then make a decision as to whether you still want to participate.

The Survey Is of the Right Type

Surveys come in many types: mail surveys, telephone surveys, Internet surveys, house-to-house interviews, and man-on-thestreet surveys (in which someone comes up to you with a clipboard and asks, “Do you have a few minutes to participate in a survey?”). One very important yet sometimes overlooked criterion of a good survey is whether the type of survey being used is appropriate for the situation. For example, if the target population is the population of people who are visually impaired, sending them a survey in the mail that has a tiny font isn’t a good idea (yes, this has happened!).

When looking at the results of a survey, be sure to find out what type of survey was used and reflect on whether this type of survey was appropriate.

Questions Are Well Worded

The way in which a question is worded in a survey can affect the results. For example, while President Bill Clinton was in office and the Monica Lewinsky scandal broke, a CNN/Gallup Poll conducted August 21–23, 1998, asked respondents to judge Clinton’s favorability, and about 60% gave him a positive result. When CNN/Gallup reworded the question to ask respondents to judge Clinton’s favorability “as a person,” only about 40% gave him a positive rating. These questions were both getting at the same issue; even though they were worded only slightly differently you can see how different the results are. So question wording does matter.

One huge problem is the use of misleading questions (in other words, questions that are worded in such a way that you know how the researcher wants you to answer). An example of a misleading question is, “Do you agree that the president should have the power of a line-item veto to eliminate waste?” This question should be worded in a neutral way, such as “What is your opinion about the line-item veto ability of a president?” Then give a scale from 1 to 5 where 1 = strongly disagree and 5 = strongly agree.

When you see the results of a survey that’s important to you, ask for a copy of the questions that were asked and analyze them to ensure that they were neutral and minimized bias.

The Timing Is Appropriate

The timing of a survey is everything. Current events shape people’s opinions, and while some pollsters try to determine how people really feel, others take advantage of these situations, especially the negative ones. For example, polls regarding gun control often come out right after a shooting that is reported by the national media. Timing of any survey, regardless of the subject matter, can still cause bias. Check the date when a survey was conducted and see whether you can determine any relevant events that may have temporarily influenced the results.

Personnel Are Well Trained

The people who actually carry out surveys have tough jobs. They have to deal with hang ups, take-us-off-your-list responses, and answering machines. After they do get a live respondent at the other end of the phone or face to face, the job becomes even harder. For example, if the respondent doesn’t understand the question and needs more information, how much can you say, while still remaining neutral?

For a survey to be successful, the survey personnel must be trained to collect data in an accurate and unbiased way. The key is to be clear and consistent about every possible scenario that may come up, discuss how they should be handled, and have this discussion well before participants are ever contacted.

You can also avoid problems by running a pilot study (a practice run with only a few respondents) to make sure the survey is clear and consistent and that the personnel are handling responses appropriately. Any problems identified can be fixed before the real survey starts.

Proper Conclusions Are Made

Even if a survey is done correctly, researchers can misinterpret or over-interpret results so that they say more than they really should. Here are some of the most common errors made in drawing conclusions from surveys:

  • Making projections to a larger population than the study actually represents
  • Claiming a difference exists between two groups when a difference isn’t really there
  • Saying that “these results aren’t scientific, but . . .” and then presenting the results as if they are scientific

To avoid common errors made when drawing conclusions:

1. Check whether the sample was selected properly and that the conclusions don’t go beyond the population presented by that sample.

2. Look for disclaimers about surveys before reading the results, if you can.

That way, you’ll be less likely to be influenced by the results if, in fact, the results aren’t based on a scientific survey. Now that you know what a scientific survey (the media’s term for an accurate and unbiased survey) actually involves, you can use those criteria to judge whether survey results are credible.

3. Be on the lookout for statistically incorrect conclusions.

If someone reports a difference between two groups based on survey results, be sure the difference is larger than the reported margin of error. If the difference is within the margin of error, you should expect the sample results to vary by that much just by chance, and the so-called “difference” can’t really be generalized to the entire population; see Chapter 7.

4. Tune out anyone who says, “These results aren’t scientific, but. . . .”

Know the limitations of any survey and be wary of any information coming from surveys in which those limitations aren’t respected. A bad survey is cheap and easy to do, but you get what you pay for. Before looking at the results of any survey, investigate how it was designed and conducted, so that you can judge the quality of the results.

Chapter 13: A Checklist for Judging Experiments

In This Chapter

  • The added value of experiments
  • Criteria for a good experiment
  • Action items for evaluating an experiment

In this chapter, you go behind the scenes of experiments — the driving force of medical studies and other investigations in which comparisons are made. You find out the difference between experiments and observational studies and discover what experiments can do for you, how they’re supposed to be done, and how you can spot misleading results.

Experiments versus Observational Studies

Although many different types of studies exist, you can boil them all down to basically two different types: experiments and observational studies. An observational study is just what it sounds like: a study in which the researcher merely observes the subjects and records the information. No intervention takes place, no changes are introduced, and no restrictions or controls are imposed. For example, a survey is an observational study. An experiment is a study that doesn’t simply observe subjects in their natural state, but deliberately applies treatments to them in a controlled situation and records the outcomes (for example, medical studies done in a laboratory).

Experiments are generally more powerful than observational studies; for example, an experiment can identify a cause-andeffect relationship between two variables, whereas an observational study can only point out a connection.

Criteria for a Good Experiment

To decide whether an experiment is credible, check the following items:

1. Is the sample size large enough to yield precise results?

2. Do the subjects accurately represent the intended population?

3. Are the subjects randomly assigned to the treatment and control groups?

4. Was the placebo effect measured (if applicable)?

5. Are possible confounding variables controlled for?

6. Is the potential for bias minimized?

7. Was the data analyzed correctly?

8. Are the conclusions appropriate?

In the following sections I present action items for evaluating an experiment based on each of the above criteria.

Inspect the Sample Size

The size of a sample greatly affects the accuracy of the results. The larger the sample size, the more accurate the results are, and the more powerful the statistical analysis will be at detecting real differences due to treatments.

Small samples — small conclusions

You may be surprised at the number of research headlines that were based on very small samples. If the results are important to you, ask for a copy of the research report and find out how many subjects were involved in the study.

Also be wary of research that finds significant results based on very small sample sizes (especially those much smaller than 30). It could be a sign of what statisticians call data fishing, where someone fishes around in their data set using many different kinds of analyses until they find a significant result (which is not repeatable because it was just a fluke).

Original versus final sample size

Be specific about what a researcher means by sample size. For example, ask how many subjects were selected to participate in an experiment and then ask for the number who actually completed the experiment — these two numbers can be very different. Make sure the researchers can explain any situations in which the research subjects decided to drop out or were unable (for some reason) to finish the experiment.

An article in the New York Times entitled “Marijuana Is Called an Effective Relief in Cancer Therapy” says in the opening paragraph that marijuana is “far more effective” than any other drug in relieving the side effects of chemotherapy. When you get into the details, you find out that the results are based on only 29 patients (15 on the treatment, 14 on a placebo). To add to the confusion, you find out that only 12 of the 15 patients in the treatment group actually completed the study; so what happened to the other three subjects?

Examine the Subjects

An important step in designing an experiment is selecting the sample of participants, called the research subjects. Although researchers would like for their subjects to be selected randomly from their respective populations, in most cases this just isn’t possible. For example, suppose a group of eye researchers wants to test out a new laser surgery on nearsighted people. To select their subjects, they randomly select various eye doctors from across the country and randomly select nearsighted patients from these doctors’ files. They call up each person selected and say, “We’re experimenting with a new laser surgery treatment for nearsightedness, and you’ve been selected at random to participate in our study. When can you come in for the surgery?” This may sound like a good random sampling plan, but it doesn’t make for an ethical experiment.

The point is, getting a truly random sample of people to participate in an experiment would be great, but is typically not feasible or ethical to do. Rather than select people at random, experimenters do the best they can to gather volunteers that meet certain criteria so they’re doing the experiment on an appropriate cross-section of the population. The randomness part comes in when individuals are assigned to the groups (treatment group, control group, and so forth) in a random fashion, as explained in the next section.

Check for Random Assignments

After the sample has been selected, the subjects are assigned to either a treatment group, which receives a certain level of some factor being studied, or a control group, which receives either no treatment or a fake treatment. How the subjects are assigned to their respective groups is extremely important.

Suppose a researcher wants to determine the effects of exercise on heart rate. The subjects in his treatment group run five miles and have their heart rates measured before and after the run. The subjects in his control group will sit on the couch the whole time and watch reruns of The Simpsons. If only the health nuts (who probably already have excellent heart rates) volunteer to be in the treatment group, the researcher will be looking only at the effect of the treatment (running five miles) on very healthy and active people. He won’t see the effect that running five miles has on the heart rates of couch potatoes. This nonrandom assignment of subjects to the treatment and control groups can have a huge impact on his conclusions.

To avoid bias, subjects must be assigned to treatment/control groups at random. This results in groups that are more likely to be fair and balanced, yielding more credible results.

Gauge the Placebo Effect

A fake treatment takes into account what researchers call the placebo effect. The placebo effect is a response that people have (or think they’re having) because they know they’re getting some sort of “treatment” (even if that treatment is a fake treatment, aka placebo, such as sugar pills).

If the control group is on a placebo, you may expect them not to report any side effects, but you would be wrong. Placebo groups often report side effects in percentages that seem quite high; this is because the knowledge that some treatment is being taken (even if it’s a fake treatment) can have a psychological (even a physical) effect. If you want to be fair about examining the side effects of a treatment, you have to take into account the side effects that the control group reports; that is, side effects that are due to the placebo effect.

In some situations, such as when the subjects have very serious diseases, offering a fake treatment as an option may be unethical. When ethical reasons bar the use of fake treatments, the new treatment is compared to an existing or standard treatment that is known to be effective. After researchers have enough data to see that one of the treatments is working better than the other, they will generally stop the experiment and put everyone on the better treatment, again for ethical reasons.

Identify Confounding Variables

A confounding variable is a characteristic which was not included or controlled for in the study, but can influence the results. That is, the real effects due to the treatment are confounded, or clouded, due to this variable.

For example, if you select a group of people who take vitamin C daily, and a group who don’t, and follow them all for a year’s time counting how many colds they get, you might notice the group taking vitamin C had fewer colds than the group who didn’t take vitamin C. However, you cannot conclude that vitamin C reduces colds. Because this was not a true experiment but rather an observational study, there are many confounding variables at work. One possible confounding variable is the person’s level of health consciousness; people who take vitamins daily may also wash their hands more often, thereby heading off germs.

How do researchers handle confounding variables? Control is what it’s all about. Here you could pair up people who have the same level of health-consciousness and randomly assign one person in each pair to taking vitamin C each day (the other person gets a fake pill). Any difference in number of colds found between the groups is more likely due to the vitamin C, compared to the original observational study. Good experiments control for potential confounding variables.

Assess Data Quality

To decide whether or not you’re looking at credible data from an experiment, look for these characteristics:

  • Reliability: Reliable data get repeatable results with subsequent measurements. If your doctor checks your weight once and you get right back on the scale and see it’s different, there is a reliability issue. Same with blood tests, blood pressure and temperature measurements, and the like. It’s important to use well-calibrated measurement instruments in an experiment to help ensure reliable data.
  • Unbiasedness: Unbiased data contains no systematic favoritism of certain individuals or responses. Bias is caused in many ways: by a bad measurement instrument, like a bathroom scale that’s sometimes 5 pounds over; a bad sample, like a drug study done on adults when the drug is actually taken by children; or by researchers who have preconceived expectations for the results (“You feel better now after you took that medicine don’t you?”)

Bias is difficult, and in some cases even impossible, to measure. The best you can do is anticipate potential problems and design your experiment to minimize them. For example, a double-blind experiment means that neither the subjects nor the researchers know who got which treatment or who is in the control group. This is one way to minimize bias by people on either side.

  • Validity: Valid data measure what they are intended to measure. For example, reporting the prevalence of crime using number of crimes in an area is not valid; the crime rate (number of crimes per capita) should be used because it factors in how many people live in the area.

Check Out the Analysis

After the data have been collected, they’re put into that mysterious box called the statistical analysis. The choice of analysis is just as important (in terms of the quality of the results) as any other aspect of a study. A proper analysis should be planned in advance, during the design phase of the experiment. That way, after the data are collected, you won’t run into any major problems during the analysis.

As part of this planning you have to make sure the analysis you choose will actually answer your question. For example, if you want to estimate the average blood pressure for the treatment group, use a confidence interval for one population mean (see Chapter 7). However, if you want to compare the average blood pressure for the treatment group versus a control group, you use a hypothesis test for two means (see Chapter 8). Each analysis has its own particular purpose; this book hits the highlights of the most commonly used analyses.

You also have to make sure that the data and your analysis are compatible. For example, if you want to compare a treatment group to a control group in terms of the amount of weight lost on a new (versus an existing) diet program, you need to collect data on how much weight each person lost (not just each person’s weight at the end of the study).

Scrutinize the Conclusions

Some of the biggest statistical mistakes are made after the data has all been collected and analyzed — when it’s time to draw conclusions, some researchers get it all wrong. The three most common errors in drawing conclusions are the following:

  • Overstating their results
  • Making connections or giving explanations that aren’t backed up by the statistics
  • Going beyond the scope of the study in terms of whom the results apply to
Overstated results

When you read a headline or hear about the big results of the latest study, be sure to look further into the details of the study — the actual results might not be as grand as what you were led to believe. For example, suppose a researcher finds a new procedure that slows down tumor growth in lab rats.

This is a great result but it doesn’t mean this procedure will work on humans, or will be a cure for cancer. The results have to be placed into perspective.

Ad-hoc explanations

Be careful when you hear researchers explaining why their results came out a certain way. Some after-the-fact (“ad-hoc”) explanations for research results are simply not backed up by the studies they came from. For example, suppose a study observes that people who drink more diet cola sleep fewer hours per night on average. Without a more in-depth study, you can’t go back and explain why this occurs. Some researchers might conclude the caffeine is causing insomnia (okay…), but could it be that diet cola lovers (including yours truly) tend to be night owls, and night owls typically sleep fewer hours than average?

Generalizing beyond the scope

You can only make conclusions about the population that’s represented by your sample. If you want to draw conclusions about the opinions of all Americans, you need a random sample of Americans. If your random sample came from a group of students in your psychology class, however, then the opinions of your psychology class is all you can draw conclusions about.

Some researchers try to draw conclusions about populations that have a broader scope than their sample, often because true representative samples are hard to get. Find out where the sample came from before you accept broad-based conclusions.

Chapter 14: Ten Common Statistical Mistakes

In This Chapter

  • Recognizing common statistical mistakes
  • How to avoid these mistakes when doing your own statistics

This book is not only about understanding statistics that you come across in your job and everyday life; it’s also about deciding whether the statistics are correct, reasonable, and fair. After all, if you don’t critique the information and ask questions about it, who will? In this chapter, I outline some common statistical mistakes made out there, and I share ways to recognize and avoid those mistakes.

Misleading Graphs

Many graphs and charts contain misinformation, mislabeled information, or misleading information, or they simply lack important information that the reader needs to make critical decisions about what is being presented.

Pie charts

Pie charts are nice for showing how categorical data is broken down, but they can be misleading. Here’s how to check a pie chart for quality:

  • Check to be sure the percentages add up to 100%, or close to it (any round-off error should be small).
  • Beware of slices labeled “Other” that are larger than the rest of the slices. This means the pie chart is too vague.
  • Watch for distortions with three-dimensional-looking pie charts, in which the slice closest to you looks larger than it really is because of the angle at which it’s presented.
  • Look for a reported total number of individuals who make up the pie chart, so you can determine “how big” the pie is, so to speak. If the sample size is too small, the results are not going to be reliable.
Bar graphs

A bar graph breaks down categorical data by the number or percent in each group (see Chapter 3). When examining a bar graph:

  • Consider the units being represented by the height of the bars and what the results mean in terms of those units. For example, total number of crimes verses the crime rate (total number of crimes per capita).
  • Evaluate the appropriateness of the scale, or amount of space between units expressing the number in each group of the bar graph. Small scales (for example, going from 1 to 500 by 10s) make differences look bigger; large scales (going from 1 to 500 by 100s) make them look smaller.
Time charts

A time chart shows how some measurable quantity changes over time, for example, stock prices (see Chapter 3). Here are some issues to watch for with time charts:

  • Watch the scale on the vertical (quantity) axis as well as the horizontal (timeline) axis; results can be made to look more or less dramatic by simply changing the scale.
  • Take into account the units being portrayed by the chart and be sure they are equitable for comparison over time; for example, are dollars being adjusted for inflation?
  • Beware of people trying to explain why a trend is occurring without additional statistics to back themselves up. A time chart generally shows what is happening. Why it’s happening is another story.
  • Watch for situations in which the time axis isn’t marked with equally spaced jumps. This often happens when data are missing. For example, the time axis may have equal spacing between 1971, 1972, 1975, 1976, 1978, when it should actually show empty spaces for the years in which no data are available.
Histograms

Histograms graph numerical data in a bar-chart type of graph (seen in Chapter 3). Items to watch for regarding histograms:

  • Watch the scale used for the vertical (frequency/relative frequency) axis, especially for results that are exaggerated or played down through the use of inappropriate scales.
  • Check out the units on the vertical axis, whether they’re reporting frequencies or relative frequencies, when examining the information.
  • Look at the scale used for the groupings of the numerical variable on the horizontal axis. If the groups are based on small intervals (for example, 0–2, 2–4, and so on), the data may look overly volatile. If the groups are based on large intervals (0–100, 100–200, and so on), the data may give a smoother appearance than is realistic.

Biased Data

Bias in statistics is the result of a systematic error that either overestimates or underestimates the true value. Here are some of the most common sources of biased data:

  • Measurement instruments that are systematically off, such as a scale that always adds 5 pounds to your weight.
  • Participants that are influenced by the data-collection process. For example, the survey question, “Have you ever disagreed with the government?” will overestimate the percentage of people unhappy with the government.
  • A sample of individuals that doesn’t represent the population of interest. For example, examining study habits by only visiting people in the campus library will create bias.
  • Researchers that aren’t objective. Researchers have a vested interested in the outcome of their studies, and rightly so, but sometimes interest becomes influence over those results. For example, knowing who got what treatment in an experiment causes bias — double-blinding the study makes it more objective.

No Margin of Error

To evaluate a statistical result, you need a measure of its precision — that is, the margin of error (for example “plus or minus 3 percentage points”). When researchers or the media fail to report the margin of error, you’re left to wonder about the accuracy of the results, or worse, you just assume that everything is fine, when in many cases it’s not. Always check the margin of error. If it’s not included, ask for it! (See Chapter 7 for all the details on margin of error.)

Nonrandom Samples

A random sample (as described in Chapter 12) is a subset of the population selected in such a way that each member of the population has an equal chance of being selected (like drawing names out of a hat). No systematic favoritism or exclusion is involved in a random sample. However, many studies aren’t based on random samples of individuals; for example, TV polls asking viewers to “call us with your opinion”; an Internet survey you heard about from your friends; or a person with a clipboard at the mall asking for a minute of your time.

What’s the effect of a nonrandom sample? Oh nothing, except it just blows the lid off of any credible conclusions the researcher ever wanted to make. Nonrandom samples are biased, and their data can’t be used to represent any population beyond themselves. Check to make sure an important result is based on a random sample. If it isn’t, run — and don’t look back!

Missing Sample Sizes

Knowing how much data went into a study is critical. Sample size determines the precision (repeatability) of the results. A larger sample size means more precision, and a small sample size means less precision. Many studies (more than you would expect) are based on only a few subjects.

You might find that headlines and visual displays (such as graphs) are not exactly what they seem to be when the details reveal either a small sample size (reducing reliability in the results) or in some cases, no information at all about the sample size. For example, you’ve probably seen the chewing gum ad that says, “Four out of five dentists surveyed recommend [this gum] for their patients who chew gum.” What if they really did ask only five dentists?

Always look for the sample size before making decisions about statistical information. Larger sample sizes have more precision than small sample sizes (assuming the data is of good quality). If the sample size is missing from the article, get a copy of the full report of the study or contact the researcher or author of the article.

Misinterpreted Correlations

Correlation is one of the most misunderstood and misused statistical terms used by researchers, the media, and the general public. (You can read all about this in Chapter 10.) Here are my three major correlation pet peeves:

  • Correlation applies only to two numerical variables, such as height and weight. So, if you hear someone say, “It appears that the voting pattern is correlated with gender,” you know that’s statistically incorrect. Voting pattern and gender may be associated, but they can’t be correlated in the statistical sense.
  • Correlation measures the strength and direction of a linear relationship. If the correlation is weak, you can say there is no linear relationship; however some other type of relationship might exist, for example, a curve (such as supply and demand curves in economics).
  • Correlation doesn’t imply cause and effect. Suppose someone reports that the more people drink diet cola, the more weight they gain. If you’re a diet cola drinker, don’t panic just yet. This may be a freak of nature that someone stumbled onto. At most, it means more research needs to be done (for example, a well-designed experiment) to explore any possible connection.

Confounding Variables

Suppose a researcher claims that eating seaweed helps you live longer; you read interviews with the subjects and discover that they were all over 100, ate very healthy foods, slept an average of 8 hours a day, drank a lot of water, and exercised. Can we say the long life was caused by the seaweed? You can’t tell, because so many other variables exist that could also promote long life (the diet, the sleeping, the water, the exercise); these are all confounding variables.

A common error in research studies is to fail to control for confounding variables, leaving the results open to scrutiny. The best way to head off confounding variables is to do a welldesigned experiment in a controlled setting.

Observational studies are great for surveys and polls, but not for showing cause-and-effect relationships, because they don’t control for confounding variables. A well-designed experiment provides much stronger evidence. (See Chapter 13.)

Botched Numbers

Just because a statistic appears in the media doesn’t mean it’s correct. Errors appear all the time (by error or design), so look for them. Here are some tips for spotting botched numbers:

  • Make sure everything adds up to what it’s reported to. With pie charts, be sure the percentages add up to 100% (or very close to it — there may be round-off error).
  • Double-check even the most basic of calculations. For example, a chart says 83% of Americans are in favor of an issue, but the report says 7 out of every 8 Americans are in favor of the issue. 7 divided by 8 is 87.5%.
  • Look for the response rate of a survey — don’t just be happy with the number of participants. (The response rate is the number of people who responded divided by the total number of people surveyed times 100%.) If the response rate is much lower than 70%, the results could be biased, because you don’t know what the nonrespondents would have said.
  • Question the type of statistic used to determine if it’s appropriate. For example, the number of crimes went up, but so did population size. Researchers should have reported crime rate (crimes per capita) instead.

Statistics are based on formulas and calculations that don’t know any better — the people plugging in the numbers should know better, though, but sometimes they either don’t know better or they don’t want you to catch on. You, as a consumer of information (also known as a certified skeptic), must be the one to take action. The best policy is to ask questions.

Selectively Reporting Results

Another bad move is when a researcher reports a “statistically significant” result but fails to mention that he found it among 50 different statistical tests he performed — the other 49 of which were not significant. This behavior is called data fishing, and that is not allowed in statistics. If he performs each test at a significance level of 0.05, that means he should expect to “find” a result that’s not really there 5 percent of the time just by chance (see Chapter 8 for more on Type I errors). In 50 tests, he should expect at least one of these errors, and I’m betting that accounts for his one “statistically significant” result.

How do you protect yourself against misleading results due to data fishing? Find out more details about the study: How many tests were done, how many results weren’t significant, and what was found to be significant? In other words, get the whole story if you can, so that you can put the significant results into perspective. You might also consider waiting to see whether others can verify and replicate the result.

The Almighty Anecdote

Ah, the anecdote — one of the strongest influences on public opinion and behavior ever created, and one of the least statistical. An anecdote is a story based on a single person’s experience or situation. For example:

  • The waitress who won the lottery
  • The cat that learned how to ride a bicycle
  • The woman who lost 100 pounds on a potato diet
  • The celebrity who claims to use an over-the-counter hair color for which she is a spokesperson (yeah, right)

An anecdote is basically a data set with a sample size of one — they don’t happen to most people. With an anecdote you have no information with which to compare the story, no statistics to analyze, no possible explanations or information to go on. You have just a single story. Don’t let anecdotes have much influence over you. Rather, rely on scientific studies and statistical information based on large random samples of individuals who represent their target populations (not just a single situation).

Appendix: Tables for Reference

This appendix provides three commonly used tables for your reference: the Z-table, the t-table, and the Binomial table.

Because the first table won’t fit on this page, I’d like to invite you to use this space to write down your innermost feelings about statistics.

Index

Statistics II For Dummies

Source: (PDF)

Dedication

To my husband Eric: My sun rises and sets with you. To my son Clint: I love you up to the moon and back.

About the Author

Deborah Rumsey has a PhD in Statistics from The Ohio State University (1993), where she’s a Statistics Education Specialist/Auxiliary Faculty Member for the Department of Statistics. Dr. Rumsey has been given the distinction of being named a Fellow of the American Statistical Association. She has also won the Presidential Teaching Award from Kansas State University. She’s the author of Statistics For Dummies, Statistics Workbook For Dummies, and Probability For Dummies and has published numerous papers and given many professional presentations on the subject of statistics education. Her passions include being with her family, bird watching, getting more seat time on her Kubota tractor, and cheering the Ohio State Buckeyes on to another National Championship.

Author’s Acknowledgments

Thanks again to Lindsay Lefevere and Kathy Cox for giving me the opportunity to write this book; to Natalie Harris and Chrissy Guthrie for their unwavering support and perfect chiseling and molding of my words and ideas; to Kim Gilbert, University of Georgia, for a thorough technical view; and to Elizabeth Rea and Sarah Westfall for great copy-editing. Special thanks to Elizabeth Stasny for guidance and support from day one; and to Joan Garfi eld for constant inspiration and encouragement.

Introduction

So you’ve gone through some of the basics of statistics. Means, medians, and standard deviations all ring a bell. You know about surveys and experiments and the basic ideas of correlation and simple regression. You’ve studied probability, margin of error, and a few hypothesis tests and confidence intervals. Are you ready to load your statistical toolbox with a new level of tools? Statistics II For Dummies picks up right where Statistics For Dummies (Wiley) leaves off and keeps you moving along the road of statistical ideas and techniques in a positive, step-by-step way.

The focus of Statistics II For Dummies is on finding more ways of analyzing data. I provide step-by-step instructions for using techniques such as multiple regression, nonlinear regression, one-way and two-way analysis of variance (ANOVA), Chi-square tests, and nonparametric statistics. Using these new techniques, you estimate, investigate, correlate, and congregate even more variables based on the information at hand.

About This Book

This book is designed for those who have completed the basic concepts of statistics through confidence intervals and hypothesis testing (found in Statistics For Dummies) and are ready to plow ahead to get through the final part of Stats I, or to tackle Stats II. However, I do pepper in some brief overviews of Stats I as needed, just to remind you of what was covered and make sure you’re up to speed. For each new technique, you get an overview of when and why it’s used, how to know when you need it, step-by-step directions on how to do it, and tips and tricks from a seasoned data analyst (yours truly). Because it’s very important to be able to know which method to use when, I emphasize what makes each technique distinct and what the results say. You also see many applications of the techniques used in real life.

I also include interpretation of computer output for data analysis purposes. I show you how to use the software to get the results, but I focus more on how to interpret the results found in the output, because you’re more likely to be interpreting this kind of information rather than doing the programming specifically. And because the equations and calculations can get too involved by hand, you often use a computer to get your results. I include instructions for using Minitab to conduct many of the calculations in this book. Most statistics teachers who cover these topics hold this philosophy as well. (What a relief!)

This book is different from the other Stats II books in many ways. Notably, this book features

✓ Full explanations of Stats II concepts. Many statistics textbooks squeeze all the Stats II topics at the very end of Stats I coverage; as a result, these topics tend to get condensed and presented as if they’re optional. But no worries; I take the time to clearly and fully explain all the information you need to survive and thrive.

✓ Dissection of computer output. Throughout the book, I present many examples that use statistical software to analyze the data. In each case, I present the computer output and explain how I got it and what it means.

✓ An extensive number of examples. I include plenty of examples to cover the many different types of problems you’ll face. 

✓ Lots of tips, strategies, and warnings. I share with you some trade secrets, based on my experience teaching and supporting students and grading their papers.

✓ Understandable language. I try to keep things conversational to help you understand, remember, and put into practice statistical definitions, techniques, and processes.

✓ Clear and concise step-by-step procedures. In most chapters, you can find steps that intuitively explain how to work through Stats II problems — and remember how to do it on your own later on.

Conventions Used in This Book

Throughout this book, I’ve used several conventions that I want you to be aware of:

✓ I indicate multiplication by using a times sign, indicated by a lowered asterisk, *.

✓ I indicate the null and alternative hypotheses as Ho (for the null hypothesis) and Ha (for the alternative hypothesis).

✓ The statistical software package I use and display throughout the book is Minitab 14, but I simply refer to it as Minitab.

✓ Whenever I introduce a new term, I italicize it.

✓ Keywords and numbered steps appear in boldface.

✓ Web sites and e-mail addresses appear in monofont.

What You’re Not to Read

At times I get into some of the more technical details of formulas and procedures for those individuals who may need to know about them — or just really want to get the full story. These minutiae are marked with a Technical Stuff icon. I also include sidebars as an aside to the essential text, usually in the form of a real-life statistics example or some bonus info you may find interesting. You can feel free to skip those icons and sidebars because you won’t miss any of the main information you need (but by reading them, you may just be able to impress your stats professor with your above-and-beyond knowledge of Stats II!).

Foolish Assumptions

Because this book deals with Stats II, I assume you have one previous course in introductory statistics under your belt (or at least have read Statistics For Dummies), with topics taking you up through the Central Limit Theorem and perhaps an introduction to confidence intervals and hypothesis tests (although I review these concepts briefly in Chapter 3). Prior experience with simple linear regression isn’t necessary. Only college algebra is needed for the mathematics details. And, some experience using statistical software is a plus but not required.

As a student, you may be covering these topics in one of two ways: either at the tail end of your Stats I course (perhaps in a hurried way, but in some way nonetheless); or through a two-course sequence in statistics in which the topics in this book are the focus of the second course. If so, this book provides you the information you need to do well in those courses.

You may simply be interested in Stats II from an everyday point of view, or perhaps you want to add to your understanding of studies and statistical results presented in the media. If this sounds like you, you can find plenty of real-world examples and applications of these statistical techniques in action as well as cautions for interpreting them.

How This Book Is Organized

This book is organized into five major parts that explore the main topic areas in Stats II, along with one bonus part that offers a series of quick top-ten references for you to use. Each part contains chapters that break down the part’s major objective into understandable pieces. The nonlinear setup of this book allows you to skip around and still have easy access to and understanding of any given topic.

Part I: Tackling Data Analysis and Model-Building Basics

This part goes over the big ideas of descriptive and inferential statistics and simple linear regression in the context of model-building and decision-making. Some material from Stats I receives a quick review. I also present you with the typical jargon of Stats II.

Part II: Using Different Types of Regression to Make Predictions

In this part, you can review and extend the ideas of simple linear regression to the process of using more than one predictor variable. This part presents techniques for dealing with data that follows a curve (nonlinear models) and models for yes or no data used to make predictions about whether or not an event will happen (logistic regression). It includes all you need to know about conditions, diagnostics, model-building, data-analysis techniques, and interpreting results.

Part III: Analyzing Variance with ANOVA

You may want to compare the means of more than two populations, and that requires that you use analysis of variance (ANOVA). This part discusses the basic conditions required, the F-test, one-way and two-way ANOVA, and multiple comparisons. The final goal of these analyses is to show whether the means of the given populations are different and if so, which ones are higher or lower than the rest.

Part IV: Building Strong Connections with Chi-Square Tests

This part deals with the Chi-square distribution and how you can use it to model and test categorical (qualitative) data. You find out how to test for independence of two categorical variables using a Chi-square test. (No more making speculations just by looking at the data in a two-way table!) You also see how to use a Chi-square to test how well a model for categorical data fits.

Part V: Nonparametric Statistics: Rebels without a Distribution

This part helps you with techniques used in situations where you can’t (or don’t want to) assume your data comes from a population with a certain distribution, such as when your population isn’t normal (the condition required by most other methods in Stats II).

Part VI: The Part of Tens

Reading this part can give you an edge in a major area beyond the formulas and techniques of Stats II: ending the problem right (knowing what kinds of conclusions you can and can’t make). You also get to know Stats II in the real world, namely how it can help you stand out in a crowd.

You also can find an appendix at the back of this book that contains all the tables you need to understand and complete the calculations in this book.

Icons Used in This Book

I use icons in this book to draw your attention to certain text features that occur on a regular basis. Think of the icons as road signs that you encounter on a trip. Some signs tell you about shortcuts, and others offer more information that you may need; some signs alert you to possible warnings, while others leave you with something to remember.

When you see this icon, it means I’m explaining how to carry out that particular data analysis using Minitab. I also explain the information you get in the computer output so you can interpret your results.

I use this icon to reinforce certain ideas that are critical for success in Stats II, such as things I think are important to review as you prepare for an exam.

When you see this icon, you can skip over the information if you don’t want to get into the nitty-gritty details. They exist mainly for people who have a special interest or obligation to know more about the more technical aspects of certain statistical issues.

This icon points to helpful hints, ideas, or shortcuts that you can use to save time; it also includes alternative ways to think about a particular concept.

I use warning icons to help you stay away from common misconceptions and pitfalls you may face when dealing with ideas and techniques related to Stats II.

Where to Go from Here

This book is written in a nonlinear way, so you can start anywhere and still understand what’s happening. However, I can make some recommendations if you want some direction on where to start.

If you’re thoroughly familiar with the ideas of hypothesis testing and simple linear regression, start with Chapter 5 (multiple regression). Use Chapter 1 if you need a reference for the jargon that statisticians use in Stats II.

If you’ve covered all topics up through the various types of regression (simple, multiple, nonlinear, and logistic) or a subset of those as your professor deemed important, proceed to Chapter 9, the basics of analysis of variance (ANOVA).

Chapter 14 is the place to begin if you want to tackle categorical (qualitative) variables before hitting the quantitative stuff. You can work with the Chi-square test there.

Nonparametric statistics are presented starting with Chapter 16. This area is a hot topic in today’s statistics courses, yet it’s also one that doesn’t seem to get as much space in textbooks as it should. Start here if you want the full details on the most common nonparametric procedures.

Part I: Tackling Data Analysis and Model-Building Basics

Chapter 1: Beyond Number Crunching: The Art and Science of Data Analysis

Data Analysis: Looking before You Crunch
Nothing (not even a straight line) lasts forever
Data snooping isn’t cool
No (data) fishing allowed
Getting the Big Picture: An Overview of Stats II
Population parameter
Sample statistic
Confidence interval
Hypothesis test
Analysis of variance (ANOVA)
Multiple comparisons
Interaction effects
Correlation
Linear regression
Chi-square tests
Nonparametrics

Chapter 2: Finding the Right Analysis for the Job

Categorical versus Quantitative Variables
Statistics for Categorical Variables
Estimating a proportion
Comparing proportions
Looking for relationships between categorical variables
Building models to make predictions
Statistics for Quantitative Variables
Making estimates
Making comparisons
Exploring relationships
Predicting y using x
Avoiding Bias
Measuring Precision with Margin of Error
Knowing Your Limitations

Chapter 3: Reviewing Confi dence Intervals and Hypothesis Tests

Estimating Parameters by Using Confidence Intervals
Getting the basics: The general form of a confidence interval
Finding the confidence interval for a population mean
What changes the margin of error?
Interpreting a confidence interval
What’s the Hype about Hypothesis Tests?
What Ho and Ha really represent
Gathering your evidence into a test statistic
Determining strength of evidence with a p-value
False alarms and missed opportunities: Type I and II errors
The power of a hypothesis test

Part II: Using Different Types of Regression to Make Predictions

Chapter 4: Getting in Line with Simple Linear Regression

Exploring Relationships with Scatterplots and Correlations
Using scatterplots to explore relationships
Collating the information by using the correlation coefficient
Building a Simple Linear Regression Model
Finding the best-fi tting line to model your data
The y-intercept of the regression line
The slope of the regression line
Making point estimates by using the regression line
No Conclusion Left Behind: Tests and Confidence Intervals for Regression
Scrutinizing the slope
Inspecting the y-intercept
Building confidence intervals for the average response
Making the band with prediction intervals
Checking the Model’s Fit (The Data, Not the Clothes!)
Defining the conditions
Finding and exploring the residuals
Using r2 to measure model fit
Scoping for outliers
Knowing the Limitations of Your Regression Analysis
Avoiding slipping into cause-and-effect mode
Extrapolation: The ultimate no-no
Sometimes you need more than one variable

Chapter 5: Multiple Regression with Two X Variables

Getting to Know the Multiple Regression Model
Discovering the uses of multiple regression
Looking at the general form of the multiple regression model
Stepping through the analysis
Looking at x’s and y’s
Collecting the Data
Pinpointing Possible Relationships
Making scatterplots
Correlations: Examining the bond
Checking for Multicolinearity
Finding the Best-Fitting Model for Two x Variables
Getting the multiple regression coefficients
Interpreting the coefficients
Testing the coefficients
Predicting y by Using the x Variables
Checking the Fit of the Multiple Regression Model
Noting the conditions
Plotting a plan to check the conditions
Checking the three conditions

Chapter 6: How Can I Miss You If You Won’t Leave? Regression Model Selection

Getting a Kick out of Estimating Punt Distance
Brainstorming variables and collecting data
Examining scatterplots and correlations
Just Like Buying Shoes: The Model Looks Nice, But Does It Fit?
Assessing the fi t of multiple regression models
Model selection procedures

Chapter 7: Getting Ahead of the Learning Curve with Nonlinear Regression

Anticipating Nonlinear Regression
Starting Out with Scatterplots
Handling Curves in the Road with Polynomials
Bringing back polynomials
Searching for the best polynomial model
Using a second-degree polynomial to pass the quiz
Assessing the fit of a polynomial model
Making predictions
Going Up? Going Down? Go Exponential!
Recollecting exponential models
Searching for the best exponential model
Spreading secrets at an exponential rate

Chapter 8: Yes, No, Maybe So: Making Predictions by Using Logistic Regression

Understanding a Logistic Regression Model
How is logistic regression different from other regressions?
Using an S-curve to estimate probabilities
Interpreting the coeffi cients of the logistic regression model
The logistic regression model in action
Carrying Out a Logistic Regression Analysis
Running the analysis in Minitab
Finding the coefficients and making the model
Estimating p
Checking the fit of the model
Fitting the Movie Model

Part III: Analyzing Variance with ANOVA

Chapter 9: Testing Lots of Means? Come On Over to ANOVA!

Comparing Two Means with a t-Test
Evaluating More Means with ANOVA
Spitting seeds: A situation just waiting for ANOVA
Walking through the steps of ANOVA
Checking the Conditions
Verifying independence
Looking for what’s normal
Taking note of spread
Setting Up the Hypotheses
Doing the F-Test
Running ANOVA in Minitab
Breaking down the variance into sums of squares
Locating those mean sums of squares
Figuring the F-statistic
Making conclusions from ANOVA
What’s next?
Checking the Fit of the ANOVA Model

Chapter 10: Sorting Out the Means with Multiple Comparisons

Following Up after ANOVA
Comparing cellphone minutes: An example
Setting the stage for multiple comparison procedures
Pinpointing Differing Means with Fisher and Tukey
Fishing for differences with Fisher’s LSD
Using Fisher’s new and improved LSD
Separating the turkeys with Tukey’s test
Examining the Output to Determine the Analysis
So Many Other Procedures, So Little Time!
Controlling for baloney with the Bonferroni adjustment
Comparing combinations by using Scheffe’s method
Finding out whodunit with Dunnett’s test
Staying cool with Student Newman-Keuls
Duncan’s multiple range test
Going nonparametric with the Kruskal-Wallis test

Chapter 11: Finding Your Way through Two-Way ANOVA

Setting Up the Two-Way ANOVA Model
Determining the treatments
Stepping through the sums of squares
Understanding Interaction Effects
What is interaction, anyway?
Interacting with interaction plots
Testing the Terms in Two-Way ANOVA
Running the Two-Way ANOVA Table
Interpreting the results: Numbers and graphs
Are Whites Whiter in Hot Water? Two-Way ANOVA Investigates

Chapter 12: Regression and ANOVA: Surprise Relatives!

Seeing Regression through the Eyes of Variation
Spotting variability and fi nding an “x-planation”
Getting results with regression
Assessing the fi t of the regression model
Regression and ANOVA: A Meeting of the Models
Comparing sums of squares
Dividing up the degrees of freedom
Bringing regression to the ANOVA table
Relating the F- and t-statistics: The fi nal frontier

Part IV: Building Strong Connections with Chi-Square Tests

Chapter 13: Forming Associations with Two-Way Tables

Breaking Down a Two-Way Table
Organizing data into a two-way table
Filling in the cell counts
Making marginal totals
Breaking Down the Probabilities
Marginal probabilities
Joint probabilities
Conditional probabilities
Trying To Be Independent
Checking for independence between two categories
Checking for independence between two variables
Demystifying Simpson’s Paradox
Experiencing Simpson’s Paradox
Figuring out why Simpson’s Paradox occurs
Keeping one eye open for Simpson’s Paradox

Chapter 14: Being Independent Enough for the Chi-Square Test

The Chi-square Test for Independence
Collecting and organizing the data
Determining the hypotheses
Figuring expected cell counts
Checking the conditions for the test
Calculating the Chi-square test statistic
Finding your results on the Chi-square table
Drawing your conclusions
Putting the Chi-square to the test
Comparing Two Tests for Comparing Two Proportions
Getting reacquainted with the Z-test for two population proportions
Equating Chi-square tests and Z-tests for a two-by-two table

Chapter 15: Using Chi-Square Tests for Goodness-of-Fit (Your Data, Not Your Jeans)

Finding the Goodness-of-Fit Statistic
What’s observed versus what’s expected
Calculating the goodness-of-fit statistic
Interpreting the Goodness-of-Fit Statistic Using a Chi-Square
Checking the conditions before you start
The steps of the Chi-square goodness-of-fit test

Part V: Nonparametric Statistics: Rebels without a Distribution

Chapter 16: Going Nonparametric

Arguing for Nonparametric Statistics
No need to fret if conditions aren’t met
The median’s in the spotlight for a change
So, what’s the catch?
Mastering the Basics of Nonparametric Statistics
Sign
Rank
Signed rank
Rank sum

Chapter 17: All Signs Point to the Sign Test and Signed Rank Test

Reading the Signs: The Sign Test
Testing the median
Estimating the median
Testing matched pairs
Going a Step Further with the Signed Rank Test
A limitation of the sign test
Stepping through the signed rank test
Losing weight with signed ranks

Chapter 18: Pulling Rank with the Rank Sum Test

Conducting the Rank Sum Test
Checking the conditions
Stepping through the test
Stepping up the sample size
Performing a Rank Sum Test: Which Real Estate Agent Sells Homes Faster?
Checking the conditions for this test
Testing the hypotheses

Chapter 19: Do the Kruskal-Wallis and Rank the Sums with the Wilcoxon

Doing the Kruskal-Wallis Test to Compare More than Two Populations
Checking the conditions
Setting up the test
Conducting the test step by step
Pinpointing the Differences: The Wilcoxon Rank Sum Test
Pairing off with pairwise comparisons
Carrying out comparison tests to see who’s different
Examining the medians to see how they’re different

Chapter 20: Pointing Out Correlations with Spearman’s Rank

Pickin’ On Pearson and His Precious Conditions
Scoring with Spearman’s Rank Correlation
Figuring Spearman’s rank correlation
Watching Spearman at work: Relating aptitude to performance

Part VI: The Part of Tens

Chapter 21: Ten Common Errors in Statistical Conclusions

Chapter 22: Ten Ways to Get Ahead by Knowing Statistics

Chapter 23: Ten Cool Jobs That Use Statistics

Appendix: Reference Tables

Index

Introduction to Random Forests for Beginners – free ebook

http://www.kdnuggets.com/2014/03/int...ree-ebook.html PDF

http://info.salford-systems.com/an-i...-for-beginners

Random Forests is of the most powerful and successful machine learning techniques. This free ebook will help beginners to leverage the power of Random Forests. An Introduction to Random Forests for Beginners.

Random Forests is one of the top 2 methods used by Kaggle competition winners. 

An Introduction to Random Forests It is an ensemble learning method for classification and regression that builds many decision trees at training time and combines their output for the final prediction. 

This ebook will help beginners leverage the power of multiple alternative analyses, randomization strategies, and ensemble learning with Random Forests. The 70-page ebook includes graphs, examples, and illustrations.

Chapters include: 
What is Random Forests?
Segment and cluster
Suited for wide data
Advantages of Random Forests
Case Study example

Random Forests

Leverage the power of multiple alternative analyses, randomization strategies, and ensemble learning.

Salford Systems and Random Forests

Salford Systems has been working with the world’s leading data mining researchers at UC Berkeley and Stanford since 1990 to deliver best-of-breed machine learning and predictive analytics software and solutions. Our powerful, easy to learn and easy to use tools have been successfully deployed in all areas of data analytics. Applications number in the thousands and include on-line targeted marketing, internet and credit card fraud detection, text analytics, credit risk and insurance risk, large scale retail sales prediction, novel segmentation methods, biological and medical research, and manufacturing quality control.

Random Forests® is a registered trademark of Leo Breiman, Adele Cutler and Salford Systems

What are Random Forests?

Random Forests are one of the most powerful, fully automated, machine learning techniques. With almost no data preparation or modeling expertise, analysts can effortlessly obtain surprisingly effective models. “Random Forests” is an essential component in the modern data scientist’s toolkit and in this brief overview we touch on the essentials of this groundbreaking methodology.

Prerequisites

RandomForests are constructed from decision trees and it is thus recommended that the user of this brief guide be familiar with this fundamental machine learning technology. If you are not familiar with decision trees we suggest that you visit our website to review some of our introductory materials on the CART (Classification and Regression Trees). You do not need to be a master of decision trees or CART to follow the current guide, but some basic understanding will help make the discussion here much more understandable. You should also know more-or-less what a predictive model is and how data is typically organized for analysis with such models.

Random Forests was originally developed by UC Berkeley visionary Leo Breiman in a paper he published in 1999, building on a lifetime of influential contributions including the CART decision tree. As he continued to perfect Random Forests he worked with his long time collaborator and former Ph.D. student Adele Cutler to develop the final form of RandomForests including the sophisticated graphics permitting deeper data understanding.

Random Forests is a tool that leverages the power of many decision trees, judicious randomization, and ensemble learning to produce astonishingly accurate predictive models, insightful variable importance rankings, missing value imputations, novel segmentations, and laser-sharp reporting on a record-by-record basis for deep data understanding.

Preliminaries

  • We start with a suitable collection of data including variables we would like to predict or understand and relevant predictors
  • Random Forests can be used to predict continuous variables such as sales of a product on a web site or the loss predicted if an insured files a claim
  • Random Forests can also be used to estimate the probability that a particular outcome occurs
  • Outcomes can be “yes/no” events or one of several possibilities, such as which model of cell phone a customer will buy
  • Could have many possible outcomes but typically multi-class problems have 8 or fewer outcomes

The Essentials

  • Random Forests are collections of decision trees that together produce predictions and deep insights into the structure of data
  • The core building block of a Random Forest is a CART® inspired decision tree.
  • Leo Breiman’s earliest version of the random forest was the “bagger”
  • Imagine drawing a random sample from your main data base and building a decision tree on this random sample
  • This “sample” typically would use half of the available data although it could be a different fraction of the master data base

More Essentials

  • Now repeat the process. Draw a second different random sample and grow a second decision tree.
  • The predictions made by this second decision tree will typically be different (at least a little) than those of the first tree.
  • Continue generating more trees each built on a slightly different sample and generating at least slightly different predictions each time
  • This process could be continued indefinitely but we typically grow 200 to 500 trees

Predictions

  • Each of our trees will generate its own specific predictions for each record in the data base
  • To combine all of these separate predictions we can use either averaging or voting
  • For predicting an item such as sales volume we would average the predictions made by our trees
  • To predict a classification outcome such as “click/no-click” we can collect counts of votes. How many trees voted “click” vs. how many “no-click” will determine prediction.
  • For classification we can also produce a predicted probability of each possible outcome based on relative share of votes for each outcome

Weakness of Bagger

  • The process just described is known as the “bagger”. There are many details we have omitted but we have presented the essentials.
  • The bagger represented a distinct advance in machine learning when it was first introduced in 1994
  • Breiman discovered that the bagger was a good machine learning method but not as accurate as he had hoped for.
  • Analyzing the details of many models he came to the conclusion that the trees in the bagger were too similar to each other
  • His repair was to find a way to make the trees dramatically more different

Putting Randomness into Random Forests

  • Breiman’s key new idea was to introduce randomness not just into the training samples but also into the actual tree growing as well
  • In growing a decision tree we normally conduct exhaustive searches across all possible predictors to find the best possible partition of data in each node of the tree
  • Suppose that instead of always picking the best splitter we picked the splitter at random
  • This would guarantee that different trees would be quite dissimilar to each other

How Random Should a Random Forest Be?

  • At one extreme, if we pick every splitter at random we obtain randomness everywhere in the tree
  • This usually does not perform very well
  • A less extreme method is to first select a subset of candidate predictors at random and then produce the split by selecting the best splitter actually available
  • If we had 1,000 predictors we might select a random set of 30 in each node and then split using the best predictor among the 30 available instead of the best among the full 1,000

More on Random Splitting

  • Beginners often assume that we select a random subset of predictors once at the start of the analysis and then grow the whole tree using this subset
  • This is not how RandomForests work
  • In RandomForests we select a new random subset of predictors in each node of a tree
  • Completely different subset of predictors may be considered in different nodes
  • If the tree grows large then by the end of the process a rather large number of predictors have had a chance to influence the tree

Controlling Degree of Randomness

  • If we always search all predictors in every node of every tree we are building bagger models-- typically not so impressive in their performance
  • Models will usually improve if we search fewer than all the variables in each node restricting attention to a random subset
  • How many variables to consider is a key control and we need to experiment to learn the best value
  • Breiman suggested starting with the square root of the number of available predictors
  • Allowing just one variable to be searched in each node almost always yields inferior results but often allowing 2 or 3 instead can yield impressive results

How Many Predictors in Each Node?

N Predictors sqrt .5 sqrt 2*sqrt ln2
100 10

  5

20 6
1,000 31 15.5 62 9

10,000

100 50 200 13
100,000 316 158 632 16
1,000,000 1000 500 2000 19

In the table above we show some of the values Breiman and Cutler advised. They suggested four possible rules: square root of the total number of predictors, or one half or twice the square root, and log base 2. We recommend experimenting with some other values. The value chosen remains in effect for the entire forest and remains the same for every node of every tree grown

Random Forests Predictions

  • Predictions will be generated for a forest as we did for the bagger, that is, by averaging or voting
  • If you want you can obtain the prediction made by each tree and save it to a database or spreadsheet
  • You could then create your own custom weighted averages or make use of the variability of the individual tree predictions
  • For example, a record that is predicted to show a sales volume in a relatively narrow range across all trees is less uncertain than one that has the same average prediction but a large variation in individual tree predictions
  • It is easiest to just let RandomForests do the work for you and save final predictions

Out of Bag (OOB) Data

  • If we sample from our available training data before growing a tree then we automatically have holdout data available (for that tree)
  • In Random Forests this holdout data is known as “Out Of Bag” data
  • There is no need to be concerned about the rationale for this terminology at this point
  • Every tree we grow has a different holdout sample associated with it because every tree has a different training sample
  • Alternatively, every record in the master data base will be “in bag” for some trees (used to train that tree) and “out of bag” for other trees (not used to grow the tree)

Testing and Evaluating

  • Keeping track of for which trees a specific record was OOB allows us to easily and effectively evaluate forest performance
  • Suppose that a given record was in-bag for 250 trees and out-of-bag for another 250 trees
  • We could generate predictions for this specific record using just the OOB trees
  • The result would give us an honest assessment of the reliability of the forest since the record was never used to generate any of the 250 trees
  • Always having OOB data means that we can effectively work with relatively small counts of records

More Testing and Evaluation

  • We can use the OOB idea for every record in the data
  • Note that every record is evaluated on its own specific subset of OOB trees and typically no two records would share the identical pattern of in-bag versus out-ofbag trees
  • We could always reserve some additional data as a traditional holdout sample but this is not necessary for Random Forests
  • The idea of OOB testing is an essential component of Random Forest data analytics

Testing vs.. Scoring

  • For model assessment using OOB data we use a subset of trees (the OOB trees) to make predictions for each record
  • When forecasting or scoring new data we would make use of every tree in the forest as no tree would have been built using the new data
  • Typically this means that scoring yields better performance than indicated by the internal OOB results
  • The reason is that in scoring we can leverage the full forest and thus benefit from averaging the predictions of a much larger number of trees

Random Forests and Segmentation

  • Another essential idea in Random Forests analysis is that of “proximity” or the closeness of one data record to another
  • Consider two records of data selected from our data base. We would like to know how similar these records are to each other
  • Drop the pair of records down each tree and note if the two end up in the same terminal node or not
  • Count the number of times the records “match” and divide by the number of trees tested

Proximity Matrix

  • We can count the number of matches found in this way for every pair of records in the data
  • This produces a possibly very large matrix. A 1,000 record data base would produce a 1,000 x 1,000 matrix with 1 million elements
  • Each entry in the matrix displays how close two data records are to each other
  • Need to keep the size of this matrix in mind if we wish to leverage the insights into the data it provides
  • To keep our measurements honest we can be selective in how we use the trees. Instead of using every tree for every pair of records we could use only trees in which one or both records are OOB
  • This does not affect matrix size but affects the reliability of the proximity measures

Characteristics of the Proximity Matrix

  • The Random Forests proximity matrix has some important advantages over traditional near neighbor measurements
  • Random Forests naturally handle mixtures of continuous and categorical data.
  • There is no need to come up with a measurement of nearness that applies to a specific variable. The forest works with all variables together to measure nearness or distance directly
  • Missing values are also no problem as they are handled automatically in tree construction

Proximity Insights

  • Breiman and Cutler made use of the proximity matrix in various ways
  • One use is in the identification of “outliers”
  • An outlier is a data value that is noticeably different than we would expect given all other relevant information
  • As such an outlier would be distant from records that to which we would expect it to be close
  • We would expect records that are “events” to be closer to other “events” than to “non-events”
  • Records that do not have any appropriate near neighbors are natural candidates to be outliers
  • RandomForests produces an “outlier score” for each record

Proximity Visualization

  • Ideally we would like to plot the records in our data to reveal clusters and outliers
  • Might also have a cluster of outliers which would be best detected visually
  • In Random Forests we do this by plotting projections of the proximity matrix to a 3D approximation
  • These graphs can suggest how many clusters appear naturally in the data (at least if there are only a few)
  • We show such a graph later in these notes

Missing Values

  • Classic Random Forests offer two approaches to missing values
  • The simple method and default is to just fill in the missing values with overall sample means or most common values for categorical predictors
  • In this approach, for example, all records with missing AGE would be filled in with the same average value
  • While crude, the simple method works surprisingly well due to the enormous amount of randomization and averaging associated with any forest

Proximity and Missing Values

  • The second “advanced” way to deal with missing values involves several repetitions of forest building
  • We start with the simple method and produce the proximity matrix
  • We then replace the simple imputations in the data with new imputations
  • Instead of using unweighted averages to calculate the imputation we weight the data by proximity
  • To impute a missing value for X for a specific record we essentially look at the good values of X among the records closest to the one with a missing value
  • Each record of data could thus obtain a unique imputed value

Missing Imputation

  • The advanced method is actually very common sense
  • Suppose we are missing the age of a specific customer
  • We use the forest to identify how close the record in question is to all other records
  • Impute by producing a weighted average of the ages of other customers but with greatest weight on those customers most “like” the one needing imputation
  • In the upcoming April 2014 version of SPM you can save these imputed values to a new data set

Variable Importance

  • Random Forests include an innovative method to measure the relative importance of any predictor
  • Method is based on measuring the damage that would be done to our predictive models if we lost access to true values of a given variable
  • To simulate losing access to a predictor we randomly scramble its values in the data. That is, we move the value belonging to a specific row of data to another row
  • We scramble just one predictor at a time and measure the consequential loss in predictive accuracy

Variable Importance Details

  • If we scrambled the values of a variable just once and then measured the damage done to predictive performance we would be reliant on a single randomization
  • In Random Forests we re-scramble the data anew for the predictor being tested in every tree in the forest
  • We therefore free ourselves from dependence on the luck of single draw. If we re-scramble a predictor 500 times in front of 500 trees the results should be highly reliable

Variable Importance Issues

  • If our data includes several alternative measures of the same concept then scrambling just one of these at a time might result in very little damage to model performance
  • For example, if we have several credit risk scores, we might be fooled into thinking a single one of them is unimportant
  • Repeating the scrambling test separately for each credit score could yield the conclusion that each considered separately is unimportant
  • It may thus be important to eliminate this kind of redundancy in the predictors used before putting too much faith in the importance rankings

Variable Importance: Final Observation

  • The data scrambling approach to measuring variable importance is based on the impact of losing access to information on model performance
  • But a variable is not necessarily unimportant just because we can do well without it
  • Need to be aware that a predictor, if available, will be used by models, but if not available, then substitute variables can be used instead
  • The “gini” measure is based on the actual role of a predictor and offers an alternative importance assessment based on the role the predictor plays in the data

Bootstrap Resampling

  • In our discussion so far we have suggested that the sampling technique underpinning RandomForests is the drawing of a random 50% of the available data for each tree
  • This style of sampling is very easy to understand and is a reasonable way to develop a Random Forest
  • Technically, RandomForests uses a somewhat more complicated method known as bootstrap resampling
  • However, bootstrap sampling and random half sampling are similar enough that we do not need to delve into the details here
  • Please consult our training materials for further technical details

The Technical Algorithm

  • Let the number of training cases be N, and the number of variables in the classifier be M.
  • We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be less and even much less than M.
  • Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e., take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes (OOB data).
  • For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
  • Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).
  • For prediction a new sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the mode vote of all trees is reported as the random forest prediction.

Suited for Wide Data

Text analytics, online behavioral prediction, social network analysis, and biomedical research may all have access to tens of thousands of predictors. RandomForests can be ideal for analyzing such data efficiently.

Wide Data

  • Wide data is data with a large number of available predictors numbering in the tens of thousands, hundreds of thousands, or even millions
  • Wide data are often encountered in text mining where each word or phrase found in a corpus of documents is represented by a predictor in the data
  • Wide data is also encountered in social network analysis, on-line behavioral modeling, chemistry, and many types of genetic research
  • Statisticians often refer to data as “wide” if the number of predictors far exceeds the number of data records

Random Forests and Wide Data

  • Suppose we have access to 100,000 predictors and we build a Random Forest searching over 317 randomly selected predictors at each node in each tree
  • In any one node of any tree we reduce the workload of tree building by more than 99%
  • Experience shows that such forests can be predictively accurate while also generating reliable predictor importance rankings
  • Random Forests may be the ideal tool for analysis of wide data for the computational savings alone
  • Random Forests is sometimes used as a predictor selection technique to radically reduce the number of predictors we ultimately need to consider

Wide Shallow Data

  • In wide shallow data we are faced with a great many columns and relatively few rows of data
  • Imagine a data base with 2,000 rows and 500,000 columns
  • Here RandomForests can be effective not only in extracting the relevant predictors but also in clustering
  • The proximity matrix will only be 2000 x 2000 regardless of the number of predictors

Strength and Weaknesses of Random Forests

RandomForests has remarkably few controls to learn and is easily parallelizable. But the size of the model may far exceed the size of the data it is designed to analyze.

Random Forests: Few Controls to Learn & Set

  • RandomForests has very few controls
  • Most important: number of predictors to consider when splitting a node
  • Number of trees to build
  • For classification problems we obtain the best results if we grow each tree to its maximum possible size
  • For predicting continuous targets might need to limit how small a terminal node may become effectively limiting the sizes of the trees

Easily Parallelizable

  • Random Forests is an ensemble of independently built decision trees
  • No tree in the ensemble depends in any way on any other tree
  • Therefore, trees could be grown on different computers (just need to work with the same master data)
  • Different trees could also be grown on different cores on the same computer
  • Allows for ultra-fast analysis
  • Scoring can also be parallelized in the same way

Random Forests Weaknesses

  • RandomForests models perform best when the trees are grown to a very large size
  • A crude rule of thumb is that if you have N training records you can expect to grow a tree with N/2 terminal nodes
  • 1 million training records thus tend to generate trees with 500,000 terminal nodes
  • 500 such trees yields 250 million terminal nodes and 500 million nodes in total
  • Every node needs to be managed in a deployed model

Therefore…

Random Forests is well suited for the analysis of complex data structures embedded in datasets containing potentially millions of columns but only a moderate number of rows. We recommend other tools such as TreeNet for much larger data bases.

Simple Example

Boston Housing Data Boston Housing data predicting above average housing values

Boston Housing Data

506 Census Tracts in greater Boston Area with quality of life data and median housing value for each tract Usually the subject of regression analysis but here we create a binary indicator with tracts with median values above 23 coded as 1 and the remainder coded as 0 Predictors include FBI official crime statistics, socioeconomic level of residents, air pollution, distance from major employment centers, zoning for commercial or industrial use and a few others We describe this data in detail in our training videos

Run an RF model

In the screen shot below we set up the RF model selecting the target, the legal predictors and the analysis engine

RandomForestsFigure1.png

RF Controls

The essential controls govern the number of trees, the number of predictors to us in each node, and whether we will devote the potentially large computer resources to postprocess the forest

RandomForestsFigure2.png

Results Summary

Performance Overview (OOB) and access to many detailed reports and graphs

RandomForestsFigure3.png

RandomForestsFigure4.png

Confusion Matrix OOB

More performance measures

RandomForestsFigure5.png

500 trees with 3 predictors chosen at random for possible role as splitter in each node

Variable Importance

Size of the typical house and types of neighbors appear to be most important.

RandomForestsFigure6.png

Most Likely Vs Not Likely Parallel Coordinate Plots

Contrasting the typical neighborhood with one of the highest probabilities of being above average versus a typical neighborhood at the other extreme

Here we take 25 neighborhoods from each end of the predicted probability (highest 25 and lowest 25) and plot average results.

RandomForestsFigure7.png

Parallel Coordinate Plots

All we are looking for here are large gaps between the blue (high value) and red (low value) lines The blue line is positioned low for measures of undesirable characteristics while the red line is positioned high for those characteristics These graphs serve to suggest the direction of the effect of any predictor There are three variables which show essentially the same values between the two groups meaning that on their own they cannot serve to differentiate the groups

Outliers

A score greater than 10 is considered worthy of attention and here we see the sorted list of scores with Record ID. 11 records look to be strange by this measure

RandomForestsFigure8.png

Proximity & Clusters

The graph below plots all of the 506 data points to display relative distances using the RandomForests proximity measure Points bunched together are very similar in this metric. Blue points are high value neighborhoods.

RandomForestsFigure9.png

Real World Example

The Future of Alaska Project: Forecasting Alaska’s Ecosystem in the 22nd Century

Analytics On a Grand Scale

Alaska over the next 100 years

  • To assist long term planning related to Alaska’s biological natural resources researchers at the University of Alaska, led by Professor Falk Huettman, have built models predicting the influence of climate change on many of Alaska’s plants and animals
  • An Associate Professor of Wildlife Ecology, Dr. Huettmann runs the EWHALE (Ecological Wildlife Habitat Data Analysis for the Land and Seascape) Lab with the Institute of Arctic Biology, Biology and Wildlife Department at the University of Alaska-Fairbanks (UAF).

Challenge

Objective: forecast how climate change, human activities, natural disasters (floods, wildfires) and cataclysmic events (mass volcanic eruptions, melting polar ice cap) might affect Alaska’s ecosystem over the next 100 years. Analyzed data on more than 400 species of animals, thousands of plant species, and diverse landscape biomes (arctic tundra, coastal tundra plains, mountain and alpine areas), deciduous forests, arboreal forests, coastal rainforests, and the interior.

Some essential questions:

  • Will melting ice create new shipping lanes?
  • Will it be easier to exploit natural gas and oil fields or find new ones?
  • How will a shrinking Arctic affect polar bears, migrating animals, commercial fishing, vegetation (will new arable land appear)?
  • How will it affect global climate?

Dr. Huettmann concentrated on biomes and five key species that should be typical for all species. These included migratory caribou, water birds, invasive plant species and the Alaskan marmot. The latter was selected because climate warming and the melting ice in the upper Arctic are severely constricting the marmot’s natural habitat, and it has no other place to go.

RandomForestsFigure10.png

The solution

Dr. Huettmann elected to use Salford Systems’ RandomForests predictive software. As Dr. Huettmann explains, “RandomForests is extremely well-suited to handle data based on GIS, spatial and temporal data. It delivers the high degree of accuracy and generalization required for our study, something other solutions couldn’t achieve because of the immense size of the database and the statistical interactions involved. It achieves this with amazing speed. Also important, RandomForests software works in conjunction with languages such as Java, Python and with outputs from science programs such as GCMs (General Circulation Models that project global climate change) to provide a single and automated workflow.”

Unlike more restrictive solutions, RandomForests predictive modeling software produces stronger statements, more advanced statistics and better generalizations. For predictive work, this is ideal and creates a new view and opportunities. Dr. Huettmann is not only a wildlife ecologist; he is also a professor who includes students in many of his studies worldwide. As he explains it, “I only have students for a limited time, so when we use predictive modeling software, I want my students working, not struggling to use the software in the remotest corner of the world. The sophisticated GUI interface and software support Salford Systems uses in its predictive modeling software makes their programs very easy to use while delivering superior accuracy and generalization.”

“Our study opened up a completely new approach to modeling, predictions and forecasting projects that are relevant to the wellbeing of mankind,” Dr. Huettmann states. “What we needed was a tool that could handle such incredible complexity. To give you some idea of the challenge, ordinary forecasting software provides accurate results using only a handful of predictors; however, with data mining and machine learning we can now use hundreds of variables. Using most of them and their interactions, we can get a much more accurate prediction.”

The result

Dr. Huettmann’s Future of Alaska report offers land managers, government agencies, communities, businesses, academics and nonprofits a range of possible futures and scenarios, all based on substantial and transparent data. The report provides a unique and useful way to evaluate how climate change and its contributors like carbon emissions, habitat change and consumption is impacting Alaska’s ecosystems. And it will guide those concerned with making better management and sustainability decisions for oceans, wildlife, endangered species.

Customer Support

Professor Huettman also cites Salford Systems’ customer support. “They helped us install and set up the program to achieve progress,” Dr. Huettmann concludes. “Someone was always available to answer questions, which is vitally important when working with students…and when the professor runs out of answers. Their customer support and developer team includes some of the most knowledgeable people I’ve ever had the privilege to work with.”

Why Salford Systems?

  • With several commercial and open source implementations of Random Forests why should you go with Salford Systems?
  • One compelling reason is that you will obtain better results
  • Greater accuracy
  • More reliable variable importance rankings
  • Built-in modeling automation
  • Built-in parallel processing

Unique to Salford

  • Salford Systems jointly owns the Random Forests trademark and intellectual property
  • Our implementation is based on source code Leo Breiman provided only to salford Systems
  • Salford has continued working on RandomForests with co-creator Adele Cutler to refine and perfect the methodology
  • Academic research confirms the superiority of Salford RandomForests

Measure the ROI of Random Forests

Find out whether your investment in Random Forests is paying off. Sign up for Salford System’s 30-day trial to access some insightful analytics. http://www.salford-systems/home/downloadspm

Connecting Alaska Landscapes Into the Future

Source: https://www.snap.uaf.edu/attachments...0-complete.pdf (PDF) (Word)

Cover Page

SNAP2010CoverPage.png

Results from an interagency climate modeling, land management and conservation project

FINAL REPORT • AUGUST 2010

Karen Murphy, USFWS

Falk Huettmann, UAF

Nancy Fresco, UA SNAP program

John Morton, USFWS

Acknowledgments

This study was made possible by the U.S. Fish and Wildlife Service.

Region 7, National Wildlife Refuges, Fisheries & Ecological Services, Migratory Birds and State Programs and Office of Subsistence Management all contributed funding to this project in FY08 and FY09. We would also like to acknowledge the following individuals not listed as project participants, as well as the following organizations:

Michael Lindgren, programmer, EWHALE Kim Jochum, programmer, EWHALE Scott Rupp, Director, SNAP

Todd Logan, USFWS LaVerne Smith, USFWS Peter Probasco, USFWS

The Alaska Department of Fish and Game The University of Alaska

UAF School of Natural Resources & Agricultural Sciences The U.S. Bureau of Land Management

The National Park Service

U.S. Geological Survey

U.S. Forest Service Defenders of Wildlife The Nature Conservancy The Wilderness Society

Alaska Natural Heritage Program Alaska GAP

Alaska Audubon

The University of Alaska Fairbanks is an affirmative action/equal opportunity employer and educational institution. Cover photos, left to right: marmot, photo by Bill Hickey, USF&WS; trumpeter swans, Ryan Hagerty, USF&WS; reed canary grass, photo courtesy of Cooperative Extension Service; caribou, photo by Jon Nickles, USF&WS.

SNAP2010Figure0.png

Executive Summary

The Connecting Alaska Landscapes into the Future project (Connectivity Project) was a collaborative proof-of-concept effort that used selected species to identify areas of Alaska that may become important in maintaining landscape-level connectivity, given climate change. Project results and data presented in this preliminary report are intended to serve as a framework for research and planning by stakeholders with an interest in ecosystem conditions that may change and an interest in ecological and socioeconomic sustainability.

The Connectivity Project was co-led by Karen Murphy and John Morton from the U.S. Fish and Wildlife Service (USFWS), Nancy Fresco from the University of Alaska (UA) statewide Scenarios Network for Alaska Planning (SNAP), and Falk Huettmann from the University of Alaska Fairbanks (UAF) Ecological Wildlife Habitat Data Analysis for the Land and Seascape Laboratory (EWHALE) lab. The project included a series of three workshops conducted in 2008 and 2009, with participants from state and federal agencies and non-profit organizations.

Using climate projection data from SNAP and input from project leaders and participants, we modeled projected changes in statewide biomes and potential habitat for four species selected by the group. The primary models used were SNAP climate models based on downscaled global projections, which provided baseline climate data and future projections; Random Forests™, which we used to match current climate-biome relationships with future ones; and Marxan software, which we used for examining landscape linkages. The Connectivity Project modeled projected shifts in broadly categorized species assemblages (biomes) based on existing land cover, current climatic conditions, and projected climate change. The Alaska biomes used in this project (Arctic, Western Tundra, Alaska Boreal, Boreal Transition, North Pacific Maritime, and Aleutian Islands) were adapted from the unified ecoregions of Alaska (Nowacki et al. 2001). Because climate change may introduce new biomes into the state, we also considered six western Canadian ecozones. We used Random Forests™ to model projected spatial shifts in  potential biomes, based on SNAP projections for mean temperature and precipitation for June and December for the decades 2000–2009, 2030–2039, 2060–2069, and 2090–2099, which defines three time steps for modeling connectivity from potential ecological change.

While it should be noted that “potential biomes” (species assemblages that might be expected to occur based on linkages with prevailing climate conditions) are not the same as actual biomes, results suggest that approximately 60% of Alaska is anticipated to experience a geographic shift of present biomes during the twenty-first century. Our initial models predict that by the end of the century, the Arctic and the Alaska Boreal will each diminish by approximately 69% and Western Tundra, by 54%—all but disappearing in its original location—in favor of the Montane Cordillera and Boreal Transition. In addition, much of southeast Alaska may be in the process of shifting from North Pacific Maritime to Canadian Pacific Maritime—again, as constrained by functional barriers. Western Tundra may be the most vulnerable biome, with the least resilience for conservation purposes even without considering the potential for significant loss due to sea level rise.

To assess connectivity(defined as the degree to which existing biomes are linked to their future potential locations), we identified areas that the models predicted would remain part of the same biome over the course of this century, and defined these areas as possible refugia. We identified areas of very high and very low Normalized Difference Vegetation Index (NDVI), as proxy measures of biodiversity and endemism for the proof-of-concept modeling, without the constraint of ecological mechanisms that might influence the rate of biome or species distribution change between time steps. We resampled climate fore- casts from 2 km to 5 km resolution and used Marxan to assign connectivity scores based on status as refugia and NDVI values. Finally, we sketched in connections between areas that ranked highest. To the best of our knowledge, this is the first time that linkages have been developed based on both spatial and temporal analyses.

The Connectivity Project also examined potential impacts on four selected species: barren-ground caribou (Rangifer tarandus granti), Alaska marmots (Marmota broweri), trumpeter swans (Cygnus buccinator), and reed canary grass (Phalaris arundinacea). The Alaska marmot model incorporated terrain roughness (based on digital elevation model) and predicted severe decreases in the potential range of this species, a Beringia relict, although further information is needed on its ecology and distribution. Reed  canary grass, an invasive exotic, is projected to continue its spread along river and road systems as climate warms and road networks grow, and may spread rapidly in the Seward Peninsula. Trumpeter swans are expected to move westward and northward as their preferred boreal habitat shifts. Modeling caribou distributions proved problematic because of their broad ecological plasticity, but it helped to elucidate new avenues for future research.

Based on our preliminary results, we recommend re-running  models  with broader data inputs; validation of our forecasts with field studies to document causal mechanisms; seeking further input from researchers and land managers, including Native Corporations and villages; repeating these analyses using more detailed land-cover maps; and modeling additional species.

Introduction

Scope and Purpose of the Report

The Connecting Alaska Landscapes into the Future project (Connectivity Project) was a complex partner-driven, consensus-driven effort. Our goals were to:

  • Identify lands and waters in Alaska that likely serve as landscape-level migration corridors for shifts in species distribution currently and into the future given climate change; and
  • Identify conservation strategies with our partners that will help maintain landscape-

level connectivity by focusing conservation efforts, minimizing redundant research and monitoring efforts, and sharing data and information for these areas.

The results are intended to serve as a framework for a planning tool for land managers, wildlife managers, local planners, nonprofits, and businesses with an interest in future ecosystem conditions, sustainability, and landscape-level conservation. Data presented in this report are preliminary and are not intended to be prescriptive, but rather to serve as a guide for planning and as a starting point for synergy and further research.

Terms and Definitions

The following terms, used throughout the report, are defined as follows:

Biodiversity: Biodiversity refers to a composite measure of both species’ richness (total number of species) and evenness (relative abundance of all species).

Biome: A biome is a broadly defined species assemblage. For the purposes of this project, Alaska biomes were based on those defined by Nowacki et al. (2001), and Canadian biomes were based on ecozones defined by Environment Canada. Biomes are defined in detail in Appendix C.

Climate envelope: A climate envelope is the range of conditions (in this study, mean tem- peratures and precipitation during June and December) in which a biome, or species, is most likely to occur, based on past climate data and current biomes.

Connectivity: For this project, connectivity is the degree to which existing biomes are linked to their future potential locations, as defined by the best fit with projected future climate data.

Conservation: Conservation is characterized by land use in which the goal of maintaining ecosystem function predominates over large-scale resource extraction or landscape altera- tions by humans. However, in the face of climate change, conservation may not mean main- tenance of the status quo, but rather successful adaptation to changing conditions.

Corridor: A corridor is an unbroken connection between habitat patches for a species or species assemblage, allowing for either seasonal migration or dispersal

Ecoregion: An ecoregion is a smaller and more precisely defined region within an ecozone. Canadian climate data for this project were available at the level of mean historical values for ecoregions.

Ecozone: An ecozone is a Canadian region broadly defined by dominant vegetation and geophysical attributes. In this project, ecozones were considered as similar to Alaska biomes (see Appendix C).

NDVI: Normalized Difference Vegetation Index (NDVI) is a numerical index that indicates the greenness of a satellite image by comparing the reflection of photosynthetically active radiation and near-infrared radiation. NDVI can be used as a proxy measure for landscape productivity.

Novel biome: This term is used to describe future conditions, where a potential biome ex- ists in a location outside of the current distribution for that biome.

Potential biome: A potential biome is the biome that best matches projected future climate conditions for a region, based on the linkage between past climate and existing biomes. Po- tential biomes may differ from actual future biomes for a number of reasons, including dif- ferences in soil type and hydrology, or lag time between climate change and seed dispersal and other mechanisms of ecosystem change.

Refugia: Refugia are areas in which the potential future biome is projected to be the same as the existing biome through all time steps from 2000 to 2100.

Resilience: In general, resilience is the ability to avoid irreversible change despite ongo- ing perturbations. In this report, resilience is used as a measure of the ability of a region to remain within the broad climate envelope that defined its biome over the course of this century.

Project Background

Understanding how climate change will affect our ability to sustain biodiversity and tradi- tional subsistence into the future is a common challenge faced by federal, state, Native, and private land managers interested in conserving Alaska’s natural resources for future gen- erations. Identifying lands that would allow a shift in distribution of renewable resources and their continued sustainable use will help prioritize future conservation efforts. These important areas for connectivity can then be considered in conjunction with ongoing state- wide, national, and international conservation initiatives (e.g., the Alaska Comprehensive Wildlife Conservation Plan, Yukon to Yellowstone Conservation Initiative, Western Boreal Conservation Initiative, and Conservation of Arctic Flora and  Fauna).

Work began in 2008 with the establishment of a North and West Alaska Cooperative Ecosystem Studies Unit (CESU) agreement between USFWS and researchers from SNAP and EWHALE. Scenarios Network for Alaska Planning is a statewide research group that creates and interprets projections of climate change based on downscaled global models. Its goals are to apply new or existing research results to meet stakeholders’ requests for specific information, provide timely access to scenarios of future conditions in Alaska for more effective planning, and communicate information and assumptions to stakeholders. The EWHALE lab in UAF’s Institute of Arctic Biology, Biology and Wildlife Department, specializes in spatial modeling of landscapes and populations for conservation planning and analysis.

The Connectivity Project was co-led by Karen Murphy and John Morton from USFWS, Nancy Fresco from SNAP, and Falk Huettmann from EWHALE. The project included three workshops in 2008 and 2009, with participants from state, federal, and non-profit agencies (Appendix A). The first workshop allowed participants to understand the overall question and underlying data, and jointly refine the goals of the project. At the second workshop, preliminary modeling results were discussed, and suggestions were made for changes and improvements. At the final workshop, participants helped define the final outputs of the project, including this report and its data and deliverables.

Between workshops, project leaders and participants conducted a literature review to assess similar and related projects from the U.S., Canada, and other regions (Appendix B). Preliminary models were developed and presented at the workshops for review and refinement.

To define future connectivity, we gathered data on existing conditions  and linked these to models of future conditions. To assess connectivity at the statewide level, we developed a structured framework process for modeling future species distributions and connectivity, with both reactive and anticipatory adaptation approaches in mind. Using downscaled climate projection data from SNAP, and input from project leaders and par- ticipants, Dr. Huettmann created statistical models explicit in space and time for projected changes in statewide biomes that represent potential habitat for key species identified by the group. With feedback from project leaders and participants, he refined these models and used them as the basis for creating maps of potential future statewide connectivity by use of Marxan (Ardon and Possingham 2008) and other methods.

All results presented in this report are based on data that were available during 2008 and 2009. Models can be powerful tools, and can be invaluable when used to deal with situations where no direct data are available for wide areas, such as projection of future conditions in Alaska and parts of western Canada. However, it should always be remem- bered that uncertainty is inherent in all modeling efforts and that models are updated toward better predictions. Whenever possible, we have attempted to include the best available models and data and define both the type and magnitude of uncertainty in our results. Relevant and documented data products are available for an assessment of our modeling forecasts, which in turn require biological validation with field trials in a separate process.

Through this project, we identified modeling opportunities that are available to create decision tools, and learned by trial and error about the strengths and weaknesses of the straightforward and sometimes simple modeling methods that were employed. In this report, we identify the lessons learned and highlight areas where further research or improved data would enhance these models.

Models Used in the Connectivity Project

MODELING CLIMATE CHANGE: SNAP CLIMATE MODELS

Scenarios Network for Alaska Planning (SNAP) climate projections are based on downscaled regional Global Circulation Models (GCMs) from the Intergovernmental Panel on Climate Change (IPCC). The IPCC used fifteen different GCMs when prepar- ing its Fourth Assessment Report released in 2007. SNAP analyzed how well each model predicted monthly mean values for three different climate variables over four overlapping northern regions for the period from 1958 to 2000. Model selection and downscaling methods are described in Walsh et al. 2008. For this project, we used mean (compos- ite) outputs from the five models that provided the most accurate overall results (see Technical Addendum I). We relied primarily on model runs based on midrange (A1B) predictions of greenhouse gas emissions, as defined by the IPCC. It is important to note that the more pessimistic A2 scenario or even more severe scenarios are now considered likely, given recent climate and emission trends (Nakicenovic et al. 2001, Raupach et al. 2007).The GCM results were scaled down from 2.5° latitude/longitude blocks to 2 km resolution, using data from Alaskan weather stations and PRISM (Parameter-elevation Regressions on Independent Slopes Model). PRISM uses point data,  a digital elevation model, and other spatial data sets to generate gridded estimates of monthly, yearly, and event-based climatic parameters, such as precipitation, temperature, and dew point (Walsh et al. 2008). SNAP models have been assessed using backcasting (http://www.snap.uaf.edu/downloads/va...climate-models) and comparison to historical conditions, and have proven to be robust in predicting overall climate trends.

In this first  iteration,  exploring  the  relationship  between  climate  and  land  cover, we used climate data for June and December in order to capture estimates of seasonal climatology across Alaska. Clearly, the selection of different months of data to incorporate into the models would produce differing results, and a full application, using all 12 months of data to determine which are the greatest drivers, would be required in a robust applica- tion of these models.

For further information on SNAP data and SNAP models, please see  http://www.snap.uaf.edu.

MODELING CLIMATE CHANGE ENVELOPES: RANDOM FORESTS™

We employed the Random Forests™ modeling algorithm to identify probable relationships between historic temperature and precipitation data and known distributions for species and biomes across Alaska. These relationships were then used to predict future species and biome distribution based on projected temperature and precipitation. This approach, known as ensemble modeling, takes the average of the outputs of multiple individual models, thus generally providing more robust predictions (Breiman 1998, 2001). More specifically, Random Forests™ (http://www.stat.berkeley.edu/~breima...ts/cc_home.htm) represents an averaged collection of independent, optimized Classification and Regression Trees (CARTs). This technique is used successfully in pattern recognition, in data mining, and for predictions (Hegel et al. 2010). The Tree/CART family of models, including Random Forests™, uses optimized binary recursive partition- ing to calculate the best fit of each data point. Random Forests™ employs the concept of “bagging,” based on a subsample (columns and rows) of the data, where multiple trees are created for each of those data sets and each tree is grown on about 63% of the original training data, due to the bootstrap sampling process chosen. Thus, 37% of the data are available to test any single tree, which allows Random Forests™ to be self-calibrating and avoids over-fitting. Because its calculations are fast, multivariate, and programmable, Random Forests™ is convenient to use and tends to provide a better fit than linear algorithms and similar methods. Random Forests™ deals well with interactions between predictive variables, intrinsically accounting for their relative power, and ranks predictors (Breiman 2001; Hegel et al. 2010). This modeling algorithm also does well when handling “noisy” data, that is, data with high intrinsic variability or missing values (Craig and Huettmann 2008). The model algorithm, which is widely used and well tested, has previously been applied to climate change research and other studies (Climate Change Atlas 2010, Lawler  2001).

MODELING CONNECTIVITY: MARXAN

We used the Marxan model (Moilanen et al. 2009) to evaluate which regions of the state are likely to be most important in the coming decades for conservation of function- ing biomes, based on our identification of areas that may be biome refugia and areas with potentially high endemism or potentially high productivity.

Marxan uses multiple interactive geographic information system (GIS) layers to pro- vide output, based on user-defined requirements (targets and penalties). Marxan was created to aid systematic reserve design and prioritization, for conservation planning.  The model uses stochastic optimization routines based on simulated annealing, a technique in which the model attempts to satisfy all requirements in a multivariate spatial context by approaching more optimal solutions stepwise, using a cost-benefit approach with weighted factors. Although Marxan’s algorithms do not generate a single ideal solution to a given optimization problem, the model can generate recommendations for spatial reserve systems that achieve particular biodiversity-representation goals with high mathe- matical optimality. Marxan is a commonly used model for prioritizing conservation areas.

Important Factors and Models That Were NOT Incorporated

SEA LEVEL RISE

Alaska is expected to see significant change associated with rising sea levels over the next century. Unfortunately, despite local attempts, there is no statewide sea level rise model available for Alaska at this time. Over the next century, global sea levels are pro- jected to rise between 0.6–1.9 feet (the range of estimates across all the IPCC scenarios) (IPCC 2007). However, in some localities of southeast and southcentral Alaska, the land surface is actually rising as a result of the retreat and loss of glacial ice (isostatic rebound) and, secondarily, as a result of active tectonic deformation. Creating a sea level rise model first requires both improved digital elevation models and improved coastline maps. Because coastal areas are likely to experience change associated with ocean waters as well as potential shifts in climate, the preliminary results presented in this report should be viewed as conservative representations of change along coastal Alaska.

PERMAFROST CHANGE

Permafrost is one of the primary ecosystem drivers for much of Alaska. Researchers, as well as everyday Alaska residents, have been documenting changes in permafrost stability in recent decades. SNAP, in collaboration with Dr. Vladimir Romanovsky and Dr. Sergei Marchenko of the UAF Geophysical Institute, is currently revising the permafrost map for Alaska. They are also creating a predictive model of how permafrost responds to climate changes, including an assessment of permafrost temperatures at varying depths statewide throughout the period spanned by SNAP data. Once this model is completed, it could be incorporated into the process described by the Connectivity Project to improve the ecological representation for potential biome shifts and refugia. In this report, we refer to refugia as areas that are expected to have more stable climatic patterns through the next century.

ALFRESCO

Boreal ALFRESCO (Alaska Frame-based Ecosystem Code) was developed by Dr. Scott Rupp at UAF. The model simulates the timing and location of fire disturbance and the timing of vegetation transitions in response to fire. Climate variables are among the key  drivers  of  ALFRESCO.  Thus,  when  SNAP  climate  projections  are  used  as inputs, ALFRESCO can be used to create projections in changes in fire dynamics with changing climate. ALFRESCO operates on an annual time step, in a landscape composed of 1 × 1 km pixels. The model currently simulates four major subarctic/boreal ecosystem types: upland tundra, black spruce forest, white spruce forest, and deciduous forest.

For further information on the ALFRESCO model, see Rupp et al. (2000) in Appendix B: Additional Reading and References and http://jfsp.nifc.gov/documents/ALFRE...uide_ver03.pdf.

Part I Modeling  Shifts in Ecosystem/Vegetation

Landscape Classification

To provide a landscape-level approach to connectivity, we modeled projected shifts in po- tential ecosystem type (biomes) based on existing land cover, current climatic conditions, and projected climate change. Biomes provide a good “container” for biological units of a landscape. The biomes used were based on the unified ecoregions of Alaska (Nowacki et al. 2001). We also defined an additional biome, the Alpine biome, based on elevation and latitude. It should be noted that Alpine is a special biome, because it is linked (“anchored”) specifically to higher altitudes and thus cannot move freely across the landscape as boreal forest can, for instance. Due to limited statewide data on where alpine vegetation occurs, our model proved unreliable, and we did not use this biome in any of our analyses (Techni- cal Addendum II: Identifying the Alpine Biome).

Future projections are based on potential climate-linked conditions, and do not take into account mechanistic biological explanations such as ease of seed dispersal or species migration. Projections are based solely upon mean summer ( June) and winter (December) temperature and precipitation to characterize the range in site-specific climate through- out the year. However, these are powerful and parsimonious indicators, because many variables associated with vegetation type, land cover, and altitude are covariates with tem- perature and precipitation. For example, the presence or absence of permafrost, the depth of the active layer, hydrologic conditions, potential evapotranspiration, growing season length, and fire frequency are all strongly linked to climate. Thus, these variables can be expected to change as climate changes, although lag times might differ. While using cli- mate data alone worked well in Random Forests™, we are fully aware that reality is more complex. In many cases, due to the interactions of geophysical and ecological variables, true landscape change may be more extreme or less extreme than that predicted by our models.

Alaska’s ecosystems and vegetation have been classified in many different ways, includ- ing fine-scale maps of vegetation types and more broadly defined biomes. We considered several options, including the unified ecoregions defined by Nowacki et al. (2001); the LANDFIRE classification scheme; The Nature Conservancy (TNC) biomes; and National Land Cover Data (NLCD) maps from the United States Geological Survey  (USGS).

The LANDFIRE data were not yet complete for Alaska, although we did use drafts from two mapping units—70 and 71—to test the applicability of these data for this proj- ect (Technical Addendum III: LANDFIRE Mapping of Units 70 and 71). The Nature Conservancy biomes (Gonzalez et al. 2005), created for the North American continent, were deemed to have unrealistic classifications for part of the state. National Land Cover Data land cover definitions were sometimes too broad and ill-defined, relying on classifi- cations such as “all coniferous” rather than identifying dominant species that change with latitude. Thus, the group decided to focus on the Nowacki ecoregions, with some minor modifications (Figure 1).

Figure 1 Ecological groups, as defined by Nowacki et al. (2001). See Appendix C for full descriptions of each biome

SNAP2010Figure1.png

The Nowacki classification for Alaska includes nine groups within three broad catego- ries. Based on these groups, we made some combinations and added the Alpine category to yield seven biomes defined by their dominant vegetation and topography: Alpine, Arctic, Western Tundra, Alaska Boreal, Boreal Transition, North Pacific Maritime, and Aleutian Islands (Appendix C).

Since climate and vegetation are not constrained by political boundaries, model data included biomes from nearby regions of Canada that may shift into Alaska from the south and east because of changing climate. This inclusion required an estimation of regions sim- ilar in scale and resolution to the six Alaska biomes. We used Canadian ecozones for this purpose (Figure 2). Ecozones, which are defined in a manner similar to biomes (http://www.ec.gc.ca/soer-ree/English...ework/NarDesc/), are based on spatial differences in a combination of landscape characteristics. For the purposes of this modeling exercise, we considered six ecozones: Montane Cordillera, Taiga Cordillera, Boreal Cordillera, Taiga Plains, Boreal Plains, and Pacific Maritime (Appendix C). It should be noted that some Canadian ecozones have significant overlap with Alaska biomes.

Figure 2 Alaska biomes and Canadian ecoregions

Twelve ecological zones were classified using past climate data, allowing Canadian vegetation groups to shift into Alaska in model runs depicting future conditions. See Appendix C for full descriptions of each biome and ecoregion.

SNAP2010Figure2.png

Modeling the Effects of Climate Change on Biomes

We used Random Forests™ to model projected spatial shift in potential biomes for three future periods. Our methods are summarized below, and a detailed description of the meth- odology used is provided in Technical Addendum IV: Defining Biomes and Modeling Bi- ome Shifts in Random Forests™.

Climate data inputs were all based on the midrange (A1B) emissions scenario for SNAP’s Composite GCM, and included mean monthly temperature and precipitation for the months of June and December for the decades 2000–2009, 2030–2039, 2060–2069, and 2090–2099, which define three time steps in the twenty-first century. Using decadal rather than annual data tends to smooth the interannual variability that occurs in both the real world and in SNAP models, and thus constrains variation in model output, where mean values are of more interest than variability is for our purposes. Using decadal means tends to smooth out the “outer edges” of the climate data and the ranges, leading to under- estimates of the potential variability.

In our early model runs, we used only Alaska biomes. The group decided, however, that our results would be more ecologically defensible if we allowed for shifts across the international boundary with Canada. Thus, our final runs all used twelve land classifica- tions, including the six Alaska biomes and the six Canadian ecozones described above, which allows for Canadian biomes to enter Alaska if the models find them to be better matches for future climate conditions.

Random Forests™ used the closest available approximation of existing conditions as training data. For Alaska, this meant using SNAP model outputs for the current decade, 2000–2009 (note that these data are simulated, and not directly linked to weather station data). Simulated, rather than actual, climate data were used to represent existing condi- tions in part because actual data for the 2000–2009 decade were not yet available during the modeling for this project in 2008 and 2009. Each 2 km grid cell on the statewide map was defined by four parameters (in addition to biome): mean decadal June temper- ature and precipitation, mean decadal December temperature precipitation (Figure 3). For the Canadian ecoregions, we used climate normals available from Government Canada (based on 1971–2000 weather station data) for each ecozone within the larger ecoregions (NCDIA 2009). To normalize the map resolution for purposes of using the Random Forests™ model and to match the temperature and precipitation data that we used for Alaska, we assigned each 2 km grid cell within each Canadian ecozone and used the climate data for the applicable ecoregion. For further information on the linking of Canadian and Alaskan data, see Technical Addendum  IV.

Figure 3 Conceptual illustration of how the Random Forests™ model was used to create climate envelopes for different modeling subjects

SNAP2010Figure3.png

Mean decadal temperature and precipitation data were taken from the Composite A1B climate model for the 2000– 2009 decade to define the framework for predicting climate envelopes used in the future. Random Forests™ incorporates interactions in addition to the additive values from each layer included in the model.

Figure 4 Current biome types as predicted by SNAP climate data

This map shows the best fit for each 2 km pixel in Alaska for 2000–2009 climate projection data, based on climate envelopes for pre-defined biomes and ecoregions in Alaska and Canada.

SNAP2010Figure4.png

Figure 5 Projected potential biomes for 2030–2039

The Boreal Transition biome is encroaching northward and westward, and the Arctic biome has started to shrink with Taiga Cordillera coming from the east.

SNAP2010Figure5.png

Figure 6 Projected potential biomes for 2060–2069

Marked northward shifts are now observed, with some Canadian biomes moving in from the east. Note that although Montane Cordillera appears in broad regions of western Alaska, this is likely a potential rather than an actual change; it would be difficult for the species of this ecoregion to effect such a large spatial shift in this time step, or to undergo such a large change in physiography.

SNAP2010Figure6.png

Figure 7 Projected potential biomes for 2090–2099

The Arctic, Alaska Boreal, and Western Tundra biomes are all greatly diminished, in favor of the Montane Cordillera and Boreal Transition. In addition, nearly half of southeast Alaska has shifted from North Pacific Maritime to the Canadian Pacific Maritime.

SNAP2010Figure7.png

Random Forests™ used simultaneous analyses of these data to define the current cli- mate envelope of each biome. Climate envelopes are the range of climate conditions that describe the current distribution for each biome. We used the modeled GCM data for the current decade 2000–2009 to define the current climate envelope for each biome’s distri- bution. We created a 5 km lattice of all climate data (resampling from the original 2 km resolution) to make our data sets more manageable in size. The model then determined which biome climate envelope was the best fit for each set of projected future climatic conditions. The model output was a new map for each decade, depicting potential biome shift based on the best climatic envelope fit for each grid cell (Figures 4–7). Note that the map of the current decade (Figure 4) redefines biomes and ecoregions under current cli- mate conditions to show slight existing irregularities in the delineation of biomes.

Figure 8 Box plots comparing climate envelopes for biomes/ecoregions

Each plot shows mean values, one standard deviation, and maximum and minimum values for one model input. Units are millimeters of precipitation and degrees Celsius. Note that the smaller ranges for Canadian ecoregions are indicative of the fact that these data sets were at much coarser resolution than Alaska data sets. Max and min values far from the mean often represent mountainous zones or other “non-typical” outliers within the region.

SNAP2010Figure8.png

As shown in Figure 8, some biomes and ecoregions have temperature envelopes that are similar to others, while some biomes and ecoregions appear to be more distinct. This disparity in variability is important to note on two levels: First, it demonstrates that slight differences in climate are associated with radically different ecosystems (e.g., grassland in the Aleutian Islands vs. coastal rainforest in the North Pacific Maritime of southeast Alaska), and that climate might not be the only parsimonious factor to determine biomes. Second, it outlines an important avenue for future research: an assessment of the simi- larities and differences between Canadian ecozones (e.g., Boreal Cordillera) and Alaska biomes (e.g., Alaska Boreal) that might be expected to have similar climates. Thus, some zones might be effectively combined or repartitioned in future model iterations. Such repartitioning would have ramifications in our assessment of biome refugia, landscape resilience, and landscape connectivity, as described later in this report. Why would chang- ing (repartitioning) the initial boundaries of biomes change the results from the models? Climate envelopes are derived based on climate variables and interactions across a defined geographic area. If that geographic area is changed, the climate envelope is likely to change too. That change is carried forward into the results, as the “new” climate envelopes is matched with the modeled climate data across Alaska.

Results show marked shifts in potential biomes, with the boreal and arctic zones shifting northward and diminishing in size. By 2039, our model projects that the Boreal Transition biome climate envelope will encroach northward and westward, and the Arctic biome will shrink by 10%. By 2069, projections indicate marked northward shifts, almost complete change in western coastal regions, and some Canadian biomes moving in from the east. It is important to note that these shifts represent potential rather than actual biome shift, since in many cases it is unconfirmed that seed dispersal, soil formation, and other functional changes could occur at the same rate as climate change. For example, although Montane Cordillera appears in broad regions of western Alaska, it would be difficult for the species of this ecoregion to effect such a large spatial shift in the modeled time step. These models illustrate that the climate envelope of today’s Montane Cordillera better matches the future climate of places in western Alaska than any of the climate enve- lopes that describe the biomes that exist in Alaska today.

Our initial models predict that by the end of the century, the Arctic and the Alaska Boreal will each diminish by approximately 69% and Western Tundra, by 54%—all but disappearing in its original location—in favor of the Montane Cordillera and Boreal Transition. In addition, much of southeast Alaska may be in the process of shifting from North Pacific Maritime to Canadian Pacific Maritime—again, as constrained by func- tional barriers. The model suggests that two-thirds of Alaska will experience a potential biome shift in climate this century, although shifts are occurring at temporally and spa- tially different rates across the landscape. Even without sea level rise, Western Tundra may be the rarest and least protected biome within the next century, with less than 2% remaining unchanged by 2100. Not surprisingly, the three most southern biomes (Boreal Transition, Aleutian Islands, and North Pacific Maritime) were the only biomes with cli- mate envelopes that occur in greater distribution through the next century.

As noted before, potential biomes are not the same as actual biomes, and these results are based on only two months ( June and December) of climate data. A full assessment of potential biome shifts should be done, using additional months as well as secondary data, such as soil depth and permafrost, before results are used for landscape planning. Nevertheless, our models indicate that large-scale change is likely by the end of the cen- tury. Further assessment should confirm the broad trends shown here. As will be examined and discussed in the following sections, when these changes are assessed within a conser- vation framework, they will pose new opportunities and challenges for land managers.

Part II Connectivity and Resilience and Conservation Potential

To assess landscape-level connectivity, we needed to determine what regions of the state are likely to be most important in coming decades for the conservation of functioning biomes. We  then needed to ascertain how well these habitat areas may be protected (based on existing conservation land status) and connected (based on cor- ridors and landscape proximity). We decided to model connectivity by focusing on areas identified as biome refugia and areas with potentially high endemism or potentially high productivity. We then examined how these areas might be linked over the course of this century across the cumulative potential range of each biome within that time period, given constraints on the amount of land in conservation status (Figure 9).

Figure 9 Conservation lands in Alaska

Although conservation status was not used as a modeling criterion in this project, results can be viewed in light of existing land ownership, use, and protection.

SNAP2010Figure9.png

Methods

For the purposes of the Connectivity Project, participants decided on a simple suite of cri- teria to test the concept. It is important to recognize that these criteria were largely arbitrary and would need serious consideration in future applications. The Marxan model was used to optimize solutions for the presence of biome refugia, two different NDVI criteria, and spe- cific percentages of the landscape. For each biome, modeling was performed within the land area defined by the presence of that potential biome at any modeling time step (2000–2009, 2030–2039, 2060–2069, 2090–2099).

Marxan models are designed to evaluate landscapes to compare how different places meet objectives set by land managers. Two important assumptions in Marxan modeling are 1) not all objectives can necessarily be met simultaneously at the most optimal level; and 2) it is not necessary (and may not even be possible) to find a single mathematically ideal solution. Instead, Marxan searches for “acceptable” solutions—solutions close to the theoretically ideal solution—using a process called simulated annealing. In each step of this process, the algorithm replaces the current solution with a generally preferable “nearby” solution, while allowing some counterintuitive adjustments to avoid stopping the search at a local minimum rather than the global minimum. Simulated annealing is typically used in optimization problems with unmanageably large numbers of objects or pixels or noisy data. Our model was stochastic. The simulated annealing algorithm was run 50 times, and solutions were selected based on the most-frequently selected solutions from those runs.

CRITERION 1: BIOME REFUGIA

To determine where biome function might be most resilient to climate change, we first looked for areas that our models predicted would remain part of the same biome climate envelope over the course of this century. Specifically, we assessed change in the three peri- ods between our sample decades (2030–39, 2060–69, and 2090–99). Since we divided the twenty-first century into three time steps, each pixel on the map had the opportunity to change potential biomes up to three times (Figure 10). A point on the map undergoing two or three changes might change to a different potential biome and then change back, or might shift to several distinct biome climate envelopes in sequence. A pixel experienc- ing zero changes would remain in the same biome classification at every time step. We considered groups of pixels with no change to be areas most resilient to climate change, at least from the perspective of habitat conditions. We considered these areas of no change as possible refugia for species representative of each biome (Figure 11). In our Marxan model, we optimized the selection of these areas.

Figure 10 Number of potential biome changes projected by 2099

Areas showing no change (dark green) are considered refugia for existing species assemblages, while those showing one, two, or three changes (light green, yellow, or red, respectively)  are more likely to experience dynamic changes in species populations and challenging management choices.

SNAP2010Figure11.png

Figure 11 Biome refugia

Areas shaded in yellow are projected to see no change in potential biome by the end of the twenty-first century. Thus, these regions may be more ecologically resilient to climate change and may serve as refugia for species assemblages from each biome.

SNAP2010Figure11.png

CRITERION 2: NDVI AS AN INDICATOR OF PRODUCTIVITY AND ENDEMISM

The Normalized Difference Vegetation Index (NDVI) is a remote sensing indicator used to analyze the relative reflection and absorption of energy in the photosynthetically active radiation spectrum versus the near infrared spectrum. NDVI is directly related to the photosynthetic capacity and hence energy absorption of plant canopies. High NDVI is a well-accepted proxy for biological productivity and is increasingly being explored as a proxy for biological diversity (Verbyla 2008). Low NDVI would be more typical of low productivity habitats such as alpine zones, where we would expect to find at-risk endemic species that have evolved strategies to occupy marginal environments to minimize com- petition (Spehn and Koerner 2009). As in the case of the Alaska marmot, low NDVI may be associated with endemism in Alaska.

For the purposes of this project, we were interested in areas of both high NDVI and low NDVI. Thus, we used the cumulative averages of historic NDVI levels for each pixel (Figure 12). Marxan optimized the upper and lower deciles of these values. In the future, there may be other opportunities to incorporate NDVI in these types of models. For example, incorporating the initiation of green-up or focusing on the end of the growing season may provide more information for some modeling objectives.

Figure 12 Historical mean NDVI across the state of Alaska

Areas of highest NDVI (darkest blue) indicate areas of potentially high productivity, while areas of lowest NDVI (lightest blue) may indicate areas of high endemism.

SNAP2010Figure12.png

CRITERION 3:PROPORTION OF LAND AREA IN CONSERVATION STATUS

For the modeling results described in this report, we did not optimize each location based on its current conservation status. In preliminary model runs, we experimented with including conservation status as a ranked criterion, with levels ranging from no con- servation mandate to designated Federal Wilderness. This avenue could be reopened in future model iterations.

We did consider limitations on land area imposed by variable land status, however. Marxan identified the best areas to select if one was seeking to cover 10% of Alaska or 25% of Alaska with areas that best met the refugia and NDVI criteria within each biome.

MODELING STEPS

Our first modeling step was the conversion of our 5 km gridded data sets into 40 km hexagons in order to make them compatible with Marxan and more manageable in size, and to better represent the smallest units for functional conservation at the landscape level (Figure 13). Again, the selection of this hexagon size was largely arbitrary. Marxan then found potential solutions based on 50 stochastic runs of the annealing algorithm, optimizing for the three criteria described (Figures 14 and 15). In each solution, each hexagon was assigned a quantile rank, with red hexagons being those selected in the high- est proportion of runs (highest 1/3), orange being second-ranked (middle 1/3), yellow being less optimal (lowest 1/3), and gray denoting “no solution” (hexagons never selected in any runs).

Figure 13 Biome refugia on a 40 km hexagonal grid

This coarser data set was used for modeling connectivity to make the calculations manageable within Marxan. Land area is classified according to how many changes in potential biome are projected to occur in three future time steps.

SNAP2010Figure13.gif

Figure 14 Marxan solution illustrating ranking by biome assuming 10% of total land area in conservation status

Each hexagon is scored based on NDVI (with both the highest and lowest NDVI deciles being optimal), and number of projected potential biome shifts, with lower numbers preferred over higher. Solutions are constrained to the area in which each potential biome is projected to occur at one or more time steps out to 2100 (blue lines). Red is the best solution, orange is next best, then yellow.

SNAP2010Figure14.png

Figure 15 Marxan solution statewide by biome assuming 25% of total land area in conservation status

Each hexagon is scored based on NDVI (with both the highest and lowest NDVI deciles being optimal), and number of projected potential biome shifts, with lower numbers preferred over higher. Solutions are constrained to the area in which each potential biome is projected to occur at one or more time steps out to 2100 (blue lines). Red is the best solution, orange is next best, then yellow.

SNAP2010Figure15.png

Due to the properties of Marxan and the simulated annealing process upon which it is based, overlap between solutions can be expected to be imperfect. For example, 10% land cover solutions for each biome are similar to 25% solutions, but are not a perfect subset of the larger solutions. Moreover, the red pixels for the 10% solution do not add up to exactly 10% of the landscape, since each solution must balance between the competing and some- times contradictory demands of the variables that are being optimized.

Marxan Modeling Results

In general, we find that the Boreal Transition, Aleutian, and Northern Pacific Maritime re- gions in the southeast portions of the state are more likely to be resilient to change. Regions within today’s Western Tundra, Arctic, and Alaska Boreal biomes are least likely to remain within their original climate envelopes. In the western part of the state, very little land is projected to remain unchanged, leaving little opportunity to connect between refugia and the new areas that may suit the Western Tundra biome. In this region, a more sophisticated process that incorporates changes at intermediate time steps may be warranted to identify important connectivity areas.

While Marxan can be used to identify zones to connect,  we  found  that  once each grid cell was ranked, drawing in connections by hand was more effective than additional modeling. We drew connections by hand for each biome using Marxan solutions and con- servation status maps as guides and using the intermediate time steps for the biome being modeled to help guide the selection of a connectivity path (Figure 16). If this exercise were being done as part of a true conservation planning exercise, we would bring in additional data layers such as topography and land status to guide connectivity selection. However, in this proof-of-concept approach, we did not include any of these refinements. To the best of our knowledge, this is the first time that linkages have been developed through both space and time. We believe this type of analysis is critical in considering network designs to address rapid climate change.

Discussion of Marxan Results

The biome shift and Marxan results yield some intriguing possibilities for future collabora- tive work, but our results are obviously based on hypothetical objectives, which need to be refined before use in a statewide adaptive planning exercise. Some of the recommendations identified through this project to improve the modeling process include:

  • Use robust projections of potential biome shifts (i.e., based on more of the available temperature and precipitation data and smoothed between the U.S. and Canada prior to establishing the initial climate envelope description for each biome).
  • Work with partners to establish landscape-scale objectives (e.g., biodiversity or sustainable harvest of game) that can be represented through spatial layers such as NDVI.
  • Consider how to prioritize the importance of areas where new potential biomes occur on the landscape and how to connect them, if at all, through space and time to their original biome locations.
Figure 16 Potential connections between Marxan solutions for lands within each biome category

These polygons were drawn by connecting the highest ranked (orange and red) hexagons from Fig. 14, which showed where each biome climate envelope was predicted to occur in the future. Note that overlap occurs where the climate envelopes overlapped between multiple biome categories. Drawing these connections by hand proved more expedient than using further modeling techniques, although future refinement is possible.

SNAP2010Figure16.png

From a management perspective, the results of this type of modeling can help identify areas to prioritize for monitoring, identify areas of research, and identify areas on the landscape that may be important linkages in time and across the landscape as our climate changes. These important linkage sites, where they fall within conservation lands, should influence land management of the conservation unit. Where important linkage sites fall outside of conservation lands, the model results provide information to state and pri- vate land managers for consideration in their management. Recommendations for future research include focusing on boosting model accuracy through field validation, delineation and monitoring of potential refugia, and exploring opportunities to develop anticipatory adaptation responses in parts of Alaska. For example, we could stratify research natural areas (RNAs) by whether they are refugia or transitory and establish long-term monitoring within each category.

In some areas of Alaska, the potential future biomes indicated by our  models are highly disconnected  from  existing  biomes,  or  suggest  extreme  range  expansion  over a short period. For example, projections for the 2090s show Montane Cordillera, a Canadian biome currently occupying most of southern British Columbia and a portion of southwestern Alberta, as the best fit for the future climate of much of Western Alaska. The Boreal Transition biome, which is currently limited to the Alaska Range and southcentral Alaska, is projected to be the best fit for areas as far north as Kotzebue and Arctic Village. In such cases, gradual species shifts are unlikely to occur. Managers may have to choose either to manage for maintenance of biomes less well suited to new climate conditions, or to consider facilitated migration.

Our criteria were selected without serious consideration of the consequences of the choices or their potential achievements. For example, while we had a rationale for incor- porating NDVI into the criteria, we could have chosen other approaches than using the top and bottom 10%. We need to consider how to prioritize connection areas for either biomes or species, and whether such an approach would be useful for multiple species. In using this type of modeling for landscape conservation planning, it is important to recog- nize that many management objectives could be modeled. One of the strengths of using models in landscape planning is the ability to compare the results when different model criteria are used. Marxan is designed to handle a much broader range of criteria than those used in this iteration. At the same time, clear documentation of criteria in scenarios made will be necessary for achieving reproducible results.

Part III Modeling Changes in Distribution of Indicator Species

The Connectivity Project originally started with the idea that we could evaluate land- scape connectivity characteristics for organisms with different life history traits (i.e., how easily they can migrate). To investigate this concept at a statewide scale, project participants selected four species with very different connectivity issues. Caribou were se- lected to represent mammal species with few migration constraints; Alaska marmot were selected to represent mammals with limited range and migration capability; trumpeter swans were selected to investigate how statewide landscape connectivity issues would ap- ply to breeding bird populations; and reed canary grass was selected as an invasive plant species that uses our human footprint on the landscape for initial dispersal and may benefit from a warming climate.

In addition, there was interest in understanding how this process could apply to veg- etation communities. Since the Connectivity Project was initiated at a time when draft LANDFIRE products were just becoming available, the group decided to use the draft land cover data for the two map zones that were completed earliest. See Technical Addendum III: LANDFIRE Mapping of Units 70 and 71 for more detail on these modeling efforts.

Caribou

SNAP2010Caribou.png

Barren-ground caribou (Rangifer tarandus granti) are an important case study because of their importance as a sub- sistence and economic resource in Alaska and because they are a migratory species with different habitat needs during different  seasons.  Alaska  Department  of  Fish  and Game (ADF&G) caribou biologists created range depictions for 33 caribou herds in the state as of 2008 (Figure 17). For most herds, these data include only a combined range area over all seasons, whereas some herds have defined summer and winter ranges. A few of the better- studied herds have more detailed data, including calving grounds. Caribou exhibit wide vari- ation in degree of migratory behavior and other ecological adaptations across North Amer- ica, and recent taxonomy broadly describes the migration pattern and ecotype of herds as migratory (tundra or mountain) and sedentary (boreal forest or mountain) (Hummel and Ray 2008:51–52). In Alaska, all herds are considered barren-ground; however, the Chisana herd on the Alaska-Yukon border is officially considered the northern mountain ecotype of woodland caribou in Canada. (Zittlau et al. 2000; L. Adams, pers. comm. 2010). Caribou herds have been defined largely by fidelity to distinct calving areas, although variation in spa- tial use patterns occurs and defies simple definitions. Herd use of seasonal ranges may vary in size and location as size of the herd changes over time (Davis et al. 1975; R.D. Boertje, pers. comm. April 2010), which complicates inference on habitat selection in response to vegeta- tion changes induced directly (e.g., fire) or indirectly (e.g., climate).

Figure 17 Alaska caribou herd ranges (2008)

Caribou exist in a wide range of habitat types and climates across the state, although herds appear more constrained when delimited by summer and winter ranges. For some herds, additional information is available, such as calving ground locations.

SNAP2010Figure17.png

Caribou ranges were assessed in Random Forests™ using temperature and precipita- tion data for June and December for the decades 2000–2009, 2030–2039, 2060–2069, and 2090–2099. We tried several modeling approaches to capture the adaptable nature of these animals and learned a lot about how early decisions to define “existing condi- tions” within the model change the model results. At first, we considered southern herds separately from northern herds, since we reasoned that northern herds may have notice- able decreases in their range areas, while southern herds may expand. However, caribou biologists were adamant that caribou are so adaptable that this was an artificial distinc- tion. We also looked at summer and winter ranges because we anticipated that seasonal changes may be more informative. We tried tying these modeling results to the results from Boreal Alfresco climate models, which raised several research questions regarding the appropriate way to link two climate-based models for the same area. Since only a few herds had specified summer or winter range, this constrained the available data to estab- lish existing climate envelopes. See Technical Addendum V: Caribou Modeling by Herd for further details on this process. In the end, we decided that the most robust approach, given the available data, was to look at total range for all herds combined (Figure 18).

Figure 18 Projected caribou range for all herds combined

Although climate modeling indicates that ranges may shrink, with a shift in climate-suitable ranges toward the eastern half of the state, this type of climate-envelope modeling for Alaska caribou herds is impractical at a statewide scale because of species plasticity and overlap between herd ranges.

SNAP2010Figure18.gif

Our model output was  the  predicted  ranges  based  on  climate-derived distribution of biomes that lacked information on specific habitat criteria, such as abundance  of forage biomass or the depth and density of snow. We anticipated a substantial change in predicted range distribution in Western Alaska, where future climate envelopes were expected to be poorly matched with existing climate envelopes for herds in that area. For a species as widespread as caribou, it may be best if models are run using climate envelopes for global caribou populations. This broader assessment could show that climate changes in western Alaska are not likely to affect the suitability for caribou in general; however, individual herds may respond differently to changes in climate. The genetic makeup of individual animals within a herd may make them more, or less, adaptable to changes in cli- mate. Recent genetic research on non-migratory herds of mountain caribou in the Yukon Territory concluded that “in the face of increasing anthropogenic pressures and climate variability, maintaining the ability of caribou herds to expand in numbers and range may be more important than protecting the survival of any individual, isolated sedentary for- est-dwelling herd” (Kuhn et al. 2010, p. 1312).

Inferring climate effects on distribution or abundance of barren-ground caribou is complex, because the species often encounters a wide range of habitat during migration and dynamics are influenced by many factors. Sharma et al. (2009) reviewed the influ- ence of biotic and abiotic conditions on the distribution of migratory (barren-ground) caribou, including snow depth, lichen cover, insect avoidance, and predator avoidance. Winter icing events, particularly early in winter, when shallow snow is melted and encases ground forage, can result in mortality of caribou (e.g., Miller and Gunn 2003) or seasonal displacement from affected ranges. These icing events on vegetation are difficult to predict spatially or temporally from broad-scale temperature and precipitation. Icing is an abi- otic event that functions independently of caribou density in a “non-equilibrium grazing system” (Miller and Gunn 2003, p. 388).

Classic corridor-type connectivity modeling for Alaska caribou herds is impractical at a statewide scale because of species plasticity and overlap between herd ranges. Species adaptability may allow caribou to utilize habitats in unexpected ways, as witnessed by the transplant of Nelchina caribou to Adak Island, where they achieved large body size on range atypical for Alaska caribou (Valkenburg et al.  2000).  Finally,  change  in cari- bou abundance and distribution is known to have varied substantially over the last 250 years, based on records of tree scarring by hooves in barren grounds of northern Canada (Payette et al. 2004; Zalatan et al. 2006). These researchers recognized that effects of har- vest, predation, fire, and climate can influence herd dynamics over long periods. Because of the complexity of caribou herd dynamics, generalizations about where caribou can and cannot thrive based on climate-driven modeling should be considered hypotheses to be tested for plausible mechanisms with empirical  data.

Alaska Marmots

SNAP2010AlaskaMarmots.png

The Alaska marmot (Marmota broweri), a relic species from the Beringia Ice Age, has limited adaptability and dispersal ability (Gunderson et al. 2009) and thus makes an excellent case study for connectivity and habitat loss for endemics (native species) in arctic environments. Data for 34 known occurrences of Alaska marmots were provided by the Alaska National Heritage Program, based on various sources including the Gunderson collections (Gunderson et al. 2009). It cannot be presumed that the species is absent from areas for which no data exist (Figure 19). We hypothesized that shrinking alpine zones and loss of connectivity between remaining alpine zones would limit the potential habitat for this spe- cies. For a detailed explanation of our modeling methods, see Technical Addendum VI.

Figure 19 Known Alaska marmot distribution and modeled current distribution

Since no absence data exist, it cannot be assumed that marmots do not also inhabit similar habitat in which no confirmed presence data are available.

SNAP2010Figure19.png

As with models described in previous sections, we used SNAP climate data for June and December mean temperature and precipitation for 2000–2009 to develop a climate envelope and current potential distribution based on known occurrence sites. However, we also added terrain roughness as a covariate. Rockiness, steepness, and associated bio- physical features are of great importance to marmot habitat, since Alaska marmots use rock piles for cover. Terrain roughness was estimated based on a localized digital eleva- tion map (DEM) depicting rate of slope/aspect change. Terrain roughness is likely to be minimally affected by climate change; therefore, the same values could be used for each 30-year time step.

Confirmed sightings of Alaska marmots have occurred in alpine areas in the Arctic and sub-arctic regions of the state. Initially we used presence/absence of the Alpine biome as a final covariate in our analysis, but the existing alpine area did not correlate well with known marmot occurrence, and predicting alpine zones proved difficult given the lack of availability of pertinent data sets (see Technical Addendum II: Identifying the Alpine Biome). Terrain roughness proved to be more accurately and reliably measured, and thus more likely to be a useful predictor.

Using the covariates mentioned above, results from Random Forests™ modeling showed shrinking range size for the Alaska marmot (Figure 20). Statewide, total range area shrank by 27% by 2039, 81% by 2069, and 87% by 2099, as compared with present esti- mated range size. In addition, previously contiguous habitat areas became disconnected.

Figure 20 Projected Alaska marmot distribution

Marmot range is expected to diminish sharply as climate warms and alpine habitat shrinks.

SNAP2010Figure20.png

Given the likelihood of limited dispersal ability for this species, the increasing frag- mentation of potential distribution areas may present as much of a problem as their shrinking area. Further analysis would be needed to ascertain whether gene flow would become impeded under the projected conditions.

Although the predicted distribution changes of climate-suitable habitat for Alaska marmots is alarming, these results should be considered with caution. There were only 23 georeferenced locations for establishing training data out of 34 sighting records. The Connectivity Project participants are concerned about the validity of using so few point locations to establish a climate envelope. It would be helpful to have researchers establish guidelines on how to scale climate data appropriately. Nevertheless, using this approach to guide future efforts to locate Alaska marmots and expand the available data on this little- understood species would be very informative.

Trumpeter Swans

SNAP2010TrumpeterSwan.png

The trumpeter swan (Cygnus buccinator) was selected as a species of interest because, like many other birds species in the state, it is migratory. As such, statewide connectivity of habitat may not be an issue for them. However, quantity and quality of habitat are pertinent to the survival of this species. Swans are limited, in part, by summer season length to fledge their young. We hypothesized that longer summer seasons expected with climate change could potentially ex- pand their overall range. Furthermore, trumpeter swans were an excellent species to model because good-quality data exist for their range and distribution, including survey data for adults, young, and brood size. For a detailed explanation of our data and modeling methods, see Technical Addendum VII.

Using the same methodology and climate data described for modeling biomes, we modeled potential shifts in swan climate-linked habitat, using SNAP temperature and pre- cipitation data for summer and winter for the current decade and three future decades (2000–2009, 2030–2039, 2060–2069, and 2090–2099). Trumpeter swans in Alaska require 138 ice-free days to fledge their young successfully (Mitchell 1994). Thus, we cre- ated an overlay of areas predicted to experience greater than or equal to 138 days between the time when the running mean temperature crosses the 0°C point in the spring, and when it crosses it again in the fall. This information was derived from the SNAP climate data. Although water bodies would not be expected to be ice-free as soon as mean tem- peratures are above freezing, the error introduced by this lag time should effectively be canceled by a similar lag time in the fall.

We also overlaid lands that were predicted to be in non-forested biomes as of 2099 as a proxy for incorporating competitive exclusion between trumpeter and tundra swans (Cygnus columbianus). Non-forested biomes included Arctic, Western Tundra, and Aleutian Islands. Trumpeter swans generally occupy forested lakes, while tundra swans are associated more with non-forested lakes. This data layer was likely to be inexact, since, as noted, there is expected to be a significant time lag between potential biome shifts (based on climate suitability) and actual biome shifts (based on climate and vegetation occurrence). Hydrology was not included in our model, because wetlands and other water bodies are ubiquitous across the state and there was no available predictive model to dem- onstrate changes in water bodies through the century. We modeled occurrence rather than brood size because brood size turned out to be a very weak predictor, perhaps affected more by mating pair experience and age than by climate-driven habitat quality.

Model results showed distribution expanding west and north (Figure 21), but did not predict movement into the Arctic. It should be noted that this shift might be happening already. Since biologists cannot easily distinguish tundra versus trumpeter swans from the air, mixing is probably occurring already at the interface between habitats along the northern and western parts of the range. Further exploration of the ability to model dis- tributions of tundra swan will help to clarify the potential distributions for both species.

Figure 21 Potential expansion of trumpeter swan habitat

These predictions are based on 138-day ice-free season, summer and winter climate envelopes, as predicted by SNAP climate projections, and a competitive filter of non-forested biomes to represent tundra swans (not included in this figure). Trumpeter swans are predicted to shift their range northward and westward over the course of this century.

SNAP2010Figure21.png

Reed Canary Grass

SNAP2010ReedCanaryGrass.png

Reed canary grass (Phalaris arundinacea) was selected as a case study because it represents an aggressive invasive species. It is already es- tablished on the Kenai Peninsula and elsewhere, and is projected to spread along road and trail systems, and along river systems statewide. Seeds and vegetative fragments float, and seeds, which adhere readily to moist skin or fur, can be transported in clothing, equipment, and vehicles (Wisconsin 2009). Because it clogs waterways, it can have a profound effect on riparian ecosystems (Zedler and Kercher 1994).

Raw data on the occurrence of reed canary grass are available from several sources; however, it should be noted that the species is likely to be widely under-sampled. Data are based on relatively coarse georeferencing, which tends to be located along road/trail systems. We relied on the most comprehensive data set available, from the Alaska Exotic Plant Information Clearinghouse (AKEPIC Database 2005), with approximately 5,000 records through the 2008 field season. For a detailed explanation of our modeling meth- ods, see Technical Addendum VIII.

Dispersal of reed canary grass is clearly through roads and streams, although some iso- lated road systems are not yet impacted. Therefore, we modeled the potential spread of the species using Department of Transportation (DOT) maps for existing all-season roads (AKDOT 2008). We were unable to model potential secondary expansion through the river systems downstream of introduction sites along roads and trails. Modeling is pos- sible, however, using new hydrological data processing tools within ArcGIS, and should be incorporated into future models. We also explored the possibility of spread through seasonal roads and future roads; see Technical Addendum VIII for details.

As with the other models described in preceding sections, we mapped the known occurrences of the species and linked these sites with SNAP climate data for June and December for 2000–2009. Using this training data, we extrapolated the potential habi- tat for the species for three future time steps (2030–2039, 2060–2069, and 2090–2099). However, in this model we also used proximity to roads as a covariate. Furthermore, the modeled climatic niche for reed canary grass is based on extant distributions of popula- tions that are rapidly expanding, making the modeling a conservative one.

Our results (Figure 22) show potential spread in the Seward Peninsula by the end of the century. The predicted spread north should raise concerns with managers that this spe- cies can present a threat in other high latitude regions of the state. These predictions are very conservative since they do not account for spread via water. In addition, our models do not take into account the potential spread by airplanes (terrestrial and floatplanes).

Figure 22 Potential spread of reed canary grass, using climate and all-season roads as predictors

Inclusion of waterways, proposed roads, and trails would be likely to broaden the modeled range of this invasive species. See Technical Addendum VIII.

SNAP2010Figure22.png

Part IV Implications of Modeling for Conservation and Research and Management

The form of predictive modeling that we have developed here has great potential for landscape-level conservation planning. As biological validation occurs with field studies, it may become useful in guiding the direction of on-the-ground management actions. In the past, management decisions have generally been directed toward maintain- ing or restoring species abundance and/or diversity, based on some approximation of “his- toric conditions.” Adaptation may now mean managing toward less-certain future condi- tions, rather than aiming for historical or current conditions (Choi et al. 2007; Harris et al. 2006). The Intergovernmental Panel on Climate Change recognizes that adaptation strate- gies can be anticipatory or reactive. Anticipatory adaptation works with climate change tra- jectories; reactive adaptation works against climate change, toward historic conditions. The former approach manages the system toward a new climate change-induced equilibrium; the latter abates the impact by trying to maintain the current condition despite climate change ( Johnson et al. 2008).

The modeling results in this study, albeit preliminary, advance the dialogue that we need to have within our larger conservation community. We can begin assessing the relative trade-offs of doing nothing to address climate change or doing something, and whether that something should be reactive or anticipatory in nature. Our modeling sug- gests that over the course of this century a dramatic 63% of Alaska will shift to climate conditions associated with a biome other than the current one. Extant distributions of caribou, Alaska marmots, reed canary grass, and trumpeter swans may be very different within a few decades. The Alaska marmot, an endemic species and a Beringia relict, may be on a trajectory for extinction within the state if habitat requirements are not better described through informed research on the animal’s life history characteristics, particu- larly on timing of spring and den emergence. Conversely, reed canary grass, an invasive plant that is currently constrained in distribution to southeast and southcentral Alaska, will likely colonize much of the state as it is dispersed along existing and future roads. In the near future, we need better information on caribou population distributions and their ties to habitat, to anticipate how caribou may respond to changing climate patterns. These future potential patterns should be considered, while also considering management efforts (e.g., fire suppression to protect lichens associated with mature black spruce, rec- reational harvest reduction) that reduce non-climate stressors in areas where caribou are declining or expected to decline based on this type of modeling. Our modeling suggests trumpeter swan populations will likely do well under future climate scenarios, but per- haps at the expense of tundra swan populations that may decline because of competitive exclusion, a mechanistic response that is difficult to model with our approach.

The identification of biome refugia perhaps offers the most significant management opportunity in Alaska, where historic ecosystems are mostly intact, and refugia can act as population sources for colonization of novel areas. Two-thirds of predicted refugia over the next century are on federal land designated for some level of conservation; however, only 18% of refugia spatially coincide with congressionally designated wilderness. Wilderness designation or expansion of the federal conservation estate requires an Act of Congress, neither of which has occurred since the passage of the Alaska National Interest Lands Conservation Act (ANILCA) in 1980. Fortunately, there are other tools and approaches to strategic land protection including working with Alaska Native corporations and villages interested in identifying and designating refugia as special places to sustain traditional ways of life, as well as a myriad of other approaches. In addition, there may be opportuni- ties to protect landscape-level migration corridors through multi-partnered conservation agreements. In fact, this project was originally funded because of interest in potential part- nering with the Yellowstone to Yukon Conservation Initiative (Y2Y). This partnership aims to protect a contiguous mountain corridor, from Yellowstone to the Yukon-Alaska border. Little imagination is needed to picture a partnership that spans Yellowstone to the Yukon Delta or Beaufort Sea.

Figure 23 Case study area, Kenai Peninsula

SNAP2010Figure23.png

There may be opportunities to manage habitats to sustain or create dispersal corridors between biome refugia and sites of future biome establishment, or to protect habitats or species from anthropogenic stressors to give them time and space to adapt to a changing climate and environment. In that regard, this type of modeling can re-direct our attention to areas that may be at greatest risk. For example, results from our project imply that the Western Tundra may be our rarest and least-protected biome over the next century.

Supporting Evidence for our Models from the Field: Kenai Peninsula Case Study

The Kenai Peninsula, particularly west of the Kenai Mountains, is forecasted to be highly impacted by climate change over the next century. We examined the predictions from the statewide Random Forests™ model for a 50 km wide hexagon (2165 km2; Figure 23), slight- ly larger than the hexagon used in our Marxan models. This hexagon was deliberately placed in an area of the peninsula with a relatively diverse landscape: various land ownerships, wide elevational change, and a range of aquatic and terrestrial habitats. In this section, we sum- marize predicted changes for biomes, reed canary grass, trumpeter swans, and caribou, and describe published and anecdotal evidence that supports (or does not support) these fore- casted effects at the local scale.

Kenai’s climate: The Kenai Peninsula has a subarctic climate. Temperatures rarely rise to more than 80°F (26°C) in the summer or drop to less than 0°F (–18°C) in the winter. The average annual temperature at the Kenai airport is 33.9°F (1°C). The frost-free growing season varies from 71 to 129 days depending on the location. The Kenai Mountains create a strong rain-shadow on the western Kenai lowlands. Sterling and Kenai, on the western side of the peninsula, receive 17 inches (43 cm) and 19 inches (48 cm), respectively, of precipita- tion per year. In contrast, Seward and Whittier, on the eastern side of the peninsula, receive 68 inches (173 cm) and 198 inches (503 cm), respectively, of precipitation per year. Annual temperatures on the Kenai Peninsula have warmed several degrees following the warming of North Pacific sea-surface temperatures in 1977. Much of this annual increase is due to warmer winters, with December and January having warmed by 9° and 7°F (13°–14°C), respectively. Summers began to warm most noticeably with the drought of 1968–69, with a resultant increased rate of evapotranspiration. Similarly, the annual water balance declined from 5.8 inches (14.7 cm) to 3.0 inches (7.6 cm) per year after 1968 (Berg et al. 2009).

Biome: The statewide modeling suggests that the potential biome will shift dramatically, from 88% Boreal Transition in 2009 to 84% Aleutian Islands in 2099. A portion of the area will remain within the North Pacific Maritime biome, increasing from 12% to 16% of the landscape from 2009 to 2099. From the field: Empirical data indicate that flora within the focal area is representative of the Boreal Transition biome. A supervised classification of Landsat imagery, based on ~4,400 field plots (L. O’Brien, KENWR, Soldotna, Alaska, pers. comm. 2004) indicates that black spruce, mixed hardwood-spruce, and white/Lutz spruce comprise 60% of the landscape (Figure 24). Recent research shows that, indeed, the warm- ing and drying climate in the past 50 years is already impacting processes that affect veg- etation composition and distribution including changed fire regimes (Berg and Anderson 2006; DeVolder 1999; Morton et al. 2006), increased spruce bark beetle outbreaks (Berg et al. 2006), accelerated glacial ablation (Adageirsdottir et al. 1998; Rice 1987), drying wet- lands (Klein et al. 2005), tree line invasion into alpine tundra (Dial et al. 2007), and shrub encroachment into historical peatlands (Berg et al. 2009). While we know the landscape is changing rapidly in response to climate, it is not as apparent that it may be moving toward a shrub-dominated system typical of the Aleutian Islands biome. In the southern part of the peninsula, mostly outside the focal area, much of the white and Lutz spruce forests have given way to a savannah dominated by Calamagrostis canadensis in the aftermath of a 15-year spruce bark beetle outbreak. Some of this conversion is due to salvage operations on private, state, and native corporation lands, and the parcelization and development of those lands; but the phenomenon is too widespread to be attributed only to these human actions.

Figure 24 Landcover, existing biome, and predicted biome

The study area is predicted to undergo a marked shift from climate characteristics of the Boreal Transition biome to climate characteristics of the Aleutian Islands biome.

SNAP2010Figure24.png

Reed canary grass: Reed canary grass is a relatively recent introduction to the Kenai Peninsula, but has been rapidly spreading along the road system and streams. The statewide modeling indicates that the likelihood of occurrence increases in the focal area from 0.22 (min=0.3 max=0.86, sd=0.22) in 2009 to 1.0 (min=1, max =1, sd=0) by 2099. From the field: Empirical data suggest that modeling of the current distribution approximates what we see on the ground. From 2005-09, there were 786 records of reed canary grass on the Kenai Peninsula, of which 64 occurred within the focal area. The modeling of future dis- tribution suggests that reed canary grass will continue to spread over the entire focal area (Figure 25). At this time, the only constraint that we can envision is the ability of the native Calamagrostis canadensis to outcompete reed canary grass in upland habitats.

Figure 25 Modeled spread of reed canary grass
SNAP2010Figure25.png

Trumpeter swans: The Kenai Peninsula is a major breeding and staging area for trumpet- er swans, particularly in the Kenai lowlands north of the focal area. The statewide modeling indicates that the likelihood of occurrence increases in the focal area from 0.39 (min=0.07, max=0.82, sd=0.17) to 0.64 (min=0.1, max=0.79, sd=0.14) by 2009 (Figure 26). From the field: Indeed, aerial surveys conducted since 1957 indicate that trumpeter swans have in- creased from 20 to over 50 breeding pairs on the Kenai National Wildlife Refuge. Within the focal area, trumpeter swans were recorded 78 times from 2004-08: 41 pairs, 23 broods, 7 flocks, and 7 single adults. Swan productivity may continue to increase based on projections of increasing frost-free days, but available habitat may be nearing carrying capacity. Addi- tionally, the drying of closed-basin lakes on the peninsula will only decrease aquatic habitat. Increasing human disturbance from the development of private lands within the focal area will further contribute to a reduction in available habitat.

Figure 26 Increase in likelihood of occurrence of trumpeter swans

SNAP2010Figure26.png

Caribou: Caribou were extirpated from the Kenai Peninsula circa 1917, but were subse- quently reintroduced by the USFWS and ADF&G during the 1960s through 1980s. About 1,000 caribou are now established in four herds on the peninsula. The statewide modeling indicates that the likelihood of caribou occurring in the focal area during the summer will not change significantly over the remainder of this century, ranging from 0.21 (min=0.01, max= 0.71, sd=0.20) in 2009 to 0.21 (min = 0.11, max = 0.41, sd=0.11) in 2099 (Figure 27). From the field: Shown are 1,889 GPS locations of 4 individuals from the Killey River herd (May–August 2002), of which 833 occurred in the focal area; and 313 locations of 2 individuals from the Kenai lowlands herd (May–August 2001), of which 18 occurred in the focal area. The long-term viability of the Kenai Lowland herd is questionable and will likely be driven by local-scale issues despite a neutral climate prediction, including harassment and mortality due to domestic dogs, habitat loss on the calving grounds due to development, and severing of its migration route by an increasing wildland-urban interface and highway traffic.

Figure 27 Little change is seen in potential caribou distribution
SNAP2010Figure27.png

Conclusion: Within the focal pixel, we see reasonable convergence of empirical data in 2009 with modeled projections in 2009. In some ways, this is not surprising given that data from the Kenai Peninsula were captured in the statewide modeling effort. However, this exercise does provide support that, given the similarities between empirical data and model output in 2009, the forecasted future in 2099 is a reasonable scenario.

Implications of Results for Conservation, Research, and Management

These models provide a crucial overview of general patterns and processes in Alaska. As with all models, the Connectivity Project models are most useful when users are aware of the methods used in generating them and the potential errors associated with the model inputs. Thus, we have tried to make these data and methods as transparent as possible. Information not included in this report or in the Technical Addenda can be obtained by contacting the project P.I.s (Appendix A).

Connectivity Project results cannot be used as exact indicators of future biome loca- tions or future species presence/absence data. However, we hope that they provide a useful springboard for planning and research, and for open discussions regarding model- ing change and development of Alaska’s habitats and species.

The Connectivity Project model outputs can be used as a guide for field research design  for  projects  intended  to  investigate  related  issues.  For  example,  Connectivity Project results indicate that Alaska marmots and similar relic endemics at high altitudes may be at high risk. A logical follow-up would involve tracking number and ranges of marmots or other high altitude species, using past and present data. Connectivity Project biome shift results indicated that the Western Tundra biome may be at the greatest risk for habitat loss statewide. This finding might inform decisions to pursue further monitoring of climate and vegetation change in this region.

Connectivity Project data can also be linked to existing knowledge about species/cli- mate interactions, in order to select new species to track more closely. For example, any invasive species that is currently limited in its spread by cold winter temperatures may be a prime candidate for closer study.

Part V Lessons Learned and Next Steps

We recognize that the results presented here are a first but significant milestone in a sustained research effort that should expand and refine our understanding of the connections between rapid climate change, landscape change, redistribution of individual species, and effective ways to promote conservation at a landscape scale. By demonstrating proof of concept for using downscaled GCMs, existing biological and spatial data, and analytical software to model connectivity now and into the future for biomes and single species, we have laid an early foundation for the development of a conservation strategy for adapting to rapid climate change in Alaska. Perhaps the most unique aspect of this project is the incorporation not only of spatial stepping stones (refu- gia) across Alaska into a conservation design, but temporal stepping stones (transitional states) based on a landscape responding to rapid climate change.

We show in this study that Random Forests™ can be a robust and useful modeling approach for defining the climatic niche of biomes, vegetation, and single species. This modeling approach has simple data requirements when compared with more complex mechanistic models. We successfully modeled nodes (e.g., refugia) and other areas of con- servation interest with Marxan. However, none of the modeling methods we considered, including least-cost path analysis using Circuitscape, moving-windows analysis, or land- scape metrics (Fragstats), were able to connect these nodes in a meaningful way to show a complete statewide conservation reserve design. We were forced to resort to the “crayon” method, where we connected Marxan hexagons by eye to meet our design criteria.

Many anticipatory adaptation options to rapid climate change, such as extinction triage, translocation of species to places they have never occurred before, or substitution of genetically modified organisms, involve high ecological and societal risks. Taking these risks demands a high degree of certainty in model outputs. However, the difficulty of assessing model uncertainty, particularly when multiple modeling approaches, assump- tions, and data sets are being used to depict a reasonable conservation reserve design that incorporates predicted future outcomes, proved to be complex and may be difficult to standardize. We suspect that the convergence of multiple models may ultimately be the most meaningful measure of certainty (or the absence of it).

We deliberately chose to model climate change effects on four species with disparate life histories to demonstrate proof of concept: an ungulate important to subsistence living (caribou), a Beringia relict (Alaska marmot), a migratory bird (trumpeter swan), and an exotic, invasive plant (reed canary grass). However, one of the other significant selection criteria was that statewide spatial data not only existed but were accessible, which turned out to be a significant constraint as we shopped around for reasonable data  sets with which to model species distributions. Furthermore, we found that ancillary data for mod- eling these species distributions either needed to be derived or were not available. For example, we recognized that reed canary grass was likely to spread rapidly downstream of road crossings, but were unable to model stream flow direction from existing hydrologic data sets and DEMs, although such data layers are now becoming available. We recognized that trumpeter swan distribution may be constrained by competition with tundra swans, but were unable to find a complete data set of the latter with which to model. Although caribou are one of the most studied animals in Alaska, we had to compile several data sets to develop a composite statewide distribution. Finally, the absence of “alpine” as a type in several land cover classifications constrained the robustness of the Alaska marmot models. Issues of scale and uncertainty are important to consider in modeling species. The inclusion of just a few additional training points to establish the existing climate envelope for a species with very few known occurrence points, such as the Alaska marmot, may greatly expand or narrow the climate envelope used for future modeling. There is also a question about how species occurrence-point data are best linked to climate models that run on global or statewide scales.

The predictions generated by this effort could be assessed and/or improved using 1) more detailed land cover or biome maps; 2) additional data on species distributions, char- acteristics, and constraints; 3) additional climate data; or 4) different modeling methods. Based on the previous discussion, we offer these suggestions for next steps:

  • A valuable next step would be to assess our results further by approaching similar questions using new perspectives, different modeling techniques, and/or alternate data. While we can never truly assess a predictive model until the future becomes the present, we can have more confidence in the models if multiple techniques predict essentially the same response. From a technical standpoint, we also need better methods to display and assess uncertainty.
  • Further compilation of species-distribution data, and refinement of these data to reflect seasonality, movement and dispersal, and even productivity, would strengthen the predictive model results.
  • The most obvious form of assessment for existing predicted distributions would come from corroboration with individuals who have immediate familiarity with the species and landscapes in question. This would include field biologists and researchers, but would also include those living in village communities around the state, particularly individuals who are frequently engaged in subsistence activities and who are already observing changes associated with altered climate.

Traditional ecological knowledge might also identify and designate biomes desired for sustaining traditional ways of life. Further work on this type of modeling for conservation will be greatly enhanced by local involvement.

  • There are several avenues for further research at the level of individual species, species communities, and biomes. The current Connectivity Project biome-shift analysis was constrained, given the lack of reliable land-classification data available during the project’s time frame. Alaska’s ecosystems cannot really be adequately captured in six—or even twelve—biome categories. Improved existing biomes could be developed using AVHRR land cover classes for Alaska and Canada, and cluster analysis methodology. One promising avenue would be to compare modern modeling and clustering techniques (i.e., confusion matrix, Kappa, and other approaches) to produce future predictive distribution maps for many different species.
  • The effort just described might be undertaken in conjunction with species distribution mapping done by the Alaska Gap Analysis Project (AK-GAP). The species assessed in the Connectivity Project served only as initial examples, and generated wider interest and discussion about creating a species atlas similar to avian atlases in California and the northeastern U.S. In the long term, an offshoot from this project might be a comprehensive atlas of current and potential future distributions for terrestrial vertebrate species for Alaska. Predictive models of the current distribution for 450 terrestrial vertebrate taxa are under development by the AK-GAP (www.akgap.info).
  • Another important step would be to expand all the models used in the above analysis, using a wider range of climate input data. For example, including spring  and fall temperature and precipitation as well as values for summer and winter might help distinguish biomes or vegetation units that differ most in the shoulder seasons. During the last workshop of the Connectivity Project partners, there was consensus that the development of robust biome shift projections for Alaska was an important next step in creating tools for conservation planning. This step has now been funded by the U.S. Fish and Wildlife Service and is underway as of spring 2010. Re-running models has the added benefit of checking for consistency and errors.

Such verification is important considering the large number of model runs that were produced during this two-year exercise.

  • The output data provided by Random Forests™ or other models include tools to define uncertainty by selecting areas where the model assigned the greatest confidence in its ranking selections. Climate envelopes from the current and future prediction areas could be evaluated to look at how significant the departures were in different regions of future predicted ranges. This would allow biologists to determine potential climatic thresholds that may be important in determining species persistence in current locations. In all instances, forecasts will still require biological validation. Use of historic field data to assess modeled backcasting with historic climate data (see section on SNAP climate models) may give insights on the credibility of forecasting.
  • Critical temperature thresholds for particular species could be used as a modeling constraint. For example, data are currently available that define temperature thresholds at which moose exhibit thermal stress. Using this type of data to define the climate envelope for species of concern would be a useful analysis in the future.
  • The culmination of developing potential biome shift models and potential shifts in species distribution would be to apply these tools to an Alaska-wide climate change adaptation plan. As we demonstrated using the Marxan models with only four landscape objectives, it is possible to use tools such as Marxan to identify areas that are likely to experience the largest ecological shifts, and conversely, those that may prove the most resilient. This assessment may prove useful to planners in multiple sectors, including agriculture, natural resources development, management of fish and wildlife, and infrastructure development.
  • Mechanistic studies are critical to advancing our understanding of climate change and dealing with its consequences appropriately. A more detailed understanding of how species interact with climate is needed if realistic models of future effects are to be achieved. Mechanistic validation of model results would help boost confidence in using long-range models as management tools. The challenge is integrating slow changes over large scales (i.e., climate, which is changing over decades to centuries) with fast changes over small scales (i.e., harvest management and predator-prey cycles, which change annually or biennially). This might include risk analysis, statistical definition of uncertainty, and analysis of the costs and benefits associated with correct and incorrect predictions and associated management decisions.
  • Despite the challenges mentioned, we believe it is worthwhile to continue discussions and plans for a statewide conservation-reserve network that allows for the natural migration of flora and fauna in response to a changing climate. This group could form the nucleus of a larger task force, appointed by the Alaska Climate Change Executive Roundtable, tasked with goals similar to the Western Governor’s Wildlife Corridors and Crucial Habitat Initiative, and in cooperation with the U.S. Department of Interior’s development of Landscape Conservation Cooperatives.

The participants in the Connectivity Project recognize that these results are only a first step. Continued research efforts may continue to expand and refine our understanding of the connections between climate change, landscape change, effects on individual species, and effective conservation efforts.

Appendices and Technical Addenda

Appendix A Project Participants

This document represents the views of the principal investigators and not necessarily all of the workshop participants. While we attempted to incorporate all of the comments and feedback received from our partners, not everyone was able to comment.

P.I.s
Alaska Department of Fish and  Game

Tom Paragi* David Tessler Kevin White

Alaska Natural Resource Center, National Wildlife Federation

Gretchen Gary

Audubon Alaska

Melanie Smith

Defenders of Wildlife

Karla Dutton*

Noah Matson

The Nature Conservancy

Evie Witten*

Amalie Couvillion

Peter Larson

Doug Wachob

The Wilderness Society

Wendy Loya*

U.S. Bureau of Land  Management

Scott Guyer* Paul Krabacher Ramone McCoy

U.S. Fish and Wildlife Service

Bob Platte*

Joel Reynolds*

Danielle Jerry

Bob Stehn

Charla Sterne

Jim Zelenak

U.S. Forest Service

Mary Friberg

U.S. Geological Survey

David Douglas*

Ben Jones

U.S. National Park Service

Michael Shephard

UAA Natural Heritage Program & Alaska GAP

Tracey Gotthardt*

Matt Carlson

* Denotes participant who provided comments on this document and/or other substantial contribution.

Appendix B Additional Reading and References

Adageirsdottir, G., K.A. Echelmeyer, and W.D. Harrison (1998). Elevation and volume changes on the Harding Icefield, Alaska. J. Glaciology 44:570–582.

AKDOT (Alaska Department of Transportation) (2008). Statewide and Area Transportation Plans. http://www.dot.state.ak.us/stwdplng/areaplans/index.shtml. Last updated 9/19/2008. Accessed May 2009.

AKEPIC Database (2005). Alaska Exotic Plant Information Clearinghouse Database. Available at: http://akweeds.uaa.alaska.edu. Retrieved April 2009.

Ardon, J., and H. Possingham (Eds.) (2008). Marxan Good Practices Handbook. Vancouver, Canada. http://www.pacmara.org/

Beaumont, L.J., et al. (2007). Where will species go? Incorporating new advances in climate modelling into projections of species distributions. Global Change Biology  13:1368–1385.

Berg, E.E., and R.S. Anderson (2006). Fire history of white and Lutz spruce forests on the Kenai Peninsula, Alaska, over the last two millennia as determined from soil charcoal. Forest Ecology and Management 227:275–283.

Berg, E.E., J.D. Henry, C.L. Fastie, A.D. DeVolder, and S.M. Matsuoka (2006). Spruce beetle outbreaks on the Kenai Peninsula, Alaska, and Kluane National Park and Reserve, Yukon Territory: Relationship to summer temperatures and regional differences in disturbance regimes. Forest Ecology and Management 227:219–232.

Berg, E.E., K. McDonnell Hillman, R. Dial, and A. DeRuwe (2009). Recent woody invasion of wetlands on the Kenai Peninsula Lowlands, south-central Alaska: A major regime shift after 18,000 years of wet Sphagnum-sedge peat recruitment. Canadian Journal of Forest Research 39:2033–2046.

Breiman, L. (1998). Classification and Regression Trees. Chapman & Hall. 368 pp. Breiman, L. (2001). Random Forests. Machine Learning 45:5–32.

Chan, K.M.A., M.R. Shaw, D.R. Cameron, E.C. Underwood, and G.C. Daily (2006). Conservation Planning for Ecosystem Services. PLoS Biol 4(11):e379.

Choi, E., E. Polack, and G. Collender (2007). Climate Change Adaptation. IDS In Focus Issue 2, Brighton, U.K.: IDS

Climate Change Atlas (2010). USDA Forest Service. http://www.nrs.fs.fed.us/atlas/ accessed 4/2/2010.

Craig, E., and F. Huettmann (2008). Using “blackbox” algorithms such as Treenet and Random Forests for data-mining and for finding meaningful patterns, relationships and outliers in complex ecological data: An overview, an example using golden eagle satellite data and an outlook for a promising future. Chapter 4 in Intelligent Data Analysis: Developing New Methodologies through Pattern Discovery and Recovery (Hsiao-Fan Wang, Ed.). Hershey, PA: IGI Global.

Davis, J.L., P. Valkenburg, and R.D. Boertje (1986). Empirical and theoretical considerations toward a model for caribou socioecology. Rangifer, Special Issue No.  1:103–109.

Davis, J.L., P. Valkenburg, H.V. Reynolds, C. Gravogel, R.T. Shideler, and D.A. Johnson (1975). Herd identity, movements, distribution, and season patterns of habitat use of the Western Arctic caribou herd. Final report (research), Federal Aid in Wildlife Restoration, Project W-17-8 and W-17-9, Job 3.21R, 1 July 1975 to 30 June 1977, 27 pp.

DeVolder, A. (1999). Fire and climate history of lowland black spruce forest, Kenai National Wildlife Refuge, Alaska. M.S. thesis, Northern Arizona University, Flagstaff, 128  pp.

Dial, R.J., E.E. Berg, K. Timm, A. McMahon, and J. Geck (2007). Changes in the alpine forest-tundra ecotone commensurate with recent warming in southcentral Alaska: Evidence from orthophotos and field plots. J. Geophys. Res. 112, G04015.

Elith, J., C.H. Graham, R. Anderson, M. Dudik, S. Ferrier, A. Guisan, R.J. Hijmans, F. Huettmann,

J.R. Leathwick, A. Lehmann, J. Li, G. Lohmann, B.A. Loiselle, G. Manion, C. Moritz, M. Nakamura,

Y. Nakazawa, J. McC. Overton, A.T. Peterson, S.J. Phillips,  K.  Richardson,  R. Scachetti-Pereira, R.E. Schapire, J. Soberon, S. Williams, M.S. Wisz, and N.E. Zimmermann (2006). Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29:129–151.

Gonzalez, P., R.P. Neilson, and R.J. Drapek (2005). Climate change vegetation shifts across global ecoregions. Ecological Society of America Annual Meeting Abstracts 90:228.

Gunderson, L.H., and C.S. Holling (Eds.) (2002). Panarchy: Understanding Transformations in Human and Natural Systems. Washington, D.C.: Island Press, 507 pp.

Gunderson, A.M, B.K. Jacobsen, and L.E. Olson (2009). Revised distribution of the Alaska marmot, Marmota broweri, and confirmation of parapatry with hoary marmots. Journal of Mammalogy 90(4):859–869.

Hannah, L., et al. (2002). Conservation of biodiversity in a changing climate. Conservation Biology 16(1):264–268.

Harris, J.A., R.J. Hobbs, E. Higgs, and J. Aronson (2006). Ecological restoration and global climate change. Restoration Ecology 14(2):170–176.

Hegel, T.M., S.A. Cushman, J. Evans, and F. Huettmann (2010). Current state of the art for statistical modelling of species distributions. Chapter 16 in Spatial Complexity, Informatics, and Wildlife Conservation, S.A. Cushman and F. Huettmann (Eds.). Tokyo: Springer.

Huettmann, F., S.E. Franklin, and G.B. Stenhouse (2005). Predictive spatial modelling of landscape change in the Foothills Model Forest. Forestry Chronicle 81(4):1–13.

Hummel, M., and J. Ray (2008). Caribou and the North: A Shared Future. Toronto: Dundurn Press, 228 pp.

IPCC (2007). Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change. S. Solomon, D. Qin, M. Manning, Z. Chen, M. Marquis, K.B. Averyt, M.Tignor, and H.L. Miller (Eds.). Cambridge, U.K., and New York: Cambridge University Press.

Johnson, K.A., J.M. Morton, G. Anderson, E. Babij, G. Cintron, V.  Fellows, H. Freifeld, B.  Hayum, L. Jones, M. Nagendran, J. Piehuta, C. Sterne, and P. Thomas (Ad Hoc Climate Change Working Group) (2008). Four key ideas to guide the Service’s response to climate change: A white paper. U.S. Fish & Wildlife Service, Arlington, VA, 28  pp.

Klein, E., E. Berg, and R. Dial (2005). Wetland drying and succession across the Kenai Peninsula Lowlands, south-central Alaska. Canadian J. Forest Research 35:1931–1941.

Kuhn, T.S., K.A. Mcfarlane, P. Groves, A.O. Mooers, and B. Shapiro (2010). Modern and ancient DNA reveal recent partial replacement of caribou in the Southwest Yukon. Molecular Ecology 19(7):1312–1323.

LANDFIRE (Landscape Fire and Resource Management Planning Tools Project) (2010). http:// www.landfire.gov/.

Lawler, E.L. (2001). Combinatorial Optimization: Networks and Matroids. Courier Dover Publications, 374 pp.

Li, M., N. Kräuchi, and S. Gao (2006). Global warming: Can existing reserves really preserve current levels of biological diversity? Journal of Integrative Plant Biology 48(3):255−259.

Miller, F.L., and A. Gunn (2003). Catastrophic die-off of Peary caribou on the western Queen Elizabeth Islands, Canadian high arctic. Arctic 56:381–390.

Mitchell, C.D. (1994). Trumpeter Swan (Cygnus buccinator), the Birds of North America Online (A. Poole, Ed.). Ithaca: Cornell Lab of Ornithology; Retrieved from the Birds of North America Online: http://bna.birds.cornell.edu/bna/species/105

Moilanen, A., K.A. Wilson, and H.P. Possingham (2009). Spatial Conservation Prioritization. Oxford University Press.

Morton, J.M., E. Berg, D. Newbould, D. MacLean, and L. O’Brien (2006). Wilderness fire stewardship on the Kenai National Wildlife Refuge, Alaska. International J. Wilderness 12(1):14–17.

Nakicenovic, N. et al. (2001). The Special Report on Emissions Scenarios. In: Intergovernmental Panel on Climate Change Third Assessment Report.

NCDIA (National Climate Data and Information Archive) (2009). Canadian Climate Normals or Averages 1971–2000. Environment Canada. http://climate.weatheroffice.ec.gc.c...s/index_e.html

Nowacki, G., P. Spencer, M. Fleming, T. Brock, and T. Jorgenson (2001). Ecoregions of Alaska: 2001. U.S. Geological Survey Open-File Report 02-297 (map).

Payette, S., S. Boudreau, C. Morneau, and N. Pitre (2004). Long-term interactions between migratory caribou, wildfires and Nunavik hunters inferred from tree rings. Ambio 33:482–486.

Phillips, S.J., R.P. Anderson, and R.E. Schapire (2006). Maximum entropy modeling of species geographic distributions. Ecological Modelling 190: 231–259.

Raupach M.R., G. Marland, P. Ciais, C. Le Quéré, J.G. Canadell, G. Klepper, and C.B. Field (2007). Global and regional drivers of accelerating CO2 emissions. Proc. Natl. Acad. Sci. USA, 2007 Jun 12; 104(24):10288-93. Epub 2007 May 22.

Rice, B. (1987). Changes in the Harding Icefield, Kenai Peninsula, Alaska. M.S. thesis, School of Agriculture and Land Resources Management, University of Alaska Fairbanks, AK, 116+  pp.

Rosenberg, D.K., B.R. Noon, and E.C. Meslow (1997). Biological corridors: Form, function, and efficacy: Linear conservation areas may function as biological corridors, but they may not mitigate against additional habitat loss. BioScience 47(10):677–687.

Rupp, T.S., A.M. Starfield, et al. (2000). A frame-based spatially explicit model of subarctic vegetation response to climatic change: Comparison with a point model. Landscape Ecology 15(4):383–400.

Sharma, S., S. Couturier, and S. Cote (2009). Impact of climate change on the seasonal distribution of migratory caribou. Global Change Biology 15:2549–2562.

Spehn, E., and C. Koerner (Eds.) (2009). Data Mining for Global Trends in Mountain Biodiversity. CRC Press, Taylor & Francis.

Thuiller, W. (2003). BIOMOD – Optimizing predictions of species distributions and projecting potential future shifts under global change. Global Change Biology 9:1353–1362.

Valkenburg, P., T.H. Spraker, M.T. Hinkes, L.H. Van Daele, R.W. Tobey, and R.A. Sellers (2000). Increases in body weight and nutritional status of transplanted Alaskan caribou. Rangifer, Special Issue No. 12:133–138.

Verbyla, D. (2008). The greening and browning of Alaska based on 1982–2003 satellite data. Global Ecology and Biogeography 17:547–555.

Walsh, J., et al. (2008). Global climate model performance over Alaska and Greenland. Journal of Climate 21:6156–6174.

Wisconsin Reed Canary Grass Management Working Group (2009). Reed Canary Grass (Phalaris arundinacea) Management Guide: Recommendations for Landowners and Restoration Professionals.

Zalatan, R., A. Gunn, and G.H.R. Henry (2006). Long-term abundance patterns of barren-ground caribou using trampling scars on roots of Picea mariana in the Northwest Territories, Canada. Arctic, Antarctic, and Alpine Research 38:624–630.

Zedler, J.B., and S. Kersher (1994). Causes and consequences of invasive plants in wetlands: Opportunities, opportunists, and outcomes. Critical Reviews in Plant Sciences  23(5):431–452.

Zittlau, K., J. Coffin, R. Farnell, G. Kuzyk, and C. Strobeck (2000). Genetic relationships of the Yukon woodland caribou herds determined by DNA typing. Rangifer, Special Issue No. 12:59–62.

Appendix C Alaskan Biome and Canadian Ecozone Descriptions

The biome descriptions below are adapted from “Home is Where the Habitat is—An Ecosystem Foundation for Wildlife Distribution and Behavior”: http://www.nsf.gov/pubs/2003/nsf03021/nsf03021_2.pdf. This article was prepared by Page Spencer, Gregory Nowacki, Michael Fleming, Terry Brock, and Torre Jorgenson, and was in turn based upon “Narra- tive Descriptions for the Ecoregions of Alaska and Neighboring Territories,” by G. Nowacki, P. Spencer, T. Brock, M. Fleming, and Torre Jorgenson.

The Canadian ecozone descriptions are based on Narrative Descriptions of Terrestrial Ecozones and Ecoregions of Canada provided by Environment Canada: http://www.ec.gc.ca/soer-ree/English...ework/NarDesc/

Alpine

The Alpine biome was derived solely for the Connectivity Project, and was based on el- evation, rather than directly from the work of Nowacki et al. (2001). As such, it occurs in scattered patches across the mountain ranges of the state. Under the influence of ongoing climate change, the Alpine biome is expected to become increasingly limited to higher eleva- tions, but not to shift spatially across the landscape.

The biome includes steep angular summits with rubble, scree, and remnant glaciers. Where vegetation occurs, alpine tundra and barrens dominate, and soils are thin and rocky. Blueberry-rich alpine tundra gives way to the harsh conditions at the highest eleva- tions, leaving these areas to bare rock, talus, and ice.

Alpine tundra habitats of sedges, grasses, and low shrubs support mountain goats, hoary marmots, and ptarmigan. Populations of Dall sheep and pikas can be found on mid and upper slopes. Caribou, wolves, and brown bears sometimes range into alpine areas.

Arctic

Climate conditions in this northernmost biome are cold and dry. Permafrost is nearly con- tinuous throughout the region, contributing to saturated organic soils in the summer and a variety of freeze-thaw ground features such as pingos, ice-wedge polygons, and oriented thaw lakes. The Arctic biome includes mountain foothills, as well as coastal plains covered by a vast mosaic of lakes, braided rivers, and wetlands. High-energy stream systems cut narrow ravines in the mountains and coalesce into large braided rivers in the foothills.

Tundra and low-shrub communities predominate throughout the Arctic. Saturated soils and numerous thaw lakes support wet sedge tundra in drained lake basins, swales, and floodplains; tussock tundra and alpine tundra are dominated by sedges and Dryas on gentle ridges. Vegetation on lower hill slopes is dominated by mixed shrub-sedge tus- sock tundra, interspersed with willow thickets along rivers and small drainages and Dryas tundra on ridges. Lower and more southerly and easterly mountain slopes and valleys are covered with sedge tussocks and shrubs; and sparse spruce, balsam poplar, and birch for- ests and tall shrublands occur in larger protected valleys.

Fish are sparse in the mountains, but at lower elevations, arctic char, arctic grayling, arctic cisco, broad whitefish, least cisco, and Dolly Varden are all common. Huge herds of caribou migrate to the Arctic annually. Wolves, arctic foxes, and grizzly bears follow and prey on caribou herds, subsisting on voles, lemmings, arctic ground squirrels, or vegetation when caribou are not available. Muskoxen are re-establishing themselves from introduced animals. Dall sheep inhabit high elevations. Several species of whales migrate into the Arctic Ocean via the Bering Strait in summer, and seals and polar bears are year-round residents. Dense concentrations of lakes and ponds support many species of nesting birds, including a wide variety of shorebirds, ducks, geese, swans, and songbirds, and the rare arctic loon. Bears, snowy owls, arctic foxes, and hares are common. Millions of seabirds (cormorants, kittiwakes, murres, puffins, and auklets) and marine mammals (northern fur seals, ribbon seals, and sea lions) inhabit offshore islands during the summer. Domestic reindeer have been introduced to Nunivak Island and the Seward Peninsula.

Western Tundra

The Western Tundra biome is dominated by a moist sub-polar climate, but summers are suf- ficiently long and warm to allow patches of stunted trees to grow, primarily along rivers and streams. However, summer warming is tempered by the cold prevailing winds off the Bering Sea. Valleys are filled with finger lakes, while the Yukon-Kuskokwim Delta and the Bristol Bay lowlands are composed of layers of glacial, alluvial, and marine sediments that form low- lying saturated soils and a mosaic of ponds, sloughs, and wandering streams. Permafrost is nearly continuous on the Yukon-Kuskokwim Delta, becoming patchy further south in Bris- tol Bay. The mountain units have thin rocky soils with sporadic permafrost in the valleys.

Vegetation patterns generally follow the terrain. Conifer species are not present throughout much of this region, though white spruce and balsam poplar grow in stands along many river systems in the eastern edge of this region. Paper birch forests and tall- shrub communities of dwarf birch and alder grow on gently rolling side slopes. Black spruce can be found on some of these sites on the eastern side of this region where it tran- sitions with the Boreal ecoregion. The higher elevations are covered with shrub tundra and lichens or barrens on the wind-scoured summits. Lowlands are covered with a rich and productive mix of emergent wetlands and sedge-tussock and sedge-moss bogs, with willows along small streams. Slight rises support low shrublands and scattered spruce.

The river systems of this division are incredibly productive for various fisheries, including Bristol Bay sockeye, king, red, and chum salmon. The lake and wetland sys- tems, particularly of the Yukon-Kuskokwim Delta, support millions of staging and nesting waterfowl and shorebirds. Great numbers of walruses and sea lions haul out on rocky beaches, while seabirds patrol the skies. Caribou, wolves, and black and grizzly bears roam the uplands, and moose are continuing to expand their range westward, following tall shrubs and birch forests.

Alaska Boreal

The Alaska Boreal region is characterized by a continental climate, with extreme weather conditions ranging from long, cold winters to short, warm summers. The continental climate is fairly dry throughout the year, and forest fires are frequent during summer droughts. The resulting vegetation pattern is a constantly shifting mosaic of successional communities in response to wildfire and river changes. Permafrost is discontinuous in the southern regions of the Alaska Boreal biome becoming continuous in the northern zones. Soils are underlain by ice-rich permafrost and are subject to thermokarsting, where ice lenses melt or form un- der insulating moss mats. The boreal forest of Alaska is vegetated with black spruce, tama- rack, and paper birch woodlands; shrubby muskeg on permafrost-rich areas; white spruce and balsam poplar on floodplains where permafrost is missing or very deep; and aspen, birch, and shrub on upland areas of recent fires and discontinuous permafrost. The highly productive vegetation along major rivers supports vigorous stands of white spruce and bal- sam poplar. Robust wet sedge meadows and aquatic vegetation are invading sloughs and ox- bow ponds. The adjacent permafrost-dominated lowlands support black spruce woodlands, dwarf birch and low-growing ericaceous shrubs of the heath family, and sedge-tussock bogs. The climate in the boreal lowlands becomes progressively more continental the further east one travels, as the temperature ranges become greater and precipitation decreases. Permafrost is continuous to the north of this biome, and discontinuous south and west of it. Boreal lowlands are pockmarked with lakes and ponds, which support tremendous concentrations of nesting waterfowl and other migratory birds, and an abundance of moose, bears, furbearers, northern pike, and salmon. Large rivers support important runs of chinook, chum, and coho salmon, while clear tributary streams support Dolly Varden and grayling. These areas support large populations of moose and black bear, abundant waterfowl during breeding and molting seasons, furbearers, including beavers, muskrats, and martins, as well as ravens and raptors, such as peregrine falcons.

Boreal uplands are primarily vegetated by white spruce, birch and aspen on south- facing slopes, and black spruce on north-facing slopes. Uplands are subject to frequent forest fires. Caribou, moose, snowshoe hares, marten, lynx, and black and brown bears are plentiful.

Boreal Transition

The climate of the Boreal Transition biome has shorter winters than the continental inte- rior, and warmer, drier summers than the marine-influenced coastal rainforests. However, the Alaska Range generates its own weather, as moisture-laden air rises over the massif and releases heavy snowfalls on the upper elevations. Remnants of glaciers and many glacial fea- tures define the landscape. Boreal forests are distributed in the valleys and lowlands of the division, but wildfire and permafrost have much less influence on vegetation succession and distribution.

Glacial rivers are silty and braided, with broad, gravelly floodplains. Clear streams are generally smaller with narrower floodplains. Arctic grayling are common in clear mountain streams, and all five species of Pacific salmon migrate into rivers of the Boreal Transition. Soils in the mountainous units of the Alaska Range and Lime Hills are generally thin, rocky, and cold, with scattered pockets of permafrost. The Copper River Basin floor has fine-grained saturated soils with ice-rich permafrost. Soils of the Cook Inlet Basin are a complex mixture of alluvial, glacial, volcanic, and lacustrine materials with occasional patches of permafrost. Both basins support boreal vegetation patterns, with white spruce and birch on higher ground and black spruce, low shrubs, sedges, and mosses growing in the wetlands. White spruce and balsam poplar form successional stands along the rivers. The lower slopes of the Alaska Range and Talkeetna Mountains are covered with dense thickets of alder that transition to low shrubs in the subalpine.

The wide variety of habitats supports many species of mammals and both resident and migratory birds. Golden eagles, ptarmigan, ravens, Dall sheep, mountain goats, wolverines, caribou, moose, grizzly and black bears, wolves, foxes, lynx, beavers, and various small mammals can be found within this region. Waterfowl nest in the wetlands of the basins, although not in the concentrations found in the Yukon-Kuskokwim Delta or Yukon Flats.

North Pacific Maritime

The North Pacific Maritime biome, along the north and east shores of the Gulf of Alaska, receives copious rain at lower elevations and snow at higher altitudes, coupled with relatively warm temperatures throughout the year. The warm, wet climate supports lush conifer rain- forests along the coast and large ice fields and glaciers at higher elevations. The coastlands reflect their glacial heritage, with steep bedrock fjords, tidewater glaciers, and numerous rocky islands.

A few areas along this coast remained ice-free during one or more glacial advances, providing refugia for plant and animal species. Soils are exceptionally thin except in ripar- ian zones. Relatively warm winters preclude permafrost. Five species of Pacific salmon migrate into the steep, fast-flowing streams to spawn, and in the process, cycle tremen- dous amounts of nutrients back to the freshwater and terrestrial systems. Dolly Varden, char, and steelhead (oceangoing rainbow) trout live in larger clear-water streams.

The warm maritime environment encourages lush moss-draped conifer forests along the coast. Old-growth forests of Sitka spruce, hemlock, and cedar blanket the lower slopes of the Alexander Archipelago. Toward the west, cedar ceases in Prince William Sound, and hemlock reaches to the end of the Kenai Peninsula. On Kodiak, Sitka spruce is expanding south across the island into new habitats. Pockets of wetlands have formed on shallow, poorly drained soils on bedrock throughout the division. Coves and rocky islands are fringed with intertidal communities of kelp, eelgrass, and barnacles. Upper forests give way to a narrow subalpine zone of alder and herbaceous meadows and then alpine tundra and bedrock or ice.

Common forest animals include black and brown bears and Sitka black-tailed deer. Offshore waters are rich with deepwater fish such as halibut and cod. Grey and humpback whales migrate along the coast. Bald eagles, common murres, Bonaparte’s gulls, Steller’s sea lions, harbor seals, and sea otters are common.

Aleutian Islands

The Aleutian Islands biome is defined by cool, moist, and harsh weather. Permafrost is ab- sent, reflecting the relatively warm climate, dominated by oceanic influences. Soils are a mix- ture of volcanic materials, often reworked by glacial and alluvial agents. Areas of recent gla- ciations and volcanic activity are largely barren cinder plains. Other parts of the region, well watered by Pacific storms and fertilized by nesting seabirds, support lush meadow and heath vegetation communities, with willows along streams. The flora is a blend of species from two continents, grading from Asian to North American affinities from west to east.

This biome is the domain of seabirds, waterfowl, and marine mammals. Sea otter populations, which rebounded from near extirpation by Russian and American fur trad- ers, have recently experienced population declines. Endangered Steller’s sea lions use low rocky shelves as haulouts and pupping areas, although their numbers have dropped dra- matically within the past several decades. Several species of whales reside here or migrate through, en route to the Arctic Ocean. Caribou, which are native on the peninsula and Unimak Island, have been introduced to several Aleutian Islands. Foxes, introduced to many islands for fox farming, and rats, introduced accidentally from ships, have nearly decimated ground-nesting waterfowl, including the Aleutian cackling goose. Fox eradi- cation and careful reintroduction on several islands of the Aleutian cackling goose have resulted in recent removal of this bird from the endangered species listing and increased nesting success for seabirds. New efforts to eradicate rats show promise for rebuilding seabird populations on additional islands.

Boreal Cordillera Ecozone

The Boreal Cordillera ecozone is located in the midsection of the cordilleran system. It cov- ers sections of northern British Columbia and the southern Yukon. The climate ranges from cold, subhumid to semiarid, and is marked by long, cold winters and short, warm summers, modified by vertical zonation and aspect. Mean annual temperatures range from 1°C to °C. The coldest mean annual temperatures occur in the Yukon Plateau region. The mean summer temperatures range from 9.5°C to 11.5°C. Mean winter temperatures range from –13°C to –23°C. The Pacific maritime influence moderates temperatures over most of the ecozone. Mean annual precipitation is lowest in valleys within the rain shadow of the coastal ranges (<300 mm) and increases in the interior ranges further east, where up to 1500 mm of precipitation is received at higher elevations. Precipitation in the intermontane plateau areas ranges 300–600 mm annually.

In some parts of this ecozone (British Columbia), there are grasslands on south-facing slopes with boreal forest vegetation on the north-facing slopes, a feature unique within the boreal forests of Canada. The vegetative cover ranges from closed to open canopies over much of the plateaus and valleys. Tree species include white and black spruce, alpine fir, lodgepole pine, quaking aspen, balsam poplar, and white birch. In the northwest, the stands are generally open, and lodgepole pine and alpine fir are usually absent. At higher elevations, there are extensive areas of rolling alpine tundra characterized by sedge-domi- nated meadows, and lichen-colonized rock fields are common.

This ecozone is characterized by mountain ranges that contain numerous high peaks and extensive plateaus, separated by wide valleys and lowlands. The area has been modi- fied by glaciation, erosion, solifluction, and eolian and volcanic ash deposition. Glacial drift, colluvium, and outcrops constitute the main surface materials. Only a small portion of this ecozone in the northwest was unglaciated. Permafrost and associated landscape features tend to be widespread in the more northerly areas and at higher elevations; soils are cryosolic in these regions. In the warmer, lower elevations in the southern half, bruni- sols, podzols, and luvisols are common.

Mammals of the Boreal Cordillera ecozone include woodland caribou, moose, Dall sheep, mountain goat, black and grizzly bear, marten, lynx, American pika, hoary marmot, and arctic ground squirrel. Representative bird species include willow, rock and white- tailed ptarmigan, and spruce grouse, along with a range of migratory songbirds and waterfowl.

Taiga Cordillera Ecozone

The Taiga Cordillera ecozone is located along the northernmost extent of the Rocky Moun- tain system and covers most of the northern half of the Yukon and southwest corner of the Northwest Territories. In this ecozone are found Canada’s largest waterfalls, deepest can- yons, and wildest rivers.

Annual precipitation ranges from less than 300 mm in the north to over 700 mm in the southeast (Selwyn Mountains). Mean annual temperatures range from –10°C in the north to –4.5°C in the south. Mean summer temperatures, which range from 6.5°C to 10°C, are modified by vertical zonation and aspect. Summers are warm to cool with extended periods of daylight. Mean winter temperatures range from –25°C in the north to –19.5°C in the south. Winters are long and cold with very short daylight hours. Weather patterns from the Arctic and Alaskan coasts have a marked influence on this ecozone.

Natural vegetation ranges from arctic tundra (dwarf or low shrubs, mosses and lichens, and cottongrass) in the north, to alpine tundra (dwarf shrubs, lichens, saxifrages, and mountain avens) in higher elevations, and taiga or open woodland in the south (white spruce and white birch), mixed with medium to low shrubs (dwarf birches and willows), mosses, and lichens.

Steep, mountainous topography, consisting of repetitive, sharply etched ridges and narrow valleys, predominates with foothills and basins present. The bedrock is largely sed- imentary in origin with minor igneous bodies. Much of the area is mantled with colluvial debris, with frequent bedrock exposures and minor glacial deposits. The northwest por- tion of this ecozone consists of unglaciated terrain. Brunisols, regosols, and cryosols tend to be the predominant soils. Most wetlands, which in some ecoregions are extensive, are underlain by permafrost. Abundant permafrost features, such as peat hummocks, palsas, and peat plateaus, are common in peatlands. The unglaciated portions of this ecozone commonly exhibit periglacial features such as cryoplanation terraces and summits and various forms of sorted and unsorted patterned ground. Continuous permafrost underlies most of the ecozone with the exception of the western half of the Mackenzie and Selwyn Mountains ecoregions.

Wildlife in the area is diverse. Characteristic mammals include Dall sheep, woodland and barren-ground caribou, moose, mountain goat, black and grizzly bear, wolf, lynx, arctic ground squirrel, American pika, hoary marmot, and a large concentration of wolver- ine. Important birds include gyrfalcon, willow and rock ptarmigan, and waterfowl. Most of the area remains a wilderness. The Yukon’s Old Crow Flats is a large wetland complex that has received international recognition for its value to swans, Canada geese, and other waterfowl species that nest or stage here in the tens of thousands each year.

Pacific Maritime Ecozone

The Pacific Maritime ecozone covers the mainland Pacific coast and offshore islands of Brit- ish Columbia. The wettest climates in Canada occur in this coastal ecozone, especially near the mountains on the windward slopes of Vancouver Island, the Queen Charlotte Islands, and the mainland Coast Mountains.

The climate of this ecozone ranges from relatively mild humid maritime at low eleva- tions to cool very humid at higher elevations in the main mountain systems. The ecozone has some of the warmest and the wettest climatic conditions in Canada. Mean annual temperatures range from 4.5°C in the north to 9°C in the Georgia-Puget Basin–Lower Mainland regions. The mean summer temperature ranges from 10°C in the north to 15.5°C in the south. Mean winter temperatures range from –0.5°C to 3.5°C. Relative to the rest of Canada, there is little variation between mean monthly temperatures throughout the year. Annual precipitation ranges from as little as 600 mm in the Gulf Islands of lower Strait of Georgia to over 4000 mm in the Coastal Gap region to the north. Overall, the zone typi- cally receives 1500–3000 mm of precipitation per year. The Pacific maritime influence is responsible for the high level of precipitation and for the temperature moderation.

The temperate coastal forests are composed of mixtures of western red cedar, yellow cedar, western hemlock, Douglas-fir, amabilis fir, mountain hemlock, Sitka spruce, and alder. Many of these trees reach  very  large  dimensions  and  grow  to  great  ages, form- ing ancient or old-growth forests of  this  ecozone.  Douglas-fir  is  confined  largely  to the extreme southern portion of the ecozone. In the north, amabilis fir becomes more common. Mountain hemlock is usually associated with higher elevations. Variations in altitude account for the presence of widely contrasting ecosystems within the ecozone, ranging from mild, humid coastal rainforest to cool boreal and alpine conditions at higher elevations.

Mountainous topography dominates, cut through by numerous fjords and glacial val- leys and bordered by coastal plains along the ocean margin. Igneous and sedimentary rocks underlie most of the area. Colluvium and glacial deposits are the main surface mate- rials. The soils are largely podzolic and brunisolic. The Queen Charlotte Islands and part of Vancouver Island that escaped glaciation are unique, because they now contain many endemic species—ones that are peculiar to those habitats. Ice-free coastal waters are asso- ciated with the narrow continental shelf and slope.

Characteristic mammals include black-tailed deer, black and grizzly bear, elk, wolf, otter, and raccoon. Bird species unique to this area of Canada include American black oys- tercatcher, California and mountain quail, tufted puffin, and chestnut-backed chickadee. Other representative birds are pygmy-owl, Steller’s jay, and northwestern crow. Marine environments are typified by northern sea lion, northern fur and harbor seal, and giant beaked, sperm, grey, killer, Pacific pilot, and blue whale. Salmon and associated spawn- ing streams are located throughout this ecozone. Freshwater discharge from coastal rivers mixing with ocean waters stimulates the occurrence of abundant marine life.

Montane Cordillera Ecozone

Most of southern British Columbia and a portion of southwestern Alberta are contained within the Montane Cordillera ecozone—the most diverse of all the ecozones, ranging from alpine tundra to dense conifer forests to dry sagebrush and grasslands. There are some large, deep lakes and major river systems, including the Fraser River and the Columbia River head-waters.

The climate of the region ranges from subarid to arid and mild in southern lower val- leys to humid and cold at higher elevations in the northern reaches. Moist Pacific air and the effect of orographic rainfall control the precipitation pattern such that both rain shad- ows and wet belts are generated within the ecozone, often in close geographic proximity to each other. The rain shadow cast by the massive Coast Mountains results in some of the driest climates in Canada in the valley bottoms of the south-central part of the ecozone. The Rocky Mountains also impede the westward flow of cold continental arctic air masses. Mean annual temperatures range between 0.5°C in the northwest (Skeena Mountains) to 7.5°C in the Okanagan area along the Canada–United States border. Mean summer tem- peratures range from 11°C to 16.5°C. The mean winter temperatures range from –11°C to –1°C. Annual precipitation is 1200–1500 mm in the mountains and ranges to the west, to 500–800 mm in the north and interior, rising again to 1200 mm in the mountains and ranges along the British Columbia-Alberta border. Precipitation falls below 300 mm in the arid valleys and plateaus to the south.

Vegetative cover is extremely diverse; alpine environments contain various herb, lichen, and shrub associations, whereas the subalpine environment has tree species such as lodgepole pine, alpine fir, and Engelmann spruce. With decreasing elevation, the veg- etation of the mountainous slopes and rolling plains separates into three general groups: a marginal band of forests characterized by Engelmann spruce, alpine fir, and lodgepole pine; forests characterized by ponderosa pine, interior Douglas-fir, lodgepole pine, and trembling aspen in much of the southwest and central portions; and forests characterized by western hemlock, western red cedar, interior Douglas-fir, and western white pine in the southeast. Shrub vegetation found in the dry southern interior includes sagebrush, rab- bitbrush, and antelope bush. Most of the natural grasslands that existed in the dry south have vanished, to be replaced by urban settlement and agriculture.

The Montane Cordillera ecozone is a rugged, mountainous unit that incorporates several major interior plains. The plains are more extensive in the north and extend as intermontane valleys toward the southern half of the ecozone. Most of these plains and valleys are covered by glacial moraine and, to some degree, fluvial and lacustrine depos- its, whereas the mountains consist largely of colluvium and rock outcrops. Luvisols and brunisols are the most common soils, with podzols occurring in the mountain ranges in the wetter eastern portion of the ecozone. The soils of the lower valley floors to the south are often chernozems and support grasslands. These soils grade into arid environments in the Okanagan area toward the Canada–United States border.

Characteristic mammals include woodland caribou, mule and white-tailed deer, moose, mountain goat, California bighorn sheep, coyote, black and grizzly bear, hoary marmot, and Columbian ground squirrel. Typical bird species include blue  grouse, Steller’s jay, and black-billed magpie.

Taiga Plains Ecozone

The Taiga Plains are located mainly in the southwesterly corner of the Northwest Territories, northeastern British Columbia, and northern Alberta. Taiga, a Russian word, refers to the northern edge of the boreal coniferous forest—that Land of Little Sticks which spans from the subarctic of Labrador to Alaska and beyond, from Siberia to Scandinavia. The ecozone is dominated by Canada’s largest river, the mighty Mackenzie, and its tributaries. It is bordered in the west by cordilleran mountain ranges, to the east by two huge lakes (the Great Slave and Great Bear), to the north by extensive Mackenzie Delta, and to the south by the closed forests of the Boreal Plains ecozone.

The climate is marked by short, cool summers and long, cold winters. Cold arctic air influences the area for most of the year. The mean annual temperature ranges between –10°C in the Mackenzie Delta region to –1°C in Alberta and British Columbia. From north to south, the mean summer temperature ranges from 6.5°C to 14°C. The mean winter temperature ranges from –26°C in the north to –15°C in the south of the ecozone. Snow and freshwater ice persist for six to eight months of the year. The mean annual pre- cipitation is low, ranging from 200 to 500 mm.

The ecozone is characterized by open, generally slow-growing conifer-dominated for- ests of predominantly black spruce. The shrub component, which is often well developed, includes dwarf birch, Labrador tea, and willow. Bearberry, mosses, and sedges are domi- nant understory species. Upland and foothill areas and southerly locales tend to be better drained, are warmer, and support mixed-wood forests characterized by white and black spruce, lodgepole pine, tamarack, white birch, trembling aspen, and balsam poplar. Along the nutrient-rich alluvial flats of the larger rivers, white spruce and balsam poplar grow to sizes comparable to the largest of these species in the boreal forests to the south.

This ecozone is the northern extension of the flat Interior Plains, which dominates the Prairie and Boreal Plains ecozones to the south. The subdued relief of broad lowlands and plateaus are incised by major rivers, the largest of which can show elevational differences of several hundred meters. Underlain by horizontal sedimentary rock—limestone, shale, and sandstone—the nearly level to gently rolling plain is covered with organic deposits and, to a lesser degree, with undulating to hummocky morainal and lacustrine deposits. Alluvial deposits are common along the major river systems, including braided networks of abandoned channels. Low-lying wetlands cover 25–50% of the zone. A large portion of the area is underlain by permafrost, and this acts to perch the surface water table and promote a regional overland seepage system. When combined with low-angle slopes, permafrost creates a landscape that is seasonally waterlogged over large areas. Patterned- ground features are common. The region’s widespread permafrost and poor drainage create favorable conditions for cryosolic, gleysolic, and organic soils.

Characteristic mammals include moose, woodland caribou, wood  bison,  muskox, wolf, black bear, marten, lynx, and arctic ground squirrel. Barren-ground caribou over- winter in the northwest corner of this ecozone. Bird species include the common redpoll, gray jay, common raven, red-throated loon, northern shrike, sharp-tailed grouse, and fox sparrow. Fish-eating raptors include the bald eagle, peregrine falcon, and osprey. The Mackenzie Valley forms one of North America’s most travelled migratory corridors for waterfowl (ducks, geese, and swans) breeding along the Arctic coast.

Boreal Plains Ecozone

The Boreal Plains ecozone extends as a wide band from the Peace River country of British Columbia in the northwest to the southeastern corner of Manitoba. Unlike the neighboring Boreal Shield, this ecozone is not bedrock-controlled, has few bedrock outcrops and consid- erably fewer lakes.

The climate is typified by cold winters and moderately warm summers, and is strongly influenced by continental climatic conditions. The mean annual temperature ranges between –2°C and 2°C. Mean summer temperatures range between 13°C and 15.5°C.

Mean winter temperatures range from –17.5°C to –11°C. Winter temperatures in the foot- hills of Alberta are a few degrees warmer. Mean annual precipitation rises from 300 mm in northern Alberta to 625 mm in southwest Manitoba. The average annual growing season ranges 1000–1250 growing degree-days above 5°C.

White and black spruce, jack pine, and tamarack are the main coniferous species. Broadleaf trees, particularly white birch, trembling aspen, and balsam poplar, are most numerous in the transitional section leading to the prairie grasslands. Black spruce and tamarack increase in dominance along the northerly sections of the ecozone.

Underlain by Cretaceous shales, this nearly level to gently rolling plain consists largely of hummocky to kettled glacial moraine and lacustrine deposits. The surface materials are usually deep and tend to mask the underlying topography. The soils of this ecozone are largely luvisols, which grade southward into black chernozems and northward into bruni- sols and organics. Wetlands, including peatlands with organic soils cover between 25% and 50% of the ecozone.

Characteristic mammals include woodland caribou, mule and white-tailed deer, moose, wapiti (elk), coyote, black bear, marten, fisher, lynx, and chipmunk. Representative birds include boreal and great horned owl, blue jay, rose-breasted and evening grosbeak, Franklin’s gull, red-tailed hawk, and northern harrier.  Pelican,  cormorant,  gull, heron, and tern are most prominent in this ecozone. The whooping  crane,  perhaps Canada’s most famous endangered species, nests in wetlands of Wood Buffalo National Park at the extreme north of the ecozone.

Technical Addendum I Derivation of SNAP Climate Projections

As described in the report, the Connectivity Project used climate model outputs provid- ed by the Scenarios Network for Alaska Planning (SNAP). These models provided mean monthly precipitation and temperature projections for future decades. Their derivation is described below.

General Circulation Models (GCMs) are the most widely used tools for projections of global climate change over the timescale of a century. Periodic assessments by the Intergovernmental Panel on Climate Change (IPCC) have relied heavily on global model simulations of future climate driven by various emission scenarios. The IPCC uses com- plex coupled atmospheric and oceanic GCMs. These models integrate multiple equations, typically including surface pressure; horizontal layered components of fluid velocity and temperature; solar short-wave radiation and terrestrial infrared and long-wave radiation; convection; land surface processes; albedo; hydrology; cloud cover; and sea ice dynamics.

General Circulation Models include equations that are iterated over a series of discrete time steps as well as equations that are evaluated simultaneously. Anthropogenic inputs such as changes in atmospheric greenhouse gases can be incorporated into stepped equa- tions. Thus, GCMs can be used to simulate changes that may occur over long time  frames because of the release of excess greenhouse gases into the atmosphere.

Greenhouse-driven climate change represents a response to radiative forcing that is associated with increases of carbon dioxide, methane, water vapor, and other gases, as well as associated changes in cloudiness. The response varies widely among models because it is strongly modified by feedbacks involving clouds, the cryosphere, water vapor, and other processes whose effects are not well understood. Thus far, changes in radiative forcing associated with increasing greenhouse gases have been small relative to existing seasonal cycles. Hence, the ability of a model to accurately replicate seasonal radiative forcing is a good test of its ability to predict anthropogenic radiative forcing.

Different coupled GCMs have different strengths and weaknesses, and some can be expected to perform better than others for northern regions of the globe. John Walsh et al. (2008) evaluated the performance of a set of fifteen global climate models used in the Coupled Model Intercomparison Project. Using the outputs for the A1B (intermediate) climate change scenario, Walsh et al. calculated the degree to which each model’s output concurred  with actual  climate  data  for  the  years  1958–2000  for  each  of  three climatic variables (surface air temperature, air pressure at sea level, and precipitation) for three overlapping regions (Alaska only, 60–90° north latitude, and 20–90° north latitude.)

The core statistic of the validation was a root-mean-square error (RMSE) evaluation of the differences between mean model output for each grid point and calendar month, and data from the European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analysis, ERA-40. The ERA-40 directly assimilates observed air temperature and sea level pressure observations into a product spanning 1958–2000. Precipitation is computed by the model used in the data assimilation. The ERA-40 is one of the most consistent and accurate gridded representations of these variables available.

To facilitate GCM intercomparison and validation against the ERA-40 data, all monthly fields of GCM temperature, precipitation, and sea level pressure were interpolated to the common 2.5° × 2.5° latitude/longitude ERA-40 grid. For each model, Walsh et al. calcu- lated RMSEs for each month, each climatic feature, and each region; then added the 108 resulting values (12 months × 3 features × 3 regions) to create a composite score for each model. A lower score indicated better model performance.

The specific models that performed best over the larger domains tended to be the ones that performed best over Alaska. Although biases in the annual mean of each model typically accounted for about half of the models’ RMSEs, the systematic errors differed considerably among the models. There was a tendency for the models with the smaller errors to simulate a larger greenhouse warming over the Arctic, as well as larger increases of arctic precipitation and decreases of arctic sea level pressure, when greenhouse gas concentrations were increased.

Since several models had substantially smaller systematic errors than the other models, the differences in greenhouse projections implied that the choice of a subset of models might offer a viable approach to narrowing the uncertainty and obtaining more robust estimates of future climate change in regions such as Alaska. Thus, SNAP selected the five best-performing models out of the fifteen: ECHAM5 (Germany), GFDL2.1 (United States), MIROC3.2 ( Japan), HADLEY3 (UK), and CGCM3.1 (Canada). These five models are used to generate climate projections independently, as well as in combination, to further reduce the error associated with dependence on a single model.

Because of the enormous mathematical complexity of GCMs, they generally provide only large-scale output, with grid cells typically 1°–5° latitude and longitude. For example, the standard resolution of HadOM3 is 1.25° in latitude and longitude, with 20 vertical levels, leading to approximately 1,500,000 variables.

Finer-scale projections of future conditions are not directly available. However, local topography can have profound effects on climate at much finer scales, and almost all land management decisions are made at much finer scales. Thus, some form of downscaling is necessary to make GCMs useful tools for regional climate change planning.

Historical climate-data estimates at 2 km resolution are available from PRISM (Parameter-elevation Regressions on Independent Slopes Model), which was originally developed to address the lack of climate observations in mountainous regions or rural areas. PRISM uses point measurements of climate data and a digital elevation model to generate estimates of annual, monthly, and event-based climatic elements. Each grid cell is estimated via multiple regression, using data from many nearby climate stations. Stations are weighted based on distance, elevation, vertical layer, topographic facet, and coastal proximity.

PRISM offers data at a fine scale useful to land managers and communities, but it does not offer climate projections. Thus, SNAP needed to link PRISM to GCM outputs. This work was done by John Walsh, Bill Chapman, et al. They first calculated mean monthly precipitation and mean monthly surface air temperature for PRISM grid cells for 1961– 1990, creating PRISM baseline values. Next, they calculated GCM baseline values for each of the five selected models using mean monthly outputs for 1961–1990. They then calculated differences between projected GCM values and baseline GCM values for each year out to 2099 and created “anomaly grids” representing these differences. Finally, they added these anomaly grids to PRISM baseline values, thus creating fine-scale (2 km) grids for monthly mean temperature and precipitation for every year out to 2099. This method effectively removed model biases while scaling down the GCM projections.

Based on this small-scale grid, SNAP offers statewide maps, available as Google Earth (KML) or GIS (ASCII) files, of mean monthly temperature and precipitation based on each of the five selected models and the means of all five (composite model). All of these maps are available for three different emissions scenarios, described by the IPCC: The A2 scenario assumes a heterogeneous world with high population growth and slow economic and technological change. The B1 scenario assumes a global population that peaks in mid- century and rapid change toward a service and information economy. The A1B scenario falls between these two, and assumes rapid economic growth, a global population that peaks in mid-century, rapid introduction of technologies that are more efficient, and a balance between fossil fuels and other energy.

SNAP models were assessed by generating model runs for past time periods and then analyzing the statistical relationship between real weather patterns and model outputs. GCM output for dates in the past is not linked in any way to historical weather station data. Hence, this validation provides an excellent means for characterizing both the strengths and weaknesses of the models. However, we can expect that even the best model data will match real data only in terms of mean values, variability between seasons, vari- ability between years, and long-term trends; not in day-to-day values.

SNAP compared GCM output to historical weather station data based on four dif- ferent metrics: monthly mean values, seasonal (month-to-month) variability, annual (year-to-year) variability, and long-term climate change trends. Each of these comparisons was performed with both temperature data and precipitation data. Each of the analyses was based on data from 32 Alaska weather stations for the period 1980–2007. Data were obtained from the Western Region Climate Center. The requirement for selection of a climatic station was that no more than 5% of the monthly values could be missing.

Overall, SNAP models performed well when analyzed for concurrence with measured climate data with respect to monthly mean values, seasonal variability, annual variability, and long-term climate change trends. In general, SNAP models mimicked real weather patterns less accurately for precipitation than for temperature. This may be due to the innate variability of precipitation across space and time.

In cases where generating mean values are more important than capturing climate variability, using a composite of all five models is likely to yield the most robust results. Of the three available emissions scenarios, true emissions seem to be following a pattern somewhere between the A2 (pessimistic) and A1B (midrange) scenarios. Thus, to select a conservative approach, the composite model for the A1B scenario was used for the Connectivity Project.

Technical Addendum II Identifying the Alpine Biome

Given that the six Alaska biomes derived from Nowacki et al. (2001) do not capture the ecological conditions unique to steep elevational gradient that occurs in high mountains, we created a seventh biome defined by elevation and location as well as by climate envelope. Unlike pixels in other biomes, we reasoned that alpine zones cannot shift north or south across the landscape, but are locked to the mountain ranges in which they occur. Based on the literature (Körner and Paulsen 2004), we estimated that alpine zones would shrink up- ward by over one meter per year, or as much as 150 meters by the end of the century.

We encountered difficulties when attempting to define accurately all current alpine zones in the state. Tree line is reached at different elevations for mountain ranges at dif- ferent latitudes and different longitudes, making it impossible to easily extrapolate alpine zones from existing data sets. We expected to address this problem by using training data from all different ranges and, potentially, to incorporate a latitudinal stratification. To map the Alpine biome (in addition to the six Alaska biomes already described), we first obtained maps of known alpine regions. Unfortunately, no comprehensive map of all such areas was available, and we were limited to data for three separate regions around the state. We used maps for the Denali region, the Kenai Peninsula, and the Ez7071 region defined in the LANDFIRE project, which includes much of the Brooks Range (LANDFIRE 2010). To represent the transition of expanding tree line with time, we added elevation as a variable, using DEM data. We then simulated this movement by modeling tree line rise for each biome-shift time step. We modeled a rise of 40 m by 2039, 80 m by 2069, and 150 m by 2099 ( Juday et al. 1999; Kullman 2000). While promising, the Alpine zones shown in Figure II-1 are incomplete, omitting mountain ranges in the northwestern and southeastern parts of the state, among other regions. These results were too limited to be incorporated into other modeling efforts, but they help illustrate potential change and the need for an alpine geospatial data layer for the state of Alaska.

Figure II-1 Projected change in the Alpine zone over time

The Alpine biome, which is shown in purple and overlaid on the other six Alaska biomes, is projected to rise 150 meters in elevation by 2099, thus markedly decreasing in total area.

SNAP2010FigureII-1.gif

Literature cited

Juday, G. P., Barber, V., Berg, E., and Valentine, D. (1999). Recent dynamics of white spruce tree line forests across Alaska in relation to climate. Finnish Forest Research Institute Research Papers 734:165– 187.

Körner, C., and J. Paulsen (2004). A world-wide study of high altitude tree line temperatures. J. Biogeogr. 31:713–732.

Kullman, L. (2000). Tree-limit rise and recent warming: a geoecological case study from the Swedish Scandes. Norsk Geografisk Tidsskrift–Norwegian Journal of Geography 54:49–59.

Technical Addendum III LANDFIRE Mapping of Units 70 and 71 (ez7071)

Although the LANDFIRE vegetation mapping project was not complete at the time we undertook this first stage of the Connectivity Project, we tested two units for which com- plete land cover data were already available in draft form—units 70 and 71, located in the Brooks Range.

In this region, map legends were consistent across grouped map zones approximating our biomes. However, the classification system used by LANDFIRE in this area included 46 classes, with 6 non-vegetated classes, at a spatial resolution or pixel size of 30 × 30 meters.

Although Random Forests™ works with many variables/classes, our sample sizes (number of pixels per class) would have been too low with all 46 classes. Thus, we regrouped them into 20 classes (Table III-1). For example, grouping all seven alpine classes together and bundling all four wetland types were easy combinations. It should be noted that different class combinations would lead to different results.

In grouping these categories into a smaller number, to provide better statistical data for Random Forests™, we encountered a few difficulties. For example, recently burned areas were lumped into one “Recently burned forest and woodland” class, regardless of what forest type they might eventually return to. White and black spruce were in distinct classes.

Next, we resampled data into a 5 km lattice and derived a climatic envelope (tem- perature and precipitation) for each of the 20 classes. We used climate projections, as in the biome modeling exercise, to find the best-fitting climatic envelope for each 5 km grid square for future dates. We then matched these new climatic envelopes with the corresponding vegetation types in units 70 and 71.

While Random Forests™ can be used to fit LANDFIRE vegetation classes statewide, based on this small training area, the results are obviously unrealistic, as can be seen in Figure III-1. There is too little climate variability in units 70 and 71 to provide adequate training for regions as diverse as the Aleutians, the Arctic, and Southeast Alaska. Thus, the resulting map shows large areas of single vegetation types (generally representing more generalist species and cover types), in contrast to the high spatial variability of the training area. Nevertheless, this approach may prove worthwhile when statewide training data are available and have been assessed for accuracy.

Table III-1 Grouping of the original LANDFIRE classes into 20 land cover classes for modeling

 

LANDFIRE CLASS NAME

Connectivity Combined Group #

Open Water

1

Perennial Ice/Snow

2

Developed, Low Intensity

3

Developed, Medium Intensity

3

Barren

4

Recently Burned Forest and Woodland

5

Western North American Boreal White Spruce Forest

6

Western North American Boreal Tree line White Spruce Woodland

6

Western North American Boreal Spruce-Lichen Woodland

7

Western North American Boreal White Spruce-Hardwood Forest

7

Western North American Boreal Mesic Black Spruce Forest

7

Western North American Boreal Mesic Birch-Aspen Forest

8

Western North American Boreal Dry Aspen-Steppe Bluff

8

Western North American Boreal Subalpine Balsam Poplar-Aspen Woodland

 

9

Alaska Sub-boreal Avalanche Slope Shrubland

10

Alaska Sub-Boreal Mesic Subalpine Alder Shrubland

10

Western North American Boreal Mesic Scrub Birch-Willow Shrubland

11

Western North American Sub-boreal Mesic Bluejoint Meadow

11

Western North American Boreal Active Inland Dune

4

Western North American Boreal Montane Floodplain Forest and Shrubland

 

12

Western North American Boreal Lowland Large River Floodplain Forest and Shrubland

 

12

Western North American Boreal Riparian Stringer Forest and Shrubland

 

13

Western North American Boreal Shrub and Herbaceous Floodplain Wetland

 

13

Western North American Boreal Herbaceous Fen

14

Western North American Boreal Sedge-Dwarf-Shrub Bog

14

Western North American Boreal Low Shrub Peatland

15

Western North American Boreal Black Spruce Dwarf-tree Peatland

15

Western North American Boreal Black Spruce Wet-Mesic Slope Woodland

 

16

Western North American Boreal Black Spruce-Tamarack Fen

16

Western North American Boreal Deciduous Shrub Swamp

17

Western North American Boreal Freshwater Emergent Marsh

17

Western North American Boreal Wet Meadow

17

Western North American Boreal Freshwater Aquatic Bed

17

Western North American Boreal Low Shrub-Tussock Tundra

18

Western North American Boreal Tussock Tundra

18

Western North American Boreal Wet Black Spruce-Tussock Woodland

18

Western North American Boreal Alpine Dwarf-Shrub Summit

19

Western North American Boreal Alpine Mesic Herbaceous Meadow

19

Western North American Boreal Alpine Dryas Dwarf-Shrubland

19

Western North American Boreal Alpine Ericaceous Dwarf-Shrubland

19

Western North American Boreal Alpine Dwarf-Shrub-Lichen Shrubland

19

Western North American Boreal Alpine Floodplain

19

Alaska Sub-boreal and Maritime Alpine Mesic Herbaceous Meadow

19

Alaska Sub-boreal White-Lutz Spruce Forest and Woodland

20

Alaska Sub-boreal White Spruce-Hardwood Forest

20

 

Figure III-1 Extrapolating statewide, based on vegetation classifications for units 70 and 71

As can be seen from this figure, it is highly unrealistic to use climate envelope modeling to extrapolate vegetation shift outside the “training zone” used by Random Forests™, since where there is no good fit, results show a few vegetation categories dominating across broad areas.

SNAP2010FigureIII-1.png

 

Technical Addendum IV Defining Biomes and Modeling Biome Shifts in Random Forests™

Reconciling Canadian and Alaskan data

The biomes used for this project were derived from the Unified Ecoregions identified by Nowacki et al. (2001), and from the six Canadian ecozones described in detail in Appendix However, since the data available for these ecozones were not entirely congruent with the data available for Alaska, some adjustments had to be made to effectively model ecozones and biomes together.

The six Alaska biomes were defined by  biophysical  characteristics  rather  than  by land area and, therefore, varied in size. The six Canadian ecozones also varied in size and were within roughly the same size range, so could be considered congruent in this sense (Figure 2). However, our Alaska climate data from SNAP were all available at 2 km resolu- tion, as described in the report. Climate data for ecozones were available only as mean values for each ecoregion within an ecozone, offering much lower resolution than the Alaska data. To get around this problem, we overlaid a 2 km grid on the Canadian data, and assigned values to each grid cell based on the ecoregion in which it fell. We downloaded the data in GIS format from the following  websites:

Thus, every pixel in an ecozone would be assigned climate values for the ecoregion in which it occurred, giving Canadian data the same weight as Alaskan data, from the point of view of Random Forests™’ climate-envelope modeling. This approach introduced some degree of unavoidable error, however, since the mean values for ecoregions would be far less likely to contain outliers than the individual pixel values for Alaska. In other words, the Alaska climate-envelope data for each biome had a much greater range and standard deviation than the Canadian data.

Since Random Forests™ assigns pixels to future climate envelopes based on mean values and best fit, outliers should not be of great concern. Pixels that are atypical for a particular biome will have little effect in the projected movement of that biome. However, the greater variability in the Alaska data may have played a role in cases where the model had trouble finding any best fit among the available biome choices.

A perhaps more important source of error implicit in this technique arises from the fact that the Canadian “baseline” data were derived from climate normals of actual weather station data for the years 1971–2000. The Alaska baseline data, on the other hand, used model projections for the decade 2000–2009. Thus, the two data sets are both temporally separated and based on different methodologies. The SNAP modeled data, when assessed by SNAP, perform similarly to weather station data for Alaska. However, the different times for the baseline may play a significant role in the definition of “current” climate envelopes, thus skewing results.

Defining climate envelopes in a continuous landscape

Although Alaska biomes and Canadian ecozones are based on identifiable ecological areas, there is invariably some overlap between the species cohorts present and other characteristics of each biome, particularly at the boundaries. Likewise, there is overlap between biomes, and high variability within biomes, in the expected climate variables used as model inputs (mean June and December mean temperature and precipitation). How, then, can we make landscape predictions based on the relationship between climate variables and biome classifications?

Random Forests™ is able to deal with the inherent fuzziness of the data through repeated sampling. The model grows many classification trees (many trees creating a “forest”) from training data. The model then randomly samples multiple training cases, with replacement at each run. A random sample of a subset of input variables is chosen at each node to define the best split (branch). Each tree is grown to the largest extent pos- sible (with no pruning of data). Finally, the best classification (best match) is chosen from among all the trees grown. There are several advantages to this method. First, the model does not overfit the training data. Second, it automatically estimates the importance of input variables. Thus, if some input variables make very little difference in the outcome of the model (which match is selected as the best), they will be given very little weight. In addition, unused training data allow cross-validation for an unbiased estimate of the clas- sification error, meaning there is no need to create separate categories of data to reserve for validation.

As described in the body of the report, for this pilot effort we created a 5 km grid by resa- mpling the original 2 km data, in order to reduce the number of pixels and thus the overall time for each modeling run. In all, our analysis included approximately 120,000 pixels.

Literature cited

Nowacki, G., P. Spencer, M. Fleming, T. Brock, and T. Jorgenson (2001). Ecoregions of Alaska: 2001. U.S. Geological Survey Open-File Report 02-297  (map).

Technical Addendum V Caribou Modeling by Herd or Season

At the start of the Connectivity Project, caribou biologists cautioned that generalized climate-envelope modeling would be challenging, given the highly adaptive nature of the species and their tendency to periodically alter their preferred habitat locations. Their pre- dictions proved accurate, though we believe that the total range analysis provided in the body of the text does add value to understanding potential changes in response to global climate change. This technical addendum describes the modeling approaches attempted and the lessons learned through the process. Our initial intent in modeling caribou was to examine their ranges on a herd-by-herd and seasonal basis. However, it quickly became clear that distribution data for caribou in Alaska are variable, with many herds being repre- sented solely by a coarse “total range” polygon. This coarse distribution data further limited our ability to develop appropriate climate-envelope metrics. Ideally, distribution for winter range, summer range, and calving areas would all be included.

Our next approach was to cluster the herds into northern and southern groups, and examine summer and winter ranges in this combined format. Northern herds tend to be large and migrate long distances in arctic habitats. About 75% of the total estimated har- vest in Alaska (primarily by subsistence users) comes from the four northern herds, which are all migratory to some extent and calve on the coastal plain or the adjacent northern foothills of the Brooks Range at roughly similar latitude. In contrast, southern herds cover a wide range of habitats (interior to coastal) over a broad range  of  latitude. Southern herds vary from tiny sedentary groups to a few migratory herds that have irrupted (e.g., Mulchatna) and even an insular herd at Adak in the Aleutian Islands, which was trans- planted from Nelchina stock. We excluded the Adak herd (non-typical habitat) from analysis to avoid strong geographic leveraging of the spatial models. Relatively few herds had winter ranges defined, compared with summer ranges (Table  V-1).

We used Random Forests™ to predict future summer and winter ranges for northern and southern herds based on climate envelopes. Of the 33 herds for which there is some level of range use defined, there were two northern herds (Central Arctic and Western Arctic) and two southern herds (Mulchatna and Nelchina) where summer and winter ranges were identified independently (Table V-1). We used the contemporary ranges of these four herds to calibrate the herd range predictions associated with future climate envelopes (Figure V-1). These exercises reinforced the need for sufficient and appropriate spatial data for this type of predictive model.

The results of this mapping effort showed a decline in range size for northern and southern herds in the winter and for northern herds in the summer, but an increase in range size for southern herds in the summer. While these results were thought-provoking, a wide range of variables were left unaccounted for, and we could not readily explain the loss of summer habitat in the Arctic shown in the combined winter- versus summer-range predictive model (Figures V-2 and V-3). Because of the small sampling area to define winter habitat ranges for caribou in Alaska, it became increasingly apparent that the best approach, given data available today, is to base predictions on combined herd ranges rather than seasonal shifts (Figure 18). In the future, it might be best to derive a climate envelope from global caribou distribution, to better represent the adaptability of this species.

Table V-1 Seasonal ranges (updated 2008) used in modeling differences in northern and southern caribou herds in Alaska

 

Herd

Geography

Summer

Winter

Central Arctic

Northern

Y

Y

Porcupine

Northern

Y

 

Teshekpuk

Northern

Y

 

Western Arctic

Northern

Y

Y

Beaver Mountains

Southern

Y

 

Chisana

Southern

Y

 

Delta

Southern

Y

 

Denali

Southern

Y

 

Farewell-Big River

Southern

Y

 

Fortymile

Southern

Y

 

Fox River

Southern

 

Y

Galena Mountain

Southern

Y

 

Kenai Lowlands

Southern

Y

 

Kenai Mountains

Southern

 

Y

Macomb

Southern

Y

 

Mentasta

Southern

Y

 

Mulchatna

Southern

Y

Y

Nelchina

Southern

Y

Y

Northern Peninsula

Southern

Y

 

Nushagak Peninsula

Southern

Y

 

Rainy Pass

Southern

Y

 

Ray Mountains

Southern

Y

 

Southern Peninsula

Southern

Y

 

Sunshine Mountains

Southern

Y

 

Tonzona

Southern

Y

 

Twin Lakes

Southern

 

Y

Unimak

Southern

Y

 

White Mountains

Southern

Y

 

Wolf Mountain

Southern

Y

 

 

Figure V-1 Available range information for southern herds in 2008

SNAP2010FigureV-1.png

Figure V-2 Projected caribou range for all herds combined

Although caribou are expected to persist statewide, climate modeling indicates that ranges may shrink, particularly summer ranges in the Arctic.

SNAP2010FigureV-2.png

 

Figure V-3 Range size of all herds as a percentage of the state

Overall, this analysis, based on only a portion of total caribou range areas, suggests a 64% decrease in summer range and a 47% decrease in winter range.

SNAP2010FigureV-3.png

How does climate directly and indirectly affect caribou movements? Temperature thresholds may impact physiology, disease, parasites, and icing events that reduce avail- able forage, and long-term climate change may impact habitat and available forage, which in turn impacts fitness. The Boreal ALFRESCO model developed by UAF is designed to predict how fire and vegetation will change with climate. The ALFRESCO  model was used to predict winter habitat for the Nelchina caribou herd (Rupp et al. 2006), so we considered combining the models. Because the two modeling approaches rely  on  the same climate data, it seemed reasonable to use the ALFRESCO results for good winter caribou habitat as a filter over the winter herd climate-envelope results. This exercise pro- duced a further decrease in available habitat; however, because we determined that winter caribou-range data in the Random Forests™ approach were too narrow, none of those results is shown here.

Literature cited

Rupp T.S., M. Olson, L.G. Adams, B.W. Dale, K. Joly, J. Henkelman , W.B. Collins, and A.M. Starfield (2006). Simulating the influences of various fire regimes on caribou winter habitat. Ecological Applications 16:1730–1743.

Technical Addendum VI Derivation of Marmot Model

Potential climate effects on Alaska marmots are largely unknown, although studies on yel- low-bellied marmots that occupy high-elevation habitats in Colorado have shown marked climate change impacts on hibernation (Inouye et al. 2000). Our modeling efforts were based on a series of facts, hypotheses, and assumptions taken from the literature (Gundersen et al. 2009) and from expert participants in our workshops:

  • Alaska marmots occur in the Brooks Range as well as in more southerly and westerly mountain ranges in Alaska. In more southern regions of the state, competition occurs between the Alaska marmot and the hoary marmot (Marmota caligata), but it is unclear whether they have non-overlapping ranges, and if so, where the boundary lies between the subspecies.
  • Alaska marmots are generally found only in alpine areas; however, “alpine” is poorly defined in Alaska, and is not available as a distinct GIS layer.
  • Marmots are generally found only in areas of “rough” terrain, i.e., where there are steep slopes and plenty of open rocks for cover.
  • Known occurrences of Alaska marmots (Figure VI-1) are likely to under-represent the true range of the species’ habitat.
  • Warming climate would tend to cause alpine zones to shrink through vegetative encroachment.
  • Warming climate might have some positive impacts on marmot survival, through greater survival during overwintering, although changes in snowpack might reduce area of denning habitat, and peak area generally declines when habitats move to higher elevations.

Based on the criteria just listed, we first attempted to model shifts in marmot habitat based on our model of changes in the Alpine zone (see Technical Addendum II, Modeling the Alpine Biome). However, this proved unsatisfactory due to the uncertainties inherent in our estimated Alpine distribution. Because we were almost certainly missing large areas of alpine terrain, we were likely to be under-representing marmot habitat. In addition, as can be seen in Figure VI-1, known occurrences of marmots do not perfectly match mod- eled alpine zones. This may be because our Alpine biome was missing data from several mountain ranges, or it may be because “Alpine” includes large areas of snow and ice, typi- cal at the highest elevations, as well as the type of terrain preferable to marmots (rocky and high elevation, but not ice-bound  year-round).

Figure VI-1 Known Alaska marmot occurrences overlaid with modeled Alpine zone model
SNAP2010FigureVI-1.png

Thus, in later modeling efforts we shifted to a new map layer to constrain potential marmot habitat: terrain roughness (Figure VI-2). Terrain roughness was calculated based on digital elevation mapping (DEM). Terrain is considered “rough” where the change in elevation and slope from one pixel to the next is high. Therefore, roughness can be loosely equated with variable steepness and variable aspect—typical of rocky slopes and summits. When known occurrences of marmots were overlain with areas of high terrain rough- ness (Figure VI-3), a good match was found. Using roughness as a parameter enabled us not only to constrain the potential habitat of marmots in our future predictions, but also to expand it within current and future time periods. In other words, we recognized that the scattered points available as verified marmot sightings do not encompass all locations at which marmots are likely to be found. Coupling DEM roughness with a climate envelope allowed us to avoid grossly overestimating the species’ likely range, while still generalizing and extrapolating marmot habitat beyond the limits of these points (Figure VI-3) This approach could prove useful for inventory efforts that seek to locate Alaska marmots.

Figure VI-2 Terrain roughness

Roughness was calculated from DEM data, based on change in slope between adjacent pixels. Note that this metric serves as a good proxy for estimating the presence of mountainous zones (seen in red on this map).

SNAP2010FigureVI-2.png

 

Figures VI-3 Potential marmot habitat with terrain roughness as a factor

Known occurrences of marmots are overlaid with modeled contemporary habitat.

SNAP2010FigureVI-3.png

Our model results predicted significant decrease in marmot habitat over time, as explained in the text. How such change may impact the connectivity of marmot habitat depends on total marmot populations and maximum dispersal distances, as well as the following, calculable from our modeling  results:

  • Change in total habitat area
  • Change in mean patch size
  • Change in mean distance between patches
  • Change in mean distance to patch of some minimum size
  • Percentage of patches (or of total habitat) more than some maximum distance from nearest other patch (i.e., % of isolated habitat)

Further research is needed to determine distributions of existing Alaska marmot populations, the nature and boundaries of competitive exclusion with hoary marmots, dispersal ability, and reproductive capacity. In addition, data on litter size would help to elucidate which habitat types allow for the greatest reproductive  success.  This exercise also highlighted the need for research on how spatial scales influence uncertainties in cli- mate envelope models. Using 34 point locations to derive a species’ climate envelope may be inappropriate, however, it is unclear what level of training data provides an adequate minimum sample for this type of  modeling.

Literature cited

Gunderson, A.M., B.K. Jacobsen, and L.E. Olson (2009). Revised distribution of the Alaska marmot, Marmota broweri, and confirmation of parapatry with hoary marmots. Journal of Mammalogy 90(4):859–869.

Inouye, D.W., B. Barr, K.B. Armitage, and B.D. Inouye (2000). Climate change is affecting altitudinal migrants and hibernating species. PNAS, 2000.

Technical Addendum VII Derivation of Swan Model

The trumpeter swan data used in our model (Figure VII-1) were provided by Debbie Groves (USFWS – Juneau) via Bob Platte (USFWS – Anchorage), and were based on the 2005 Trumpeter Swan Survey, a census flown every five years in August.

These data were complete within the zone of the survey. However, trumpeter swans were distinguished from tundra swans based on habitat rather than on morphological characteristics, with all swans in forested regions assumed to be trumpeters. Tundra swans were not surveyed. Some tundra swans may have been misidentified as trumpeter swans, and vice versa. No clear information is available to indicate whether competitive exclusion or habitat limitations are responsible for this partitioning of habitat. Based on the avail- able data, it is impossible to guess exactly where the boundary between species lies and how it is likely to shift with climate change, unless we assume that it will shift at the same rate that vegetation (forest) shifts. In our modeling of potential biome shifts, we can only show when the temperature and precipitation patterns are suitable for a forested biome, and not the actual movement of vegetation within each biome. As a proxy for competition with tundra swans, we used the predicted occurrence of a forested biome (Alaska Boreal, Boreal Transition, Boreal Plains, Taiga Plains, Montane Cordillera, Pacific Maritime, or North Pacific Maritime) as a limiting factor (“mask”) when modeling future habitat for this species. In today’s non-forested regions of the state, the expansion of trumpeter swans is likely to be an over estimate of the rate at which they could occupy those areas if the link to forested landscapes is a requirement.

Based on the literature (Bellrose 1980), we determined the following:

  • Trumpeter swans re-establish territories as soon as the ice goes out in May or June; they often re-use old nest sites (high nest fidelity) so about 4 days are needed to construct or repair a new nest.
  • Trumpeter swans lay an egg every other day until the clutch is complete; mean clutch is 5, so we can assume egg-laying takes 10 days.
  • Incubation varies from 33–37 days.
  • Cygnet development from hatching to flying takes 13–15 weeks (91–105 days).
  • Total time from nest initiation to flying away is 138 to 156 days.

In sum, this implies that 138 ice-free days are needed to fledge cygnets reliably and successfully. Thus, we used SNAP data on ice-free days as a limiting factor (“mask”) for delineation of swan habitat. As noted in the report, SNAP’s estimation of this parameter is likely to be inexact, but any significant skew in the presumed date of thaw due to lag time between temperatures of about zero and open water is likely to be cancelled out by an equivalent lag in autumn freeze-up.

Figure VII-1 Trumpeter swan sightings, 2005

Purple points indicate sightings of adult swans, and blue points indicate adults with young less than one year old.

SNAP2010FigureVII-1.png

In our modeling efforts, we originally attempted to use brood size as a variable, to map not only presence/absence but also reproductive success. Our hope was to be able to find some kind of gradient of reproductive success across the range of climate envelopes in habitat types that would indicate which habitats are optimal and which are marginal. However, initial model runs evidenced no such gradient. There appeared to be no clear links between brood size, climate, and habitat. Therefore, we decided to model based merely on presence/absence.

As described in the report, our results predicted overall spread of trumpeter swan hab- itat, as forested biomes encroach northward and westward. Further research could help clarify the role of competitive exclusion with tundra swans, as well as the effects of shifting climate and changing habitat on brood size.

Literature cited

Bellrose, E.C. (1980). Ducks, Geese and Swans of North America. Harrisburg, Pennsylvania: Stackpole Books.

Technical Addendum VIII Derivation of Reed Canary Grass Model

Reed canary grass, which is already well established in southcentral Alaska, is rapidly spreading across the state. The species may be difficult to limit or control, since mechanical removal is labor intensive and no herbicides are selective enough to be used in wetlands without the potential for injuring native species (ANHP  2006).

Reed canary grass has an extremely broad range, and little data exist in the literature on the limitations to range of the grass based on climate factors. Given that the grass is known to disperse along road and river corridors (WRCGMWP 2009), our modeling efforts were based on the climate envelope of existing occurrences of the grass ( June and December mean temperature and precipitation) as well as distance to roads (Figure VIII-1).

Existing occurrence of reed canary grass was modeled using data from AKEPIC. Approximately 5,000 records exist for the species, although even this is likely to be a gross underrepresentation of actual occurrence. Most sightings of the grass are recorded near roads and human habitation, and it is almost impossible to determine whether this is because the grass is spread along these corridors, or because no records have been taken from locations that are more far-flung. As with other species, even the best data account- ing for the presence of a species are incomplete without data on the species’ absence, which are rarely recorded.

Initially, we planned to incorporate distance to rivers also, as a variable analogous to distance to roads. However, this proved problematic for several reasons. First, although GIS layers exist for Alaska water bodies, at the time we were preparing model runs, they were not coded in terms of direction of flow or connectivity between rivers and tributary streams. These data are now being added to publically available GIS files, and might be useful for inclusion in further studies. Moreover, it was unclear how to consider variables such as stream type (glacial, non-glacial, fast or slow moving) and size. Alaska is rich in waterways and wetlands, but without this crucial connectivity data, we found it chal- lenging to incorporate seed transmission along waterways as a variable. We did consider including waterways as a static factor, since the grass occurs almost exclusively along their banks and channels. However, waterways and wetlands are so ubiquitous across the state that including this layer put us at risk of overestimating the spread of the grass.

We also struggled with how to include roads most realistically. In our first iterations of the reed canary grass model, we used the existing road network as our roads layer. Later, we created new GIS layers to include RS2477 trails (official corridors that may or may not currently be in use) and proposed roads. These roads, although mapped and clearly described in regional DOT plans available online (State of Alaska 2009), are drawn long- hand in existing plans, and did not exist as GIS layers. Thus, we had to create layers using estimates from the existing maps. Since it was not always clear when the proposed roads were expected to be built, we included all new roads in the final (2090s) map projections. For example, in the final decadal step, we added a probable road to Nome, which was supported by the Governor’s office at the time of our modeling, and a few other planned roads from publically available DOT maps. These additions expanded the already increas- ing range distribution from the 2060s time step especially through the Seward Peninsula (Figure VIII-1).

Figure VIII-1 Grass modeling using a more extensive transportation network

In this model iteration, the spread of reed canary grass was modeled using roads connected to the continental highway system plus projected roads, winter trails, and RS2477 historic trails. However, many of these corridors are seasonal, disconnected, or rarely used.

SNAP2010FigureVIII-1.png

In our final iteration, in order to address the variable use of trails and seasonal roads, the fact that this network may be disconnected in both space and time, and the uncertainty of proposed roads, we reverted to including only all-season existing roads. Thus, our results are likely to represent a significant underestimation of potential distribution, based on both a cautious approach to road-based spread and a complete lack of water-assisted spread.

Although the few sampling points along the Dalton Highway were too few to result  in a modeled spread into the Arctic, it is clear that the Dalton Highway (Haul Road) could be a bottleneck or a conduit for the spread of reed canary grass into the Arctic, since it serves as a single artery to the south for a large network of winter roads (impassible in summer) associated with oil and gas development across the North Slope. Mitigation to slow the spread of this species and other undesirable invasives might include a truck-wash site or other disinfecting process, as is practiced for contracted equipment entering Denali National Park. Of course, such a measure would require extensive discussion between industry, DOT, and land managers. It will be important to manage the spread of this spe- cies along the highway corridor so that it does not become a threat in the Arctic.

Literature cited

ANHP (Alaska Natural Heritage Program) (2006). Non-Native Plant Species of Alaska: Reed canarygrass. Environment and Natural Resources Institute University of Alaska Anchorage. http://akweeds.uaa.alaska.edu/pdfs/s...os_PHAR_ed.pdf

State of Alaska (2009). Statewide and Area Transportation Plans. http://dot.alaska.gov/stwdplng/areap...LRTPHome.shtml

WRCGMWP (Wisconsin Reed Canary Grass Management Working Group) (2009). Reed Canary Grass (Phalaris arundinacea) Management Guide: Recommendations for Landowners and Restoration Professionals.

Notes

Page statistics
2160 view(s) and 100 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments