Glavna
Statistics: The Art and Science of Learning from Data
Statistics: The Art and Science of Learning from Data
Christine A. Franklin, Bernhard Klingenberg, Alan Agresti
0 /
0
Koliko vam se sviđa ova knjiga?
Kakav je kvalitet fajla?
Preuzmite knjigu radi procene kvaliteta
Kakav je kvalitet preuzetih fajlova?
For courses in introductory statistics. Statistics: The Art and Science of Learning from Data takes a conceptual approach, helping students understand what statistics is about and learning the right questions to ask when analysing data, rather than just memorising procedures. This book takes the ideas that have turned statistics into a central science in modern life and makes them accessible, without compromising the necessary rigor. Students will enjoy reading this book, and will stay engaged with its wide variety of realworld data in the examples and exercises. The authors believe that it's important for students to learn and analyse both quantitative and categorical data. As a result, the text pays greater attention to the analysis of proportions than many other introductory statistics texts. Concepts are introduced first with categorical data, and then with quantitative data.
Kategorije:
Godina:
2017
Izdanje:
4
Izdavač:
Pearson
Jezik:
english
Strane:
817
ISBN 10:
1292164778
ISBN 13:
9781292164779
Fajl:
PDF, 14,68 MB
Vaši tagovi:
Preuzeti (pdf, 14,68 MB)
 Otvoriti u Pregledaču
 Checking other formats...
 Prvo se prijavite na svoj nalog

Treba pomoć? Pročitajte naše kratko uputstvo kako poslati knjigu na Kindle
Fajl će biti poslat na vaš email. Može da prođe do 15 minuta pre nego što ga primite.
Fajl će biti poslat na vaš Kindle nalog. Može da prođe do 15 minuta pre nego što ga primite.
Napomena: morate da verifikujete svaku knjigu koju želite da pošaljete na svoj Kindle. Proverite u svom poštanskom sandučetu verifikacioni email dopis od Amazon Kindle Support.
Napomena: morate da verifikujete svaku knjigu koju želite da pošaljete na svoj Kindle. Proverite u svom poštanskom sandučetu verifikacioni email dopis od Amazon Kindle Support.
Možda će vas zanimati Powered by Rec2Me
Najčešći pojmovi
sample^{3001}
probability^{1769}
variable^{1336}
proportion^{1254}
interval^{1165}
regression^{1159}
sampling^{996}
variables^{920}
standard deviation^{904}
confidence interval^{844}
statistic^{707}
sampling distribution^{647}
interpret^{626}
estimate^{588}
hypothesis^{581}
significance^{553}
correlation^{535}
samples^{510}
proportions^{485}
test statistic^{476}
statistics^{451}
observations^{447}
margin^{445}
explanatory^{430}
score^{417}
comparing^{410}
statistical^{394}
variability^{385}
sample size^{381}
distributions^{378}
software^{375}
income^{372}
slope^{370}
equals^{369}
null hypothesis^{361}
sample mean^{356}
equation^{354}
plot^{353}
population proportion^{341}
sample proportion^{340}
median^{338}
response variable^{335}
standard deviations^{333}
categorical^{333}
predicted^{323}
summary^{319}
percentage^{319}
quantitative^{315}
intervals^{314}
gender^{312}
probabilities^{308}
anova^{301}
cancer^{298}
randomly^{295}
random sample^{294}
inference^{290}
scores^{284}
prediction^{283}
squared^{283}
sided^{280}
Povezani Spiskovi knjiga
0 comments
Možete napisati recenziju za knjigu i podeliti svoje mišljenje. Ostale čitaoce će uvek zanimati vaše mišljenje o knjigama koje ste pročitali. Bez obzira da li vam se knjiga svidela ili ne, ako iskreno i detaljno izložite svoje misli, ljudi će pronaći nove knjige koje im odgovaraju.
1

2

Statistics The Art and Science of Learning from Data Fourth Edition Global Edition Alan Agresti University of Florida Christine Franklin University of Georgia Bernhard Klingenberg Williams College With Contributions by Michael Posner Villanova University Harlow, England • London • New York • Boston • San Francisco • Toronto • Sydney • Dubai • Singapore • Hong Kong Tokyo • Seoul • Taipei • New Delhi • Cape Town • Sao Paulo • Mexico City • Madrid • Amsterdam • Munich • Paris • Milan Editorial Director: Chris Hoag Editor in Chief: Deirdre Lynch Acquisitions Editor: Suzanna Bainbridge Editorial Assistant: Justin Billing Acquisitions Editor, Global Editions: Sourabh Maheshwari Program Manager: Danielle Simbajon Project Manager: Rachel S. Reeve Assistant Project Editor, Global Editons: Vikash Tiwari Senior Manufacturing Controller, Global Editions: Kay Holman Program Management Team Lead: Karen Wernholm Project Management Team Lead: Christina Lepre Media Producer: Jean Choe Media Production Manager, Global Editions: Vikram Kumar TestGen Content Manager: Marty Wright MathXL Content Manager: Robert Carroll Product Marketing Manager: Tiffany Bitzel Field Marketing Manager: Andrew Noble Marketing Assistant: Jennifer Myers Senior Author Support/Technology Specialist: Joe Vetere Rights and Permissions Project Manager: Gina M. Cheselka Procurement Specialist: Carol Melville Associate Director of Design: Andrea Nix Program Design Lead: Beth Paquin Production Coordination, Text Design, Composition, and Illustrations: Integra Software Services Pvt Ltd. Cover Design: Lumina Datamatics Cover Image: Vipada Kanajod/Shutterstock.com Acknowledgements of thirdparty content appear on page C1, which constitutes an extension of this copyright page. PEARSON, ALWAYS LEARNING, and MYSTATLAB are exclusive trademarks owned by Pearson Education, Inc., or its affiliates in the U.S. and/or other countries. Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world Visit ; us on the World Wide Web at: www.pearsonglobaleditions.com © Pearson Education Limited 2018 The rights of Alan Agresti, Christine Franklin, and Bernhard Klingenberg to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. Authorized adaptation from the United States edition, entitled Statistics: The Art and Science of Learning from Data, Fourth Edition, ISBN 9780321997838, by Alan Agresti, Christine Franklin, and Bernhard Klingenberg, published by Pearson Education © 2017 . All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without either the prior written permission of the publisher or a license permitting restricted copying in the United Kingdom issued by the Copyright Licensing Agency Ltd, Saffron House, 6–10 Kirby Street, London EC1N 8TS. All trademarks used herein are the property of their respective owners. The use of any trademark in this text does not vest in the author or publisher any trademark ownership rights in such trademarks, nor does the use of such trademarks imply any affiliation with or endorsement of this book by such owners. British Library CataloguinginPublication Data A catalogue record for this book is available from the British Library 10 9 8 7 6 5 4 3 2 1 ISBN 10: 1292164778 ISBN 13: 9781292164779 Typeset by Integra Software Services Pvt Ltd. Printed and bound in Malaysia Dedication To my wife Jacki for her extraordinary support, including making numerous suggestions and putting up with the evenings and weekends I was working on this book. ALAN AGRESTI To Corey and Cody, who have shown me the joys of motherhood, and to my husband, Dale, for being a dear friend and a dedicated father to our boys. You have always been my biggest supporters. CHRIS FRANKLIN To my wife Sophia and our children Franziska, Florentina, Maximilian, and Mattheus, who are a bunch of fun to be with, and to JeanLuc Picard for inspiring me. BERNHARD KLINGENBERG Contents Preface Part One 9 Gathering and Exploring Data Chapter 1 Statistics: The Art and Chapter 3 Association: Contingency, Science of Learning from Data 28 1.1 Using Data to Answer Statistical Questions 1.2 Sample Versus Population 34 1.3 Using Calculators and Computers 29 43 Chapter Summary 49 Chapter Problems 50 Chapter 2 Exploring Data with Different Types of Data 53 2.2 Graphical Summaries of Data 2.3 Measuring the Center of Quantitative Data 2.4 3.1 The Association Between Two Categorical Variables 119 3.2 The Association Between Two Quantitative Variables 127 3.3 Predicting the Outcome of a Variable 3.4 Cautions in Analyzing Associations 139 154 Chapter Summary 170 Chapter Problems 171 Graphs and Numerical Summaries 52 2.1 Correlation, and Regression 117 Chapter 4 Gathering Data 179 4.1 Experimental and Observational Studies 180 4.2 Good and Poor Ways to Sample 188 Measuring the Variability of Quantitative Data 84 4.3 Good and Poor Ways to Experiment 2.5 Using Measures of Position to Describe Variability 92 4.4 2.6 Recognizing and Avoiding Misuses of Graphical Summaries 102 Other Ways to Conduct Experimental and Nonexperimental Studies 204 58 76 Chapter Summary 108 Chapter Problems 109 Part Two 198 Chapter Summary 216 Chapter Problems 216 Part Review 1 Online Probability, Probability Distributions, and Sampling Distributions Chapter 5 Probability in Our Daily Lives 226 Chapter Summary 273 Chapter Problems 273 5.1 How Probability Quantifies Randomness 227 Chapter 6 Probability Distributions 5.2 Finding Probabilities 234 6.1 5.3 Conditional Probability 247 Summarizing Possible Outcomes and Their Probabilities 281 5.4 Applying the Probability Rules 258 6.2 Probabilities for BellShaped Distributions 4 293 280 Contents 6.3 Probabilities When Each Observation Has Two Possible Outcomes 305 How Sample Means Vary Around the Population Mean 337 Chapter Summary 352 Chapter Problems 352 Chapter Summary 315 Chapter Problems 316 Chapter 7 Sampling Distributions 7.1 7.2 324 Part Review 2 Online How Sample Proportions Vary Around the Population Proportion 325 Part Three Inferential Statistics Chapter 8 Statistical Inference: Confidence Intervals 360 8.1 Point and Interval Estimates of Population Parameters 361 8.2 Constructing a Confidence Interval to Estimate a Population Proportion 367 8.3 Constructing a Confidence Interval to Estimate a Population Mean 380 8.4 Choosing the Sample Size for a Study 8.5 Using Computers to Make New Estimation Methods Possible 400 Significance Tests About Means 434 9.4 Decisions and Types of Errors in Significance Tests 444 9.5 Limitations of Significance Tests 449 9.6 The Likelihood of a Type II Error and the Power of a Test 456 Chapter Summary 463 Chapter Problems 464 391 Chapter Summary 404 Chapter Problems 404 Steps for Performing a Significance Test 413 9.2 Significance Tests About Proportions 418 470 10.1 Categorical Response: Comparing Two Proportions 472 486 10.3 Other Ways of Comparing Means, Including a Permutation Test 498 10.4 Analyzing Dependent Samples 513 Significance Tests About Hypotheses 412 9.1 Chapter 10 Comparing Two Groups 10.2 Quantitative Response: Comparing Two Means Chapter 9 Statistical Inference: Part Four 9.3 10.5 Adjusting for the Effects of Other Variables 524 Chapter Summary 530 Chapter Problems 531 Part Review 3 Online Analyzing Association and Extended Statistical Methods Chapter 11 Analyzing the Association 11.4 Using Residuals to Reveal the Pattern of Association 572 Between Categorical Variables 542 11.1 Independence and Dependence (Association) 11.3 Determining the Strength of the Association 11.5 Fisher’s Exact and Permutation Tests 576 543 11.2 Testing Categorical Variables for Independence 548 Chapter Summary 585 Chapter Problems 585 563 5 6 Contents Chapter 12 Analyzing the Association Between Quantitative Variables: Regression Analysis 592 Chapter Summary 689 Chapter Problems 690 Chapter 14 Comparing Groups: 12.1 Modeling How Two Variables Are Related 593 Analysis of Variance Methods 695 12.2 Inference About Model Parameters and the Association 603 14.1 OneWay ANOVA: Comparing Several Means 696 12.3 Describing the Strength of Association 610 12.4 How the Data Vary Around the Regression Line 620 12.5 Exponential Regression: A Model for Nonlinearity 631 14.3 TwoWay ANOVA 716 Chapter Summary 637 Chapter Problems 638 Chapter Summary 730 Chapter Problems 730 Chapter 13 Multiple Regression 644 Chapter 15 Nonparametric Statistics 13.1 Using Several Variables to Predict a Response 645 15.1 Compare Two Groups by Ranking 13.2 Extending the Correlation and R2 for Multiple Regression 651 13.4 Checking a Regression Model Using Residual Plots 668 674 Chapter Summary 759 Chapter Problems 759 Part Review 4 Online 13.6 Modeling a Categorical Response 680 Appendix A1 Answers A7 Index I1 Index of Applications I9 Credits C1 A Guide to Learning From the Art in This Text Dataset Files 737 15.2 Nonparametric Methods for Several Groups and for Matched Pairs 748 13.3 Inferences Using Multiple Regression 657 13.5 Regression and Categorical Predictors 14.2 Estimating Differences in Groups for a Single Factor 706 D2 A Guide to Choosing a Statistical Method D3 Summary of Key Notations and Formulas D4 D1 736 An Introduction to the Web Apps The book’s website, www.pearsonglobaleditions.com/agresti, links to several new and interactive webbased applets (or web apps) that run in a browser. These apps are designed to help students understand a wide range of statistical concepts and carry out statistical inference. Many of these apps are featured (often including screenshots) in Activities throughout the book. The apps allow saving output (such as graphs or tables) for potential inclusion in homework or projects. • The Random Numbers app generates uniform random numbers (with or without replacement) from a userdefined range of integer values and simulates flipping a (potentially biased) coin. • The Mean vs. Median app allows users to add or delete points from a dot plot as the users explore the effect of outliers or skew on these two statistics. • The Explore Categorical Data and Explore Quantitative Data apps provide basic statistics and plots for usersupplied data. • The Explore Linear Regression app allows users to add or delete points from a scatterplot and observe how the regression line changes for different patterns or is affected by outliers. The Fit Linear Regression app allows users to supply their own data, fit a linear regression model and explore residuals. • The Guess the Correlation app lets users guess the correlation for a given scatterplot (and find the correlation between guesses and the true values). • The Binomial, Normal, t, Chisquare, and F Distribution apps visually explore the meaning of parameters for these distributions. Users can also find probabilities and percentiles and check them visually on the graph. • The various Sampling Distribution apps generate sampling distributions of the sample proportion or the sample mean. These apps let users generate samples of various sizes from a wide range of distributions such as skewed, uniform, bellshaped, bimodal, or custombuilt. The apps display the population distribution, the data distribution of a randomly generated sample, and the sampling distribution of the sample mean or proportion. With the (repeated) click on a button, one can see how the sampling distribution builds up one simulated random sample at a time and, for large sample sizes, assumes a bell shape. Users can move sliders for sample size and various population parameters to see the effect on the sampling distribution. Chapter 7 shows many screenshots of these apps. • The Inference for a Proportion and the Inference for a Mean app carry out statistical inference. They provide graphs, confidence intervals and results from z or ttests for data supplied in summary or original form. • The Explore Coverage app uses simulation to demonstrate the concept of the confidence coefficient, both 7 8 An Introduction to the Web Apps for confidence intervals for the proportion and for the mean. Different sliders for true population parameter, sample size or confidence coefficient show their effect on coverage and width of confidence intervals. • The Errors and Power app explores Type I and Type II errors and the concept of power visually and interactively. Users can move sliders to connect these concepts to sample size, significance level, and true parameter value for onesample tests about proportions or means. • The Inference Comparing Proportions and the Inference Comparing Means apps construct appropriate graphs for a visual comparison and carry out twosample inference. Confidence intervals and results of hypotheses tests for two independent (or two dependent) samples are displayed. Data can be supplied in summary or original form. • The Bootstrap app finds a bootstrap confidence interval for a mean, median, or standard deviation. • The Permutation Test (cont. data) app compares quantitative responses between two groups using a permutation approach. By repeatedly clicking a button, the sampling distribution using permutations is generated stepbystep, which is useful when first introducing the topic. Both the original and the (randomly) permuted datasets are shown. • The Permutation Test for Independence app tests for independence in contingency tables using the permutation sampling distribution of the Chisquared statistic X2. It displays the original contingency table and bar chart along with the table and chart for the permuted dataset, as well as the sampling distribution of X2. • The Fisher Exact Test app can be used for exact inference in 2 * 2 contingency tables. • The ANOVA (OneWay) app allows comparison of several means, including posthoc pairwise multiple comparisons. Preface We have each taught introductory statistics for many years, and we have witnessed the welcome evolution from the traditional formuladriven mathematical statistics course to a conceptdriven approach. This conceptdriven approach places more emphasis on why statistics is important in the real world and places less emphasis on mathematical probability. One of our goals in writing this book was to help make the conceptual approach more interesting and more readily accessible to students. At the end of the course, we want students to look back at their statistics course and realize that they learned practical concepts that will serve them well for the rest of their lives. We also want students to come to appreciate that in practice, assumptions are not perfectly satisfied, models are not exactly correct, distributions are not exactly normally distributed, and different factors should be considered in conducting a statistical analysis. The title of our book reflects the experience of data analysts, who soon realize that statistics is an art as well as a science. What’s New in This Edition Our goal in writing the fourth edition of our textbook was to improve the student and instructor user experience. We have: • Clarified terminology and streamlined writing throughout the text to improve ease of reading and facilitate comprehension. • Used real data and real examples to illustrate almost all concepts discussed. Throughout the book, within three to five consecutive pages, an example is presented that depicts a realworld scenario to illustrate the statistical concept discussed. • Introduced new webbased applets (referred to as web apps or apps) illustrating and helping students interact with key statistical concepts and techniques. These apps invite students to explore consequences of changing parameters and to carry out statistical inference. Among other relevant concepts and techniques, students are introduced to: • Sampling distributions • Central limit theorem • Bootstrapping for interval estimation (Chapter 8) • Randomization or permutation tests for significance testing (Chapter 10 for difference in two means and Chapter 11 for two categorical variables). • Inserted brief overviews to set the stage for each chapter, introducing students to chapter concepts and helping them see how previous chapters’ concepts, tools, and techniques are related. • Included computer output from the most recent versions of MINITAB and the TI calculator. • Expanded Chapter 1, providing key terminology to establish a foundation to understand the big picture of the statistical investigative process—the importance of asking good statistical questions, designing an appropriate study, performing descriptive and inferential analysis, and making a conclusion. 9 10 Preface • Reflected the latest trends in statistical education, including: • Measures of association for categorical variables in Chapter 3 • Permutation testing in Chapters 10 and 11 • Updated coverage of McNemar’s test in Chapter 10 (previously Chapter 11) • Moved important coverage of risk difference and relative risk to Chapter 3 (instead of first introducing these measures in Chapter 11). We believe that understanding these two statistics is a necessary part of statistical literacy for the everyday citizen as they are pervasive in mass media and the medical literature. • Updated or replaced over 25 percent of the exercises and examples. In addition, we have updated all General Social Services (GSS) data with the most current data available. Our Approach In 2005, the American Statistical Association (ASA) endorsed guidelines and recommendations for the introductory statistics course as described in the report, “Guidelines for Assessment and Instruction in Statistics Education (GAISE) for the College Introductory Course” (www.amstat.org/education/gaise). The report states that the overreaching goal of all introductory statistics courses is to produce statistically educated students, which means that students should develop statistical literacy and the ability to think statistically. The report gives six key recommendations for the college introductory course: • • • • • • Emphasize statistical literacy and develop statistical thinking. Use real data. Stress conceptual understanding rather than mere knowledge of procedures. Foster active learning in the classroom. Use technology for developing concepts and analyzing data. Use assessment to evaluate and improve student learning. We wholeheartedly endorse these recommendations, and our textbook takes every opportunity to support these guidelines. Ask and Answer Interesting Questions In presenting concepts and methods, we encourage students to think about the data and the appropriate analyses by posing questions. Our approach, learning by framing questions, is carried out in various ways, including (1) presenting a structured approach to examples that separates the question and the analysis from the scenario presented, (2) providing homework problems that encourage students to think and write, and (3) asking questions in the figure captions that are answered in the Chapter Review. Present Concepts Clearly Students have told us that this book is more “readable” and interesting than other introductory statistics texts because of the wide variety of intriguing real data examples and exercises. We have simplified our prose wherever possible, without sacrificing any of the accuracy that instructors expect in a textbook. A serious source of confusion for students is the multitude of inference methods that derive from the many combinations of confidence intervals and tests, means and proportions, large sample and small sample, variance known Preface 11 and unknown, twosided and onesided inference, independent and dependent samples, and so on. We emphasize the most important cases for practical application of inference: large sample, variance unknown, twosided inference, and independent samples. The many other cases are also covered (except for known variances), but more briefly, with the exercises focusing mainly on the way inference is commonly conducted in practice. We present the traditional probability distribution–based inference but now also include inference using simulation through bootstrapping and permutation tests. Connect Statistics to the Real World We believe it’s important for students to be comfortable with analyzing a balance of both quantitative and categorical data so students can work with the data they most often see in the world around them. Every day in the media, we see and hear percentages and rates used to summarize results of opinion polls, outcomes of medical studies, and economic reports. As a result, we have increased the attention paid to the analysis of proportions. For example, we use contingency tables early in the text to illustrate the concept of association between two categorical variables and to show the potential influence of a lurking variable. Organization of the Book The statistical investigative process has the following components: (1) asking a statistical question; (2) designing an appropriate study to collect data; (3) analyzing the data; and (4) interpreting the data and making conclusions to answer the statistical questions. With this in mind, the book is organized into four parts. Part 1 focuses on gathering and exploring data. This equates to components 1, 2, and 3, when the data is analyzed descriptively (both for one variable and the association between two variables). Part 2 covers probability, probability distributions, and the sampling distribution. This equates to component 3, when the student learns the underlying probability necessary to make the step from analyzing the data descriptively to analyzing the data inferentially (for example, understanding sampling distributions to develop the concept of a margin of error and a Pvalue). Part 3 covers inferential statistics. This equates to components 3 and 4 of the statistical investigative process. The students learn how to form confidence intervals and conduct significance tests and then make appropriate conclusions answering the statistical question of interest. Part 4 covers analyzing associations (inferentially) and looks at extended statistical methods. The chapters are written in such a way that instructors can teach out of order. For example, after Chapter 1, an instructor could easily teach Chapter 4, Chapter 2, and Chapter 3. Alternatively, an instructor may teach Chapters 5, 6, and 7 after Chapters 1 and 4. Features of the Fourth Edition Promoting Student Learning To motivate students to think about the material, ask appropriate questions, and develop good problemsolving skills, we have created special features that distinguish this text. 12 Preface Student Support To draw students to important material we highlight key definitions, guidelines, procedures, “In Practice” remarks, and other summaries in boxes throughout the text. In addition, we have four types of margin notes: • In Words: This feature explains, in plain language, the definitions and symbolic notation found in the body of the text (which, for technical accuracy, must be more formal). • Caution: These margin boxes alert students to areas to which they need to pay special attention, particularly where they are prone to make mistakes or incorrect assumptions. • Recall: As the student progresses through the book, concepts are presented that depend on information learned in previous chapters. The Recall margin boxes direct the reader back to a previous presentation in the text to review and reinforce concepts and methods already covered. • Did You Know: These margin boxes provide information that helps with the contextual understanding of the statistical question under consideration. Graphical Approach Because many students are visual learners, we have taken extra care to make the text figures informative. We’ve annotated many of the figures with labels that clearly identify the noteworthy aspects of the illustration. Further, most figure captions include a question (answered in the Chapter Review) designed to challenge the student to interpret and think about the information being communicated by the graphic. The graphics also feature a pedagogical use of color to help students recognize patterns and distinguish between statistics and parameters. The use of color is explained on page D1 for easy reference. HandsOn Activities and Simulations Each chapter contains diverse and dynamic activities that allow students to become familiar with a number of statistical methodologies and tools. The instructor can elect to carry out the activities in class, outside of class, or a combination of both. The activity often involves simulation, commonly using a web app available through the book’s website or MyStatLab. Similar activities can also be found within MyStatLab. These handson activities and simulations encourage students to learn by doing. Connection to History: On the Shoulders of . . . We believe that knowledge pertaining to the evolution and history of the statistics discipline is relevant to understanding the methods we use for designing studies and analyzing data. Throughout the text, several chapters feature a spotlight on people who have made major contributions to the statistics discipline. These spotlights are titled On the Shoulders of . . . RealWorld Connections ChapterOpening Example Each chapter begins with a highinterest example that raises key questions and establishes themes that are woven throughout the chapter. Illustrated with engaging photographs, this example is designed to grab students’ attention and draw them into the chapter. The issues discussed in the chapter’s opening example are referred to and revisited in examples within the chapter. All chapteropening examples use real data from a variety of applications. Preface 13 Statistics: In Practice We realize that there is a difference between proper academic statistics and what is actually done in practice. Data analysis in practice is an art as well as a science. Although statistical theory has foundations based on precise assumptions and conditions, in practice the real world is not so simple. In Practice boxes and text references alert students to the way statisticians actually analyze data in practice. These comments are based on our extensive consulting experience and research and by observing what welltrained statisticians do in practice. Exercises and Examples Innovative Example Format Recognizing that the worked examples are the major vehicle for engaging and teaching students, we have developed a unique structure to help students learn to model the questionposing and investigative thought process required to examine issues intelligently using statistics. The five components are as follows: • Picture the Scenario presents background information so students can visualize the situation. This step places the data to be investigated in context and often provides a link to previous examples. • Questions to Explore reference the information from the scenario and pose questions to help students focus on what is to be learned from the example and what types of questions are useful to ask about the data. • Think It Through is the heart of each example. Here, the questions posed are investigated and answered using appropriate statistical methods. Each solution is clearly matched to the question so students can easily find the response to each Question to Explore. • Insight clarifies the central ideas investigated in the example and places them in a broader context that often states the conclusions in less technical terms. Many of the Insights also provide connections between seemingly disparate topics in the text by referring to concepts learned previously and/or foreshadowing techniques and ideas to come. • Try Exercise: Each example concludes by directing students to an endofsection exercise that allows immediate practice of the concept or technique within the example. Concept tags are included with each example so that students can easily identify the concept demonstrated in the example. Relevant and Engaging Exercises The text contains a strong emphasis on real data in both the examples and exercises. We have updated the exercise sets in the fourth edition to ensure that students have ample opportunity to practice techniques and apply the concepts. Nearly all of the chapters contain more than 100 exercises, and more than 25 percent of the exercises are new to this edition or have been updated with current data. These exercises are realistic and ask students to provide interpretations of the data or scenario rather than merely to find a numerical solution. We show how statistics addresses a wide array of applications, including opinion polls, market research, the environment, and health and human behavior. Because we believe that most students benefit more from focusing on the underlying concepts and interpretations of data analyses than from the actual calculations, the exercises often show summary statistics and printouts and ask what can be learned from them. We have exercises in three places: • At the end of each section. These exercises provide immediate reinforcement and are drawn from concepts within the section. 14 Preface • At the end of each chapter. This more comprehensive set of exercises draws from all concepts across all sections within the chapter. • Part Reviews. These exercises draw connections among a part’s chapters and summarize the overarching themes and concepts. Part exercises reinforce primary learning objectives. These are all available in MyStatLab. Each exercise has a descriptive label. Exercises for which technology is recommended (such as using software or an app to carry out the analysis) are indicated with the icon. Larger data sets used in examples and exercises are referenced in the text, listed on page D2, and made available on the book’s website. The exercises are divided into the following three categories: • Practicing the Basics are the section exercises and the first group of endofchapter exercises; they reinforce basic application of the methods. • Concepts and Investigations exercises require the student to explore real data sets and carry out investigations for miniprojects. They may ask students to explore concepts and related theory or be extensions of the chapter’s methods. This section contains some multiplechoice and truefalse exercises to help students check their understanding of the basic concepts and prepare for tests. A few more difficult, optional exercises (highlighted icon) are included to present some additional concepts and with the methods. Concepts and Investigations exercises are found in the endofchapter exercises. • Student Activities are designed for group work based on investigations each of the students performs on a team. Student Activities are found in the endofchapter exercises, and additional activities may be found within chapters as well. Technology Integration UptoDate Use of Technology The availability of technology enables instruction that is less calculationbased and more conceptoriented. Output from software applications and calculators is displayed throughout the textbook, and discussion focuses on interpretation of the output rather than on the keystrokes needed to create the output. Although most of our output is from MINITAB® and the TI calculators, we also show screen captures from IBM® SPSS® and Microsoft Excel® as appropriate. Web Apps Web apps referred to in the text are found on the book’s website (www. pearsonglobaleditions.com/agresti) and in MyStatLab. These apps have great value because they demonstrate concepts to students visually. For example, creating a sampling distribution is accomplished more readily with a dynamic and interactive web app than with a static text figure. (Description and list of the apps may be found on page 7 .) Data Sets We use a wealth of real data sets throughout the textbook. These data sets are available on the www.pearsonglobaleditions.com/agresti website. The same data set is often used in several chapters, helping reinforce the four components of the statistical investigative process and allowing the students to see the big picture of statistical reasoning. Exercises requiring students to download the data set from the book’s website are noted with this icon: Preface 15 Learning Catalytics Learning Catalytics is a webbased engagement and assessment tool. As a “bringyourowndevice” direct response system, Learning Catalytics offers a diverse library of dynamic question types that allow students to interact with and think critically about statistical concepts. As a realtime resource, instructors can take advantage of critical teaching moments both in the classroom or through assignable and gradeable homework. UPDATED! ExampleLevel Videos Select examples from the text have guided videos. These updated videos provide excellent support for students who require additional assistance or want reinforcement on topics and concepts learned in class. MyStatLab™ Online Course (access code required) MyStatLab is a course management system that delivers proven results in helping individual students succeed. • MyStatLab can be successfully implemented in any environment—labbased, hybrid, fully online, traditional—and demonstrates the quantifiable difference that integrated usage has on student retention, subsequent success, and overall achievement. • MyStatLab’s comprehensive online gradebook automatically tracks students’ results on tests, quizzes, homework, and in the study plan. Instructors can use the gradebook to intervene if students have trouble or to provide positive feedback. Data can be easily exported to a variety of spreadsheet programs, such as Microsoft Excel. MyStatLab provides engaging experiences that personalize, stimulate, and measure learning for each student. • Tutorial Exercises with Multimedia Learning Aids: The homework and practice exercises in MyStatLab align with the exercises in the textbook, and they regenerate algorithmically to give students unlimited opportunity for practice and mastery. Exercises offer immediate helpful feedback, guided solutions, sample problems, animations, videos, and eText clips for extra help at pointofuse. • Getting Ready for Statistics: A library of questions now appears within each MyStatLab course to offer the developmental math topics students need for the course. These can be assigned as a prerequisite to other assignments if desired. • Conceptual Question Library: In addition to algorithmically regenerated questions that are aligned with your textbook, a library of 1,000 Conceptual Questions is available in the assessment managers that require students to apply their statistical understanding. • StatCrunch: MyStatLab includes a webbased statistical software, StatCrunch, within the online assessment platform so that students can easily analyze data sets from exercises and the text. In addition, MyStatLab includes access to www.StatCrunch.com, a website where users can access more than 20,000 shared data sets, conduct online surveys, perform complex analyses using the powerful statistical software, and generate compelling reports. • Integration of Statistical Software: Knowing that students often use external statistical software, we make it easy to copy our data sets from the 16 Preface MyStatLab questions into software like StatCrunch, MINITAB, Excel, and more. Students have access to a variety of support—Technology Tutorials Videos and Technology Study Cards—to learn how to use statistical software effectively. And, MyStatLab comes from a trusted partner with educational expertise and an eye on the future. Knowing that you are using a Pearson product means knowing that you are using quality content. That means that our eTexts are accurate, that our assessment tools work, and that our questions are errorfree. And whether you are just getting started with MyStatLab, or have a question along the way, we’re here to help you learn about our technologies and how to incorporate them into your course. To learn more about how MyStatLab combines proven learning applications with powerful assessment, visit www.mystatlab.com or contact your Pearson representative. StatCrunch® StatCrunch® is powerful webbased statistical software that allows users to perform complex analyses, share data sets, and generate compelling reports of their data. The vibrant online community offers more than 20,000 data sets for instructors to use and students to analyze. • Collect. Users can upload their own data to StatCrunch or search a large library of publicly shared data sets, spanning almost any topic of interest. Also, an online survey tool allows users to collect data quickly through webbased surveys. • Crunch. A full range of numerical and graphical methods allows users to analyze and gain insights from any data set. Interactive graphics help users understand statistical concepts and are available for export to enrich reports with visual representations of data. • Communicate. Reporting options help users create a wide variety of visually appealing representations of their data. Full access to StatCrunch is available with a MyStatLab kit, and StatCrunch is available by itself to qualified adopters. For more information, visit our website at www.statcrunch.com or contact your Pearson representative. An Invitation Rather Than a Conclusion We hope that students using this textbook will gain a lasting appreciation for the vital role the art and science of statistics plays in analyzing data and helping us make decisions in our lives. Our major goals for this textbook are that students learn how to: • Recognize that we are surrounded by data and the importance of becoming statistically literate to interpret these data and make informed decisions based on data. • Become critical readers of studies summarized in mass media and of research papers that quote statistical results. • Produce data that can provide answers to properly posed questions. • Appreciate how probability helps us understand randomness in our lives and grasp the crucial concept of a sampling distribution and how it relates to inference methods. • Choose appropriate descriptive and inferential methods for examining and analyzing data and drawing conclusions. Preface 17 • Communicate the conclusions of statistical analyses clearly and effectively. • Understand the limitations of most research, either because it was based on an observational study rather than a randomized experiment or survey or because a certain lurking variable was not measured that could have explained the observed associations. We are excited about sharing the insights that we have learned from our experience as teachers and from our students through this text. Many students still enter statistics classes on the first day with dread because of its reputation as a dry, sometimes difficult, course. It is our goal to inspire a classroom environment that is filled with creativity, openness, realistic applications, and learning that students find inviting and rewarding. We hope that this textbook will help the instructor and the students experience a rewarding introductory course in statistics. Resources for Success MyStatLab™ Online Course for Statistics:The Art and Science of Learning from Data by Agresti, Franklin, and Klingenberg (access code required) MyStatLab is available to accompany Pearson’s market leading text offerings. To give students a consistent tone, voice, and teaching method, each text’s flavor and approach is tightly integrated throughout the accompanying MyStatLab course, making learning the material as seamless as possible. Technology Tutorials and Study Cards Technology tutorials provide brief video walkthroughs and stepbystep instructional study cards on common statistical procedures for MINITAB®, Excel®, and the TI family of graphing calculators. New! Apps: Examples, Exercises, and Simulations Authorcreated web apps allow students to interact with key statistical concepts and techniques, including statistical distributions, inference for one and two samples, permutation tests, bootstrapping, and sampling distributions. Students can explore consequences of changing parameters and carry out simulations to explore coverage, or simply obtain descriptive statistics or a proper statistical graph. All these in a highly userfriendly and well designed app, where results can be downloaded for inclusion in homework or projects. ExampleLevel Resources Students looking for additional support can use the examplebased videos to help solve problems, provide reinforcement on topics and concepts learned in class, and support their learning. m Figure 10.10 Screenshot from Permutation Web App. The histogram shows the permutation sampling distribution of the difference between sample means. This sampling distribution is obtained by considering all possible ways of dividing the responses of the The semicircle in the right tail indicates the actually observed difference. (The dot plots above the histogram show the original data and the data after a random permutation, Group 2) Question Is the observed difference extreme? www.mystatlab.com Resources for Success Instructor Resources Additional resources can be downloaded from www.pearsonglobaleditions.com/agresti Textspecific website, www.pearsonglobaleditions.com/ agresti New to this edition, students and instructors will have a full library of resources, including apps developed for intext activities and data sets (.csv, TI83/84 Plus C, and .txt). Updated! Instructor to Instructor Videos provide an opportunity for adjuncts, parttimers, TAs, or other instructors who are new to teaching from this text or have limited class prep time to learn about the book’s approach and coverage from the authors. The videos, available through MyStatLab, focus on those topics that have proven to be most challenging to students. The authors offer suggestions, pointers, and ideas about how to present these topics and concepts effectively based on their many years of teaching introductory statistics. They also share insights on how to help students use the textbook in the most effective way to realize success in the course. Instructor’s Solutions Manual, by James Lapp, contains fully worked solutions to every textbook exercise. Available for download from Pearson’s online catalog at www.pearsonglobaleditions.com/agresti and through MyStatLab. Answers to the Student Laboratory Workbook are available for download from www.pearsonglobaleditions.com/agresti and through MyStatLab. PowerPoint Lecture Slides are fully editable and printable slides that follow the textbook. These slides can be used during lectures or posted to a website in an online course. The PowerPoint Lecture Slides are available from www.pearsonglobaleditions.com/agresti and through MyStatLab. TestGen® (www.pearsoned.com/testgen) enables instructors to build, edit, print, and administer tests using a computerized bank of questions developed to cover all the objectives of the text. TestGen is algorithmically based, allowing instructors to create multiple but equivalent versions of the same question or test with the click of a button. Instructors can also modify test bank questions or add new questions. The test bank is available for download from www.pearsonglobaleditions.com/agresti and through MyStatLab. The Online Test Bank is a test bank derived from TestGen®. It includes multiple choice and short answer questions for each section of the text, along with the answer keys. Available for download from www.pearsonglobaleditions.com/agresti and through MyStatLab. Student Resources Additional resources to help student success. Textspecific website, www.pearsonglobaleditions.com/ agresti New to this edition, students and instructors will have a full library of resources, including apps developed for intext activities and data sets (.csv, TI83/84 Plus C, and .txt). Updated! Examplelevel videos explain how to work examples from the text. The videos provide excellent support for students who require additional assistance or want reinforcement on topics and concepts learned in class (available in MyStatLab). Student Laboratory Workbook, by Megan Mocko (University of Florida) and Maria Ripol (University of Florida), is a study tool for the first ten chapters of the text. This workbook provides sectionbysection review and practice and additional activities that cover fundamental statistical topics (ISBN10: 0133860892; ISBN13: 9780133860894). Study Cards for Statistics Software This series of study cards, available for Excel®, MINITAB®, JMP®, SPSS®, R®, StatCrunch®, and the TI family of graphing calculators provides students with easy, stepbystep guides to the most common statistics software. Available in MyStatLab. www.mystatlab.com 20 Preface Acknowledgments We are indebted to the following individuals, who provided valuable feedback for the fourth edition: Evelyn Bailey, Oxford College of Emory University Erica Bernstein, University of Hawaii at Hilo Phyllis Curtiss, Grand Valley State Katherine C. Earles, Wichita State University Rob Eby, Blinn College—Bryan Campus Matthew Jones, Austin Perry Statue University Ann Kalinoskii, San Jose University Michael Roty, Mercer University PingShou Zhong, Michigan State University We are also indebted to the many reviewers, class testers, and students who gave us invaluable feedback and advice on how to improve the quality of the book. ARIZONA Russel Carlson, University of Arizona; Peter FlanaganHyde, Phoenix Country Day School ■ CALIFORINIA James Curl, Modesto Junior College; Christine Drake, University of California at Davis; Mahtash Esfandiari, UCLA; Brian Karl Finch, San Diego State University; Dawn Holmes, University of California Santa Barbara; Rob Gould, UCLA; Rebecca Head, Bakersfield College; Susan Herring, Sonoma State University; Colleen Kelly, San Diego State University; Marke Mavis, Butte Community College; Elaine McDonald, Sonoma State University; Corey Manchester, San Diego State University; Amy McElroy, San Diego State University; Helen Noble, San Diego State University; Calvin Schmall, Solano Community College ■ COLORADO David Most, Colorado State University ■ CONNECTICUT Paul Bugl, University of Hartford; Anne Doyle, University of Connecticut; Pete Johnson, Eastern Connecticut State University; Dan Miller, Central Connecticut State University; Kathleen McLaughlin, University of Connecticut; Nalini Ravishanker, University of Connecticut; John Vangar, Fairfield University; Stephen Sawin, Fairfield University ■ DISTRICT OF COLUMBIA Hans Engler, Georgetown University; Mary W. Gray, American University; Monica Jackson, American University ■ FLORIDA Nazanin Azarnia, Santa Fe Community College; Brett Holbrook; James Lang, Valencia Community College; Karen Kinard, Tallahassee Community College; Megan Mocko, University of Florida; Maria Ripol, University of Florida; James Smart, Tallahassee Community College; Latricia Williams, St. Petersburg Junior College, Clearwater; Doug Zahn, Florida State University ■ GEORGIA Carrie Chmielarski, University of Georgia; Ouida Dillon, Oconee County High School; Kim Gilbert, University of Georgia; Katherine Hawks, Meadowcreek High School; Todd Hendricks, Georgia Perimeter College; Charles LeMarsh, Lakeside High School; Steve Messig, Oconee County High School; Broderick Oluyede, Georgia Southern University; Chandler Pike, University of Georgia; Kim Robinson, Clayton State University; Jill Smith, University of Georgia; John Seppala, Valdosta State University; Joseph Walker, Georgia State University ■ IOWA John Cryer, University of Iowa; Kathy Rogotzke, North Iowa Community College; R. P. Russo, University of Iowa; William Duckworth, Iowa State University ■ ILLINOIS Linda Brant Collins, University of Chicago; Dagmar Budikova, Illinois State University; Ellen Fireman, University of Illinois; Jinadasa Gamage, Illinois State; University; Richard Maher, Loyola University Chicago; Cathy Poliak, Northern Illinois University; Daniel Rowe, Heartland Community College ■ KANSAS James Higgins, Kansas State University; Michael Mosier, Washburn University ■ KENTUCKY Lisa Kay, Eastern Kentucky University Preface 21 ■ MASSACHUSETTS Richard Cleary, Bentley University; Katherine Halvorsen, Smith College; Xiaoli Meng, Harvard University; Daniel Weiner, Boston University ■ MICHIGAN Kirk Anderson, Grand Valley State University; Phyllis Curtiss, Grand Valley State University; Roy Erickson, Michigan State University; JannHuei Jinn, Grand Valley State University; Sango Otieno, Grand Valley State University; Alla Sikorskii, Michigan State University; Mark Stevenson, Oakland Community College; Todd Swanson, Hope College; Nathan Tintle, Hope College ■ MINNESOTA Bob Dobrow, Carleton College; German J. Pliego, University of St.Thomas; Peihua Qui, University of Minnesota; Engin A. Sungur, University of Minnesota–Morris ■ MISSOURI Lynda Hollingsworth, Northwest Missouri State University; Robert Paige, Missouri University of Science and Technology; Larry Ries, University of Missouri–Columbia; Suzanne Tourville, Columbia College ■ MONTANA Jeff Banfield, Montana State University ■ NEW JERSEY Harold Sackrowitz, Rutgers, The State University of New Jersey; Linda Tappan, Montclair State University ■ NEW MEXICO David Daniel, New Mexico State University ■ NEW YORK Brooke Fridley, Mohawk Valley Community College; Martin Lindquist, Columbia University; Debby Lurie, St. John’s University; David Mathiason, Rochester Institute of Technology; Steve Stehman, SUNY ESF; Tian Zheng, Columbia University ■ NEVADA: Alison Davis, University of Nevada–Reno ■ NORTH CAROLINA Pamela Arroway, North Carolina State University; E. Jacquelin Dietz, North Carolina State University; Alan Gelfand, Duke University; Gary Kader, Appalachian State University; Scott Richter, UNC Greensboro; Roger Woodard, North Carolina State University ■ NEBRASKA Linda Young, University of Nebraska ■ OHIO Jim Albert, Bowling Green State University; John Holcomb, Cleveland State University; Jackie Miller, The Ohio State University; Stephan Pelikan, University of Cincinnati; Teri Rysz, University of Cincinnati; Deborah Rumsey, The Ohio State University; Kevin Robinson, University of Akron; Dottie Walton, Cuyahoga Community College  Eastern Campus ■ OREGON Michael Marciniak, Portland Community College; Henry Mesa, Portland Community College, Rock Creek; QiMan Shao, University of Oregon; Daming Xu, University of Oregon ■ PENNSYLVANIA Winston Crawley, Shippensburg University; Douglas Frank, Indiana University of Pennsylvania; Steven Gendler, Clarion University; Bonnie A. Green, East Stroudsburg University; Paul Lupinacci, Villanova University; Deborah Lurie, Saint Joseph’s University; Linda Myers, Harrisburg Area Community College; Tom Short, Villanova University; Kay Somers, Moravian College; Sister Marcella Louise Wallowicz, Holy Family University ■ SOUTH CAROLINA Beverly Diamond, College of Charleston; Martin Jones, College of Charleston; Murray Siegel, The South Carolina Governor’s School for Science and Mathematics; ■ SOUTH DAKOTA Richard Gayle, Black Hills State University; Daluss Siewert, Black Hills State University; Stanley Smith, Black Hills State University ■ TENNESSEE Bonnie Daves, Christian Academy of Knoxville; T. Henry Jablonski, Jr., East Tennessee State University; Robert Price, East Tennessee State University; Ginger Rowell, Middle Tennessee State University; Edith Seier, East Tennessee State University ■ TEXAS Larry Ammann, University of Texas, Dallas; Tom Bratcher, Baylor University; Jianguo Liu, University of North Texas; Mary Parker, Austin Community College; Robert Paige, Texas Tech University; Walter M. Potter, Southwestern University; Therese Shelton, Southwestern University; James Surles, Texas Tech University; Diane Resnick, University of HoustonDowntown ■ UTAH Patti Collings, Brigham Young University; Carolyn Cuff, Westminster College; Lajos Horvath, University of Utah; P. Lynne Nielsen, Brigham Young University ■ VIRGINIA David Bauer, Virginia Commonwealth University; ChingYuan Chiang, James Madison University; Jonathan Duggins, Virginia Tech; Steven Garren, James Madison University; Hasan Hamdan, James Madison University; Debra Hydorn, Mary Washington College; Nusrat Jahan, James Madison University; D’Arcy Mays, Virginia Commonwealth University; Stephanie Pickle, Virginia Polytechnic Institute and State University 22 Preface ■ WASHINGTON Rich Alldredge, Washington State University; Brian T. Gill, Seattle Pacific University; June Morita, University of Washington ■ WISCONSIN Brooke Fridley, University of Wisconsin–LaCrosse; Loretta Robb Thielman, University of Wisconsin–Stoutt. ■ WYOMING Burke Grandjean, University of Wyoming ■ CANADA Mike Kowalski, University of Alberta; David Loewen, University of Manitoba We appreciate all of the thoughtful contributions to the fourth edition by Michael Posner. We also thank the following individuals, who made invaluable contributions to the third edition: Ellen Breazel, Clemson University Linda Dawson, Washington State University, Tacoma Bernadette Lanciaux, Rochester Institute of Technology Scott Nickleach, Sonoma State University The detailed assessment of the text fell to our accuracy checkers, Ann Cannon and Joan Sanuik. Thank you to James Lapp, who took on the task of revising the solutions manuals to reflect the many changes to the fourth edition. We also want to thank Jackie Miller (Ohio State University) for her contributions to the Instructor’s Notes, and our student technology manual and workbook authors, Megan Mocko (University of Florida) and Maria Ripol (University of Florida). We would like to thank the Pearson team who has given countless hours in developing this text; without their guidance and assistance, the text would not have come to completion. We thank Suzanna Bainbridge, Rachel Reeve, Danielle Simbajon, Justin Billing, Jean Choe, Tiffany Bitzel, Andrew Noble, Jennifer Myers, and especially Joe Vetere. We also thank Kristin Jobe, Senior Project Manager at IntegraChicago, for keeping this book on track throughout production. Alan Agresti would like to thank those who have helped us in some way, often by suggesting data sets or examples. These include Anna Gottard, Wolfgang Jank, Bernhard Klingenberg, René LeePack, Jacalyn Levine, Megan Lewis, Megan Meece, Dan Nettleton, Yongyi Min, and Euijung Ryu. Many thanks also to Tom Piazza for his help with the General Social Survey. Finally, Alan Agresti would like to thank his wife Jacki Levine for her extraordinary support throughout the writing of this book. Besides putting up with the evenings and weekends he was working on this book, she offered numerous helpful suggestions for examples and for improving the writing. Chris Franklin gives a special thank you to her husband and sons, Dale, Corey, and Cody Green. They have patiently sacrificed spending many hours with their spouse and mom as she has worked on this book through four editions. A special thank you also to her parents Grady and Helen Franklin and her two brothers, Grady and Mark, who have always been there for their daughter and sister. Chris also appreciates the encouragement and support of her colleagues and her many students who used the book, offering practical suggestions for improvement. Finally, Chris thanks her coauthors Alan and Bernhard for the amazing journey of writing a textbook together. Bernhard Klingenberg wants to thank his statistics teachers in Graz, Austria; Sheffield, UK; and Gainesville, Florida, who showed him all the fascinating facets of statistics throughout his education. Thanks also to the Department of Mathematics & Statistics at Williams College for being such a wonderful place to work. Finally, thanks to Chris Franklin and Alan Agresti for a wonderful and inspiring collaboration. Alan Agresti, Gainesville, Florida Christine Franklin, Athens, Georgia Bernhard Klingenberg, Williamstown, Massachusetts Preface 23 Acknowledgments for the Global Edition Pearson would like to thank and acknowledge the following people for their work on the Global Edition. Contributors Mohammad Kacim, Holy Spirit University of Kaslik Abhishek K. Umrawal, Delhi University Reviewers Ruben Garcia Berasategui, Jakarta International College Amit Kumar Misra, Babasaheb Bhimrao Ambedkar University Ranjita Pandey, Delhi University Louise M. Ryan, University of Technology Sydney C. V. Vinay, JSS Academy of Technical Education This page intentionally left blank About the Authors Alan Agresti is Distinguished Professor Emeritus in the Department of Statistics at the University of Florida. He taught statistics there for 38 years, including the development of three courses in statistical methods for social science students and three courses in categorical data analysis. He is author of more than 100 refereed articles and six texts, including Statistical Methods for the Social Sciences (with Barbara Finlay, Prentice Hall, 4th edition, 2009) and Categorical Data Analysis (Wiley, 3rd edition, 2013). He is a Fellow of the American Statistical Association and recipient of an Honorary Doctor of Science from De Montfort University in the UK. He has held visiting positions at Harvard University, Boston University, the London School of Economics, and Imperial College and has taught courses or short courses for universities and companies in about 30 countries worldwide. He has also received teaching awards from the University of Florida and an excellence in writing award from John Wiley & Sons. Christine Franklin is a Senior Lecturer and Lothar Tresp Honoratus Honors Professor in the Department of Statistics at the University of Georgia. She has been teaching statistics for more than 35 years at the college level. Chris has been actively involved at the national and state level with promoting statistical education at PreK–16 since the 1980s. She is a past Chief Reader for AP Statistic. Chris served as the lead writer for the ASAendorsed Guidelines for Assessment and Instruction in Statistics Education (GAISE) Report: A Pre K–12 Curriculum Framework and chaired the ASA Statistical Education of Teachers (SET) Report. Chris has been honored by her selection as a Fellow of the American Statistical Association, recipient of the 2006 Mu Sigma Rho National Statistical Education Award, the 2013 USCOTS Lifetime Achievement Award, the 2014 ASA Founders Award, a 2014–2015 U.S. Fulbright Scholar, and numerous teaching and advising awards at the University of Georgia. Most important for Chris is her family, who love to hike and attend baseball games together. Bernhard Klingenberg is Associate Professor of Statistics in the Department of Mathematics & Statistics at Williams College, where he has been teaching introductory and advanced statistics classes for the past 11 years. In 2013, Bernhard was instrumental in creating an undergraduate major in statistics at Williams, one of the first for a liberal arts college. A native of Austria, Bernhard frequently returns there to hold visiting positions at universities and gives short courses on categorical data analysis in Europe and the United States. He has published several peerreviewed articles in statistical journals and consults regularly with academia and industry. Bernhard enjoys photography (some of his pictures appear in this book), scuba diving, and time with his wife and four children. 25 This page intentionally left blank PART Gathering and Exploring Data 1 Chapter 1 Statistics: The Art and Science of Learning from Data Chapter 2 Exploring Data with Graphs and Numerical Summaries Chapter 3 Association: Contingency, Correlation, and Regression Chapter 4 Gathering Data 27 CHAPTER 1 1.1 Using Data to Answer Statistical Questions 1.2 Sample Versus Population 1.3 Using Calculators and Computers Statistics: The Art and Science of Learning from Data Example 1 How Statistics Helps Us Learn About the World Picture the Scenario In this book, you will explore a wide variety of everyday scenarios. For example, you will evaluate media reports about opinion surveys, medical research studies, the state of the economy, and environmental issues. You’ll face financial decisions such as choosing between an investment with a sure return and one that could make you more money but could possibly cost you your entire investment. You’ll learn how to analyze the available information to answer necessary questions in such scenarios. One purpose of this book is to show you why an understanding of statistics is essential for making good decisions in an uncertain world. Questions to Explore This book will show you how to collect appropriate information and how to apply statistical methods so you can better evaluate that information 28 and answer the questions posed. Here are some examples of questions we’ll investigate in coming chapters: j j j j j j j j How can you evaluate evidence about global warming? Are cell phones dangerous to your health? What’s the chance your tax return will be audited? How likely are you to win the lottery? Is there bias against women in appointing managers? What ‘hot streaks’ should you expect in basketball? How can you analyze whether a diet really works? How can you predict the selling price of a house? Thinking Ahead Each chapter uses questions like these to introduce a topic and then introduces tools for making sense of the available information. We’ll see that statistics is the art and science of designing studies and analyzing the information that those studies produce. Section 1.1 Using Data to Answer Statistical Questions 29 In the business world, managers use statistics to analyze results of marketing studies about new products, to help predict sales, and to measure employee performance. In finance, statistics is used to study stock returns and investment opportunities. Medical studies use statistics to evaluate whether new ways to treat disease are better than existing ways. In fact, most professional occupations today rely heavily on statistical methods. In a competitive job market, understanding statistics provides an important advantage. But it’s important to understand statistics even if you will never use it in your job. Understanding statistics can help you make better choices. Why? Because every day you are bombarded with statistical information from news reports, advertisements, political campaigns, and surveys. How do you know what to heed and what to ignore? An understanding of the statistical reasoning—and in some cases statistical misconceptions—underlying these pronouncements will help. For instance, this book will enable you to evaluate claims about medical research studies more effectively so that you know when you should be skeptical. For example, does taking an aspirin daily truly lessen the chance of having a heart attack? We realize that you are probably not reading this book in the hope of becoming a statistician. (That’s too bad, because there’s a severe shortage of statisticians—more jobs than trained people. And with the everincreasing ways in which statistics is being applied, it’s an exciting time to be a statistician.) You may even suffer from math phobia. Please be assured that to learn the main concepts of statistics, logical thinking and perseverance are more important than highpowered math skills. Don’t be frustrated if learning comes slowly and you need to read about a topic a few times before it starts to make sense. Just as you would not expect to sit through a single foreign language class session and be able to speak that language fluently, the same is true with the language of statistics. It takes time and practice. But we promise that your hard work will be rewarded. Once you have completed even part of this text, you will understand much better how to make sense of statistical information and, hence, the world around you. 1.1 Using Data to Answer Statistical Questions Does a lowcarbohydrate diet result in significant weight loss? Are people more likely to stop at a Starbucks if they’ve seen a recent Starbucks TV commercial? Information gathering is at the heart of investigating answers to such questions. The information we gather with experiments and surveys is collectively called data. For instance, consider an experiment designed to evaluate the effectiveness of a lowcarbohydrate diet. The data might consist of the following measurements for the people participating in the study: weight at the beginning of the study, weight at the end of the study, number of calories of food eaten per day, carbohydrate intake per day, bodymass index (BMI) at the start of the study, and gender. A marketing survey about the effectiveness of a TV ad for Starbucks could collect data on the percentage of people who went to a Starbucks since the ad aired and analyze how it compares for those who saw the ad and those who did not see it. Defining Statistics You already have a sense of what the word statistics means. You hear statistics quoted about sports events (number of points scored by each player on a basketball team), statistics about the economy (median income, unemployment rate), and statistics about opinions, beliefs, and behaviors (percentage of students who 30 Chapter 1 Statistics: The Art and Science of Learning from Data indulge in binge drinking). In this sense, a statistic is merely a number calculated from data. But statistics as a field is a way of thinking about data and quantifying uncertainty, not a maze of numbers and messy formulas. Statistics Statistics is the art and science of designing studies and analyzing the data that those studies produce. Its ultimate goal is translating data into knowledge and understanding of the world around us. In short, statistics is the art and science of learning from data. Statistical methods help us investigate questions in an objective manner. Statistical problem solving is an investigative process that involves four components: (1) formulate a statistical question, (2) collect data, (3) analyze data, and (4) interpret results. The following examples ask questions that we’ll learn how to answer using statistical investigations. Scenario 1: Predicting an Election Using an Exit Poll In elections, television networks often declare the winner well before all the votes have been counted. They do this using exit polling, interviewing voters after they leave the voting booth. Using an exit poll, a network can often predict the winner after learning how several thousand people voted, out of possibly millions of voters. The 2010 California gubernatorial race pitted Democratic candidate Jerry Brown against Republican candidate Meg Whitman. A TV exit poll used to project the outcome reported that 53.1% of a sample of 3889 voters said they had voted for Jerry Brown.1 Was this sufficient evidence to project Brown as the winner, even though information was available from such a small portion of the more than 9.5 million voters in California? We’ll learn how to answer that question in this book. Scenario 2: Making Conclusions in Medical Research Studies Statistical reasoning is at the foundation of the analyses conducted in most medical research studies. Let’s consider three examples of how statistics can be relevant. Heart disease is the most common cause of death in industrialized nations. In the United States and Canada, nearly 30% of deaths yearly are due to heart disease, mainly heart attacks. Does regular aspirin intake reduce deaths from heart attacks? Harvard Medical School conducted a landmark study to investigate. The people participating in the study regularly took either an aspirin or a placebo (a pill with no active ingredient). Of those who took aspirin, 0.9% had heart attacks during the study. Of those who took the placebo, 1.7% had heart attacks, nearly twice as many. Can you conclude that it’s beneficial for people to take aspirin regularly? Or, could the observed difference be explained by how it was decided which people would receive aspirin and which would receive the placebo? For instance, might those who took aspirin have had better results merely because they were healthier, on average, than those who took the placebo? Or, did those taking aspirin have a better diet or exercise more regularly, on average? For years there has been controversy about whether regular intake of large doses of vitamin C is beneficial. Some studies have suggested that it is. But some scientists have criticized those studies’ designs, claiming that the subsequent statistical analysis was meaningless. How do we know when we can trust the statistical results in a medical study that is reported in the media? 1 Source: Data from www.cnn.com/ELECTION/2010/results/polls/. Section 1.1 Using Data to Answer Statistical Questions 31 Suppose you wanted to investigate whether, as some have suggested, heavy use of cell phones makes you more likely to get brain cancer. You could pick half the students from your school and tell them to use a cell phone each day for the next 50 years, and tell the other half never to use a cell phone. Fifty years from now you could see whether more users than nonusers of cell phones got brain cancer. Obviously it would be impractical to carry out such a study. And who wants to wait 50 years to get the answer? Years ago, a British statistician figured out how to study whether a particular type of behavior has an effect on cancer, using already available data. He did this to answer a then controversial question: Does smoking cause lung cancer? How did he do this? This book will show you how to answer questions like these. You’ll learn when you can trust the results from studies reported in the media and when you should be skeptical. Scenario 3: Using a Survey to Investigate People’s Beliefs How similar are your opinions and lifestyle to those of others? It’s easy to find out. Every other year, the National Opinion Research Center at the University of Chicago conducts the General Social Survey (GSS). This survey of a few thousand adult Americans provides data about the opinions and behaviors of the American public. You can use it to investigate how adult Americans answer a wide diversity of questions, such as, “Do you believe in life after death?” “Would you be willing to pay higher prices in order to protect the environment?” “How much TV do you watch per day?” and “How many sexual partners have you had in the past year?” Similar surveys occur in other countries, such as the Eurobarometer survey within the European Union. We’ll use data from such surveys to illustrate the proper application of statistical methods. Reasons for Using Statistical Methods The scenarios just presented illustrate the three main components of statistics for answering a statistical question: j j j In Words The verb infer means to arrive at a decision or prediction by reasoning from known evidence. Statistical inference does this using data as evidence. Design: Stating the goal and/or statistical question of interest and planning how to obtain data that will address them Description: Summarizing and analyzing the data that are obtained Inference: Making decisions and predictions based on the data for answering the statistical question Design refers to planning how to obtain data that will efficiently shed light on the statistical question of interest. How could you conduct an experiment to determine reliably whether regular large doses of vitamin C are beneficial? In marketing, how do you select the people to survey so you’ll get data that provide good predictions about future sales? Description means exploring and summarizing patterns in the data. Files of raw data are often huge. For example, over time the General Social Survey has collected data about hundreds of characteristics on many thousands of people. Such raw data are not easy to assess—we simply get bogged down in numbers. It is more informative to use a few numbers or a graph to summarize the data, such as an average amount of TV watched or a graph displaying how number of hours of TV watched per day relates to number of hours per week exercising. Inference means making decisions or predictions based on the data. Usually the decision or prediction refers to a larger group of people, not merely those in the study. For instance, in the exit poll described in Scenario 1, of 3889 voters sampled, 53.1% said they voted for Jerry Brown. Using these data, we can predict (infer) that a majority of the 9.5 million voters voted for him. Stating the 32 Chapter 1 Statistics: The Art and Science of Learning from Data In Words Variable refers to the characteristic being measured, such as number of hours per day that you watch TV. percentages for the sample of 3889 voters is description, whereas predicting the outcome for all 9.5 million voters is inference. Statistical description and inference are complementary ways of analyzing data. Statistical description provides useful summaries and helps you find patterns in the data; inference helps you make predictions and decide whether observed patterns are meaningful. You can use both to investigate questions that are important to society. For instance, “Has there been global warming over the past decade?” “Is having the death penalty available for punishment associated with a reduction in violent crime?” “Does student performance in school depend on the amount of money spent per student, the size of the classes, or the teachers’ salaries?” Long before we analyze data, we need to give careful thought to posing the questions to be answered by that analysis. The nature of these questions has an impact on all stages—design, description, and inference. For example, in an exit poll, do we just want to predict which candidate won, or do we want to investigate why by analyzing how voters’ opinions about certain issues related to how they voted? We’ll learn how questions such as these and the ones posed in the previous paragraph can be phrased in terms of statistical summaries (such as percentages and means) so that we can use data to investigate their answers. Finally, a topic that we have not mentioned yet but that is fundamental for statistical inference is probability, which is a framework for quantifying how likely various possible outcomes are. We’ll study probability because it will help us answer questions such as, “If Brown were actually going to lose the election (that is, if he were supported by less than half of all voters), what’s the chance that an exit poll of 3889 voters would show support by 53.1% of the voters?” If the chance were extremely small, we’d feel comfortable making the inference that his reelection was supported by the majority of all 9.5 million voters. c Activity 1 Downloading Data from the Internet It is simple to get descriptive summaries of data from the General Social Survey (GSS). We’ll demonstrate, using one question asked in recent surveys, “On a typical day, about how many hours do you personally watch television?” j j j j Go to the website sda.berkeley.edu/GSS. Click GSS—with NO WEIGHT as the default weight selection (SDA 4.0). The GSS name for the number of hours of TV watching is TVHOURS. Type TVHOURS as the row variable name. (See the output below on the left.) In the Weight menu, make sure that No Weight is selected. Click Run the Table. Section 1.1 Using Data to Answer Statistical Questions 33 Now you’ll see a table that shows the number of people and, in bold, the percentage who made each of the possible responses. For all the years combined in which this question was asked, the most common response was 2 hours of TV a day (about 27% made this response as shown in the output on the right.). What percentage of the people surveyed reported watching 0 hours of TV a day? How many people reported watching TV 24 hours a day? Another question asked in the GSS is, “Taken all together, would you say that you are very happy, pretty happy, or not too happy?” The GSS name for this item is HAPPY. What percentage of people reported being very happy? You might use the GSS to investigate what sorts of people are more likely to be very happy. Those who are happily married? Those who are in good health? Those who have lots of friends? We’ll see how to find out in this book. Try Exercises 1.3 and 1.4 b 1.1 Practicing the Basics 1.1 Aspirin the wonder drug An analysis by Professor Peter M Rothwell and his colleagues (Nuffield Department of Clinical Neuroscience, University of Oxford, UK) published in 2012 in the medical journal The Lancet (http://www. thelancet.com) assessed the effects of daily aspirin intake on cancer mortality. They looked at individual patient data from 51 randomized trials (77,000 participants) of daily intake of aspirin versus no aspirin or other antiplatelet agents. According to the authors, aspirin reduced the incidence of cancer, with maximum benefit seen when the scheduled duration of trial treatment was five years or more and resulted in a relative reduction in cancer deaths of about 15% (562 cancer deaths in the aspirin group versus 664 cancer deaths in the Control group). Specify the aspect of this study that pertains to (a) design, (b) description, and (c) inference. 1.2 Poverty and age The Current Population Survey (CPS) is a survey conducted by the U.S. Census Bureau for the Bureau of Labor Statistics. It provides a comprehensive body of data on the labor force, unemployment, wealth, poverty, and so on. The data can be found online at www.census.gov/cps/. The 2014 CPS ASEC (Annual Social and Economic Supplement) had redesigned questions for income that were implemented to a sample of approximately 30,000 addresses that were eligible to receive these. The report indicated that 21.1% of children under 18 years, 13.5% of people between 18 to 64 years, and 10.0% of people 65 years and older were below the poverty line. Based on these results, the report concluded that the percentage of all people between the ages of 18 to 64 in poverty lies between 13.2% and 34 1.3 1.4 Chapter 1 Statistics: The Art and Science of Learning from Data 13.8%. Specify the aspect of this study that pertains to (a) description and (b) inference. GSS and heaven Go to the General Social Survey website, http://sda.berkeley.edu/GSS. Enter HEAVEN as the row variable and then click Run the Table. When asked whether they believed in heaven, what percentage of those surveyed said yes, definitely; yes, probably; no, probably not; and no, definitely not? GSS and heaven and hell Refer to the previous exercise. You can obtain data for a particular survey year such as 2008 by entering YEAR(2008) in the Selection Filter option box before you click Run the Table. 1.5 a. Do this for HEAVEN in 2008, giving the percentages for the four possible outcomes. Note that HEAVEN is not available for the 2014 data because the question wasn’t asked that year. b. Summarize opinions in 2008 about belief in hell (row variable HELL). Was the percentage of “yes, definitely” responses higher for belief in heaven or in hell? GSS for subject you pick At the GSS website, click Standard Codebook under Codebooks and then click Sequential Variable List. Find a subject that interests you and look up a relevant GSS code name to enter as the row variable. Summarize the results that you obtain. 1.2 Sample Versus Population We’ve seen that statistics consists of methods for designing investigative studies, describing (summarizing) data obtained for those studies, and making inferences (decisions and predictions) based on those data to answer a statistical question of interest. We Observe Samples But Are Interested in Populations The entities that we measure in a study are called the subjects. Usually subjects are people, such as the individuals interviewed in a General Social Survey. But they need not be. For instance, subjects could be schools, countries, or days. We might measure characteristics such as the following: j j j For each school: the perstudent expenditure, the average class size, the average score of students on an achievement test For each country: the percentage of residents living in poverty, the birth rate, the percentage unemployed, the percentage who are computer literate For each day in an Internet café: the amount spent on coffee, the amount spent on food, the amount spent on Internet access The population is the set of all the subjects of interest. In practice, we usually have data for only some of the subjects who belong to that population. These subjects are called a sample. Population and Sample The population is the total set of subjects in which we are interested. A sample is the subset of the population for whom we have (or plan to have) data, often randomly selected. In Practice Sometimes the population is welldefined, but sometimes it is a hypothetical set of people that is not enumerated. For example, in a clinical trial to examine the impact of a new drug on lowering cholesterol, the population would be all people with high cholesterol who might have signed up for such a trial. This set of people is not specifically listed. In the 2014 General Social Survey (GSS), the sample was the 2538 people who participated in this survey. The population was the set of all adult Americans at that time—more than 318 million people. Section 1.2 Sample Versus Population 35 Example 2 Sample and population b An Exit Poll Picture the Scenario Scenario 1 in the previous section discussed an exit poll. The purpose was to predict the outcome of the 2010 gubernatorial election in California. The exit poll sampled 3889 of the 9.5 million people who voted. Question to Explore For this exit poll, what was the population and what was the sample? Think It Through The population was the total set of subjects of interest, namely, the 9.5 million people who voted in this election. The sample was the 3889 voters who were interviewed in the exit poll. These are the people from whom the poll obtained data about their votes. Insight The ultimate goal of most studies is to learn about the population. For example, the sponsors of this exit poll wanted to make an inference (prediction) about all voters, not just the 3889 voters sampled by the poll. Did You Know? Examples in this book use the five parts shown in this example: Picture the Scenario introduces the context. Question to Explore states the question addressed. Think It Through shows the reasoning used to answer that question. Insight gives followup comments related to the example. Try Exercises direct you to a similar “Practicing the Basics” exercise at the end of the section. Also, each example title is preceded by a label highlighting the example’s concept. In this example, the concept label is “Sample and population.”b c Try Exercises 1.9 and 1.10 Occasionally data are available from an entire population. For instance, every ten years the U.S. Bureau of the Census gathers data from the entire U.S. population (or nearly all). But the census is an exception. Usually, it is too costly and timeconsuming to obtain data from an entire population. It is more practical to get data for a sample. The General Social Survey and polling organizations such as the Gallup poll usually select samples of about 1000 to 2500 Americans to learn about opinions and beliefs of the population of all Americans. The same is true for surveys in other parts of the world, such as the Eurobarometer in Europe. Descriptive Statistics and Inferential Statistics Using the distinction between samples and populations, we can now tell you more about the use of description and inference in statistical analyses. Description in Statistical Analyses Descriptive statistics refers to methods for summarizing the collected data (where the data constitutes either a sample or a population). The summaries usually consist of graphs and numbers such as averages and percentages. A descriptive statistical analysis usually combines graphical and numerical summaries. For instance, Figure 1.1 is a bar graph that shows the percentages of educational attainment in the United States in 2013. It summarizes a survey of 78,000 households by the U.S. Bureau of the Census. The main purpose of descriptive statistics is to reduce the data to simple summaries without distorting or losing much information. Graphs and numbers such as percentages and averages are easier to comprehend than the entire set of data. It’s much easier to get a sense of the data by looking at Figure 1.1 than by reading through the Chapter 1 Statistics: The Art and Science of Learning from Data U.S. Educational Attainment Levels 35 Percent of US Adults 36 30 25 20 15 10 5 0 No High School Diploma High School or Equivalent Some College, less than a 4year degree Education Level Bachelor’s degree or higher m Figure 1.1 Educational Attainment, Based on a Sample of 78,000 Households in the 2013 Current Population Survey. (Source: Data from United States Census Bureau.) questionnaires filled out by the 78,000 sampled households. From this graph, it’s readily apparent that about a third of people have at least a bachelor’s degree, whereas about 11% do not have a high school diploma. Descriptive statistics are also useful when data are available for the entire population, such as in a census. By contrast, inferential statistics are used when data are available for a sample only, but we want to make a decision or prediction about the entire population. Inference in Statistical Analyses Inferential statistics refers to methods of making decisions or predictions about a population, based on data obtained from a sample of that population. In most surveys, we have data for a sample, not for the entire population. We use descriptive statistics to summarize the sample data and inferential statistics to make predictions about the population. Example 3 Descriptive and b inferential statistics Polling Opinions on Handgun Control Picture the Scenario Suppose we’d like to know what people think about controls over the sales of handguns. Let’s consider how people feel in Florida, a state with a relatively high violentcrime rate. The population of interest is the set of more than 10 million adult residents of Florida. Because it is impossible to discuss the issue with all these people, we can study results from a recent poll of 834 Florida residents conducted by the Institute for Public Opinion Research at Florida International University. In that poll, 54.0% of the sampled subjects said they favored controls over the sales of handguns. A newspaper article about the poll reports that the margin of error for how close this number falls to the population percentage is 3.4%. We’ll see (later in the textbook) that this means we can predict with high confidence (about 95% certainty) that the percentage of all adult Floridians favoring control over sales of handguns falls within 3.4% of the survey’s value of 54.0%, that is, between 50.6% and 57.4%. Section 1.2 Sample Versus Population 37 Question to Explore In this analysis, what is the descriptive statistical analysis and what is the inferential statistical analysis? Think It Through The results for the sample of 834 Florida residents are summarized by the percentage, 54.0%, who favored handgun control. This is a descriptive statistical analysis. We’re interested, however, not just in those 834 people but in the population of all adult Florida residents. The prediction that the percentage of all adult Floridians who favor handgun control falls between 50.6% and 57.4% is an inferential statistical analysis. In summary, we describe the sample, and we make inferences about the population. Insight The sample size of 834 was small compared to the population size of more than 10 million. However, because the values between 50.6% and 57.4% are all above 50%, the study concluded that a slim majority of Florida residents favored handgun control. c Try Exercises 1.11, part a, and 1.12, parts a–c An important aspect of statistical inference involves reporting the likely precision of a prediction. How close is the sample value of 54% likely to be to the true (unknown) percentage of the population favoring gun control? We’ll see (in Chapters 4 and 6) why a welldesigned sample of 834 people yields a sample percentage value that is very likely to fall within about 3–4% (the socalled margin of error) of the population value. In fact, we’ll see that inferential statistical analyses can predict characteristics of entire populations quite well by selecting samples that are small relative to the population size. Surprisingly, the absolute size of the sample matters much more than the size relative to the population total. For example, the population of China is about four times that of the United States, but a random sample of 1000 people from the Chinese population and a random sample of 1000 people from the U.S. population would achieve similar levels of accuracy. That’s why most polls take samples of only about a thousand people, even if the population has millions of people. In this book, we’ll see why this works. Sample Statistics and Population Parameters In Example 3, the percentage of the sample favoring handgun control is an example of a sample statistic. It is crucial to distinguish between sample statistics and the corresponding values for the population. The term parameter is used for a numerical summary of the population. Recall A population is the total group of individuals about whom you want to make conclusions. A sample is a subset of the population for whom you actually have data. b Parameter and Statistic A parameter is a numerical summary of the population. A statistic is a numerical summary of a sample taken from the population. For example, the percentage of the population of all adult Florida residents favoring handgun control is a parameter. We hope to learn about parameters so that we can better understand the population, but the true parameter values are almost always unknown. Thus, we use sample statistics to estimate the parameter values. 38 Chapter 1 Statistics: The Art and Science of Learning from Data Randomness and Variability Random is often thought to mean chaotic or haphazard, but randomness is an extremely powerful tool for obtaining good samples and conducting experiments. A sample tends to be a good reflection of a population when each subject in the population has the same chance of being included in that sample. That’s the basis of random sampling, which is designed to make the sample representative of the population. A simple example of random sampling is when a teacher puts each student’s name on a slip of paper, places it in a hat, and then draws names from the hat without looking. j j Random sampling allows us to make powerful inferences about populations. Randomness is also crucial to performing experiments well. If, as in Scenario 2 on page 30, we want to compare aspirin to a placebo in terms of the percentage of people who later have a heart attack, it’s best to randomly select those in the sample who use aspirin and those who use the placebo. This approach tends to keep the groups balanced on other factors that could affect the results. For example, suppose we allowed people to choose whether to use aspirin (instead of randomizing whether the person receives aspirin or the placebo). Then, the people who decided to use aspirin might have tended to be healthier than those who didn’t, which could produce misleading results. People are different from each other, so, not surprisingly, the measurements we make on them vary from person to person. For the GSS question about TV watching in Activity 1 on page 32, different people reported different amounts of TV watching. In the exit poll of Example 2, not all people voted the same way. If subjects did not vary, we’d need to sample only one of them. We learn more about this variability by sampling more people. If we want to predict the outcome of an election, we’re better off sampling 100 voters than one voter, and our prediction will be even more reliable if we sample 1000 voters. j Just as people vary, so do samples vary. Suppose you take an exit poll of 1000 voters to predict the outcome of an election. Suppose the Gallup organization also takes an exit poll of 1000 voters. Your sample will have different people than Gallup’s. Consequently, the predictions will also differ. Perhaps your exit poll of 1000 voters has 480 voting for the Republican candidate, so you predict that 48% of all voters voted for that person. Perhaps Gallup’s exit poll of 1000 voters has 440 voting for the Republican candidate, so they predict that 44% of all voters voted for that person. Activity 2 at the end of the chapter shows that with random sampling, the amount of variability from sample to sample is actually quite predictable. Both of your predictions are likely to fall within 5% of the actual population percentage who voted Republican, assuming the samples are random. If, on the other hand, Republicans are more likely than Democrats to refuse to participate in the exit poll, then we would need to account for this. In the 2004 U.S. presidential election, much controversy arose when George W. Bush won several states in which exit polling predicted that John Kerry had won. Is it likely that the way the exit polls were conducted led to these incorrect predictions? One of the main goals of statistical inference is to make statements about large populations based on a small sample. We will see more on this in Chapters 8 and 9. Estimation from Surveys with Random Sampling Sample surveys are a common method of gathering data (see Chapter 4). Data from sample surveys are frequently used to estimate population percentages. Section 1.2 Sample Versus Population 39 For instance, a Gallup poll recently reported that 30% of Americans worried that they might not be able to pay health care costs during the next 12 months. How close is a sample estimate to the true, unknown population percentage? When you read results of surveys, you’ll often see a statement such as, “The margin of error is plus or minus 3 percentage points.” The margin of error is a measure of the expected variability from one random sample to the next random sample. As we will see in Activity 2, there is variability in samples. A second sample of the same size could yield a proportion of 29%, whereas a third might yield 32%. A margin of error of plus or minus 3 percentage points means it is very likely that the population percentage is no more than 3% lower or 3% higher than the reported sample percentage. So if Gallup reports that 30% worry about health care costs, it’s very likely that in the entire population, the percentage that worry about health care costs is between about 27% and 33% (that is, within 3% of 30%). “Very likely” typically means that about 95 times out of 100, such statements are correct. We refer to this as a 95% confidence interval. Chapter 8 will show details about margin of error and how to calculate it. For now, we’ll use a rough approximation. In statistics, we let n denote the number of subjects in the sample. When creating a 95% confidence interval using a simple random sample of n subjects, and finding the margin of error for the estimation of a population proportion, approximate margin of error = 1 * 100% 1n Example 4 Margin of error b Gallup Poll Picture the Scenario On April 20, 2010, one of the worst environmental disasters took place in the Gulf of Mexico, the Deepwater Horizon oil spill. As a result of an explosion on an oil drilling platform, oil flowed freely into the Gulf of Mexico for nearly three months until it was finally capped on July 15, 2010. It is estimated that more than 200 million gallons of crude oil spilled, causing extensive damage to marine and wildlife habitats and crippling the Gulf’s fishing and tourism industries. In response to the spill, many activists called for an end to deepwater drilling off the U.S. coast and for increased efforts to eliminate our dependence on oil. Meanwhile, approximately nine months after the Gulf disaster, political turbulence gripped the Middle East, causing the price of gasoline in the United States to approach an alltime high. Between March 3 and 6, 2011, Gallup’s annual environmental survey2 reported that 60% of Americans favored offshore drilling as a means to reduce U.S. dependence on foreign oil, 37% opposed offshore drilling, and the remaining 3% had no opinion. The poll was based on interviews conducted with a random sample of 1021 adults, aged 18 and older, living in the continental United States, selected using random digit dialing. Questions to Explore a. Find an approximate margin of error for these results reported in the environmental survey report. b. How is the margin of error interpreted? 2 www.gallup.com/poll/146615/OilDrillingGainsFavorAmericans.aspx 40 Chapter 1 Statistics: The Art and Science of Learning from Data Think It Through a. The sample size was n = 1021 U.S. adults. The Gallup poll was a random sample but not a simple random sample. It used random digit dialing, which can give results nearly as accurate as with a simple random sample. We estimate the margin of error as approximately 1 1 * 100% = * 100% = 0.03 * 100% = 3% 1n 11021 b. Gallup reported that 60% of Americans support offshore drilling. This percentage refers to the sample of 1021 U.S. adults. The reported margin of error of 3% suggests that in the population of adult Americans, it’s likely that between about 57% and 63% support offshore drilling. Insight You may be surprised or skeptical that a sample of only 1021 people out of a huge population (such as 200 million adult Americans) can provide such a precise inference. That’s the power of random sampling. We’ll see why this works in Chapter 7. The more precise formula shown there will give a somewhat smaller margin of error value when the sample percentage is far from 50%. c Try Exercises 1.17 and 1.18 Testing and Statistical Significance Now suppose you’ve conducted an experiment to determine whether a new form of chemotherapy is effective at keeping cancer in remission for at least five years, compared to the standard chemotherapy, which we will call the control group. You find that the cancer remission rate (the rate at which the cancer is no longer observed in the patient) is 55% in subjects taking a new form of chemotherapy, whereas the cancer remission rate is 44% in the control group. Can you conclude that the new chemotherapy was effective? Not quite yet. You must convince yourself that this difference between 44% and 55% cannot be explained by the variation that occurs naturally just by chance. Even if the effect of the chemotherapy is no different from the effect of control, the sample remission rates would not be exactly the same for the two groups. Just by ordinary chance, even if there is no effect of the treatment, the remission rate in the treatment group could be higher than the rate in the control group. In a randomized experiment, the variation that could be expected to occur just by chance alone is roughly like the margin of error with simple random sampling. So if a treatment has n = 215 observations, this is about 11> 12152 * 100%, or about 7%. If the population percentage of people in remission is 50%, the sample percentage of 215 subjects in remission is very likely to fall between 43% and 57% (that is, within 7% of 50%). The difference in a study between 44% relapsing and 55% relapsing could be explained by the ordinary variation expected for this size of sample. By contrast, if each treatment had n = 1000 observations, the ordinary variation we’d expect due to chance is only about 11> 110002 * 100% or 3% for each percentage. Then, the difference between 44% and 55% could not be explained by ordinary variation: There is no plausible common population percentage for the two treatments such that 44% and 55% are both within 3% of Section 1.2 Sample Versus Population 41 its value. The new chemotherapy would truly seem to be better than standard chemotherapy. The difference expected due to ordinary variation is smaller with larger samples. You can be more confident that sample results reflect a true effect when the sample size is large than when it is small. Obviously, we cannot learn much by using only n = 1 subject for each treatment. When the difference between the results for the two treatments is so large that it would be rare to see such a difference by ordinary random variation, we say that the results are statistically significant. For example, the difference between 55% remission and 44% remission when n = 1000 for each group is statistically significant. We can then conclude that the observed difference is likely due to the effect of the treatments rather than merely due to ordinary random variation. Thus, in the population, we infer that the response truly depends on the treatment. How to determine this and how it depends on the sample size are topics we’ll study in later chapters. For now, suffice it to say the larger the sample size, the better. The Basic Ideas of Statistics Here is a summary of the key concepts of statistics that you’ll learn about in this book: Chapter 2: Exploring Data with Graphs and Numerical Summaries How can you present simple summaries of data? You replace lots of numbers with simple graphs and numerical summaries. Chapter 3: Association: Contingency, Correlation, and Regression How does annual income ten years after graduation correlate wi