Transcription

A.2 R AND S-PLUS EXAMPLESR is free software maintained and regularly updated by many volunteers. It is anopen-source version using the S programming language, and many S-Plus functionsalso work in R. See http://www.r-project.org, at which site you can download Rand find various documentation. Our discussion in this Appendix refers to R version2.13.0.Dr. Laura Thompson has prepared an excellent, detailed manual on the use of Rand S-Plus to conduct the analyses shown in the second edition of Categorical DataAnalysis. You can access this athttps://home.comcast.net/ lthompson221/Splusdiscrete2.pdfA good introductory resource about R functions for various basic types of categorical data analyses is material prepared by Dr. Brett Presnell at the University ofFlorida. The siteswww.stat.ufl.edu/ presnell/Courses/sta4504-2000sp/Rand in particular,www.stat.ufl.edu/ presnell/Courses/sta4504-2000sp/R/R-CDA.pdfhave details for an introductory course on categorical data analysis with many of theexamples from my books.Another useful resource is the website of Dr. Chris re the link to R has examples of the use of R for most chapters of my introductorytext, An Introduction to Categorical Data Analysis. The link to Schedule at Bilder’swebsite for Statistics 875 at the University of Nebraska has notes for a course on thistopic following that text as well as R code and output imbedded within the notes.Thanks to Dr. Bilder for this outstanding resource.Another good source of examples for Splus and R is Dr. Pat Altham’s at Cambridge, UK,www.statslab.cam.ac.uk/ patAn upcoming excellent resource of the use of R for contingency table analysis is asoon-to-appear book by Dr. Maria Kateri. That text also provides functions for manymethods not readily available in R.Texts that contain examples of the use of R for various categorical data methodsinclude Statistical Modelling in R by M. Aitkin, B. Francis, J. Hinde, and R. Darnell(Oxford 2009), Modern Applied Statistics With S-PLUS, 4th ed., by W. N. Venablesand B. D. Ripley (Springer, 2010), Analyzing Medical Data Using S-PLUS by B.Everitt and S. Rabe-Hesketh (Springer, 2001), Regression Modeling Strategies by F.E. Harrell (Springer, 2001), and Bayesian Computation with R by J. Albert (Springer,2009).1

Chapter 1: IntroductionUnivariate binomial and multinomial inferenceThe function dbinom can generate binomial probabilities, for example, dbinom(6, 10,0.5) gives the probability of 6 successes in 10 trials with “probability of success”parameter π 0.50. The function pbinom(6, 10, 0.5) would give the correspondingcumulative probability of 6 or fewer successes.The function prop.test gives the Pearson (score) test and score confidence interval for a binomial proportion, for example, prop.test(6, 10, p .5, correct FALSE),where “correct FALSE” turns off the continuity correction, which is the default. Thefunction binom.test gives a small-sample binomial test, for example binom.test(8, 12,p 0.5, alternative c(”two.sided”)) gives a two-sided test of H0 : π 0.50 with 8successes in 12 trials.The proportion package contains a great variety of confidence intervals for a binomial parameter, including Wald, likelihood-ratio, and score intervals. For instance,for 95% confidence intervals based on 0 successes in 10 trials, the package ----------------------------- library(proportion) ciAllx(0, 10, 0.05)method xLowerLimit UpperLimit1Wald 0 0.000000e 00 0.000000003 Likelihood 0 2.525061e-05 0.174818274Score 0 2.005249e-17 -------------------------------For special R functions for confidence intervals for a binomial parameter, seewww.stat.ufl.edu/ aa/cda/R/one-sample/R1/index.htmlThe confidence intervals include the score (Wilson) CI, Blaker’s exact CI, the smallsample Clopper-Pearson interval and its mid-P adaptation discussed in Section 16.6of the textbook, and the Agresti–Coull CI and its add-4 special case. Some of theseare also available in the PropCIs package prepared by Ralph Scherer at the Institutefor Biometry in Hannover, Germany. For ---------------------------------- library(PropCIs) addz2ci(9, 10, 0.95) # Agresti-Coull CI adding z 2 to success and failures95 percent confidence interval:0.5740323 1.0000000 add4ci(9, 10, 0.95) # Wald CI after add 2 successes and 2 failures95 percent confidence interval:0.5707764 1.0000000 scoreci(9, 10, 0.95) # score CI2

95 percent confidence interval:0.5958 -------------------------------The confidence interval based on the test using mid P -value is available with themidPci function in the PropCIs ----------------------------- library(PropCIs) midPci(9, 10, 0.95) # 9 successes in 10 trials95 percent confidence interval:0.5966 ---------------------------Binomial tests and confidence intervals using the mid P -value are available withthe exactci package. Other available inferences with that package include the Blakerexact confidence ------------------------------ library(exactci) binom.exact(9, 10, 0.50, alternative c("greater"), midp TRUE)number of successes 9, number of trials 10, p-value 0.005859 binom.exact(9, 10, conf.level 0.95, midp TRUE)95 percent confidence interval:0.5965206 0.9950264 binom.exact(9, 10, conf.level 0.95, tsmethod c("blaker"))95 percent confidence interval:0.5555 ---------------------------Joseph Lang, in a 2017 article in The American Statistician (vol. 71, pp. 354-368),has proposed “mean-minimum exact confidence intervals” for a proportion, for whichboth the mean and minimum coverage probability never fall below specified values.He has an R package CI.binom for this.The table function constructs contingency tables.The function chisq.test can perform the Pearson chi-squared test of goodness-of-fitof a set of multinomial probabilities. For example, with 3 categories and hypothesizedvalues (0.4, 0.3, 0.3) and observed counts (12, 8, 10), x - c(12, 8, 10) p - c(0.4, 0.3, 0.3) chisq.test(x, p p)Chi-squared test for given probabilities3

data: xX-squared 0.2222, df 2, p-value 0.8948 chisq.test(x, p p, simulate.p.value TRUE, B 10000)Chi-squared test for given probabilities withsimulated p-value (based on 10000 replicates)data: xX-squared 0.2222, df NA, p-value 0.8763The argument “simulate.p.value TRUE” requests simulation of the exact smallsample test of goodness of fit, with B replicates. So, the second run above usessimulation of 10,000 multinomials with the hypothesized probabilities and finds thesample proportion of them having X 2 value at least as large as the observed value of0.2222.Bayesian inferenceFor a Bayesian posterior interval based on a beta(α1 , α2 ) prior distribution and ysuccesses and n y failures, we find percentiles of the beta distribution with parametersα1 y and α2 n y using the quantile function qbeta. We find a 95% posterior intervalhere based on the uniform prior (α1 α2 1.0) and y 0 and n y 25. Likewise,we can use the pbeta function of cumulative probabilities to find a tail probability,such as the posterior probability that π is at least -------------------------- qbeta(0.025, 1, 26); qbeta(0.975, 1, 26)[1] 0.0009732879[1] 0.1322746 pbeta(0.50, 1, 26) # posterior beta cumulative prob. at 0.50[1] 1 1 - pbeta(0.50, 1, 26) # right-tail prob. above 0.50[1] 1.490116e-08 library(proportion} ciBAx(0, 25, 0.05, 1.0, 1.0) # y, n, CI error prob, beta prior valuesxLBAQxUBAQxLBAHxUBAHx # 2nd set are HPD1 0 0.0009732879 0.1322746 2.440083e-10 ------------------------------One can also use the proportion package for Bayesian posterior intervals based on betapriors, as shown above.See logitnorm.r-forge.r-project.org/ for utilities such as a quantile function for thelogit-normal distribution.The hpd function in the TeachingDemos library can construct HPD intervals froma posterior distribution. The package hdrcde is a more sophisticated package for suchmethods. For the informative analysis of the vegetarians example at the end of Section1.6.4:4

library("TeachingDemos")y - 0; n - 25a1 - 3.6; a2 - 41.4a - a1 y; b - a2 nh - hpd(qbeta, shape1 a, shape2 b)Chapters 2–3: Two-Way Contingency TablesFor creating mosaic plots in R, see www.datavis.ca and also the mosaic functions inthe vcd and vcdExtra libraries; see Michael Friendly’s tutorial ettes/vcd-tutorial.pdf, whichalso is useful for basic descriptive and inferential statistics for contingency tables. Toconstruct a mosaic plot for the data in Table 3.2, one can enter x - 104,293)data - matrix(x, nrow 3,ncol 6, byrow TRUE)dimnames(data) list(Degree c(" HS","HS","College"),Belief tra")library("vcdExtra")StdResid - 0.0,3.4,3.1,4.7,4.8,-0.8,1.1,-6.7)StdResid - matrix(StdResid,nrow 3,ncol 6,byrow TRUE)mosaic(data,residuals StdResid, gp shading Friendly)Chi-squared and Fisher’s exact test; ResidualsThe function chisq.test also can perform the Pearson chi-squared test of independencein a two-way contingency table. For example, for Table 3.2 of the text, using also thestdres component for providing standardized residuals, data - ,89,19,104,293),ncol 6,byrow TRUE) chisq.test(data)Pearson’s Chi-squared testdata: dataX-squared 76.1483, df 10, p-value 2.843e-12 chisq.test(data) stdres[,1][,2][,3][,4][,5][,6][1,] -0.368577 -2.227511 -1.418621 -1.481383 -1.3349600 3.590075[2,] -2.504627 -2.635335 -3.346628 1.832792 0.0169276 3.382637[3,] 3.051857 4.724326 4.839597 -0.792912 1.0794638 -6.665195Likewise, replacing the stdres component by expected would generate the expectedfrequency estimates. As shown above, you can simulate the exact conditional distribution to estimate the P -value whenever the chi-squared asymptotic approximation issuspect.5

Here is an example of using Mantel’s ordinal test for a two-way table with the M 2statistic of Section 3.4 applied to the infant birth defects -------------------------------- Malform - matrix(c(17066, 14464, 788, 126, 37, 48, 38, 5, 1, 1), ncol 2) Malform[,1] [,2][1,] 1706648[2,] 1446438[3,]7885[4,]1261[5,]371 library(vcdExtra) CMHtest(Malform, rscores c(0, 0.5, 1.5, 4.0, 7.0))Cochran-Mantel-Haenszel Statistics # test was proposed by Nathan MantelAltHypothesisChisq DfProbcorNonzero correlation 6.5699 1 --------------------------------The function fisher.test performs Fisher’s exact test. For example, for the teatasting data of Table 3.9 in the text, tea - matrix(c(3,1,1,3),ncol 2,byrow TRUE) fisher.test(tea)Fisher’s Exact Test for Count Datadata: teap-value 0.4857alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval:0.2117329 621.9337505sample estimates:odds ratio6.408309 fisher.test(tea,alternative "greater")Fisher’s Exact Test for Count Datadata: teap-value 0.2429alternative hypothesis: true odds ratio is greater than 1The P -value is the sum of probabilities of tables with the given margins that have probability no greater than the observed table. The output also shows the conditional MLestimate of the odds ratio (see Sec. 16.6.4) and a corresponding exact confidence interval based on noncentral hypergeometric probabilities. Use fisher.test(tea,alternative “greater”)for the one-sided test.For an I J table called “table,” using6

fisher.test(table, simulate.p.value TRUE, B 10000)generates Monte Carlo simulation with B replicates to estimate the exact P -valuebased on the exact conditional multiple hypergeometric distribution obtained by conditioning on all row and column marginal totals (proposed by Agresti, Wackerly, andBoyett, 1979).For mid-P values, you can use the exact2x2 -------------------------------- library(exact2x2) fisher.exact(tea,midp TRUE,conf.int FALSE,alternative "greater")p-value 0.1286alternative hypothesis: true odds ratio is greater than 1 fisher.exact(tea,midp TRUE,conf.level 0.95)p-value 0.2571alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval:0.3100451 ------------------------------------This also shows a confidence interval based on inverting the exact conditional nonnullhypergeometric distribution, but using the mid P -value.Confidence intervals for association measuresFor a 2 2 table, the function prop.test provides the Wald confidence interval for thedifference of proportions, where one uses the option correct FALSE to suppress thecontinuity correction.For parameters comparing two binomial proportions such as the difference of proportions, relative risk, and odds ratio, a good general-purpose method for constructingconfidence intervals is to invert the score test. Ralph Scherer at the Institute for Biometry in Hannover, Germany, has prepared a package PropCIs on CRAN incorporatingmany of these confidence interval functions for proportions and comparisons of proportions. For example, Here, we illustrate for the example on aspirin and heart ---------------------------------- library(PropCIs) diffscoreci(189, 11034, 104, 11037, 0.95) # CI for difference of proportions95 percent confidence interval:# score CI0.004716821 0.010788501 riskscoreci(189, 11034, 104, 11037, 0.95) # score CI for relative risk95 percent confidence interval:1.433904 2.304713 orscoreci(189, 11034, 104, 11037, 0.95) # score CI for odds ratio95 percent confidence interval:1.440802 ------------------------------------7

Some of these functions are taken fromwww.stat.ufl.edu/ aa/cda/R/two-sample/R2/index.htmlwhere former students of mine prepared R functions for confidence intervals comparingtwo proportions with independent samples (difference of proportions, relative risk,odds ratio), and the sitewww.stat.ufl.edu/ aa/cda/R/matched/R2 matched/index.htmlwhich has R functions for confidence intervals comparing two proportions with dependent samples. Most of these were written by my former graduate student, Yongyi Min,who also prepared the Bayesian intervals mentioned below. Please quote this site ifyou use one of these R functions for confidence intervals for association parameters.We believe these functions are dependable, but no guarantees or support are available,so use them at your own risk. Note that the score CI for the difference of proportions isbased on the formula on p. 79 of the text based on the work of Mee (1984). The slightlydifferent score interval suggested by Miettinen and Nurminen (1985) incorporates abias correction factor in the variance of (n1 n2 )/(n1 n2 1) that can result in evenbetter coverage properties according to an article by Newcombe and Nurminen (2011,Communications in Statistics, 40: 1271–1282). Bernhard Klingenberg has written anR function for both of these score intervals, with the Miettinen and Nurminen as thedefault. For details, go to http://sites.williams.edu/bklingen. Here is Bernhard’scode, with examples:# from Bernhard Klingenberg# returns score test statistic, p-value and restricted MLEs for diffof prop p1 - p2:score.test - function(delta.null, y,n, alternative "two.sided",MN TRUE) {N sum(n)C sum(y)L3 NL2 (n[1] 2*n[2])*delta.null-N-CL1 (n[2]*delta.null-N-2*y[2])*delta.null CL0 y[2]*delta.null*(1-delta.null)c L2 3/(3*L3) 3 - L1*L2/(6*L3 2) L0/(2*L3)b ifelse(c 0,1,-1)*sqrt(L2 2/(3*L3) 2 - L1/(3*L3))d - min(c/b 3,1)d - max(d, -1)a (3.14159265358979 acos(d))/3p2 2*b*cos(a)-L2/(3*L3)p2 max(p2, 0)p2 min(p2,1)p1 p2 delta.nullp1 max(p1, 0)p1 min(p1,1)se0 - sqrt(p1*(1 - p1)/n[1] p2*(1 - p2)/n[2])if (!MN) z (y[1]/n[1]-y[2]/n[2]-delta.null)/se0 #Meeelse z (y[1]/n[1]-y[2]/n[2]-delta.null)/(se0*N/(N-1)) #Mietinnen andNurminenif (se0 0){z 08

}pvalue - switch(alternative,"two.sided" 1 - pchisq(z 2, df 1),"less" pnorm(z),"greater" pnorm(z, lower.tail FALSE))return(list(test.stat z, p.value pvalue, p1.null p1, p2.null p2))}#returns lower and upper score bounds for diff of prop p1 - p2:score.int - function(y,n,conflev 0.95, type "two.sided", MN TRUE) {if (type "two.sided") c qnorm(1-(1-conflev)/2) 2 elsec qnorm(conflev) 2delta1 (y[1] 1)/(n[1] 2) - (y[2] 1)/(n[2] 2) #starting pointif (any(type "lower",type "two.sided")) {delta2 -1while( abs(delta1-delta2) 10 (-6) ) {#Bisection for LBdelta (delta1 delta2)/2z score.test(delta,y,n, MN MN) test.stat 2if (z c) delta2 delta else delta1 delta}}delta.LB delta1delta1 (y[1] 1)/(n[1] 2) - (y[2] 1)/(n[2] 2) #starting pointif (any(type "upper",type "two.sided")) {delta2 1;while( abs(delta1-delta2) 10 (-6) ) {#Bisection for UBdelta (delta1 delta2)/2z score.test(delta,y,n, MN MN) test.stat 2if (z c) delta2 delta else delta1 delta}}delta.UB delta1return(switch(type,"lower" delta.LB,"upper" delta.UB,"two.sided" cbind(delta.LB,delta.UB)))}### Example: y - c(10,5) n - c(20,20) score.int(y,n)delta.LB delta.UB[1,] -0.05793719 0.5161737 score.int(y,n, MN FALSE)delta.LB delta.UB[1,] -0.05026568 0.51046699

score.test(delta.null 0,y,n) test.stat[1] 1.592168 p.value[1] 0.1113469 p1.null[1] 0.375 p2.null[1] 0.375 score.test(delta.null 0,y,n, MN FALSE) test.stat[1] 1.632993 p.value[1] 0.1024704 p1.null[1] 0.375 p2.null[1] 0.375Here is code to obtain the profile likelihood confidence interval for the odds ratiofor Table 3.1 on seat-belt use and traffic accidents (using the fact that the log oddsratio is the parameter in a simple logistic model): yes - c(54,25)n - c(10379,51815)x - c(1,0)fit - glm(yes/n x, weights n, family binomial(link logit))summary(fit)Coefficients:Estimate Std. Error z value Pr( z )(Intercept) -7.63610.2000 -38.17 2e-16 ***x2.38270.24219.84 2e-16 *** confint(fit)Waiting for profiling to be done.2.5 %97.5 %(Intercept) -8.055554 -7.268025x1.919634 2.873473 exp(1.919634); exp(2.873473)[1] 6.818462[1] 17.69838Fay (2010a) described an R package exact2x2 that constructs a small-sample confidence interval for the odds ratio by inverting the test using the P -value (mentionedin Section 16.6.1) that was suggested by Blaker (2000), which equals the minimumone-tail probability plus an attainable probability in the other tail that is as close aspossible to, but not greater than, that one-tailed probability. Seejournal.r-project.org/archive/2010-1/RJournal 2010-1 Fay.pdfand10

lFor example, for a 2 2 table called data, the command exact2x2(data, tsmethod “blaker”) provides an exact test using Blaker’s P -value and the confidence intervalbased on inverting that test.You can construct a small-sample confidence interval for the odds ratio based onthe exact conditional test with mid P -value using the epitools itools/epitools.pdfFor the data in the tea-tasting example, we ---------------------------- install.packages("epitools") library(epitools) ormidp.test(3, 1, 1, 3, or 1)one.sided two.sided1 0.1285714 0.2571429 or.midp(c(3,1,1,3), conf.level 0.95) conf.int[1]0.3100508 ----------------------------------This is also available in