R Usage Notes


Reading and working with the exercise datasets

R datasets

If you have installed the Devore7 package, all of the data the author has provided is available in R.

However, it's not all that obvious how to access it, especially if you are new to R.

There are two documents on the companion website that illustrate how to use these (unfortunately, they both have the same filename, Devore7.pdf).

The first is written by Douglas Bates at the University of Wisconsin-Madison, and is titled Using the Devore7 package with R. This document contains "vignettes" that give a brief overview of the features of R that you need to learn, and instructions on how to duplicate the examples in the text. The apparent naming convention is that datasets containing examples from chapter 1 begin with "xmp01" followed by a period, followed by a two-digit sequence number.

The second is a 306 page PDF document that contains a directory of the datasets associated with the exercises at the end of the chapters as well as the examples. The information provided for each dataset is similar to what the str (structure) method produces when applied to one of the dataset objects. The apparent naming convention here is that exercises from chapter 1 begin with "ex01" followed by a period, followed by the exercise number. Unlike the first document, very little is provided in the way of explanation and examples.


Assignment 1 Help

To get to the data for the exercises in assignment 1, start R and at the command prompt enter:

require(Devore7)

You can then display the structure of the exercise datasets with the following commands:

The structure (str) method will give you the name and type of the variable(s) in the exercise data, which is usually stored in an data.frame object.

Alternatively, you can find the variable names from the Devore7.pdf document.

Once you know the variable name, you can invoke the stem or hist functions to get stem and leaf plots or histograms with the following syntax. Suppose we have run str on the data for exercise 23,

with(ex01.23,str(ex01.23))

and discovered that the variable name is "C1". Now we can produce a stem-and-leaf plot (taking all defaults) with the command:

with(ex01.23,stem(C1))

You can experiment with the "scale" parameter to vary the output a bit:

with(ex01.23,stem(C1,scale=2))

To discover the parameters that are available with "stem", access the help for this function by typing:

?stem

The command to produce a histogram of variable C1 in data.frame ex01.23 is:

with(ex01.23,hist(C1))

As before, you can discover the optional parameters for "hist" and their defaults by typing

?hist

It is important to note that for some exercises, you may not be able to produce exactly the output requested in the problem. It is possible that none of the three software packages we have available (R, Minitab, SPSS) will produce exactly what you want. For simple exercises like stem and leaf plots for small datasets, you can produce the desired output by hand if necessary.

While it may be annoying that the problem asks for something the software may not provide, in a real world situation this is exactly what you often encounter - you have an idea what kind of chart or output you would like, and you have to try to figure a way to get it with the software that you have available, often without knowing for certain whether the software is capable of producing what you want. A certain amount of trial and error (and the accompanying frustration) is to be expected.


Histograms with Specified Breaks

You can specify the boundaries of the bars in a histogram with the break parameter of hist.

The breaks can be specified as an R list. For example, suppose in exercise 24 of chapter 1, we want breaks starting at 4000 with 200 units between each break. We have to determine the highest value for the list, which depends on the maximum value of the variable C1. To find this, enter:

max(ex01.17$C1)

which should return 5828, so the last entry in the list should be 6000. The command is now:

hist(ex01.17$C1,breaks=c(4000,4200,4400,4600,4800,5000,5200,5400,5600,5800,6000))


Relative Frequency Histograms

If the histogram bars all have width 1, you can produce a frequency histogram by simply coding freq=F, for example,

hist(ex01.17$C1,freq=F)

will produce a relative frequency histogram for problem 17 of chapter 1.

If the bars have width other than one, things are more complicated. Coding freq=F will produce a density histogram, where the vertical scale is the relative frequency divided by the bar width.

If you want relative frequencies, you can write the histogram to an R object and rescale the density as described in a response on an R help forum:

https://stat.ethz.ch/pipermail/r-help/2001-October/015845.html

To write a histogram of the problem 24 data with specified breaks to an R object called t, the syntax is:

t<-hist(ex01.24$C1,breaks=c(4000,4200,4400,4600,4800,5000,5200,5400,5600,5800,6000),plot=F)

This will produce an R object containing information on how to produce the histogram. You can enter

str(t)

to find the structure of this object, which produces the following output:

List of 7
 $ breaks     : num [1:11] 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 ...
 $ counts     : int [1:10] 1 2 9 12 19 22 20 7 7 1
 $ intensities: num [1:10] 0.00005 0.0001 0.00045 0.0006 0.00095 ...
 $ density    : num [1:10] 0.00005 0.0001 0.00045 0.0006 0.00095 ...
 $ mids       : num [1:10] 4100 4300 4500 4700 4900 5100 5300 5500 5700 5900
 $ xname      : chr "ex01.24$C1"
 $ equidist   : logi TRUE
 - attr(*, "class")= chr "histogram"

The idea is to multiply each of the entries in t$density by the bar width, 200. This will change from a density histogram to a relative frequency histogram.

The R code for this is as follows:

for(i in 1:10) t$density[[i]]<-t$density[[i]]*200

After this command is executed, str(t) produces the following output:

List of 7
 $ breaks     : num [1:11] 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 ...
 $ counts     : int [1:10] 1 2 9 12 19 22 20 7 7 1
 $ intensities: num [1:10] 0.00005 0.0001 0.00045 0.0006 0.00095 ...
 $ density    : num [1:10] 0.01 0.02 0.09 0.12 0.19 ...
 $ mids       : num [1:10] 4100 4300 4500 4700 4900 5100 5300 5500 5700 5900
 $ xname      : chr "ex01.24$C1"
 $ equidist   : logi TRUE
 - attr(*, "class")= chr "histogram"

Now we can produce a frequency histogram with the plot command,

plot(t,freq=F)