Brought to you by molecularsciences.org.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License.
This publication may not be redistributed without this notice.

Statistics with R

Descriptive Statistics
Descriptive statistics is used to summarize a collection of data in a clear and understandable manner. Measurements of an experiment can be summarized numerically or graphically. For the numerical approach, we compute the mean, standard deviation, etc. The graphical method involves box plots and stem and leaf displays. Numerical approach is generally more objective and precise while the graphical method is more useful for identifying patterns in data.

Descriptive statistics is looking at the data prior to formal analysis.

Inferential Statistics
Inferential statistics is used to draw inferences about a population from a sample. Statistical inferences can be made by either estimation or by hypothesis testing. In estimation, the sample is used to estimate a parameter and a confidence estimate. In hypothesis testing, we are interesting in finding whether we can reject a null hypothesis.

Variable
Variables are characteristics or attributes which can vary across different individuals. For example, age, height, gender, etc.

Experimental unit
Experimental units are objects or individuals on which a variable is measured. Suppose we measure the time it took a group of runners to run 100 meters. The experimental units being measured are runners, the runners are the experimental units. The variable is time it took them to complete a 100 meters run.

Univariate, bivariate, and multivariate data
In the runners example, we are only measuring the time. This is an example of univariate data since a single variable is measured on a single experimental unit. If we measure time and height of each runner, it is bivariate data since we are measuring two variable per experimental unit. Multivariate data has two or more variables per experimental unit.

Population and sample
A population is the set of all measurements of an entire group. A sample is a subset of these measurements.

Categorical or qualitative:
These variables are measured on a nominal scale. They have category names but no ordering e.g. black bear, polar bear, grisly bear, etc. Frequency

Numerical and quantitative:
These variables are measured on an ordinal (e.g. good, better, best), interval, or ratio scale. center and spread. Numerical variables can be either discrete (exact numbers) or continuous (range). For example, favorite football player, favorite singer, or favorite color are qualitative variables. Speed and count of something are quantitative variable. A variable can be independent or dependent.

Describing data with numerical measures

A graph is a visual representation of data which facilitates interpretation. However, it is difficult to convey a visual graph to someone else verbally. It is also difficult to make precise statistical inferences from graphs e.g. how do you measure the difference between two graphs?

Numerical measures can be used to solve these problems. They are a set of numbers that convey the meaning of the graph. Numerical measures of a population are called parameters and numerical measure of a sample are called statistics.

Numerical measures can be used to measure the center, variance from the center, and relative standing of a subset of the distribution. Measures of center find the center of the data. The variance shows how data deviates from or varies with respect to the center. Relative standing shows how different sections of the data set relate to one another.

R Quickstart

R can be downloaded from http://cran.r-project.org. On Ubuntu Linux, type sudo apt-get r-recommended to install R. The type R on the commandline to use R.

> 2+2
[1] 4
> 3-5
[1] -2
> 3*2
[1] 6
> 7/3
[1] 2.333333
> 8^2
[1] 64
> pi
[1] 3.141593

Storing in variables

> radius <- 24
> area <- 2*pi*radius
> area
[1] 150.7964

Vectors

> math <- c(60,90,34)
> science <- c(56,98,76)
> english <- c(34,98,22)
> avg_grades <- (math + science + english) / 3
> avg_grades
[1] 50.00000 95.33333 44.00000

Graphical summaries
- For a single categorical variables, we use bar plots and dot plots
- For single numerical variables, we use histograms and boxplots
- For two numerical variables, we use scatterplot

Histogram
A histogram is a special kind of bar plot. It is used for visualizing the distribution of values of a numerical variable. When drawn with a density scale, the area of each bar is the proportion of observations in the interval. Height represents density where the total area is 100%.

Type the following for help on histogram

> ?hist

Example

> par(mfrow=c(2,2))
> simdata <- rchisq(100,8)
> hist(simdata)
> hist(simdata,breaks=2)

Mean is appropriate for distributions that are fairly symmetrical.

> mean(math)

The median is the middlemost number. Half of the values are greater than the median and the other half are smaller. Median is usually more appropriate summary for skewed distributions.

R Language Documentation

R organizes documentation in terms of packages and functions. Documentation comes with each package you download and install.

R Package Documentation
All R functions, dataset and other objects are stored in packages. A package's contents, including its documentation, is available only after it is loaded. To see a list of installed R packages type:

> library()

This should produce a list such as follows:


Packages in library 'C:/PROGRA~1/R/rw2011/library':



base        The R Base Package

boot        Bootstrap R (S-Plus) Functions (Canty)

class       Functions for Classification

cluster     Cluster Analysis Extended Rousseeuw et al.

datasets    The R Datasets Package

graphics    The R Graphics Package

...

To load a package, use library("package_name"):

> library(graphics)

To view the documentation of a package, we use library(help="package_name") or help(package="package_name"):

> library(help=graphics)

or

> help(package=graphics)

This should produce a list such a follows:


                Information on package 'graphics'



Description:



Package:       graphics

Version:       2.1.1

Priority:      base

Title:         The R Graphics Package

Maintainer:    R Core Team 

Description:   R functions for base graphics

Depends:       grDevices

License:       GPL Version 2 or later.

Built:         R 2.1.1; ; 2005-06-21 08:25:36; windows



Index:



abline                  Add a Straight Line to a Plot

arrows                  Add Arrows to a Plot

assocplot               Association Plots

...

R Function Documentation
The package documentation list all the functions of a package along with one line descriptions of the functions. A more detailed documentation is also available for each function. To access it, you need to use help(function_name):

> help(arrows)

Navigable Help System
A HTML based help system is also available. To use this you have to type the following command:

> help.start()

The help system is very user-friendly and contains all the information you can access with the help command and more.

Sensitivity, Specificity and Predictive Value

Diagnostic tests are rarely perfect and all-conclusive. Therefore the we need to able to interpret the probability of obtaining different results and calculating their predictive values. In medical diagnostic tests, the lab technicians for testing devices look for presence, absense or abnormal quantities for specific substances (molecules). Suppose we wish to analyze a new device which detects AIDS virus. There could be 4 possible results:

Disease present Disease Absent
Test Positive True Positive (TP) False Positive (FP)
Test Negative True Negative (TN) False Negative (FN)

Sensitivity: is the proportion of persons who are infected with the AIDS virus and were diagnosed to have the disease.
P(T+|D+) = TP / (TP + FN)

Specificity: is the proportion of patients without disease who test negative.
P(T-|D-) = TN / (TN + FP)

Sensitivity and specificity describe how well a test discriminates between persons with and without and disease. Given that someone is tested positive for AIDS with this new device, what is that probability that he actually has the disease. In other words, what is the predictive value of this test.

Predictive value of a positive test: is the proportion of persons with the disease who were tested positive.
P(D+|T+) = TP / (TP + FP)

Predictive value of a negative test: is the proportion of healthy persons diagnosed to not have the disease.
P(D-|T-) = TN / (TN + FN)

Plotting in R

R has very powerful and easy to use plotting functions. Unfortunately, they are not very intuitive and can be quite confusing for beginners. For starters, here's a very simple plot.

a <- c(60,164,164,100,62,44,26,174,106,146,102,50,152,86,166)
plot(a)


All numeric values are placed in a vector a. Then we plot and R uses default settings to plot the graph. Now we add labels to the graph.

plot(a, ylab="Number of Nodes", xlab="Spectra", main="Window Filter Analysis")


Pretty straightforward, isn't it. Now let's plot a line instead of points on the graph.

plot(a, ylab="Number of Nodes", xlab="Spectra", main="Window Filter Analysis", type="l")


To draw a line, all we need to do is specify type="l". Other options include b, c, h, o, p, s. They are plotted below.

type="b" type="c"
type="h" type="o"
type="p" type="s"

Plotting multiple data sets on a graph

It is very often common to plot multiple data sets on a graph. Such graphs often give useful insight into how the two data set differ or resemble each other. Unfortunately, plot() function cannot be used to plot two data sets. Thus, we have to work around the problem by using another function lines(). points() function can also be used.

a <- c(60,164,164,100,62,44,26,174,106,146,102,50,152,86,166)
b <- c(26,44,50,62,86,106,60,102,100,146,166,174,164,152,164)
> plot(a, ylab="Number of Nodes", xlab="Spectra", main="Window Filter Analysis", type="o")
> lines(b,col="red")
> points(b,col="red")


The plot() function draws the points and line graph. lines() function draws a line in red color. points() function draws the red points.

For more information on these functions:

help("plot")
help("points")
help("lines")

Creating ROC curves using R

INTRODUCTION
ROC is a technique with which one can visualize, organize and select classifiers on the bases of its qualities and performances. The ROC is represented graphically by plotting relationship between true positive and false positive rate where as true positive rate (sensitivity) is placed on x-axis and false positive rate (1-specificity) on y-axis. The ROC curve is analyzed keeping two important points in mind that is area under curve and shape of the curve. If the ROC curve rises toward upper left corner then the larger would be a value of area under curve and test performance is good(true positive rate is high). Area under curve would have values between 0 and 1. If the curve decline from upper left corner to lower right corner it means the test performance is bad(flase positive rate is high).[1-3]

MATERIAL AND METHOD
The code for the ROC curve module was taken from the website which was written in R language. RGui is used as an interface for running the code. First of all ROCR package and Verification package was installed on R interface which are needed to run the code.
In order to install the packages internet connection was enabled then in RGui interface install. packages(“ROCR”) command was used and it was installed automatically. The same procedure was followed for the verification package.

ROC CURVE MODULE IN R

# it allows two different plots in the same frame
par(mfrow = c(1,2))
# plot a ROC curve for a single prediction run
# and color the curve according to cutoff.
library(ROCR)
data(ROCR.simple)
pred <- prediction(ROCR.simple$predictions, ROCR.simple$labels)
perf <- performance(pred,"tpr", "fpr")
plot(perf,colorize = TRUE)
# plot a ROC curve for a single prediction run
# with CI by bootstrapping and fitted curve
library(verification)
roc.plot(ROCR.simple$labels,ROCR.simple$predictions, xlab = "False positive rate",
ylab = "True positive rate", main = NULL, CI = T, n.boot = 100, plot = "both", binormal = TRUE)
(auc <- as.numeric(performance(pred, measure = "auc", x.measure = "cutoff")@y.values))
[4]

RESULTS


DISCUSSION

The arguments and functions used in the code are as following:

Par: A function use to access and modify the graphic parameters
mfrow: A graphic parameter use in par function which means multiple figures in row
library(ROCR): Command for loading ROCR package in code
ROCR.simple: Data set for prediction sets and class labels
Prediction: A function that converts the input data into standardized format
pred <- prediction: Prediction function is assigned to pred object.
ROCR.simple$predictions: Vector numerical prediction
ROCR.simple$labels: Vectors for true class labels
Performance: A function to make predictor elevation
"tpr", "fpr": Performance measures
Colorize=TRUE: The curve is colorized according to cutoffs
Plot: A function for plotting R objects. To show different plots parameters are assigned like “both”, ”binorm” and ”emp”(default).
roc.plot: A function that plots a ROC for a given model
xlab, ylab: Label the x and y axes as false positive and true positive rate
main: Title for plot
CI: Confidence Intervals
n.boot: Number of bootstarp sample
binormal: If its set to true then binormal roc curve is calculated
auc: Area under curve
as.numeric: Generic function that attempts to change arguments to “double”
measure=”auc”: auc is a performance measure use for an evaluation
x.measure = "cutoff": It’s different form of performance measure where x.measure is taken as a unit on x axis and measure is taken
as a unit on y axis and two dimensional curve is obtained which is parameterize with the cutoff.
y.values: A list in which each entry contains the y values of the curve

APPLICATIONS OF ROC CURVE


REFRENCES
[1] ROC Curve (OriginPro only).OrignLab Available at http://www.originlab.com/www/helponline/Origin8/en/origin.htm#stats/roc_... .Accessed on August 20,2009
[2] SIB-ROC Available at www.toodoc.com/ROC-curve-ppt.html .Accessed on August 20,2009
[3] ROC Curves & Wilcoxon and Mann-Whitney Tests Available at www.utstat.toronto.edu/.../ROC%20Curves%20&%20Wilcoxon%20and%20Mann-Whit... .Accessed on August 20,2009
[4] ROC curve in ROCR and Verification package. One R tip a day Available at http://onertipaday.blogspot.com/2007/08/receiver-operating-characteristi... .Accessed on August 20,2009
[5] ROC curve. OrignLab Available at http://www.originlab.com/index.aspx?s=8&lm=322&pid=1076 .Accessed on August 20,2009
[6] Receiver operating characteristic. Wikipedia The Free Encyclopedia Available at http://en.wikipedia.org/wiki/Receiver_operating_characteristic .Accessed on August 20,2009

Contributed by: Shahina Hayat