- Mastering Scientific Computing with R
- Paul Gerrard Radia M. Johnson
- 357字
- 2025-02-26 02:13:15
Chapter 2. Statistical Methods with R
This chapter will present an overview of how to summarize your data and get useful statistical information for downstream analysis. We will also show you how to plot and get statistical information from probability distributions and how to test the fit of your sample distribution to well-defined probability distributions. We will also go over some of the functions used to perform hypothesis testing including the Student's t-test, Wilcoxon rank-sum test, z-test, chi-squared test, Fisher's exact test, and F-test.
Before we begin, we will load the gene expression profiling data from the E-GEOD-19577 study entitled "MLL partner genes confer distinct biological and clinical signatures of pediatric AML, an AIEOP study" from the ArrayExpress website to use as a sample dataset for some of our examples. For simplicity, we will not go into the details of how the data was generated, except to mention that the study evaluates the expression level of 54,675 probes in 42 leukemia patients' samples. If you would like to learn more about the study, please consult the experiment web page at http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-19577. Here are the steps we will follow to load the data into R:
- Download the R ExpressionSet (
E-GEOD-19577.eSet.r
). - Load the dataset with the
load()
function. This command will create thestudy
object, which contains the raw experimental data. - Rename the
study
object asMLLpartner.dataset
. - Load the
Biobase
andaffy
bioconductor packages. - Normalize the data with the
rma()
function. - Inspect the data.
These steps can be implemented in R, as follows:
> load(url("http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-19577/E-GEOD-19577.eSet.r")) > MLLpartner.ds <- study > library("affy") > library("Biobase") > AEsetnorm = rma(MLLpartner.ds) Background correcting Normalizing Calculating Expression > head(exprs(AEsetnorm)) #output shown truncated GSM487973 GSM487972 GSM487971 GSM487970 GSM487969 1007_s_at 4.372242 4.293080 4.210850 4.707231 4.345101 1053_at 8.176033 8.541016 8.338475 7.935021 7.868985 117_at 5.730343 8.723568 5.172717 5.404062 5.731468 121_at 7.744487 6.951332 7.202343 7.158402 6.959318 1255_g_at 2.707115 2.717625 2.699625 2.698669 2.701679 1294_at 9.077232 7.611238 9.649630 7.911132 9.732346
Now, let's get the expression values for two probes to be used in our examples, as follows:
> probeA <- as.numeric(exprs(AEsetnorm)[1,]) > probeA <- setNames(probeA, colnames(exprs(AEsetnorm))) > probeB <- as.numeric(exprs(AEsetnorm)[2,]) > probeB <-setNames(probeB, colnames(exprs(AEsetnorm)))
Now, let's create a matrix with all the expression values for each probe evaluated in the 42 patient samples, as follows:
> MLLpartner.mx <- as.matrix(exprs(AEsetnorm)) > #Lets save the object to our session > dump("MLLpartner.mx", "MLLpartner.R") > class(MLLpartner.mx) [1] "matrix" > dim(MLLpartner.mx) [1] 54675 42