R on the ACCRE Cluster

From ACCRE Wiki

More information R website

R is a widely used statistical analysis environment and programming language. Many versions of R are available to use on the cluster. Users typically first develop code interactively on their laptop/desktop, and then run batch processing jobs on the ACCRE cluster through the SLURM job scheduler.

Versions of R on the ACCRE Cluster

R can be added to your environment using Lmod. We encourage users to use the most recent version installed. To see a list of installed versions simply type:

[bob@vmps11 ~]$ module spider R

To see details about how to load a specific version you can then run the same command but with version information included in the package name:

[bob@vmps11 ~]$ module spider R/3.3.3-X11-20160819

The output from this command will give you information about the dependencies that first need to be loaded in order to add R to your environment. For example:

[bob@vmps11 ~]$ module load GCC OpenMPI R

Here, we are loading the R version 3.3.3 built with the Intel compiler and Intel’s MPI library. We will periodically install new versions of R, at which point the default version of R will change, so you may want to hard-code the version of R into your module load command (i.e. module load GCC/5.4.0-2.26 OpenMPI/1.10.3 R/3.3.3-X11-20160819) to avoid picking up a new version of R when you don’t want it. Since it can be a handful to type in, you may wish to define a shortcut using the alias command if it’s not part of a SLURM script or bash script. Our current R installation comes with a large number of popular scientific and high-performance computing packages preinstalled (e.g. ggplot2, snow, doParallel, foreach, Rmpi). Even more packages are available in the R Bioconductor package which is also available via Lmod:

[bob@vmps11 ~]$ module load R-bundle-Bioconductor

Checking Installed Packages

One simple way to do this is by typing library() from the R command prompt. For example:

[bob@vmps11 ~]$ R -e 'library()'

R version 3.3.3 (2017-03-06) -- "Another Canoe"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library()
Packages in library ‘/gpfs22/easybuild/centos6/software/MPI/GCC/
                      5.4.0-2.26/OpenMPI/1.10.3/R-bundle-Bioconductor/
                      3.3-R-3.3.3’:

affy                    Methods for Affymetrix Oligonucleotide Arrays
affycoretools           Functions useful for those doing repetitive
                        analyses with Affymetrix GeneChips
affyio                  Tools for parsing Affymetrix data files
AgiMicroRna             Processing and Differential Expression Analysis
                        of Agilent microRNA chips
ALDEx2                  Analysis of differential abundance taking
                        sample variation into account
annaffy                 Annotation tools for Affymetrix biological
                        metadata
annotate                Annotation for microarrays
AnnotationDbi           Annotation Database Interface
AnnotationForge         Code for Building Annotation Database Packages
AnnotationHub           Client to access AnnotationHub resources
baySeq                  Empirical Bayesian analysis of patterns of
                        differential expression in count data
Biobase                 Biobase: Base functions for Bioconductor
BiocGenerics            S4 generic functions for Bioconductor
BiocInstaller           Install/Update Bioconductor, CRAN, and github
                        Packages
BiocParallel            Bioconductor facilities for parallel evaluation
biomaRt                 Interface to BioMart databases (e.g. Ensembl,
                        COSMIC ,Wormbase and Gramene)
biomformat              An interface package for the BIOM file format
Biostrings              String objects representing biological
                        sequences, and matching algorithms
biovizBase              Basic graphic utilities for visualization of
                        genomic data.
BSgenome                Infrastructure for Biostrings-based genome data
                        packages and support for efficient SNP
                        representation
BSgenome.Hsapiens.UCSC.hg19
                        Full genome sequences for Homo sapiens (UCSC
                        version hg19)
bumphunter              Bump Hunter
.
.
.

Note that the above output has been truncated for brevity. If you were to run this command you would see additional information about installed packages: the path to the package, version, dependencies, license information, and a few other details. To load a package into a R session simply type library("package_name") . For example, to load the parallel package one would need to type:

library("parallel")

Installing New Packages

If you find that a particular package you need is missing from the R version you use, you will need to install the package yourself into your home directory. There are multiple ways to install R packages. Below is an example of how you would go about installing a package from the R command prompt. To begin, create a directory in your home directory to install these packages into. In this example, the packages will be installed into a directory at ~/R/rlib:

[bob@vmps11 ~]$ mkdir -p ~/R/rlib-3.3.3

Notice that we are including the R version in the name of this directory. In general, when switching to a new R version you should reinstall packages to be used with the new version of R. So you might have a ~/R/rlib-3.3.3 directory and later create a ~/R/rlib-3.4.0 when you switch over the new version of R.

Now load R and start up an R session from the terminal. In this example we will install the Zelig package.

[bob@vmps11 ~]$ module load GCC OpenMPI R R-bundle-Bioconductor
[bob@vmps11 ~]$ R
.
.
.
> .libPaths("~/R/rlib-3.3.3")
> install.packages("Zelig")
Installing package into ‘/gpfs22/home/bob/R/rlib-3.3.3’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
CRAN mirror 

  1: 0-Cloud                        2: Algeria                    
  3: Argentina (La Plata)           4: Australia (Canberra)       
  5: Australia (Melbourne)          6: Austria                    
  7: Belgium                        8: Brazil (BA)                
  9: Brazil (PR)                   10: Brazil (RJ)                
 11: Brazil (SP 1)                 12: Brazil (SP 2)              
 13: Canada (BC)                   14: Canada (NS)                
 15: Canada (ON)                   16: Canada (QC 1)              
 17: Canada (QC 2)                 18: Chile                      
 19: China (Beijing 1)             20: China (Beijing 2)          
 21: China (Beijing 3)             22: China (Beijing 4)          
 23: China (Hefei)                 24: China (Lanzhou)            
 25: China (Xiamen)                26: Colombia (Cali)            
 27: Czech Republic                28: Denmark                    
 29: Ecuador                       30: El Salvador                
 31: Estonia                       32: France (Lyon 1)            
 33: France (Lyon 2)               34: France (Montpellier)       
 35: France (Paris 2)              36: France (Strasbourg)        
 37: Germany (Berlin)              38: Germany (Goettingen)       
 39: Germany (Frankfurt)           40: Germany (Münster)          
 41: Greece                        42: Hungary                    
 43: Iceland                       44: India                      
 45: Indonesia (Jakarta)           46: Iran                       
 47: Ireland                       48: Italy (Milano)             
 49: Italy (Padua)                 50: Italy (Palermo)            
 51: Japan (Tokyo)                 52: Japan (Yamagata)           
 53: Korea (Seoul 1)               54: Korea (Seoul 2)            
 55: Korea (Ulsan)                 56: Lebanon                    
 57: Mexico (Mexico City)          58: Mexico (Texcoco)           
 59: Netherlands (Amsterdam)       60: Netherlands (Utrecht)      
 61: New Zealand                   62: Norway                     
 63: Philippines                   64: Poland                     
 65: Portugal                      66: Russia (Moscow 1)          
 67: Russia (Moscow 2)             68: Singapore                  
 69: Slovakia                      70: South Africa (Johannesburg)
 71: Spain (A Coruña)              72: Spain (Madrid)             
 73: Sweden                        74: Switzerland                
 75: Taiwan (Chungli)              76: Taiwan (Taipei)            
 77: Thailand                      78: Turkey                     
 79: UK (Bristol)                  80: UK (Cambridge)             
 81: UK (Hampshire)                82: UK (London)                
 83: UK (London)                   84: UK (St Andrews)            
 85: USA (CA 1)                    86: USA (CA 2)                 
 87: USA (IA)                      88: USA (IN)                   
 89: USA (KS)                      90: USA (MD)                   
 91: USA (MI 1)                    92: USA (MI 2)                 
 93: USA (MO)                      94: USA (OH 1)                 
 95: USA (OH 2)                    96: USA (OR)                   
 97: USA (PA 1)                    98: USA (PA 2)                 
 99: USA (TN)                     100: USA (TX 1)                 
101: USA (WA 1)                   102: USA (WA 2)                 
103: Venezuela                    104: Vietnam

Here we are prompted for the repository we would like to download the package from. Let’s choose the Tennessee repository (option 99):

Selection: 99
also installing the dependencies ‘zoo’, ‘sandwich’

trying URL 'http://mirrors.nics.utk.edu/cran/src/contrib/zoo_1.7-12.tar.gz'
Content type 'application/x-gzip' length 839181 bytes (819 KB)
==================================================
downloaded 819 KB

trying URL 'http://mirrors.nics.utk.edu/cran/src/contrib/sandwich_2.3-3.tar.gz'
Content type 'application/x-gzip' length 466503 bytes (455 KB)
==================================================
downloaded 455 KB

trying URL 'http://mirrors.nics.utk.edu/cran/src/contrib/Zelig_4.2-1.tar.gz'
Content type 'application/x-gzip' length 3262531 bytes (3.1 MB)
==================================================
downloaded 3.1 MB

* installing *source* package ‘zoo’ ...
** package ‘zoo’ successfully unpacked and MD5 sums checked
** libs
icc -std=gnu99 -I/usr/local/R/3.2.0/x86_64/intel14/nonet/lib64/R/include 
-DNDEBUG -I../inst/include -I/usr/local/include    -fpic  -O3 -msse3 
-funroll-loops  -funsigned-char  -c coredata.c -o coredata.o
icc -std=gnu99 -I/usr/local/R/3.2.0/x86_64/intel14/nonet/lib64/R/include 
-DNDEBUG -I../inst/include -I/usr/local/include    -fpic  -O3 -msse3 
-funroll-loops  -funsigned-char  -c init.c -o init.o
icc -std=gnu99 -I/usr/local/R/3.2.0/x86_64/intel14/nonet/lib64/R/include 
-DNDEBUG -I../inst/include -I/usr/local/include    -fpic  -O3 -msse3 
-funroll-loops  -funsigned-char  -c lag.c -o lag.o
icc -std=gnu99 -shared -L/usr/local/lib64 -o zoo.so coredata.o init.o lag.o
installing to /gpfs22/home/frenchwr/R/rlib/zoo/libs
** R
** demo
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (zoo)
* installing *source* package ‘sandwich’ ...
** package ‘sandwich’ successfully unpacked and MD5 sums checked
** R
** data
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (sandwich)
* installing *source* package ‘Zelig’ ...
** package ‘Zelig’ successfully unpacked and MD5 sums checked
** R
** data
** demo
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (Zelig)

The downloaded source packages are in
    ‘/tmp/RtmpmGSgLS/downloaded_packages’

Notice that Zelig had a few dependencies (zoo and sandwich) that were also installed along the way. It appears that the installation was successful, so let’s exit the R session to check if the packages are now in our home directory:

> quit()
Save workspace image? [y/n/c]: n
[bob@vmps65 ~]$ ls ~/R/rlib-3.3.3/
sandwich  Zelig  zoo

There they are! Finally, let’s re-start R to make sure we can load the package we’ve installed:

[bob@vmps11 ~]$ R
.
.
.
> .libPaths("~/R/rlib-3.3.3")
> library("Zelig")
Loading required package: boot
Loading required package: MASS
Loading required package: sandwich
ZELIG (Versions 4.2-1, built: 2013-09-12)

+----------------------------------------------------------------+
|  Please refer to http://gking.harvard.edu/zelig for full       |
|  documentation or help.zelig() for help with commands and      |
|  models support by Zelig.                                      |
|                                                                |
|  Zelig project citations:                                      |
|    Kosuke Imai, Gary King, and Olivia Lau.  (2009).            |
|    ``Zelig: Everyone's Statistical Software,''                 |
|    http://gking.harvard.edu/zelig                              |
|   and                                                          |
|    Kosuke Imai, Gary King, and Olivia Lau. (2008).             |
|    ``Toward A Common Framework for Statistical Analysis        |
|    and Development,'' Journal of Computational and             |
|    Graphical Statistics, Vol. 17, No. 4 (December)             |
|    pp. 892-913.                                                |
|                                                                |
|   To cite individual Zelig models, please use the citation     |
|   format printed with each model run and in the documentation. |
+----------------------------------------------------------------+

Attaching package: ‘Zelig’

The following object is masked from ‘package:utils’:

    cite

Zelig appears to load properly, confirming we have successfully installed the package. Note that we first needed to type:

> .libPaths("~/R/rlib-3.3.3")

in order to point R to the directory where our packages are installed. This command was also need before installing the packages. Alternatively, you may drop this line in your .RProfile if you always want R to see these libraries. Note that these packages were installed for a specific version of R, so it’s unlikely that they will work for a different version. If you need information on installing a package from source code or from Bioconductor, refer to our FAQ page.

Example Scripts

Running a R script within a SLURM job is generally straightforward. Unless you are attempting to run one of R’s multi-processing packages, you will want to request a single task, load the appropriate version of R from your SLURM script, and then run your script using the Rscript command. The –no-save flag passed to Rscript prevents R from saving the workspace, which in this example would be relatively large. The following example runs R 3.2.0 on a simple R script that demonstrates the utility of writing vectorized R code:

[bob@vmps11 run1]$ ls
R.slurm  vectorize.R

[bob@vmps11 run1]$ cat R.slurm 
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --mem=500M
#SBATCH --output=R_job_slurm.out

module load GCC OpenMPI R

Rscript --no-save vectorize.R

[bob@vmps11 run1]$ cat vectorize.R 
n = 10^7
# populate with random nos
v=runif(n)
system.time({vv<-v*v; m<-mean(vv)}); m
system.time({for(i in 1:length(v)) { vv[i]<-v[i]*v[i] }; m<-mean(vv)}); m

Note this example was taken from a Stackoverflow thread . We next submit the job with sbatch :

[bob@vmps11 run1]$ sbatch R.slurm 
Submitted batch job 2271536

After waiting a few minutes:

[bob@vmps11 run1]$ ls
R_job_slurm.out  R.slurm  vectorize.R

[bob@vmps11 run1]$ cat R_job_slurm.out 
   user  system elapsed 
  0.047   0.014   0.062 
[1] 0.3333861
   user  system elapsed 
 20.158   0.058  20.253 
[1] 0.3333861

The elapsed column indicates that the vectorized version of the code executed in 0.062 seconds while the non-vectorized section executed in 20.253 seconds. Both versions produced identical results (0.3333861). Moral of the story: used vectorized code in R scripts whenever possible!

Contributing New Examples

In order to foster collaboration and develop local R expertise at Vanderbilt, we encourage users to submit examples of their own to ACCRE’s R Github repository . Instructions for doing this can be found on Github Repositories .