R and Rstudio

R is a language and environment for statistical computing and graphics. It is an integrated suite of software facilities for data manipulation, calculation, and graphical display. It includes

an effective data handling and storage facility,
a suite of operators for calculations on arrays, in particular matrices,
a large, coherent, integrated collection of intermediate tools for data analysis,
graphical facilities for data analysis and display either on-screen or on hardcopy, and
a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input, and output facilities

More information can be found here.

Availability and Restrictions
Usage
HOWTO: Install Local R Packages
R packages with external dependencies
- Import modules in R
renv: Package Manager
Parallel R
R batchtools
Profiling R code
Rstudio for classroom
Troubleshooting issues

Availability and Restrictions

Versions

The following versions of R are available on OSC systems:

Version	Pitzer	Ascend	Cardinal
3.5.1^#	X
3.6.0 or 3.6.0-gnu7.3	X
3.6.3 or 3.6.3-gnu9.1	X
4.0.2 or 4.0.2-gnu9.1	X
4.1.0 or 4.1.0-gnu9.1^**	X
4.2.1 or 4.2.1-gnu11.2	X*
4.3.0 or 4.3.0-gnu11.2	X
4.4.0 or 4.4.0-gnu11.2	X	X	X*^#

^# R/4.4.0 on Cardinal is compiled with gcc/12.3.0. To load R/4.4.0, please load the gcc/12.3.0 module first.

* Current default version ^# R/3.5.0 is available for both intel/16 and intel/18, but they may differ for R packages under them. R/3.6.0 and later versions are compiled with gnu and mkl. Loading R/3.6.X modules require dependencies to be preloaded whereas R/3.6.X-gnuY modules will automatically load required dependencies.

** The user state directory (session data) is stored at ~/.local/share/rstudio for the latest RStudio that we have deployed with R/4.1.0. It is located at ~/.rstudio for older versions. Users would need to delete session data from ~/.local/share/rstudio for R/4.1.0 and ~/.rstudio for older versions to clear workspace history.

Known Issue

There's a known issue loading modules in RStudio's environment after changing versions or clusters.

If you have issues using modules in the RConsole - try these remedies

restarting the terminal
restarting the RConole
logging out of the RStudio session and logging back in.
remove your ~/.local/share/rstudio

You can use module avail R to view available modules and module spider R/version to show how to load the module for a given machine. Feel free to contact OSC Help if you need other versions for your work.

Access

R is available to all OSC users. If you have any questions, please contact OSC Help.

Publisher/Vendor/Repository and License Type

R Foundation, Open source

Usage

R software can be launched two different ways; through Rstudio on OSC OnDemand and through the terminal.

Rstudio

In order to access Rstudio and OSC R workshop materials, please visit here.

Terminal Acess

In order to configure your environment for R, please run the following command:

module load R/version
#for example,
module load R/4.4.0-gnu11.2

R/3.6.0 and onwards versions use gnu compiler and intel mkl libraries for performance improvements. Loading R/3.6.X modules require dependencies to be preloaded as below whereas R/3.6.X-gnuY modules will automatically load required dependencies.

Using R

Once your environment is configured, R can be started simply by entering the following command:

For a listing of command line options, run:

R --help

Running R interactively on a login node for extended computations is not recommended and may violate OSC usage policy. Users can either request compute nodes to run R interactively or run R in batch.

Running R interactively on terminal:

Request compute node or nodes if running parallel R as,

sinteractive -A <project-account> -N 1 -n 28 -t 01:00:00

When the compute node is ready, launch R by loading modules

module load R/4.4.0-gnu11.2
R

Batch Usage

Reference the example batch script below. This script requests one full node on the Cardinal cluster for 1 hour of wall time.

#!/bin/bash
#SBATCH --job-name R_ExampleJob
#SBATCH --nodes=1 --ntasks-per-node=48
#SBATCH --time=01:00:00
#SBATCH --account <your_project_id>

module load gcc/12.3.0    
module load R/4.4.0
    
cp in.dat test.R $TMPDIR
cd $TMPDIR
    
R CMD BATCH test.R test.Rout
    
cp test.Rout $SLURM_SUBMIT_DIR

HOWTO: Install Local R Packages

R comes with a single library $R_HOME/library which contains the standard and recommended packages. This is usually in a system location.

Users can check the library path as follows after launching an R session;

> .libPaths()
[1] "/users/PZS0680/soottikkal/R/x86_64-pc-linux-gnu-library/3.6"
[2] "/usr/local/R/gnu/9.1/3.6.3/site/pkgs"
[3] "/usr/local/R/gnu/9.1/3.6.3/lib64/R/library"

Users can check the list of available packages as follows;

>installed.packages()

To install local R packages, use install.package() command. For example,

>install.packages("lattice")

For the first time local installation, it will give a warning as follows:

Installing package into ‘/usr/local/R/gnu/9.1/3.6.3/site/pkgs’
(as ‘lib’ is unspecified)
Warning in install.packages("lattice") :
'lib = "/usr/local/R/gnu/9.1/3.6.3/site/pkgs"' is not writable
Would you like to use a personal library instead? (yes/No/cancel)

Answer y , and it will create the directory and install the package there.

If you are using R older than 3.6, and if you have errors similar to

/opt/intel/18.0.3/compilers_and_libraries_2018.3.222/linux/compiler/include/complex(310): error #308: member "std::complex::_M_value" (declared at line 1346 of "/apps/gnu/7.3.0/include/c++/7.3.0/complex") is inaccessible
return __x / __y._M_value;

then create a Makevars file in your project path and add the following command to it:

CXXFLAGS = -diag-disable 308

Set the R_MAKEVARS_USER to the custom Makevars created under your project path as follows

export R_MAKEVARS_USER="/your_project_path/Makevars"

Installing Packages from GitHub

Users can install R packages directly from Github using devtools package as follows

>install.packages("devtools")
>devtools::install_github("author/package")

Installing Packages from Bioconductor

Users can install R packages directly from Bioconductor using BiocManager.

>install.packages("BiocManager")
>BiocManager::install(c("GenomicRanges", "Organism.dplyr"))

R packages with external dependencies

When installing R packages with external dependencies, users may need to import appropriate libraries into R. Sometimes using a gnu version of R can alleviate problems, e.g., try R/4.3.0-gnu11.2 if R/4.3.0 fails. One of the frequently requested R packages is sf which needs geos, gdal and PROJ libraries. We have a few versions of those packages installed and they can be loaded as modules. Another relativey common external dependency is gsl use, e.g.: module spider gsl, to find the available versions of such dependencies.

Here is an example of how to install R package sf.

module load geos/3.9.1 proj/8.1.0 gdal/3.3.1
module load R/4.4.0-gnu11.2
R
>install.packages("sf")

Now you can install other packages that depend on sf normally. Please note that if you get an error indicating the sqlite version is outdated, you can load its module along with geos, proj and gdal modules: module load sqlite/3.26.0

This is an example of the stars package installation, which has a dependency of sf package.

>install.packages("stars")
>library(stars)

When modules of external libs are not available, users can install those and link libraries to the R environment. Here is an example of how to install the sf package on Cardinal without modules.

Please note that the library paths will change to /apps/ on Pitzer instead of ./apps/spack/... on cardinal

Please note that if you get an error indicating the sqlite version is outdated, you can load its module before proceeding with the installation: module load sqlite/3.26.0

>old_ld_path <- Sys.getenv("LD_LIBRARY_PATH")
>Sys.setenv(LD_LIBRARY_PATH = paste(old_ld_path, "/usr/local/gdal/3.3.1/lib", "/usr/local/proj/8.1.0/lib","/usr/local/geos/3.9.1/",sep=":"))

>Sys.setenv("PKG_CONFIG_PATH"="/usr/local/proj/8.1.0/lib/pkgconfig")
>Sys.setenv("GDAL_DATA"="/usr/local/gdal/3.3.1/share/gdal")

>install.packages("sf", configure.args=c("--with-gdal-config=/usr/local/gdal/3.3.1/bin/gdal-config","--with-proj-include=/usr/local/proj/8.1.0/include","--with-proj-lib=/usr/local/proj/8.1.0/lib","--with-geos-config=/usr/local/geos/3.9.1/bin/geos-config"),INSTALL_opts="--no-test-load")

>dyn.load("/usr/local/gdal/3.3.1/lib/libgdal.so")
>dyn.load("/usr/local/geos/3.9.1/lib/libgeos_c.so", local=FALSE)
>library(sf)

Please note that every time before loading sf package, you have to execute the dyn.load of both libraries listed above. In addition, the first time you install an external package you should answer yes to using and creating a personal library, e.g.:

'lib = "/usr/local/R/gnu/9.1/4.0.2/site/pkgs"' is not writable. Would you like to use a personal library instead? (yes/No/cancel) yes. Would you like to create a personal library '~/R/x86_64-pc-linux-gnu-library/4.0' to install packages into? (yes/No/cancel) yes.

You can install other packages that depend on sf as follows. This is an example of terra package installation.

>install.packages("terra", configure.args=c("--with-gdal-config=/usr/local/gdal/3.3.1/bin/gdal-config","--with-proj-include=/usr/local/proj/8.1.0/include","--with-proj-lib=/usr/local/proj/8.1.0/lib","--with-geos-config=/usr/local/geos/3.9.1/bin/geos-config"),INSTALL_opts="--no-test-load")
>library(terra)

Import modules in R

Alternatively you can load modules in R for those external depedencies if they are available on system

> source(file.path(Sys.getenv("LMOD_PKG"), "init/R"))
> module("load", "geos")

You can check if an external pacakge is available

> module("avail", "geos")

renv: Package Manager

if you are using R for multiple projects, OSC recommendsrenv, an R dependency manager for R package management. Please see more information here.

The renv package helps you create reproducible environments for your R projects. Use renv to make your R projects more:

Isolated: Each project gets its own library of R packages, so you can feel free to upgrade and change package versions in one project without worrying about breaking your other projects.
Portable: Because renv captures the state of your R packages within a lockfile, you can more easily share and collaborate on projects with others, and ensure that everyone is working from a common base.
Reproducible: Use renv::snapshot() to save the state of your R library to the lockfile renv.lock. You can later use renv::restore() to restore your R library exactly as specified in the lockfile.

Users can install renv package as follows;

>install.packages("renv")

The core essence of the renv workflow is fairly simple:

After launching R, go to your project directory using R command setwd and initiate renv:
```
setwd("your/project/path")
renv::init()
```
This function forks the state of your default R libraries into a project-local library. A project-local .Rprofile is created (or amended), which is then used by new R sessions to automatically initialize renv and ensure the project-local library is used.

Work in your project as usual, installing and upgrading R packages as required as your project evolves.
Use renv::snapshot() to save the state of your project library. The project state will be serialized into a file called renv.lock under your project path.
Use renv::restore() to restore your project library from the state of your previously-created lockfile renv.lock.

In short: use renv::init() to initialize your project library, and use renv::snapshot() / renv::restore() to save and load the state of your library.

After your project has been initialized, you can work within the project as before, but without fear that installing or upgrading packages could affect other projects on your system.

Global Cache

One of renv’s primary features is the use of a global package cache, which is shared across all projects using renvWhen using renv the packages from various projects are installed to the global cache. The individual project library is instead formed as a directory of symlinks into the renv global package cache. Hence, while each renv project is isolated from other projects on your system, they can still re-use the same installed packages as required. By default, global Cache of renv is located ~/.local/share/renvUser can change the global cache location using RENV_PATHS_CACHE variable. Please see more information here.

Please note that renv does not load packages from site location (add-on packages installed by OSC) to the rsession. Users will have access to the base R packages only when using renv. All other packages required for the project should be installed by the user.

Version Control with `renv`

If you would like to version control your project, you can utilize git versioning of renv.lock file. First, initiate git for your project directory on a terminal

git init

Continue working on your R project by launching R, installing packages, saving snapshot using renv::snapshot()command. Please note that renv::snapshot() will only save packages that are used in the current project. To capture all packages within the active R libraries in the lockfile, please see the type option.

>renv::snapshot(type="simple")

If you’re using a version control system with your project, then as you call renv::snapshot() and later commit new lockfiles to your repository, you may find it necessary later to recover older versions of your lockfiles. renv provides the functions renv::history()to list previous revisions of your lockfile, and renv::revert() to recover these older lockfiles.

If you are using renvpackage for the first time, it is recommended that you check R startup files in your $HOME such as .Rprofile and .Renviron and remove any project-specific settings from these files. Please also make sure you do not have any project-specific settings in ~/.R/Makevars.

A Simple Example

First, you need to load the module for R and fire up R session

module load R/3.6.3-gnu9.1
R

Then set the working directory and initiate renv

setwd("your/project/path")
renv::init()

Let's install a package called lattice, and save the snapshot to the renv.lock

renv::install("lattice")
renv::snapshot(type="simple")

The latticepackage will be installed in global cache of renv and symlink will be saved in renv under the project path.

Restore a Project

Use renv::restore() to restore a project's dependencies from a lockfile, as previously generated by snapshot(). Let's remove the lattice package.

renv::remove("lattice")

Now let's restore the project from the previously saved snapshot so that the lattice package is restored.

renv::restore()
library(lattice)

Collaborating with `renv`

When using renv, the packages used in your project will be recorded into a lockfile, renv.lock. Because renv.lock records the exact versions of R packages used within a project, if you share that file with your collaborators, they will be able to use renv::restore() to install exactly the same R packages as recorded in the lockfile. Please find more information here.

Parallel R

Please set the environment variables OMP_NUM_THREADS and MKL_NUM_THREADS to 1 in your job scripts. This adjustment helps avoid additional internal parallel processing by libraries such as OpenMP and MKL, which can otherwise conflict with parallelism set by R’s parallel processing packages.

R provides a number of methods for parallel processing of the code. Multiple cores and nodes available on OSC clusters can be effectively deployed to run many computations in R faster through parallelism.

Consider this example, where we use a function that will generate values sampled from a normal distribution and sum the vector of those results; every call to the function is a separate simulation.

    myProc <- function(size=1000000) {
      # Load a large vector
      vec <- rnorm(size)
      # Now sum the vec values
      return(sum(vec))
    }

Serial execution with loop

Let’s first create a serial version of R code to run myProc() 100x on Owens

    tick <- proc.time()
    for(i in 1:100) {
      myProc()
    }
    tock <- proc.time() - tick
    tock

    ##    user  system elapsed
    ##   6.437   0.199   6.637

Here, we execute each trial sequentially, utilizing only one of our 28 processors on this machine. In order to apply parallelism, we need to create multiple tasks that can be dispatched to different cores. Using apply() family of R function, we can create multiple tasks. We can rewrite the above code to use apply(), which applies a function to each of the members of a list (in this case the trials we want to run):

    tick <- proc.time()
    result <- lapply(1:100, function(i) myProc())
    tock <-proc.time() - tick
    tock

    ##    user  system elapsed
    ##   6.346   0.152   6.498

parallel package

The parallellibrary can be used to dispatch tasks to different cores. The parallel::mclapply function can distributes the tasks to multiple processors.

    library(parallel)
    cores <- system("nproc", intern=TRUE)
    tick <- proc.time()
    result <- mclapply(1:100, function(i) myProc(), mc.cores=cores)
    tock <- proc.time() - tick
    tock

    ##    user  system elapsed
    ##   8.653   0.457   0.382

foreach package

The foreach package provides a looping construct for executing R code repeatedly. It uses the sequential %do% operator to indicate an expression to run.

    library(foreach)
    tick <- proc.time()
    result <-foreach(i=1:100) %do% {
       myProc()
    }
    tock <- proc.time() - tick
    tock

    ##    user  system elapsed
    ##   6.420   0.018   6.439

doParallel package

foreach supports a parallelizable operator %dopar% from the doParallel package. This allows each iteration through the loop to use different cores.

    library(doParallel, quiet = TRUE)
    library(foreach)
    cl <- makeCluster(28)
    registerDoParallel(cl)
    
    tick <- proc.time()
    result <- foreach(i=1:100, .combine=c) %dopar% {
        myProc()
    }
    tock <- proc.time() - tick
    tock
    invisible(stopCluster(cl))
    detachDoParallel()

    ##    user  system elapsed
    ##   0.085   0.013   0.446

Rmpi package

Rmpi package allows to parallelize R code across multiple nodes. Rmpi provides an interface necessary to use MPI for parallel computing using R. This allows each iteration through the loop to use different cores on different nodes. Rmpijobs cannot be run with RStudio at OSC currently, instead users can submit Rmpi jobs through terminal App. R uses openmpi as MPI interface therefor users would need to load openmpi module before installing or using Rmpi. Rmpi is installed at central location for R versions prior to 4.2.1. If it is not availbe, users can install it as follows

Rmpi Installation

   # Get source code of desired version of RMpi
wget https://cran.r-project.org/src/contrib/Rmpi_0.7-2.tar.gz

# Load modules
ml openmpi/1.10.7 R/4.4.0-gnu11.2

# Install RMpi
R CMD INSTALL --configure-vars="CPPFLAGS=-I$MPI_HOME/include LDFLAGS='-L$MPI_HOME/lib'" --configure-args="--with-Rmpi-include=$MPI_HOME/include --with-Rmpi-libpath=$MPI_HOME/lib --with-Rmpi-type=OPENMPI" Rmpi_0.7-2.tar.gz

# Test loading
library(Rmpi)

Please make sure that $MPI_HOME is defined after loading openmpi module. Newer versions of openmpi module has $OPENMPI_HOME instead of $MPI_HOME. So you would need to replace $MPI_HOME with $OPENMPI_HOME for those versions of openmpi.

Above example code can be rewritten to utilize multiple nodes with Rmpias follows;

    library(Rmpi)
    library(snow)
    workers <- as.numeric(Sys.getenv(c("PBS_NP")))-1
    cl <- makeCluster(workers, type="MPI") # MPI tasks to use
    clusterExport(cl, list('myProc'))
    tick <- proc.time()
    result <- clusterApply(cl, 1:100, function(i) myProc())
    write.table(result, file = "foo.csv", sep = ",")
    tock <- proc.time() - tick
    tock

Batch script for job submission is as follows;

    #!/bin/bash
    #SBATCH --time=10:00
    #SBATCH --nodes=2 --ntasks-per-node=28
    #SBATCH --account=<project-account>
    #SBATCH --export=OMP_NUM_THREADS=1,MKL_NUM_THREADS=1
    
    module load R/3.6.3-gnu9.1 openmpi/1.10.7
    
    # parallel R: submit job with one MPI master
    mpirun -np 1 R --slave < Rmpi.R

pbdMPI package

pbdMPI is an improved version of RMpi package that provides efficient interface to MPI by utilizing S4 classes and methods with a focus on Single Program/Multiple Data ('SPMD') parallel programming style, which is intended for batch parallel execution.

Installation of pbdMPI

Users can download latest version of pbdMPI from CRAN https://cran.r-project.org/web/packages/pbdMPI/index.html and install it as follows,

wget https://cran.r-project.org/src/contrib/pbdMPI_0.5-1.tar.gz
ml R/4.4.0-gnu11.2
ml openmpi/4.1.4-hpcx
R CMD INSTALL pbdMPI_0.5-1.tar.gz

Examples

Here are few resources that demonstrate how to use pbdMPI

https://cran.r-project.org/web/packages/pbdMPI/pbdMPI.pdf

http://hpcf-files.umbc.edu/research/papers/pbdRtara2013.pdf

R Batchtools

The R package, batchtools provides a parallel implementation of Map for high-performance computing systems managed by schedulers Slurm on OSC system. Please find more info here https://github.com/mllg/batchtools.

Users would need two files slurm.tmpl and .batch.conf.R

Slurm.tmpl is provided below. Please change "your project_ID".

    #!/bin/bash -l
    ## Job Resource Interface Definition
    ## ntasks [integer(1)]:       Number of required tasks,
    ##                            Set larger than 1 if you want to further parallelize
    ##                            with MPI within your job.
    ## ncpus [integer(1)]:        Number of required cpus per task,
    ##                            Set larger than 1 if you want to further parallelize
    ##                            with multicore/parallel within each task.
    ## walltime [integer(1)]:     Walltime for this job, in seconds.
    ##                            Must be at least 60 seconds.
    ## memory   [integer(1)]:     Memory in megabytes for each cpu.
    ##                            Must be at least 100 (when I tried lower values my
    ##                            jobs did not start at all).
    ## Default resources can be set in your .batchtools.conf.R by defining the variable
    ## 'default.resources' as a named list.
    
    <%
    # relative paths are not handled well by Slurm
    log.file = fs::path_expand(log.file)
    -%>
    
    #SBATCH --job-name=<%= job.name %>
    #SBATCH --output=<%= log.file %>
    #SBATCH --error=<%= log.file %>
    #SBATCH --time=<%= ceiling(resources$walltime / 60) %>
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=<%= resources$ncpus %>
    #SBATCH --mem-per-cpu=<%= resources$memory %>
    #SBATCH --account=your_project_id
    <%= if (!is.null(resources$partition)) sprintf(paste0("#SBATCH --partition='", resources$partition, "'")) %>
    <%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>
    
    
    ## Initialize work environment like
    ## source /etc/profile
    ## module add ...
    
    module add  R/4.0.2-gnu9.1
    
    ## Export value of DEBUGME environemnt var to slave
    export DEBUGME=<%= Sys.getenv("DEBUGME") %>
    <%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
    <%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
    <%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>
    
    
    ## Run R:
    ## we merge R output with stdout from SLURM, which gets then logged via --output option
    
    Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

.batch.conf.R is provided below.

    cluster.functions = makeClusterFunctionsSlurm(template="path/to/slurm.tmpl")

A test example is provided below. Assuming the current working directory has both slurm.tmpl and .batch.conf.R files.

    ml R/4.0.2-gnu9.1
    R
    
    >install.packages("batchtools")
    >library(batchtools)
    >myFct <- function(x) {
    result <- cbind(iris[x, 1:4,],
    Node=system("hostname", intern=TRUE),
    Rversion=paste(R.Version()[6:7], collapse="."))}
    
    >reg <- makeRegistry(file.dir="myregdir", conf.file=".batchtools.conf.R")
    >Njobs <- 1:4 # Define number of jobs (here 4)
    >ids <- batchMap(fun=myFct, x=Njobs)
    >done <- submitJobs(ids, reg=reg, resources=list( walltime=60, ntasks=1, ncpus=1, memory=1024))
    >waitForJobs()
    >getStatus() # Summarize job

Profiling R code

Profiling R code helps to optimize the code by identifying bottlenecks and improve its performance. There are a number of tools that can be used to profile R code.

Grafana:

OSC jobs can be monitored for CPU and memory usage using grafana. If your job is in running status, you can get grafana metrics as follows. After log in to OSC OnDemand, select Jobs from the top tabs, then select Active Jobs and then Job that you are interested to profile. You will see grafana metrics at the bottom of the page and you can click on detailed metrics to access more information about your job at grafana.

Screen Shot of grafana metrics

Rprof:

R’s built-in tool,Rprof function can be used to profile R expressions and the summaryRprof function to summarize the result. More information can be found here.

Here is an example of profiling R code with Rprofe for data analysis on Faithful data.

Rprof("Rprof-out.prof",memory.profiling=TRUE, line.profiling=TRUE)
data(faithful)
summary(faithful)
plot(faithful)
Rprof(NULL)

To analyze profiled data, runsummaryRprof on Rprof-out.prof

summaryRprof("Rprof-out.prof")

You can read more about summaryRprofhere.

Profvis:

It provides an interactive graphical interface for visualizing data from Rprof.

library(profvis)
profvis({
    data(faithful)
    summary(faithful)
    plot(faithful)
},prof_output="profvis-out.prof")

If you are running the R code on Rstudio, it will automatically open up the visualization for the profiled data. More info can be found here.

Using Rstudio for classroom

OSC provides an isolated and custom R environment for each classroom project that requires Rstudio. More information can be found here.

Troubleshooting issues

1. If you're encountering difficulties launching the RStudio App on-demand, it's recommended to review your ~/.bashrc file for any conda/python configurations. Consider commenting out these configurations and attempting to launch the app again.

2. If your R session is taking too long to initialize, it might be due to issues from a previous session. To resolve this, consider restoring R to a fresh session by removing the previous state stored at

~/.local/share/rstudio (~/.rstudio for <R/4.1)

mv ~/.local/share/rstudio ~/.local/share/rstudio.backup

Search form

You are here