R is a language and environment for statistical computing and graphics. It is an integrated suite of software facilities for data manipulation, calculation, and graphical display. It includes
- an effective data handling and storage facility,
- a suite of operators for calculations on arrays, in particular matrices,
- a large, coherent, integrated collection of intermediate tools for data analysis,
- graphical facilities for data analysis and display either on-screen or on hardcopy, and
- a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input, and output facilities
More information can be found here.
- Availability and Restrictions
- Usage
- HOWTO: Install Local R Packages
- R packages with external dependencies
- renv: Package Manager
- Parallel R
- R batchtools
- Profiling R code
- Rstudio for classroom
- Troubleshooting issues
Availability and Restrictions
Versions
The following versions of R are available on OSC systems:
Version | Owens | Pitzer | Ascend | Cardinal |
---|---|---|---|---|
3.5.0 | X | |||
3.5.0# | X* | |||
3.5.1# | X | |||
3.6.0 or 3.6.0-gnu7.3 | X | |||
3.6.1 or 3.6.1-gnu9.1 | X | |||
3.6.3 or 3.6.3-gnu9.1 | X | X | ||
4.0.2 or 4.0.2-gnu9.1 | X | X | ||
4.1.0 or 4.1.0-gnu9.1** | X | X | ||
4.2.1 or 4.2.1-gnu11.2 | X | X* | X* | |
4.3.0 or 4.3.0-gnu11.2 | X | X | X | |
4.4.0 or 4.4.0-gnu11.2 | X | X | X | X*# |
Known Issue
There's a known issue loading modules in RStudio's environment after changing versions or clusters.
If you have issues using modules in the RConsole - try these remedies
- restarting the terminal
- restarting the RConole
- logging out of the RStudio session and logging back in.
- remove your ~/.local/share/rstudio
You can use module avail R
to view available modules and module spider R/version
to show how to load the module for a given machine. Feel free to contact OSC Help if you need other versions for your work.
Access
R is available to all OSC users. If you have any questions, please contact OSC Help.
Publisher/Vendor/Repository and License Type
R Foundation, Open source
Usage
R software can be launched two different ways; through Rstudio on OSC OnDemand and through the terminal.
Rstudio
In order to access Rstudio and OSC R workshop materials, please visit here.
Terminal Acess
In order to configure your environment for R, run the following command:
module load R/version #for example, module load R/4.4.0-gnu11.2
R/3.6.0 and onwards versions use gnu compiler and intel mkl libraries for performance improvements. Loading R/3.6.X modules require dependencies to be preloaded as below whereas R/3.6.X-gnuY modules will automatically load required dependencies.
Using R
Once your environment is configured, R can be started simply by entering the following command:
R
For a listing of command line options, run:
R --help
Running R interactively on a login node for extended computations is not recommended and may violate OSC usage policy. Users can either request compute nodes to run R interactively or run R in batch.
Running R interactively on terminal:
Request compute node or nodes if running parallel R as,
sinteractive -A <project-account> -N 1 -n 28 -t 01:00:00
When the compute node is ready, launch R by loading modules
module load R/4.4.0-gnu11.2 R
Batch Usage
Reference the example batch script below. This script requests one full node on the Owens cluster for 1 hour of wall time.
#!/bin/bash #SBATCH --job-name R_ExampleJob #SBATCH --nodes=1 --ntasks-per-node=48 #SBATCH --time=01:00:00 #SBATCH --account <your_project_id> module load R/4.4.0-gnu11.2 cp in.dat test.R $TMPDIR cd $TMPDIR R CMD BATCH test.R test.Rout cp test.Rout $SLURM_SUBMIT_DIR
HOWTO: Install Local R Packages
R comes with a single library $R_HOME/library
which contains the standard and recommended packages. This is usually in a system location. On Owens, it is /usr/local/R/gnu/9.1/3.6.3/lib64/R
for R/3.6.3. OSC also installs popular R packages into the site located at /usr/local/R/gnu/9.1/3.6.3/site/pkgs
for R/3.6.3 on Owens.
Users can check the library path as follows after launching an R session;
> .libPaths() [1] "/users/PZS0680/soottikkal/R/x86_64-pc-linux-gnu-library/3.6" [2] "/usr/local/R/gnu/9.1/3.6.3/site/pkgs" [3] "/usr/local/R/gnu/9.1/3.6.3/lib64/R/library"
Users can check the list of available packages as follows;
>installed.packages()
To install local R packages, use install.package() command. For example,
>install.packages("lattice")
For the first time local installation, it will give a warning as follows:
Installing package into ‘/usr/local/R/gnu/9.1/3.6.3/site/pkgs’ (as ‘lib’ is unspecified) Warning in install.packages("lattice") : 'lib = "/usr/local/R/gnu/9.1/3.6.3/site/pkgs"' is not writable Would you like to use a personal library instead? (yes/No/cancel)
Answer y
, and it will create the directory and install the package there.
If you are using R
older than 3.6
, and if you have errors similar to
/opt/intel/18.0.3/compilers_and_libraries_2018.3.222/linux/compiler/include/complex(310): error #308: member "std::complex::_M_value" (declared at line 1346 of "/apps/gnu/7.3.0/include/c++/7.3.0/complex") is inaccessible return __x / __y._M_value;
then create a Makevars file in your project path and add the following command to it:
CXXFLAGS = -diag-disable 308
Set the R_MAKEVARS_USER
to the custom Makevars created under your project path as follows
export R_MAKEVARS_USER="/your_project_path/Makevars"
Installing Packages from GitHub
Users can install R packages directly from Github using devtools package as follows
>install.packages("devtools") >devtools::install_github("author/package")
Installing Packages from Bioconductor
Users can install R packages directly from Bioconductor using BiocManager.
>install.packages("BiocManager") >BiocManager::install(c("GenomicRanges", "Organism.dplyr"))
R packages with external dependencies
When installing R packages with external dependencies, users may need to import appropriate libraries into R. Sometimes using a gnu version of R can alleviate problems, e.g., try R/4.3.0-gnu11.2 if R/4.3.0 fails. One of the frequently requested R packages is sf
which needs geos
, gdal
and PROJ
libraries. We have a few versions of those packages installed and they can be loaded as modules. Another relativey common external dependency is gsl
use, e.g.: module spider gsl
, to find the available versions of such dependencies.
Here is an example of how to install R package sf
.
module load geos/3.9.1 proj/8.1.0 gdal/3.3.1 module load R/4.4.0-gnu11.2 R >install.packages("sf")
Now you can install other packages that depend on sf
normally. Please note that if you get an error indicating the sqlite version is outdated, you can load its module along with geos, proj and gdal modules: module load sqlite/3.26.0
This is an example of the stars
package installation, which has a dependency of sf
package.
>install.packages("stars") >library(stars)
When modules of external libs are not available, users can install those and link libraries to the R environment. Here is an example of how to install the sf
package on Owens without modules.
/apps/
on Pitzer instead of /usr/local/
as on Owens.module load sqlite/3.26.0
>old_ld_path <- Sys.getenv("LD_LIBRARY_PATH") >Sys.setenv(LD_LIBRARY_PATH = paste(old_ld_path, "/usr/local/gdal/3.3.1/lib", "/usr/local/proj/8.1.0/lib","/usr/local/geos/3.9.1/",sep=":")) >Sys.setenv("PKG_CONFIG_PATH"="/usr/local/proj/8.1.0/lib/pkgconfig") >Sys.setenv("GDAL_DATA"="/usr/local/gdal/3.3.1/share/gdal") >install.packages("sf", configure.args=c("--with-gdal-config=/usr/local/gdal/3.3.1/bin/gdal-config","--with-proj-include=/usr/local/proj/8.1.0/include","--with-proj-lib=/usr/local/proj/8.1.0/lib","--with-geos-config=/usr/local/geos/3.9.1/bin/geos-config"),INSTALL_opts="--no-test-load") >dyn.load("/usr/local/gdal/3.3.1/lib/libgdal.so") >dyn.load("/usr/local/geos/3.9.1/lib/libgeos_c.so", local=FALSE) >library(sf)
Please note that every time before loading sf package, you have to execute the dyn.load
of both libraries listed above. In addition, the first time you install an external package you should answer yes to using and creating a personal library, e.g.:
You can install other packages that depend on sf
as follows. This is an example of terra
package installation.
>install.packages("terra", configure.args=c("--with-gdal-config=/usr/local/gdal/3.3.1/bin/gdal-config","--with-proj-include=/usr/local/proj/8.1.0/include","--with-proj-lib=/usr/local/proj/8.1.0/lib","--with-geos-config=/usr/local/geos/3.9.1/bin/geos-config"),INSTALL_opts="--no-test-load") >library(terra)
Import modules in R
Alternatively you can load modules in R for those external depedencies if they are available on system
> source(file.path(Sys.getenv("LMOD_PKG"), "init/R")) > module("load", "geos")
You can check if an external pacakge is available
> module("avail", "geos")
renv: Package Manager
if you are using R for multiple projects, OSC recommendsrenv
, an R dependency manager for R package management. Please see more information here.
The renv
package helps you create reproducible environments for your R projects. Use renv
to make your R projects more:
-
Isolated: Each project gets its own library of R packages, so you can feel free to upgrade and change package versions in one project without worrying about breaking your other projects.
-
Portable: Because
renv
captures the state of your R packages within a lockfile, you can more easily share and collaborate on projects with others, and ensure that everyone is working from a common base. -
Reproducible: Use
renv::snapshot()
to save the state of your R library to the lockfilerenv.lock
. You can later userenv::restore()
to restore your R library exactly as specified in the lockfile.
Users can install renv
package as follows;
>install.packages("renv")
The core essence of the renv
workflow is fairly simple:
-
After launching R, go to your project directory using R command
setwd
and initiaterenv
:setwd("your/project/path") renv::init()
This function forks the state of your default R libraries into a project-local library. A project-local
.Rprofile
is created (or amended), which is then used by new R sessions to automatically initializerenv
and ensure the project-local library is used.Work in your project as usual, installing and upgrading R packages as required as your project evolves.
-
Use
renv::snapshot()
to save the state of your project library. The project state will be serialized into a file calledrenv.lock
under your project path. -
Use
renv::restore()
to restore your project library from the state of your previously-created lockfilerenv.lock
.
In short: use renv::init()
to initialize your project library, and use renv::snapshot()
/ renv::restore()
to save and load the state of your library.
After your project has been initialized, you can work within the project as before, but without fear that installing or upgrading packages could affect other projects on your system.
Global Cache
One of renv
’s primary features is the use of a global package cache, which is shared across all projects using renv
When using renv
the packages from various projects are installed to the global cache. The individual project library is instead formed as a directory of symlinks into the renv
global package cache. Hence, while each renv
project is isolated from other projects on your system, they can still re-use the same installed packages as required. By default, global Cache of renv is located ~/.local/share/renv
User can change the global cache location using RENV_PATHS_CACHE
variable. Please see more information here.
Please note that renv does not load packages from site location (add-on packages installed by OSC) to the rsession. Users will have access to the base R packages only when using renv. All other packages required for the project should be installed by the user.
Version Control with renv
If you would like to version control your project, you can utilize git versioning of renv.lock
file. First, initiate git for your project directory on a terminal
git init
Continue working on your R project by launching R, installing packages, saving snapshot using renv::snapshot()
command. Please note that renv::snapshot()
will only save packages that are used in the current project. To capture all packages within the active R libraries in the lockfile, please see the type option.
>renv::snapshot(type="simple")
If you’re using a version control system with your project, then as you call renv::snapshot()
and later commit new lockfiles to your repository, you may find it necessary later to recover older versions of your lockfiles. renv
provides the functions renv::history()
to list previous revisions of your lockfile, and renv::revert()
to recover these older lockfiles.
If you are using renv
package for the first time, it is recommended that you check R startup files in your $HOME such as .Rprofile and .Renviron and remove any project-specific settings from these files. Please also make sure you do not have any project-specific settings in ~/.R/Makevars.
A Simple Example
First, you need to load the module for R and fire up R session
module load R/3.6.3-gnu9.1 R
Then set the working directory and initiate renv
setwd("your/project/path") renv::init()
Let's install a package called lattice
, and save the snapshot to the renv.lock
renv::install("lattice") renv::snapshot(type="simple")
The lattice
package will be installed in global cache of renv
and symlink will be saved in renv
under the project path.
Restore a Project
Use renv::restore() to restore a project's dependencies from a lockfile, as previously generated by snapshot()
. Let's remove the lattice package.
renv::remove("lattice")
Now let's restore the project from the previously saved snapshot so that the lattice package is restored.
renv::restore() library(lattice)
Collaborating with renv
When using renv
, the packages used in your project will be recorded into a lockfile, renv.lock
. Because renv.lock
records the exact versions of R packages used within a project, if you share that file with your collaborators, they will be able to use renv::restore()
to install exactly the same R packages as recorded in the lockfile. Please find more information here.
Parallel R
OMP_NUM_THREADS
and MKL_NUM_THREADS
to 1
in your job scripts. This adjustment helps avoid additional internal parallel processing by libraries such as OpenMP and MKL, which can otherwise conflict with parallelism set by R’s parallel processing packages.R provides a number of methods for parallel processing of the code. Multiple cores and nodes available on OSC clusters can be effectively deployed to run many computations in R faster through parallelism.
Consider this example, where we use a function that will generate values sampled from a normal distribution and sum the vector of those results; every call to the function is a separate simulation.
myProc <- function(size=1000000) {
# Load a large vector
vec <- rnorm(size)
# Now sum the vec values
return(sum(vec))
}
Serial execution with loop
Let’s first create a serial version of R code to run myProc() 100x on Owens
tick <- proc.time()
for(i in 1:100) {
myProc()
}
tock <- proc.time() - tick
tock
## user system elapsed
## 6.437 0.199 6.637
Here, we execute each trial sequentially, utilizing only one of our 28 processors on this machine. In order to apply parallelism, we need to create multiple tasks that can be dispatched to different cores. Using apply() family of R function, we can create multiple tasks. We can rewrite the above code to use apply(), which applies a function to each of the members of a list (in this case the trials we want to run):
tick <- proc.time()
result <- lapply(1:100, function(i) myProc())
tock <-proc.time() - tick
tock
## user system elapsed
## 6.346 0.152 6.498
parallel package
The parallel
library can be used to dispatch tasks to different cores. The parallel::mclapply function can distributes the tasks to multiple processors.
library(parallel)
cores <- system("nproc", intern=TRUE)
tick <- proc.time()
result <- mclapply(1:100, function(i) myProc(), mc.cores=cores)
tock <- proc.time() - tick
tock
## user system elapsed
## 8.653 0.457 0.382
foreach package
The foreach
package provides a looping construct for executing R code repeatedly. It uses the sequential %do% operator to indicate an expression to run.
library(foreach)
tick <- proc.time()
result <-foreach(i=1:100) %do% {
myProc()
}
tock <- proc.time() - tick
tock
## user system elapsed
## 6.420 0.018 6.439
doParallel package
foreach
supports a parallelizable operator %dopar% from the doParallel package. This allows each iteration through the loop to use different cores.
library(doParallel, quiet = TRUE)
library(foreach)
cl <- makeCluster(28)
registerDoParallel(cl)
tick <- proc.time()
result <- foreach(i=1:100, .combine=c) %dopar% {
myProc()
}
tock <- proc.time() - tick
tock
invisible(stopCluster(cl))
detachDoParallel()
## user system elapsed
## 0.085 0.013 0.446
Rmpi package
Rmpi
package allows to parallelize R code across multiple nodes. Rmpi
provides an interface necessary to use MPI for parallel computing using R. This allows each iteration through the loop to use different cores on different nodes. Rmpi
jobs cannot be run with RStudio at OSC currently, instead users can submit Rmpi
jobs through terminal App. R uses openmpi as MPI interface therefor users would need to load openmpi module before installing or using Rmpi. Rmpi is installed at central location for R versions prior to 4.2.1. If it is not availbe, users can install it as follows
Rmpi Installation
# Get source code of desired version of RMpi wget https://cran.r-project.org/src/contrib/Rmpi_0.7-2.tar.gz # Load modules ml openmpi/1.10.7 R/4.4.0-gnu11.2 # Install RMpi R CMD INSTALL --configure-vars="CPPFLAGS=-I$MPI_HOME/include LDFLAGS='-L$MPI_HOME/lib'" --configure-args="--with-Rmpi-include=$MPI_HOME/include --with-Rmpi-libpath=$MPI_HOME/lib --with-Rmpi-type=OPENMPI" Rmpi_0.7-2.tar.gz # Test loading library(Rmpi)
Please make sure that $MPI_HOME is defined after loading openmpi module. Newer versions of openmpi module has $OPENMPI_HOME instead of $MPI_HOME. So you would need to replace $MPI_HOME with $OPENMPI_HOME for those versions of openmpi.
Above example code can be rewritten to utilize multiple nodes with Rmpi
as follows;
library(Rmpi)
library(snow)
workers <- as.numeric(Sys.getenv(c("PBS_NP")))-1
cl <- makeCluster(workers, type="MPI") # MPI tasks to use
clusterExport(cl, list('myProc'))
tick <- proc.time()
result <- clusterApply(cl, 1:100, function(i) myProc())
write.table(result, file = "foo.csv", sep = ",")
tock <- proc.time() - tick
tock
Batch script for job submission is as follows;
#!/bin/bash #SBATCH --time=10:00 #SBATCH --nodes=2 --ntasks-per-node=28 #SBATCH --account=<project-account> #SBATCH --export=OMP_NUM_THREADS=1,MKL_NUM_THREADS=1 module load R/3.6.3-gnu9.1 openmpi/1.10.7 # parallel R: submit job with one MPI master mpirun -np 1 R --slave < Rmpi.R
pbdMPI package
pbdMPI is an improved version of RMpi package that provides efficient interface to MPI by utilizing S4 classes and methods with a focus on Single Program/Multiple Data ('SPMD') parallel programming style, which is intended for batch parallel execution.
Installation of pbdMPI
Users can download latest version of pbdMPI from CRAN https://cran.r-project.org/web/packages/pbdMPI/index.html and install it as follows,
wget https://cran.r-project.org/src/contrib/pbdMPI_0.5-1.tar.gz ml R/4.4.0-gnu11.2 ml openmpi/4.1.4-hpcx R CMD INSTALL pbdMPI_0.5-1.tar.gz
Examples
Here are few resources that demonstrate how to use pbdMPI
https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=BD40B7B615DF79...
http://hpcf-files.umbc.edu/research/papers/pbdRtara2013.pdf
R Batchtools
The R package, batchtools provides a parallel implementation of Map for high-performance computing systems managed by schedulers Slurm on OSC system. Please find more info here https://github.com/mllg/batchtools.
Users would need two files slurm.tmpl and .batch.conf.R
Slurm.tmpl is provided below. Please change "your project_ID".
#!/bin/bash -l ## Job Resource Interface Definition ## ntasks [integer(1)]: Number of required tasks, ## Set larger than 1 if you want to further parallelize ## with MPI within your job. ## ncpus [integer(1)]: Number of required cpus per task, ## Set larger than 1 if you want to further parallelize ## with multicore/parallel within each task. ## walltime [integer(1)]: Walltime for this job, in seconds. ## Must be at least 60 seconds. ## memory [integer(1)]: Memory in megabytes for each cpu. ## Must be at least 100 (when I tried lower values my ## jobs did not start at all). ## Default resources can be set in your .batchtools.conf.R by defining the variable ## 'default.resources' as a named list. <% # relative paths are not handled well by Slurm log.file = fs::path_expand(log.file) -%> #SBATCH --job-name=<%= job.name %> #SBATCH --output=<%= log.file %> #SBATCH --error=<%= log.file %> #SBATCH --time=<%= ceiling(resources$walltime / 60) %> #SBATCH --ntasks=1 #SBATCH --cpus-per-task=<%= resources$ncpus %> #SBATCH --mem-per-cpu=<%= resources$memory %> #SBATCH --account=your_project_id <%= if (!is.null(resources$partition)) sprintf(paste0("#SBATCH --partition='", resources$partition, "'")) %> <%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %> ## Initialize work environment like ## source /etc/profile ## module add ... module add R/4.0.2-gnu9.1 ## Export value of DEBUGME environemnt var to slave export DEBUGME=<%= Sys.getenv("DEBUGME") %> <%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%> <%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%> <%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%> ## Run R: ## we merge R output with stdout from SLURM, which gets then logged via --output option Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
.batch.conf.R is provided below.
cluster.functions = makeClusterFunctionsSlurm(template="path/to/slurm.tmpl")
A test example is provided below. Assuming the current working directory has both slurm.tmpl and .batch.conf.R files.
ml R/4.0.2-gnu9.1 R >install.packages("batchtools") >library(batchtools) >myFct <- function(x) { result <- cbind(iris[x, 1:4,], Node=system("hostname", intern=TRUE), Rversion=paste(R.Version()[6:7], collapse="."))} >reg <- makeRegistry(file.dir="myregdir", conf.file=".batchtools.conf.R") >Njobs <- 1:4 # Define number of jobs (here 4) >ids <- batchMap(fun=myFct, x=Njobs) >done <- submitJobs(ids, reg=reg, resources=list( walltime=60, ntasks=1, ncpus=1, memory=1024)) >waitForJobs() >getStatus() # Summarize job
Profiling R code
Profiling R code helps to optimize the code by identifying bottlenecks and improve its performance. There are a number of tools that can be used to profile R code.
Grafana:
OSC jobs can be monitored for CPU and memory usage using grafana. If your job is in running status, you can get grafana metrics as follows. After log in to OSC OnDemand, select Jobs from the top tabs, then select Active Jobs and then Job that you are interested to profile. You will see grafana metrics at the bottom of the page and you can click on detailed metrics to access more information about your job at grafana.
Rprof:
R’s built-in tool,Rprof
function can be used to profile R expressions and the summaryRprof
function to summarize the result. More information can be found here.
Here is an example of profiling R code with Rprof
e for data analysis on Faithful data.
Rprof("Rprof-out.prof",memory.profiling=TRUE, line.profiling=TRUE) data(faithful) summary(faithful) plot(faithful) Rprof(NULL)
To analyze profiled data, runsummaryRprof
on Rprof-out.prof
summaryRprof("Rprof-out.prof")
You can read more about summaryRprof
here.
Profvis:
It provides an interactive graphical interface for visualizing data from Rprof.
library(profvis) profvis({ data(faithful) summary(faithful) plot(faithful) },prof_output="profvis-out.prof")
If you are running the R code on Rstudio, it will automatically open up the visualization for the profiled data. More info can be found here.
Using Rstudio for classroom
OSC provides an isolated and custom R environment for each classroom project that requires Rstudio. More information can be found here.
Further Reading
Troubleshooting issues
1. If you're encountering difficulties launching the RStudio App on-demand, it's recommended to review your ~/.bashrc
file for any conda/python configurations. Consider commenting out these configurations and attempting to launch the app again.
2. If your R session is taking too long to initialize, it might be due to issues from a previous session. To resolve this, consider restoring R to a fresh session by removing the previous state stored at
~/.local/share/rstudio
(~/.rstudio
for <R/4.1
)
mv ~/.local/share/rstudio ~/.local/share/rstudio.backup