mpiBLAST

mpiBLAST is a freely available, open-source, parallel implementation of NCBI BLAST. mpiBLAST takes advantage of distributed computational resources, i.e., a cluster, through explicit MPI communication and thereby utilizes all available resources unlike standard NCBI BLAST which can only take advantage of shared-memory multi-processors (SMPs).

Availability & Restrictions

mpiBLAST is available without restriction to all OSC users.

The following version of mpiBLAST are available on OSC systems:

Version Glenn Oakley
1.6.0 X X

Usage

Set-up

To load the mpiBLAST software on the Glenn system, use the following commands:

module load biosoftw
module load mpiblast

On the Oakley system, use the following command:

module load mpiblast

Using mpiBLAST

Once mpiblast module is loaded, the commands are available for your use.

mpiblast 
mpiblast_cleanup 
mpiformatdb

Formatting a database

Before processing blast queries the sequence database must be formatted with mpiformatdb. The command line syntax looks like this:
mpiformatdb -N 16 -i nt -o T

The above command would format the nt database into 16 fragments. Note that currently mpiformatdb does not support multiple input files.

mpiformatdb places the formatted database fragments in the same directory as the FASTA database. To specify a different target location, use the "-n" option as what is available in the NCBI formatdb.

Querying the database

mpiblast command line syntax is nearly identical to NCBI's blastall program. Running a query on 18 nodes would look like:
mpiexec -n 18 mpiblast -p blastn -d nt -i blast_query.fas -o blast_results.txt

The above command would query the sequences in blast_query.fas against the nt database and write out results to the blast_results.txt file in the current working directory. By default, mpiBLAST reads configuration information from ~/.ncbirc. Furthermore, mpiBLAST needs at least 3 processes to perform a search: two processes dedicated for scheduling tasks and coordinating file output, while any additional processes actually perform search tasks.

Extra options to mpiblast

  • --partition-size=[integer]
    Enable hierarchical scheduling with multiple masters. The partition size equals the number of workers in a partition plus 1 (the master process). For example, a partition size of 17 creates partitions consisting of 16 workers and 1 master. An individual output file will be generated for each partition. By default, mpiBLAST uses one partition. This option is only available for version 1.6 or above.
  • --replica-group-size=[integer]
    Specify how database fragments are replicated within a partition. Suppose the total number of database fragments is F, the number of MPI processes in a partition is N, and the replica-group-size is G, then in total (N-1)/G database replicas will be distributed in the partition (the master process does not host any database fragments), and each worker process will host F/G fragments. In other words, a database replica will be distributed to every G MPI processes.
  • --query-segment-size=[integer]
    The default value is 5. Specify the number of query sequences that will be fetched from the supermaster to the master at a time. This parameter controls the granularity of load balancing between different partitions. This option is only available for version 1.6 or above.
  • --use-parallel-write
    Enable the high-performance parallel output solution. Note the current implementation of parallel-write does not require a parallel file system.
  • --use-virtual-frags
    Enable workers to cache database fragments in memory instead of local storage. This is recommended on diskless platforms where there is no local storage attaching to each processor. Default to be enabled on Blue Gene systems.
  • --predistribute-db
    Distribute database fragments to workers before the search begins. Especially useful in reducing data input time when multiple database replicas need to be distributed to workers.
  • --output-search-stats
    Enable output of the search statistics in the pairwise and XML output format. This could cause performance degradation on some diskless systems such as Blue Gene.
  • --removedb
    Removes the local copy of the database from each node before terminating execution.
  • --copy-via=[cp|rcp|scp|mpi|none]
    Sets the method of copying files that each worker will use. Default = "cp"
    • cp : use standard file system "cp" command. Additional option is --concurrent.
    • rcp : use rsh "rcp" command. Additonal option is --concurrent.
    • scp : use ssh "scp" command. Additional option is --concurrent.
    • mpi : use MPI_Send/MPI_Recv to copy files. Additional option is --mpi-size.
    • none : do not copy files, instead use shared storage as local storage.
  • --debug[=filename]
    Produces verbose debugging output for each node, optionally logs the output to a file.
  • --time-profile=[filename]
    Reports execution time profile.
  • --version
    Print the mpiBLAST version.

Please refer to the README file in the mpiBLAST package for performance tuning guide.

Removing a database

The --removedb command line option will cause mpiBLAST to do all work in a temporary directory that will get removed from each node's local storage directory upon successful termination. For example:
mpiexec -n 18 mpiblast -p blastx -d yeast.aa -i ech_10k.fas -o results.txt --removedb

The above command would perform a 18 node (16 worker) search of the yeast.aa database, writing the output to results.txt. Upon completion, worker nodes would delete the yeast.aa database fragments from their local storage.

Databases can also be removed without performing a search in the following manner:
mpiexec -n 18 mpiblast_cleanup

Batch Usage

Below is a sample batch script for running mpiBLAST job. It asks for 24 processors and 30 minutes of walltime.

#PBS -l walltime=30:00
#PBS -l nodes=1:ppn=12
#PBS -N mpiBLAST
#PBS -j oe

cp /usr/local/mpiblast/1.6.0/.ncbirc ./
module load mpiblast

# copy data over to $TMPDIR on compute node
cd $PBS_O_WORKDIR
cp query.fasta $TMPDIR
cp db/benchmark.fasta* $TMPDIR

# Break the database into 10 pieces
cd $TMPDIR
/usr/bin/time mpiformatdb -N 10 -i benchmark.fasta -o T -p T
cp benchmark.fasta* /nfs/proj01/PZS0002/biosoftw/db/

# run mpiblast
/usr/bin/time mpiexec -n 12 mpiblast -p blastp -d benchmark.fasta -i query.fasta -o blast_results.txt


# Copy output back to working directory
mkdir $PBS_O_WORKDIR/$PBS_JOBID
cp blast_results.txt $PBS_O_WORKDIR/$PBS_JOBID
cd $PBS_O_WORKDIR

Further Reading

See Also

Supercomputer: 
Service: 
Fields of Science: