RepeatMasker

"RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). On average, almost 50% of a human genomic DNA sequence currently will be masked by the program." (http://www.repeatmasker.org/)

Availability & Restrictions

RepeatMasker is available to all OSC users without restriction.

The following versions of RepeatMasker are available on OSC systems:

Version Glenn Oakley
2.1 X  

Usage

Set-up

On the Glenn Cluster RepeatMasker is accessed by executing the following commands:

module load biosoftw
module load RepeatMasker

RepeatMasker will be added to the users PATH and can be run with the command:

RepeatMasker [-options] <seqfiles(s) in fasta format>

Options

-h(elp)
      Detailed help
      Default settings are for masking all type of repeats in a primate sequence.
-pa(rallel) [number]
      The number of processors to use in parallel (only works for batch files or sequences over 50 kb)
-s    Slow search; 0-5% more sensitive, 2-3 times slower than default
-q    Quick search; 5-10% less sensitive, 2-5 times faster than default
-qq   Rush job; about 10% less sensitive, 4->10 times faster than default (quick searches are fine under most circumstances) repeat options
-nolow /-low
      Does not mask low_complexity DNA or simple repeats
-noint /-int
      Only masks low complex/simple repeats (no interspersed repeats)
-norna
      Does not mask small RNA (pseudo) genes
-alu
      Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)
-div [number]
      Masks only those repeats < x percent diverged from consensus seq
-lib [filename]
      Allows use of a custom library (e.g. from another species)
-cutoff [number]
      Sets cutoff score for masking repeats when using -lib (default 225)
-species <query species>
      Specify the species or clade of the input sequence. The species name must be a valid NCBI Taxonomy Database species name and be contained in the RepeatMasker repeat database. Some examples are:
      -species human
      -species mouse
      -species rattus
      -species "ciona savignyi"
      -species arabidopsis
      Other commonly used species: mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu, danio, "ciona intestinalis" drosophila, anopheles, elegans, diatoaea, artiodactyl, arabidopsis, rice, wheat, and maize

Contamination options

-is_only
      Only clips E coli insertion elements out of fasta and .qual files
-is_clip
      Clips IS elements before analysis (default: IS only reported)
-no_is
      Skips bacterial insertion element check
-rodspec
      Only checks for rodent specific repeats (no repeatmasker run)
-primspec
      Only checks for primate specific repeats (no repeatmasker run)

Running options

-gc [number]
      Use matrices calculated for 'number' percentage background GC level
-gccalc
      RepeatMasker calculates the GC content even for batch files/small seqs
-frag [number]
      Maximum sequence length masked without fragmenting (default 40000, 300000 for DeCypher)
-maxsize [nr]
      Maximum length for which IS- or repeat clipped sequences can be produced (default 4000000). Memory requirements go up with higher maxsize.
-nocut
      Skips the steps in which repeats are excised
-noisy
      Prints search engine progress report to screen (defaults to .stderr file)
-nopost
      Do not postprocess the results of the run ( i.e. call ProcessRepeats).
       NOTE: This options should only be used when ProcessRepeats will be run manually on the results.

Output options

-dir [directory name]
      Writes output to this directory (default is query file directory, "-dir ." will write to current directory).
-a(lignments)
      Writes alignments in .align output file; (not working with -wublast)
-inv
      Alignments are presented in the orientation of the repeat (with option -a)
-lcambig
      Outputs ambiguous DNA transposon fragments using a lower case name.  All other repeats are listed in upper case. Ambiguous fragments match multiple repeat elements and can only be called based on flanking repeat information.
-small
      Returns complete .masked sequence in lower case
-xsmall
      Returns repetitive regions in lowercase (rest capitals) rather than masked
-x    Returns repetitive regions masked with Xs rather than Ns
-poly
      Reports simple repeats that may be polymorphic (in file.poly)
-source
      Includes for each annotation the HSP "evidence". Currently this option is only available with the "-html" output format listed below.
-html
      Creates an additional output file in xhtml format.
-ace
      Creates an additional output file in ACeDB format
-gff
      Creates an additional Gene Feature Finding format output
-u    Creates an additional annotation file not processed by ProcessRepeats
-xm   Creates an additional output file in cross_match format (for parsing)
-fixed
      Creates an (old style) annotation file with fixed width columns
-no_id
      Leaves out final column with unique ID for each element (was default)
-e(xcln)
      Calculates repeat densities (in .tbl) excluding runs of >=20 N/Xs in the query

Example

#PBS -N RepeatMasker_test
#PBS -l walltime=4:00:00
#PBS -l nodes=1:ppn=4

module load biosoftw
module load RepeatMasker
cp /usr/local/biosoftw/bowtie-0.12.7/genomes/NC_008253.fna .
RepeatMasker –pa 4 NC_008253.fna

Errors

The following commands result in errors:  RepeatMasker -w, RepeatMasker -de, RepeatMasker -e.

Further Reading

Supercomputer: 
Service: 
Fields of Science: