Supported by grant #IIS-1017621 
     HOME       MANUAL/FAQ       DOWNLOAD       WHAT'S NEW       CONTACT US       LBI

Important Notes

  • VNTRseek is a computational pipeline for the detection of VNTRs
  • VNTRView is a separate, web-based, frontend for viewing the results of a VNTRseek run.
  • Only CentOS 6 and Ubuntu 12.10 and up have been tested.
  • Only a Linux 64-bit version of our software is available at this time.
  • Currently, the default install path is valid only on UNIX-like platforms.

Requirements

You will need a fast computer with plenty of RAM and disk space. This pipeline is very CPU and IO intensive, and will require plenty of system memory and space for output. We recommend a machine with at least 8 modern CPUs and at least 32GB RAM.

The following programs are required for the VNTRseek pipeline to run, with the minimum version shown:

  • MySQL client 5.0.95 or higher
  • Perl 5.8.8 or higher

A MySQL server is required, but can be hosted on a remote machine.

The Perl DBI and DBD::mysql modules are required for interacting with a MySQL server.

VNTRView requires:

  • Apache 2.2.0 or higher
  • PHP 5.1.6 or higher
  • GD Graphics Library (http://www.libgd.org)

The GD graphics library is available with most Linux distributions. The development packages may be called libgd-dev or gd-devel.

Apart from Apache, VNTRView has been known to work with nginx, but we do not officially support it.

VNTRseek Installation

To install the VntrSeek pipleline download either the binary gzipped file from our download page (only linux 64 bit version available at this time), or download and compile the source archive by following the directions below.

Note: TRF is also required, but is downloaded during installation. If for some reason the download fails, you can download it manually from the TRF homepage and save it as trf407b-ngs.linux.exe in the build directory (see below).

Installation requires CMake (http://www.cmake.org/), minimum version 2.8

On Ubuntu, this can be installed using:

sudo apt-get install cmake

On Red Hat 6/CentOS 6, run (as root):

yum install cmake28

On Fedora (13 and up), run (as root):

yum install cmake

On Archlinux, run:

pacman -Sy cmake

Additionally you will need GCC version 4.1.2 or higher.

To install:

tar xzvf vntrseekN.NN.tar.gz
cd vntrseekN.NNsrc
mkdir build
cd build
cmake ..     # may be cmake28 on some systems
make install # or sudo make install, if needed

Note the space and two dots after "cmake".

By default, this will install the pipeline to /usr/local/vntrseekN.NN (eg, /usr/local/vntrseek1.08).

If you would like to choose a different installation prefix, simply run:

cmake -DCMAKE_INSTALL_PREFIX=<full path> ..

For example, to install to your home directory, ${HOME}/vntrseekN.NN, use:

cmake -DCMAKE_INSTALL_PREFIX=${HOME} ..

To complete the installation, copy three hg19 prepared reference files into the installation folder from the .data.tar.gz file (see downloads page). To use another genome you will need to prepare data files using TRDB (https://tandem.bu.edu/cgi-bin/trdb/trdb.exe). Also set the MySQL credentials in the global vs.cnf file (optionally they can be set using command line options). CREATE DATABASE priviliges must be granted.

If you installed this pipeline as root, and are creating an INDIST file you may need to run it as root unless you give your user permission to write to the installation directory.

If you installed to a non-standard location, you may need to add /path/to/prefix/bin to your PATH variable (eg, if your prefix was /opt, you will need to have /opt/bin in your PATH).

In environments where one central installation is used by a team, we suggest that the users are all assigned to the same group. Then, read permissions can be given to the global vs.cnf file to members of that group.

IMPORTANT: for correct execution, please add these lines to the [mysqld] section of the my.cnf file and restart the mysql process:

innodb_buffer_pool_size=1G
innodb_additional_mem_pool_size=20M

VNTRView Installation

To install VNTRView:

tar xzvf vntrviewN.NNsrc.tar.gz
cd vntrviewN.NNsrc
mkdir build
cd build
module load cmake # only needed on some systems
cmake -DCMAKE_INSTALL_PREFIX=/path/to/web/document/root .. # for example, /var/www/html, also may be cmake28 on some systems
make install      # Or sudo make install, if needed

To complete the installation, set login and password in index.php and result.php. Also change these files to be owned by the web server user and set permission flags to 600 to prevent others on the system from learning your MySQL password!

Uninstalling

On UNIX-like systems, simply run:

xargs rm < install_manifest.txt # or sudo xargs rm < install_manifest.txt

from the build directory you created above. The directory will remain, however, so you will not lose any reference files.

Preparing your input data

VNTRseek requires that your input files be named in a particular format. It also expects them to be gzip-compressed.

For FASTA files, files should be named with the following format:

fasta_filename.gz

and FASTQ files with format:

fastq_filename.gz

where filename is any text (the original name of the file, or any string to distinguish between files).

When running on multi-core machines such as clusters, you may wish to split your input files into many smaller files (eg, 1 million reads per file).

At the moment, FASTA and FASTQ formats are the only formats supported. If you require support for another format, please contact us, or you can try adding support your self via our git repository (see the Downloads page).

Running the pipeline (master script header)

# MASTER SCRIPT TO RUN THE TR VARIANT SEARCH PIPELINE
#
# DO NOT USE SPACES IN PATHS AND DO NOT USE DOTS (.) OR HYPHENS (-) IN DBSUFFIX
#
# command line usage example:
#  vntrseek N K --dbsuffix dbsuffix
#       where N is the start step to execute (0 is the first step)
#       and K is the end step (19 is the last step)
#
# example:
#  vntrseek 0 19 --dbsuffix run1 --server orca.bu.edu --nprocesses 8 --html_dir /var/www/html/vntrview 
#                --fasta_dir /bfdisk/watsontest --output_root /smdisk --tmpdir /tmp &
#
# special commands:
#  vntrseek 100 --dbsuffix dbsuffix
#       clear error (because of temporary files, don't run step 3 without step 2 and step 13-17 without step 12)
#  vntrseek 99 --dbsuffix dbsuffix
#       return next step that needs to be run (this can be
#       used for multi/single processor execution flow control used with
#       advanced cluster script)
#  vntrseek 100 N --dbsuffix dbsuffix
#       clear error and set NextRunStep to N (for advanced cluster script)
#
# IMPORTANT: for correct execution, please add these lines to
# [mysqld] section of my.cnf file and restart mysql process:
#
# innodb_buffer_pool_size=1G
# innodb_additional_mem_pool_size=20M

Options

Usage: vntrseek startstep endstep [OPTIONS]

Example:

#To tell the master script what step to execute. The first step is 0, last step is 19.
vntrseek 0 19

Options:
  --HELP                        prints this help message
  --LOGIN                       mysql login
  --PASS                        mysql pass
  --HOST                        mysql host (default localhost)
  --NPROCESSES                  number of processors on your system
  --MIN_FLANK_REQUIRED          minimum required flank on both sides for a read TR to be considered (default 10)
  --MAX_FLANK_CONSIDERED        maximum flank length used in flank alignments, set to big number to use full flank (default 50)
  --MIN_SUPPORT_REQUIRED        minimum number of mapped reads which agree on copy number to call an allele (default 2)
  --DBSUFFIX                    suffix for database name
  --SERVER                      server name, used for html generating links
  --STRIP_454_KEYTAGS           for 454 platform, strip leading 'TCAG', 0/1 (default 0)
  --IS_PAIRED_READS             data is paired reads, 0/1 (default 0)
  --HTML_DIR                    html directory (must be writable and executable!)
  --FASTA_DIR                   input data directory (plain or gzipped fasta/fastq files)
  --OUTPUT_ROOT                 output directory (must be writable and executable!)
  --TMPDIR                      temp (scratch) directory (must be writable!)
  --REFERENCE_FILE              reference profile file (default set in global config file)
  --REFERENCE_SEQ               reference sequence file (default set in global config file)
  --REFERENCE_INDIST            reference indistinguishables file (default set in global config file)
  --REFERENCE_INDIST_PRODUCE    generate a file of indistinguishable references, 0/1 (default 0)
  --REFS_TOTAL                  total number of reference TRs prior to filtering (default set in global config file)


ADDITIONAL USAGE:

  vntrseek 100                  clear error (because of temporary files, don't run step 3 without step 2 and step 13-17 without step 12)
  vntrseek 100 N                clear error and set NextRunStep to N (0-19, this is only when running on a cluster 
                                using the advanced cluster script that checks for NextRunStep)

Step-by-step running instructions

DON'T FORGET TO ADD THE 2 LINES (about innodb, mentioned above) into the mysql configuration!

  1. Make sure there is enough space in the output folder (at least half as much as the compressed reads take)
  2. Although mysql credentials could be set with command line parameters, we suggest setting them in the global vs.cnf (in the installation folder)
  3. Running example:
vntrseek 0 19 --dbsuffix run1 --server orca.bu.edu --nprocesses 8 --html_dir /var/www/html/vntrview --fasta_dir /bfdisk/watsontest --output_root /smdisk --tmpdir /tmp

This will run all steps of the pipeline (0-19).

Note that a configuration file will be created for this run at ~/vs.[DBSUFIX].cnf If some steps will be needed to rerun in the future, only --dbsuffix needs to be specified.

Restarting: If your job was killed or died due to errors, after fixing the problem, clear the error and restart from the failed step: vntrseek 100 --dbsuffix run1 # clears the error (because of temporary files, don't run step 3 without step 2 and step 13-17 without step 12)

Viewing results

After successful completion of all 19 steps of the pipeline, a number of output files are produced in /OUTPUT_ROOT/VNTR_DBSUFIX/data_out_clean/result, particularly the spanN vcf file (N can be adjusted in the paramters, default is 2). The spanNALL vcf file contains all TRs and their observed alleles. Additionally, php output can be viewed with an the optional VNTRview package installed in the web root directory. The VNTRView package is also available on the download page.

If VNTRView is installed, you can monitor pipeline progress at http://yourserver/vntrview/result.php?db=VNTRPIPE_yourrun (click the + sign to expand statistics panel and look at the step completion times and dates).

The VCF file contains links to the viewer as well, except it is to the individual references, eg, http://orca.bu.edu/vntrview/index.php?db=VNTRPIPE_watson_ref230306&ref=-175343809&isref=1&istab=1&ispng=1&rank=3

VNTRView's expanded statistics page contains links to VCF files, distribution, and latex as well as some others.

Testing with control data

The file called samplereads.tar.gz is meant to be used as a control test for vntrseek pipeline. Inside, watsontest folder contains selected reads from the watson genome (66 reads). Use this as the FASTA_DIR input. After running the pipeline, do a diff on data_out_clean/result/report.span2.vcf from your output and report.span2.vcf given inside the samplereads.tar.gz, results should be identical (other than the date). There should be 10 vntrs found.

Running on a Cluster

These instructions are for vntrseek1.08 and higher to be used with clusters that use the Open Grid Scheduler/Grid Engine batch-queuing system. This requires files "qsub_test_advanced.sh" and "master_for_qstub_test_advanced.sh". These files could possibly be adapted for use with other scheduling systems.

The advantage of the advanced cluster sript is the ability to run database insertion steps in single processor mode and multi-processor steps in multi mode for the purpose of not wasting CPU time. If this is not an issue, use the simple (qsub_test.sh and master_for_qstub_test.sh) scripts.


  • FIRST make sure pipeline is installed and runs correctly by following instructions here: http://orca.bu.edu/vntrseek/manual.php

  • DON'T FORGET TO ADD THE 2 LINES into the mysql configuration (about innodb, mentioned above)!

  1. make sure there is enough space in the output folder (at least half as much as the compressed reads take)
  2. although mysql credentials could be set with command line parameters, we suggest setting them in the global vs.cnf (in the installation folder)
  3. run

    vntrseek 0 --dbsuffix run1 --server orca.bu.edu --nprocesses 8 --html_dir /var/www/html/vntrview --fasta_dir /bfdisk/watsontest --output_root /smdisk --tmpdir /tmp
    

    This needs to be run to create the config file with all the variables

  4. run

     qsub_test_advanced.sh DBSUFIX NPROCESSORS 
    

(note in the qsub_test_advanced.sh it requests 40 hours (in 2 places), if you know for sure your run is short, you might want to change it. This might schedule it faster. But if it goes over, it will kill your job.)

Checking status:

qstat -u yourname

Also there should be .e[pid] files for each executed step in the folder.

Also, if vntrview is installed, monitor at http://yourserver/vntrview/result.php?db=VNTRPIPE_yourrun (click the + sign to expand statistics panel and look at the step completion times and dates).

Restarting:

If your job was killed or died for some reason, clear the error (because of temporary files, don't run step 3 without step 2 and step 13-17 without step 12)

vntrseek 100 --dbsuffix run1

and restart by running

qsub_test_advanced.sh DBSUFIX NPROCESSORS 

It should automatically resume from the correct step.

If you are using the advanced script and need to rerun from a certain (already completed) step, NextStep needs to be reset before restarting. Use

vntrseek 100 N --dbsuffix run1

where N is the step to be set as the next step (0-19).


Send any questions or comments to: Yozen Hernandez.

Last updated: July 19, 2015