DNA tandem repeats (TRs) are ubiquitous genome features which consist of two or more
adjacent copies of an underlying pattern sequence. The copies may be identical or
TRs can be highly mutable with respect to the number of adjacent copies (copy number),
mutation rates orders of magnitude higher than those of SNPs. TRs are known to directly
more than a dozen human neurological syndromes, are associated with other diseases, may
regulatory functions, and are useful genetic markers due to their high mutation rates.
Whole genome sequencing provides the raw material for identifying and studying, for the
time on a genome-wide scale, those TRs which are inherently variable. In this paper we
the VNTRseek pipeline for discovery of minisatellite TR variants, and its application to
454 high-throughput sequencing data from the Watson and Khoisan genomes. A minisatellite
TR, for the purposes of this study, has a pattern size 7 nucleotides. A VNTR, or
Number of Tandem Repeats, is a TR locus for which more than one allele has been observed,
with each allele having a distinct copy number. Our pipeline maps reads to a set of
TRs and then identies those references which appear to be VNTRs.