Polymorphisms are differences in genomic DNA sequences that naturally occur in a population. A single nucleotide substitution is called single nucleotide polymorphism (SNP). SNPs are common but minute variations that occur in human DNA at a frequency of one every 1,000 bases. SNPs are established genetic markers that aid in the identification of loci affecting quantitative traits and/or disease in a wide variety of eukaryote species. The recent completion of a single version of the human genome has now provided the substrates for direct comparison of individuals in both health and disease. Ideally, to better understand the genetic contributions to severe diseases, one would obtain the entire human genome sequence for all disease-carrying individuals for comparison to unaffected control groups. In reality, a strategy that is approachable with today's resources is the re-sequencing of a large set of appropriate candidate genes in individuals with a given disease to screen for causative mutations. Such an approach is fruitful in investigation different diseases [2].
In addition, SNPs have been used extensively in efforts to study the evolution of microbial populations. Such efforts have largely been confined to multi-locus sequence typing of clinical isolates of species such as Neisseria meningitidis and Staphylococcus aureus [3]. However, the recent application of random shotgun sequencing to environmental samples [4,5,6] make possible more extensive SNP analysis of co-occurring and co-evolving microbial populations. An intriguing finding from the Tyson et al. study was the mosaic nature of the genomes of an archaeal population inferred to be the result of extensive homologous recombination of three ancestral strains. This observation was based on a manual analysis of a small subset of the data (ca. 40,000 basepairs) and remains to be verified across the whole genome. Tools to analyze this type of data are in their infancy.
Manipulation, cross-referencing, and haplotype viewing of SNP data are essential for quality assessment and identification of variants associated with genetic disease. The display and interpretation of large genotype data sets can be simplified by using a graphical display.
Several software tools have been developed to assist researchers to carry out this task. A visual genotype (VG2) display [7,2] proved to be useful in presenting raw datasets of individuals' genotype data. This format presents all data in an array of samples (rows) x polymorphic sites (columns) and encodes each diallelic polymorphism according to a general color scheme. This array format allows one to visually inspect the data across both individual's diplotypes and polymorphic sites to make comparisons.
Another program, ViewGene [8], was developed as a flexible tool that takes and constructs an assembly reference scaffold that can be viewed through a simple graphical interface. Polymorphisms generated from many sources can be added to this scaffold with a variety of options to control what is displayed. Large amounts of polymorphism data can be organized so that patterns and haplotypes can be readily discerned. One more software system for automated and visual analysis of functionally annotated haplotypes, HapScope [9], displays genomic structure with haplotype information in an integrated environment, providing alternative views for assessing genetic and functional correlation.
Although these tools provide a number of valuable options for the scientist, some of the needs have not been addressed. VG2 uses simple but effective representations to show genotype data with SNP classification and organizes the data using hierarchical clustering. The major drawbacks of this tool are its static display, lack of provision for details on demand and lack of capabilities to map SNPs to genomic structure. ViewGene provides a simple interface for analyzing sequence data to locate regions favorable to re-sequencing but is limited in its capabilities for post-processing of SNPs data. HapScope consists of valuable haplotype analysis methods along with interactive visualization, but its major focus is the presentation of results from haplotype analysis. Our goal was to develop exploration tools for discovery of disease-related mutations from re-sequencing data.
Most experiments in SNPs research are exploratory in nature, and it has become essential to provide the scientific community with an advanced SNPs exploration tools. With SNPs data growing as a result of large-scale gene re-sequencing and ecogenomics projects, there exists a need to overcome limitations of current SNPs analysis tools. We present an interactive visualization tool, which aids scientists in generating hypotheses from large-scale SNPs data.
SNP-VISTA is implemented as a stand-alone Java application using JBuilder (http://www.borland.com/us/products/jbuilder/index.html [10]) as a development environment. SNP-VISTA uses clustering software, Levenshtein (http://odur.let.rug.nl/~kleiweg/levenshtein/index.html [11]) which is bundled with the package. Automatic recombination points are calculated using a C++ program that can be invoked from the Java application.
SNP-VISTA is available in two versions, as GeneSNP-VISTA or EcoSNP-VISTA, each tailored for a specific application. We describe the two versions in next two sections.
We use the ABO blood group gene (transferase A, alpha 1-3-N-acetylgalactosaminyltransferase; transferase B, alpha 1.3.galactosyltransferase) from the finished genelists of SeattleSNPs (http://pga.mbt.washington.edu/ [12]) to demonstrate our tool.
Figure 1. GeneSNP-VISTA screenshot for ABO blood group gene(transferase A, alpha 1-3-N-acetylgalactosaminyltransferase; transferase B, alpha 1.3.galactosyltransferase.) |
Our tool requires the following files as input:
<exon/cds><tab><start><tab><end>If the coding sequence is not specified explicitly then exons are merged to obtain the coding sequence.
<Site Position><tab><Sample ID><tab><Allele 1><tab><Allele 2>
Sample input files are available on the website http://genome.lbl.gov/vista/GeneSNP-VISTA/ (see [14]).
SNP-VISTA supports the following applications:
We have used the acid mine drainage [4] dataset that is publicly available at http://durian.jgi-psf.org/~eszeto/metag-web/pub/ [18].
Figure 2. EcoSNP-VISTA screenshot of scaffold 1 of the microbial genome of ferroplasma II (Tyson et al., 2004.) |
The following files are needed as input:
<exon/cds><tab><start><tab><end>
<Read name><tab><Position>
Sample input files are available at http://genome.lbl.gov/vista/EcoSNP-VISTA/ [19].
The following modifications are made to GeneSNP-VISTA for to handle ecogenomics data:
The majority of SNPs obtained from re-sequencing of disease-related genes do not have damaging effects on the structure and function of a protein. It is important to filter out such SNPs from causative mutations. GeneSNP-VISTA is an interactive visual tool for highly efficient analysis of large amounts of SNPs data to determine a set of potentially causative mutations. As shown in Figure 1, all the information about a SNP (type, location on genomic structure, frequency of occurrence, amino acid change it causes and conservation of the changed amino acid) allows a scientist to determine whether a SNP is a possible causative mutation. By providing a visually integrated representation of SNPs data with genomic structure and protein conservation, GeneSNP-VISTA facilitates the screening of causative mutations from re-sequencing of a large set of appropriate candidate genes in individuals with a given disease.
Adaptation of existing computational methods and development of new ones for effective SNP analysis of co-occurring and co-evolving microbial populations from ecogenomics data poses new challenges. Manual analysis (Tyson et al., 2004) led to interesting results, but such an analysis is time-intensive and becomes prohibitive for whole genome-scale analysis. Automatic methods are not available yet for such an analysis. As an alternative, EcoSNP-VISTA provides a visual interface for semi-automatic analysis of SNPs data from ecogenomics data. As shown in Figure 2, a compact color-coded representation of SNPs data allows a scientist to manually detect recombination points and visually verify automatically calculated recombination points. EcoSNP-VISTA provides insight into homologous recombination in microbial populations and has the potential to guide in the development of computational methods for such analysis.
We have developed SNP-VISTA, a publicly available interactive visualization tool that assists scientists in the analysis of re-sequence data of disease-related genes for discovery of associated and/or causative alleles and ecogenomics data for studying homologous recombination in microbial populations.
1. SNP-VISTA: An Interactive SNPs
Visualization Tool. [http://genome.lbl.gov/vista/snpvista]
2.
Reider M. J., Taylor S. L., Clark A. G. and Nickerson D. A.: Sequence variation in the human angiotensin
converting enzyme. Nature Genetics,
22, 59-62, 1999.
3.
Spratt B. G.,
Zhang Q.,
Jones D. M., Hutchison A., Brannigan J. A., Dowson C. G.:
Recruitment of a
Penicillin-Binding
Protein Gene from Neisseria flavescens during the Emergence of
Penicillin
Resistance in Neisseria meningitidis. PNAS. 86(22), 8988-8992, 1989.
4.
Tyson et al.: Community structure and
metabolism through reconstruction of microbial
genomes from the environment. Nature,
428, 37- 43 2004.
5. Venter et al.:
Environmental
genome shotgun sequencing of the
6. Tringe SG,
von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short
JM,
Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM.: Comparative metagenomics of microbial
communities. Science. 308, 554-7,
2005.
7.
Nickerson et al.: DNA sequence diversity
in a 9.7-kb region of the human lipoprotein
lipase gene. Nature Genetics.19:233-240,
1998.
8. Kashuk C., SenGupta S.,
Eichler E.,
Chakravarti A.: ViewGene: A graphical
tool for polymorphism visualization and characterization. Genome Research, 12(2), 333-8, 2002.
9. Zhang J., Rowe
W. L., Struewing J. P., Buetow K.H.: HapScope: A software
system for automated
and visual analysis of functionally annotated haplotypes. Nucleic Acids Research, 30(23), 5213-21,
2002.
10. Borland
: JBuilder. [http://www.borland.com/us/products/jbuilder/index.html]
11. RuG/L04 -
dialectometrics & cartography. [http://odur.let.rug.nl/~kleiweg/levenshtein/index.html]
12. SeattleSNPs.
[http://pga.mbt.washington.edu/]
13. EBI
help. [http://www.ebi.ac.uk/help/formats_frame.html]
14. GeneSNP-VISTA.
[http://genome.lbl.gov/vista/GeneSNP-VISTA/]
15. Ng P.C., Henikoff. S.: Accounting for human polymorphisms predicted to affect protein function. Genome Research, 12:436-446, 2002.
16. Ramensky V.,
Bork P., Sunyaev S.: Human
non-synonymous SNPs: server and survey.
Nucleic Acids Research, 30:17:3894-3900,
2002.
17. [http://pga.gs.washington.edu/data/abo/abobg.pph-sift.txt]
18. Metagenomics
Prototype Web
Tools. [http://durian.jgi-psf.org/~eszeto/metag-web/pub/]
19. EcoSNP-VISTA.
[http://genome.lbl.gov/vista/EcoSNP-VISTA/]
20.
Huber T., Faulkner G., Hugenholtz P.: Bellerophon:
A program to detect chimeric sequences in multiple sequence alignments.
Bioinformatics, 20.14, 2317-2319,
2004.