Following are the features supported by the latest verison of Varant:
Features |
Descriptions |
Variant type that the tool annotates |
SNPs, Indels & MNPs |
Input file format |
VCF |
Output file format |
VCF |
VCF Parser |
Provides parser for Varant annotated VCF |
Following are the 22 annotations provided by Varant:
1. dbSNP, 1000Genome Minor Allele Frequency (MAF) & ESP (MAF)
2. Clinically significant variants from ClinVar DB
3. GWAS Phenotype
4. Genomic region - Intergenic, Intronic, Exonic & UTR
5. Downstream and upstream gene for intergenic variants
6. Splice Site (Donor/Acceptor)
7. Mutation Type - NonSyn, Syn, StartGain, StartLoss, StopGain, StopLoss, SynStop
8. Codon Usage in Human
9. Exonic splice enhancer / silencer site
10. Flag variants that spans boundary region like Intron-Exon or UTR-CDS
11. Distance of intronic variants from splice sites
12. UTR Functional Motifs
13. miRNA Binding Site
14. Polyphen2, SIFT & CADD prediction
15. Gene-Disease association - OMIM, NCBI-GAD
16. Position Conservation - Gerp++ Score
17. Interpro Domain
18. TFBS
19. eQTL
20. Low complexity region
21. Pseudo autosomal region
22. Capture region
Varant annotations are written to the INFO field of the VCF file in compliance to the VCF file format and can be
easily parsed by any VCF parser.
Following are the details about the format -
dbSNP, 1000Genome or ESP variants
Clinically significant variants from ClinVar DB
GWAS variants
Intergenic variants
Genic variants
Conserved Regions
CADD prediction
Interpro Domain
TFBS and eQLT
Low complexity regions
Pseudo Autosomal regions
Capture regions
dbSNP, 1000Genome or ESP variants
Annotations in INFO field of VCF file |
Description |
DB138 |
Variant present in dbSNP138 |
dbSNPBuildID |
dbSNP build ID in which the variant was first reported |
KGDB |
Variant present in 1000Genome |
KGAF |
The minor allele frequency (fraction) reported in 1000Genome. Eg KGAF=0.002 |
ESPDB |
Variant present in Exome Sequencing Project |
ESPAF |
The minor allele frequency (fraction) reported in Exome sequencing project. Eg ESPAF=0.01 |
Example:
Clinically significant variants from ClinVar DB
Annotations in INFO field of VCF file |
Description |
CLNDSDB |
Variant disease database name |
CLNACC |
Variant Accession and Versions |
CLNDBN |
Variant disease name |
CLNSRC |
Variant Clinical Chanels |
CLNSIG |
Variant Clinical Significance, 0 - unknown, 1 - untested, 2 - non-pathogenic, 3 - probable-non-pathogenic, 4 - probable-pathogenic, 5 - pathogenic, 6 - drug-response, 7 - histocompatibility, 255 - other |
CLNORIGIN |
Allele Origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other |
CLN |
SNP is Clinical(LSDB,OMIM,TPA,Diagnostic) |
CLNDSDBID |
Variant disease database ID |
CLNHGVS |
Variant names from HGVS. The order of these variants corresponds to the order of the info in the other clinical INFO tags. |
CLNSRCID |
Variant Clinical Channel IDs |
Example:
GWAS variants
Annotations in INFO field of VCF file |
Description |
GWASPhenotype |
NHGRI-GWAS phenotypes associated with the variant |
Example:
Intergenic variants
For the variants in intergenic region, the downstream and upstream genes along with the distance from the variant to the gene’s TSS/TES are
reported in the following format –
VARANT_INTERGENIC=UpstreamGene(dist=XYZ), DownstreamGene(dist=XYZ)
Example:
Genic variants
Variant in genic region is annotated in following the format -
VARANT_GENIC=Gene(Transcript_ID|Region|Exon_Number|AltID|cDNAPos|SpliceSite|UTRSignal|Mutation|Codon_Change|Amino_Acid_Change|Ref_Protein_Length|Codon_Usage|SIFT(pred_score)|Polyphen2(pred_score)|Warning:OMIM_Disease:OMIM_Ids:GAD_Disease)
If there are more than one transcript for a gene, then the annotations on them are appended after the annotations for the first transcript by ':'. The last three annotations appended by ':' to the transcript related annotations are always OMIM phenotype, OMIM phenotype Ids and GAD phenotypes associated with the gene.
Following is the description of the fields in VARANT_GENIC :
Fields |
Description |
Gene |
Gene Name |
Transcript_ID |
Transcript Accession number |
Region |
Genomic region where the variant is present. Following are the possible values for this field - |
|
CodingExonic - Variant present in CDS |
|
CodingIntronic - Variant present in intron of coding transcript |
|
NonCodingExonic - Variant present in exon of non-coding transcript |
|
NonCodingIntronic - Variant present in intron of non-coding transcript |
|
UTR5 - Variant present in 5'UTR of transcript |
|
UTR3 - Variant present in 3'UTR of transcript |
|
Intergenic_UTR5_boundary - Variant spanning Intergenic and 5'UTR region |
|
CodingExonic_CodingIntronic_boundary - Variant spanning CDS and intronic region of coding transcript |
|
UTR5_CodingExonic_boundary - Variant spanning 5'UTR and CDS of coding transcript |
|
CodingExonic_UTR3_boundary - Variant spanning CDS and 3'UTR of coding transcript |
|
UTR3_Intergenic_boundary - Variant spanning 3'UTR and Intergenic region of coding transcript |
|
NonCodingExonic_NonCodingIntronic_boundary - Variant spanning exon and intron of non-coding transcript |
Exon_Number |
Exon which hosts the variant. If the variant spans more than one exon then the exon numbers are reported as '__' delimited. |
AltID |
Alternate allele id number (1,2,3 etc) to which the annotations corresponds |
cDNAPos |
cDNA position of the variant |
SpliceSite |
Splicing site or splice regulation site annotations. Following are the possible values for this field - |
|
ESE - Variant present in predicted Exonic Splice Enhancer Site |
|
ESS - Variant present in predicted Exonic Splice Silencer Site |
|
SpliceDonor - Variant present at 5' splice site |
|
SpliceAcceptor - Variant present at 3' splice site |
|
ExonX_SpliceDonor_dist1__ExonY_SpliceAcceptor_dist2 - This annotation is reported for the intronic variants not at splice sites. The ExonX and ExonY are the flanking exons with X and Y as exon numbers, dist1 is the distance of the variant from SpliceDonor site and dist2 is the distance of the variant from SpliceAcceptor site. E.g. Exon7_SpliceDonor_5__Exon8_SpliceAcceptor_750 indicating the variant is 5 bp away from SpliceDonor and 750bp away from SpliceAcceptor and the variant is between Exon7 and Exon8 |
UTRSignal |
Functional Motifs in the untranslated region of transcript |
Mutation |
Mutation type caused by variant. Following are the possible values for this field - |
|
NonSyn, Syn, StartGain, StartLoss, StopGain, StopLoss, FrameShiftInsert, FrameShiftDelete, NonFrameShiftInsert, NonFrameShiftDelete, NoCDSChange |
Codon_Change |
Reports wild type codon and mutant type codon that hosts the variant |
Amino_Acid_Change |
Change in the amino acid due to mutation. Reports the wild type and mutant type amino acids and protein position |
Ref_Protein_Length |
The original protein length |
Codon_Usage |
Change in codon usage. Following are the possible values for this field - |
|
CodonUsageDown - The mutant type codon usage is lower than the wild type codon usage |
|
CodonUsageUp - The mutant type codon usage is higher than the wild type codon usage |
SIFT(pred_score) |
The SIFT prediction and score for the NonSyn mutations. Following are the possible values for this field - |
|
T_Score - 'T' stands for Tolerated effect followed by SIFT score. E.g. T_0.28 |
|
D_Score - 'D' stands for Damaging effect followed by SIFT score. E.g. D_0.02 |
Polyphen2(pred_score) |
The Polyphen2 prediction and score for the NonSyn mutations. Following are the possible values for this field - |
|
PP2B_Score - 'PP2B' stands for Benign effect followed by PolyPhen2 score. E.g. PP2B_0.018 |
|
PP2PD_Score - 'PP2PD' stands for Possibly Damaging effect followed by PolyPhen2 score. E.g. PP2PD_0.789 |
|
PP2D_Score - 'PP2D' stands for Probably Damaging effect followed by PolyPhen2 score. E.g. PP2D_0.991 |
Warning |
Annotation Warnings and following are the possible values for this field - |
|
NOT_ACTUAL_STOP_CODON__TRANSCRIPT_WITH_MULTIPLE_STOP_CODON - This warning is reported when there is a StopLoss mutation and the stop codon is not present at the end of CDS. |
|
CDS_NOT_MULTIPLE_OF_3 - This warning is reported when the transcript is incomplete. |
OMIM_Disease |
Gene associated OMIM Phenotypes. Mutiple values are '__' delimited. |
OMIM_Ids |
Gene associated OMIM Phenotype IDs. Mutiple values are '__' delimited. |
GAD_Disease |
Gene associated Phenotypes from NCBI-GAD database. Mutiple values are '__' delimited. |
Example:
Variant in CDS of a gene and causing NonSyn mutation and predicted as damaging. The gene is also associated with a OMIM phenotype.
Variant that spans boundary region.
Variant not at splice site but very close to 5' splice site.
Variant on overlapping genes.
Conserved Regions
Annotations in INFO field of VCF file |
Description |
GerpConserve |
Flags the variant to be present in a conserved region |
GerpRSScore |
Gerp Rejected Substitutions Score. Rejected substitutions are a natural measure of constraint that reflects the strength of past purifying selection on the element. |
GerpPValue |
Gerp P-Value |
Example:
CADD prediction
Annotations in INFO field of VCF file |
Description |
CADD_raw |
CADD raw score for funtional prediction of a SNP. The larger the score the more likely the SNP has damaging effect. Scores are reported in the order in which ALT alleles are reported. |
CADD_phred |
CADD phred-like score. This is phred-like rank score based on whole genome CADD raw scores. The larger the score the more likely the SNP has damaging effect. Scores are reported in the order in which ALT alleles are reported |
Example:
Interpro Domain
Annotations in INFO field of VCF file |
Description |
Interpro_domains |
Variant position is part of a domain or conserved site |
Example:
TFBS and eQLT
Regulome DB is used for this annotation.
Annotations in INFO field of VCF file |
Description |
RegulomeScore |
Regulome Score as described here . Following are the possible values for this field - |
|
1a or 1b or 1c or 1d or 1e or 1f - Represents eQTL |
|
2a or 2b or 2c or 3a or 3b or 4 - Represents TFBS |
Example:
Variant upstream of a gene and at TFBS
Variant causing Syn mutation but is a eQTL
Low complexity regions
The LCR data used was distributed as supplement material for the paper -
Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples by Heng Li.
The file used for annotation was downloaded from -
LCR-hs37d5.bed.gz
Annotations in INFO field of VCF file |
Description |
LCR |
Variant position is part of low-complexity region |
Example: An insertion variant in low complexity region
Pseudo Autosomal regions
The genes in PAR regions were compiled from the following two sources -
1.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2435358/
2.
http://en.wikipedia.org/wiki/Pseudoautosomal_region
Annotations in INFO field of VCF file |
Description |
PAR |
Variant is in Pseudoautosomal Region |
Example:
Capture regions
The current version is set up to annotate using Nimblegen capture bed file but this feature can be extended to use other capture arrays from Illumina or Agilent.
Annotations in INFO field of VCF file |
Description |
CaptureCore |
Variant is present in Capture bed file |
Capture5p |
Number of bases that the variant position is 5p upstream of capture region |
Capture3p |
Number of bases that the variant position is 3p downstream of capture region |
Example:
Variant is part of capture region
Variant position is 20 bases upstream of capture region
Variant position is 28 bases downstream of capture region