Varant Features

Following are the features supported by the latest verison of Varant:

Features Descriptions
Variant type that the tool annotates SNPs, Indels & MNPs
Input file format VCF
Output file format VCF
VCF Parser Provides parser for Varant annotated VCF

Varant Annotations

Following are the 22 annotations provided by Varant:

1. dbSNP, 1000Genome Minor Allele Frequency (MAF) & ESP (MAF)
2. Clinically significant variants from ClinVar DB
3. GWAS Phenotype
4. Genomic region - Intergenic, Intronic, Exonic & UTR
5. Downstream and upstream gene for intergenic variants
6. Splice Site (Donor/Acceptor)
7. Mutation Type - NonSyn, Syn, StartGain, StartLoss, StopGain, StopLoss, SynStop
8. Codon Usage in Human
9. Exonic splice enhancer / silencer site
10. Flag variants that spans boundary region like Intron-Exon or UTR-CDS
11. Distance of intronic variants from splice sites
12. UTR Functional Motifs
13. miRNA Binding Site
14. Polyphen2, SIFT & CADD prediction
15. Gene-Disease association - OMIM, NCBI-GAD
16. Position Conservation - Gerp++ Score
17. Interpro Domain
18. TFBS
19. eQTL
20. Low complexity region
21. Pseudo autosomal region
22. Capture region

Varant Annotation Output Format

Varant annotations are written to the INFO field of the VCF file in compliance to the VCF file format and can be easily parsed by any VCF parser.

Following are the details about the format -

dbSNP, 1000Genome or ESP variants
Clinically significant variants from ClinVar DB
GWAS variants
Intergenic variants
Genic variants
Conserved Regions
CADD prediction
Interpro Domain
TFBS and eQLT
Low complexity regions
Pseudo Autosomal regions
Capture regions

dbSNP, 1000Genome or ESP variants

Annotations in INFO field of VCF file Description
DB138 Variant present in dbSNP138
dbSNPBuildID dbSNP build ID in which the variant was first reported
KGDB Variant present in 1000Genome
KGAF The minor allele frequency (fraction) reported in 1000Genome. Eg KGAF=0.002
ESPDB Variant present in Exome Sequencing Project
ESPAF The minor allele frequency (fraction) reported in Exome sequencing project. Eg ESPAF=0.01
Example:

Clinically significant variants from ClinVar DB

Annotations in INFO field of VCF file Description
CLNDSDB Variant disease database name
CLNACC Variant Accession and Versions
CLNDBN Variant disease name
CLNSRC Variant Clinical Chanels
CLNSIG Variant Clinical Significance, 0 - unknown, 1 - untested, 2 - non-pathogenic, 3 - probable-non-pathogenic, 4 - probable-pathogenic, 5 - pathogenic, 6 - drug-response, 7 - histocompatibility, 255 - other
CLNORIGIN Allele Origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other
CLN SNP is Clinical(LSDB,OMIM,TPA,Diagnostic)
CLNDSDBID Variant disease database ID
CLNHGVS Variant names from HGVS. The order of these variants corresponds to the order of the info in the other clinical INFO tags.
CLNSRCID Variant Clinical Channel IDs
Example:

GWAS variants

Annotations in INFO field of VCF file Description
GWASPhenotype NHGRI-GWAS phenotypes associated with the variant
Example:

Intergenic variants

For the variants in intergenic region, the downstream and upstream genes along with the distance from the variant to the gene’s TSS/TES are reported in the following format –

   VARANT_INTERGENIC=UpstreamGene(dist=XYZ), DownstreamGene(dist=XYZ)
Example:

Genic variants

Variant in genic region is annotated in following the format -

 VARANT_GENIC=Gene(Transcript_ID|Region|Exon_Number|AltID|cDNAPos|SpliceSite|UTRSignal|Mutation|Codon_Change|Amino_Acid_Change|Ref_Protein_Length|Codon_Usage|SIFT(pred_score)|Polyphen2(pred_score)|Warning:OMIM_Disease:OMIM_Ids:GAD_Disease)

If there are more than one transcript for a gene, then the annotations on them are appended after the annotations for the first transcript by ':'. The last three annotations appended by ':' to the transcript related annotations are always OMIM phenotype, OMIM phenotype Ids and GAD phenotypes associated with the gene.

Following is the description of the fields in VARANT_GENIC :

Fields Description
Gene Gene Name
Transcript_ID Transcript Accession number
Region Genomic region where the variant is present. Following are the possible values for this field -
CodingExonic - Variant present in CDS
CodingIntronic - Variant present in intron of coding transcript
NonCodingExonic - Variant present in exon of non-coding transcript
NonCodingIntronic - Variant present in intron of non-coding transcript
UTR5 - Variant present in 5'UTR of transcript
UTR3 - Variant present in 3'UTR of transcript
Intergenic_UTR5_boundary - Variant spanning Intergenic and 5'UTR region
CodingExonic_CodingIntronic_boundary - Variant spanning CDS and intronic region of coding transcript
UTR5_CodingExonic_boundary - Variant spanning 5'UTR and CDS of coding transcript
CodingExonic_UTR3_boundary - Variant spanning CDS and 3'UTR of coding transcript
UTR3_Intergenic_boundary - Variant spanning 3'UTR and Intergenic region of coding transcript
NonCodingExonic_NonCodingIntronic_boundary - Variant spanning exon and intron of non-coding transcript
Exon_Number Exon which hosts the variant. If the variant spans more than one exon then the exon numbers are reported as '__' delimited.
AltID Alternate allele id number (1,2,3 etc) to which the annotations corresponds
cDNAPos cDNA position of the variant
SpliceSite Splicing site or splice regulation site annotations. Following are the possible values for this field -
ESE - Variant present in predicted Exonic Splice Enhancer Site
ESS - Variant present in predicted Exonic Splice Silencer Site
SpliceDonor - Variant present at 5' splice site
SpliceAcceptor - Variant present at 3' splice site
ExonX_SpliceDonor_dist1__ExonY_SpliceAcceptor_dist2 - This annotation is reported for the intronic variants not at splice sites. The ExonX and ExonY are the flanking exons with X and Y as exon numbers, dist1 is the distance of the variant from SpliceDonor site and dist2 is the distance of the variant from SpliceAcceptor site. E.g. Exon7_SpliceDonor_5__Exon8_SpliceAcceptor_750 indicating the variant is 5 bp away from SpliceDonor and 750bp away from SpliceAcceptor and the variant is between Exon7 and Exon8
UTRSignal Functional Motifs in the untranslated region of transcript
Mutation Mutation type caused by variant. Following are the possible values for this field -
NonSyn, Syn, StartGain, StartLoss, StopGain, StopLoss, FrameShiftInsert, FrameShiftDelete, NonFrameShiftInsert, NonFrameShiftDelete, NoCDSChange
Codon_Change Reports wild type codon and mutant type codon that hosts the variant
Amino_Acid_Change Change in the amino acid due to mutation. Reports the wild type and mutant type amino acids and protein position
Ref_Protein_Length The original protein length
Codon_Usage Change in codon usage. Following are the possible values for this field -
CodonUsageDown - The mutant type codon usage is lower than the wild type codon usage
CodonUsageUp - The mutant type codon usage is higher than the wild type codon usage
SIFT(pred_score) The SIFT prediction and score for the NonSyn mutations. Following are the possible values for this field -
T_Score - 'T' stands for Tolerated effect followed by SIFT score. E.g. T_0.28
D_Score - 'D' stands for Damaging effect followed by SIFT score. E.g. D_0.02
Polyphen2(pred_score) The Polyphen2 prediction and score for the NonSyn mutations. Following are the possible values for this field -
PP2B_Score - 'PP2B' stands for Benign effect followed by PolyPhen2 score. E.g. PP2B_0.018
PP2PD_Score - 'PP2PD' stands for Possibly Damaging effect followed by PolyPhen2 score. E.g. PP2PD_0.789
PP2D_Score - 'PP2D' stands for Probably Damaging effect followed by PolyPhen2 score. E.g. PP2D_0.991
Warning Annotation Warnings and following are the possible values for this field -
NOT_ACTUAL_STOP_CODON__TRANSCRIPT_WITH_MULTIPLE_STOP_CODON - This warning is reported when there is a StopLoss mutation and the stop codon is not present at the end of CDS.
CDS_NOT_MULTIPLE_OF_3 - This warning is reported when the transcript is incomplete.
OMIM_Disease Gene associated OMIM Phenotypes. Mutiple values are '__' delimited.
OMIM_Ids Gene associated OMIM Phenotype IDs. Mutiple values are '__' delimited.
GAD_Disease Gene associated Phenotypes from NCBI-GAD database. Mutiple values are '__' delimited.
Example:
Variant in CDS of a gene and causing NonSyn mutation and predicted as damaging. The gene is also associated with a OMIM phenotype.

Variant that spans boundary region.

Variant not at splice site but very close to 5' splice site.

Variant on overlapping genes.


Conserved Regions

Annotations in INFO field of VCF file Description
GerpConserve Flags the variant to be present in a conserved region
GerpRSScore Gerp Rejected Substitutions Score. Rejected substitutions are a natural measure of constraint that reflects the strength of past purifying selection on the element.
GerpPValue Gerp P-Value
Example:

CADD prediction

Annotations in INFO field of VCF file Description
CADD_raw CADD raw score for funtional prediction of a SNP. The larger the score the more likely the SNP has damaging effect. Scores are reported in the order in which ALT alleles are reported.
CADD_phred CADD phred-like score. This is phred-like rank score based on whole genome CADD raw scores. The larger the score the more likely the SNP has damaging effect. Scores are reported in the order in which ALT alleles are reported
Example:

Interpro Domain

Annotations in INFO field of VCF file Description
Interpro_domains Variant position is part of a domain or conserved site
Example:

TFBS and eQLT

Regulome DB is used for this annotation.
Annotations in INFO field of VCF file Description
RegulomeScore Regulome Score as described here . Following are the possible values for this field -
1a or 1b or 1c or 1d or 1e or 1f - Represents eQTL
2a or 2b or 2c or 3a or 3b or 4 - Represents TFBS
Example:
Variant upstream of a gene and at TFBS

Variant causing Syn mutation but is a eQTL


Low complexity regions

The LCR data used was distributed as supplement material for the paper - Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples by Heng Li.
The file used for annotation was downloaded from - LCR-hs37d5.bed.gz
Annotations in INFO field of VCF file Description
LCR Variant position is part of low-complexity region
Example: An insertion variant in low complexity region


Pseudo Autosomal regions

The genes in PAR regions were compiled from the following two sources -
1. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2435358/
2. http://en.wikipedia.org/wiki/Pseudoautosomal_region
Annotations in INFO field of VCF file Description
PAR Variant is in Pseudoautosomal Region
Example:

Capture regions

The current version is set up to annotate using Nimblegen capture bed file but this feature can be extended to use other capture arrays from Illumina or Agilent.
Annotations in INFO field of VCF file Description
CaptureCore Variant is present in Capture bed file
Capture5p Number of bases that the variant position is 5p upstream of capture region
Capture3p Number of bases that the variant position is 3p downstream of capture region
Example:
Variant is part of capture region
Variant position is 20 bases upstream of capture region
Variant position is 28 bases downstream of capture region