VCF Bulk Export

This form provides filtering of existing VCF files and export into common formats. Most of the filter criteria and many of the formats are provided by VCFtools+.
1
Choose your VCF File.
The following table contains all the available VCF Files. Choose the one you would like to filter and export by selecting the circle at the beginning of the appropriate row.
NameAssemblyNumber of SNPs
AGILE LDP Exome Capture SNP SetLc2.0578,890

More information on AGILE LDP Exome Capture SNP Set: This is the AGILE LDP Exome Capture SNP set with 324 germplasm lines. Due to space constraints, it has been minimally filtered for minor allele frequency of 5%, maximum missing frequency of 90% across all individuals, and minimum SNP quality score of 30. This set also contains indels.

The AGILE Lentil Diversity Panel exome capture set is being made publicly available to accompany the germplasm set so others can conduct analyses without having to re-generate the genotypic data. Through use of these data, you are bound to the following principles:

  • That this data as accessed is pre-competitive and is not patentable in its present state.
  • To identify https://knowpulse.usask.ca/AGILE/2 as the source of the data as well as cite:
    Teketel A. Haile, Taryn Heidecker, Derek Wright, Sandesh Neupane, Larissa Ramsay, Albert Vandenberg, and Kirstin E. Bett. 2020. Genomic Selection for Lentil Breeding: Empirical Evidence. The Plant Genome. 13:e20002. DOI: 10.1002/tpg2.20002
  • To credit “The Application of Genomic Innovation in the Lentil Economy (AGILE) Project” and its Project Leaders (KE Bett and A Vandenberg), and the researcher(s) responsible for generating the data.
  • It would be appreciated if you contact AGILE Project Leader (k.bett@usask.ca) and data producers to discuss any analyses that make use of this data to avoid the overlap of any analyses that are already underway or completed.
  • The AGILE Project Leaders and data producers provide these data in good faith, but make no warranty, expressed or implied, nor assume any legal liability or responsibility for any purpose for which the data are used.

Please specify some filter criteria for this set! Waiting time of downloading it unfiltered may take up to 1 hour!

Statistical SummaryMinMaxMedianAverageSD
Minor Allele Frequency0.050.3810.0630.07190.0244
Missing Frequency 00.0990.0120.02620.0303

2
Specify filter criteria.
Restrict dataset to specific germplasm or regions
Select a VCF file from above to see specific germplasm.
Only include sites in specific regions. For example, if you want to include all genotypic data for a QTL on Chr1 from 5555 to 6666 then you would enter Chr1:5555..6666
Region input format: Chrom:position-lower-bound..position-upper-bound (e.g.,Lcu.2RBY.Chr1:111111..222222).
All germplasm (individuals) in selected file. Please copy the germplasm that need to be kept to textarea under.
You must copy the germplasm you want to keep into the Keep these Germplasm or your file will contain all germplasm.
Only include these germplasm (individuals) in export file. Each germplasm name should be on it's own line and must match exactly the names in the file chosen (names are shown above).
If you check this checkbox, only SNPs with 2 alleles across all individuals will be kept. For example, in the example data below, SNP Chr3p34567 would be removed.
Only include SNP calls that have at least the specified number of reads to support the call. For example, if you specify 5 for this filter then for SNP Chr2p25678 in the example table below, only the call for Germplasm4 will be set to missing data.
Only include SNP positions with a minor allele frequency greater than or equal to this value. Allele frequency is defined as the number of times an allele appears over all individuals at that site, divided by the total number of non-missing alleles at that site. For example, if your enter 45% in this filter then SNPs with a minor allele frequency lower than 45% could be removed (SNP Chr1p12344 in the example data below).
Exclude SNPs with more than this number of missing genotypes over all individuals/germplasm. For example, if you enter 1 for this filter then SNPs with more than 1 missing genotype would be removed (SNP Chr4p48765 in the example data below).
Exclude SNPs based on the proportion of missing data. For example, if you enter 25% for this filter then SNPs with a missing data frequency higher than 25% would be removed (SNP Chr4p48765 in the example data below).
Example Table: Example Data for Filter Explanation.
SNP NameSNP BackboneSNP PositionGerm1Germ2Germ3Germ4Germ5Germ6
Chr1p12344Chr112344AA:5TT:12TT:15AT:19TT:15
Chr2p25678Chr225678GG:7GG:13GG:5TT:2GG:22GT:24
Chr3p34567Chr334567AA:5CC:12AC:7TT:15CC:19TC:23
Chr4p48765Chr448765CC:12AC:7CC:19AA:23

* The above example will be referred to in the description of each filter criteria to aid in the explanation of how it will affect your data. NOTE: the cell for each SNP by germplasm combination contains the call and the read depth seperated by a colon (:). For example, AA:5 means a call of AA with a read depth of 5.

3
Pick your Export format.
Select one of the formats listed below and the filtered VCF will be converted accordingly. Keep in mind that if you choose a format with no quality information, you should have been stringent with your filtering criteria to ensure you are working with good data.
FormatHas Quality Info?Description
Genotype MatrixNoVariant by Germplasm matrix of Genotype per call in a tab-separated values(TSV) format.
Quality MatrixYesVariant by Germplasm matrix of Read Depth per call.
Variant Call Format (VCF)YesA variant by germplasm matrix with each cell containing a combination of SNP call and quality information. See the Specification for more information.
Haplotype Map (Hapmap)NoA Hapmap file is a tab-separated values(TSV) format for storing genotypic data. Hapmap format is easier to edit and handle but less informative than VCF format.
NOTE: This format is only suitable for SNPs and any INDELS will be removed.
Bgzipped VCFYesAn archive containing a bgzipped VCF file and a Tabix file. This combination is required by various programs such as the R package VariantAnnotation. See the tabix manual for more information.
ABH FormatNoAlleles are coded as A if they match the maternal parent, B based on the paternal parent, whereas H represents a heterozygous call and "-" as missing.
NOTE: This format is only suitable for biparental crosses and any SNPs in which the parents are missing, heterozygous, or the same genotypic call will be excluded!
+ The Variant Call Format and VCFtools, Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert Handsaker, Gerton Lunter, Gabor Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin and 1000 Genomes Project Analysis Group, Bioinformatics, 2011.