A: Yes! SnpHub can be applied to any specie (both plant or animal) as long as the variation data is provided in
VCF format or
Q: Does SnpHub support to use SNP array data (usually in
.hapmap format) as input (instead of VCF format)?
A: Yes, SnpHub accept the variantion data in format of both VCF format (generated by re-sequencing, genotyping-by-sequencing or whole-genome exome capture sequencing) and hapmap formats (generated by microarray). To support the hapmap format, set the option
data_type in the configuration file
hapmap. Also, provide the path to your
.hapmap files for the
Q: Can I skip the pre-processing step? I already have a prepared data set, and I believe I can handle this.
A: In fact, our demo is an example of skipping the pre-processing step. In the default
advanced_config.R, we assigned the file pathes directly to the variables. That's also why users need to delete
advanced_config.R, and then rename
advanced_config.R at the first time setting up with their own data. Your data set should be processed as follow:
- FASTA: The
fastafile should be indexed using
# Indexing fasta samtools faidx test.fasta
- GFF3: The
GFF3file should be sorted, zipped by
# Sort sort -k1,1 -k4,4n test.gff3 > test.sorted.gff3 # Zip bgzip < test.sorted.gff3 > test.sorted.gff3.gz # Index tabix -C -p gff test.sorted.gff3.gz
- VCF: The
VCFfile should be converted into
BCFformat, zipped by
bgzip, indexed by
bcftools, then annotated by
SnpEff. The annotated
BCFfile also needed to be indexed.
# Zip VCF bgzip < test.vcf > test.vcf.gz # Index VCF bcftools index test.vcf.gz # Convert into BCF bcftools concat test.vcf.gz -Ob -o unannotated.bcf.gz # Index BCF bcftools index unannotated.bcf.gz # Annotate BCF # Assuming that the SnpEff database "SnphubBuilding" has already been built. bcftools view unannotated.bcf.gz --threads 4 \ | java -jar snpEff.jar -t SnphubBuilding - \ | bcftools view --threads 4 -o output.ann.bcf.gz -Ob # Index annotated BCF bcftools index output.ann.bcf.gz
GeneIndex: There should be 4 columns in the file, without header, separated by
\t. Each line describes a gene. The first column contains the chromosome, the second contains the starting site, the third contains the ending site, and the last contains the name of the gene. You can also check our demo data to get an intuitive feeling.
Other files: Files except the aforementioned ones are as described in the "File Formats" section