5. Questions and Answers

5.1. Application field

Q: Can SnpHub be applied to other species?

A: Yes! SnpHub can be applied to any specie (both plant or animal) as long as the variation data is provided in VCF format or .hapmap format.

5.2. Input/Output formats

Q: Does SnpHub support to use SNP array data (usually in .hapmap format) as input (instead of VCF format)?

A: Yes, SnpHub accept the variantion data in format of both VCF format (generated by re-sequencing, genotyping-by-sequencing or whole-genome exome capture sequencing) and hapmap formats (generated by microarray). To support the hapmap format, set the option data_type in the configuration file setup.conf as hapmap. Also, provide the path to your .hapmap files for the path_hapmap option.

5.3. Configuration of SnpHub

Q: Can I skip the pre-processing step? I already have a prepared data set, and I believe I can handle this.

A: In fact, our demo is an example of skipping the pre-processing step. In the default advanced_config.R, we assigned the file pathes directly to the variables. That's also why users need to delete advanced_config.R, and then rename advanced_config_O.R to advanced_config.R at the first time setting up with their own data. Your data set should be processed as follow:

  • FASTA: The fasta file should be indexed using samtools.
# Indexing fasta
samtools faidx test.fasta
  • GFF3: The GFF3 file should be sorted, zipped by bgzip and indexed.
# Sort
sort -k1,1 -k4,4n test.gff3 > test.sorted.gff3
# Zip
bgzip < test.sorted.gff3 > test.sorted.gff3.gz
# Index
tabix -C -p gff test.sorted.gff3.gz
  • VCF: The VCF file should be converted into BCF format, zipped by bgzip, indexed by bcftools, then annotated by SnpEff. The annotated BCF file also needed to be indexed.
# Zip VCF
bgzip < test.vcf > test.vcf.gz
# Index VCF
bcftools index test.vcf.gz
# Convert into BCF
bcftools concat test.vcf.gz -Ob -o unannotated.bcf.gz
# Index BCF
bcftools index unannotated.bcf.gz
# Annotate BCF
# Assuming that the SnpEff database "SnphubBuilding" has already been built.
bcftools view unannotated.bcf.gz --threads 4 \
    | java -jar snpEff.jar -t SnphubBuilding - \
    | bcftools view --threads 4 -o output.ann.bcf.gz -Ob
# Index annotated BCF
bcftools index output.ann.bcf.gz
  • GeneIndex: There should be 4 columns in the file, without header, separated by \t. Each line describes a gene. The first column contains the chromosome, the second contains the starting site, the third contains the ending site, and the last contains the name of the gene. You can also check our demo data to get an intuitive feeling.

  • Other files: Files except the aforementioned ones are as described in the "File Formats" section