5. Building a custom database¶
5.1. Building a custom database with user-provided data¶
Put all your genomes in a separate directory
Create a phylogenetic/taxonomic tree in the newick format, with leaves named the same way as the FASTA files (e.g., a leaf named genome1 will correspond to genome1.fa). Ensure that all leaves have their respective FASTA files.
- Build a ProPhyle index by
prophyle index -A -k <kmer_length> \ -g [<dir_with_genomes>] <tree_1.nw> <index_dir>
The -A parameter will ensure that your tree gets autocompleted, which includes creating names for internal nodes.
5.2. Building a custom database with RefSeq genomes and the NCBI taxonomy¶
Download sequences from NCBI¶
Create a subdirectory in ProPhyle’s main directory, and change the working directory to the newly created one
Download an
assembly_summary.txt
file from NCBI’s FTP server. For RefSeq’s bacterial genomes, follow this link. More information about these files can be found hereSelect genomes of interest; the following command will select all complete genomes, but it is sufficient to add more conditions to
awk
to select a subset. Fields 1, 6 and 20 contain respectively the accession number, taxonomic identifier and ftp directory of each genome.acc2taxid.tsv
will therefore be a tab separated file containing the taxid corresponding to each sequence’s accession number. Please refer to NCBI’s genomes download FAQ for further information). Since strain taxids are deprecated, we are working on a solution to create artificial nodes for each genome. For the moment, you can associate the sequences to the species node, by selecting field 7 instead of 6 in the assembly summary, which corresponds to the species taxid.
awk -F "\t" -v OFS="\t" '$12=="Complete Genome" && $11=="latest"\
{split($1, a, "."); print a[1], $7, $20}' assembly_summary.txt >ftpselection.tsv
cut -f 3 ftpselection.tsv | awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}\
{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' >ftpfilepaths.tsv
cut -f 1,2 ftpselection.tsv >acc2taxid.tsv
Download selected sequences using
parallel --gnu -j 32 -a ftpfilepaths.tsv wget
. Number of jobs (-j
option) should be changed according to the number of available cores and bandwidth.Extract them using
parallel --gnu -j 32 -a ftpfilepaths.tsv gunzip {/}
Create a fasta index for each file using
find . -name '*.fna' | parallel -j 32 samtools faidx {}
Build a tree¶
Build a taxonomic tree for the downloaded sequences using
prophyle_ncbi_tree.py <library_name> <library_main_dir>\
<output_file> acc2taxid.tsv -l <log_file>
library_name
is the name of the subdirectory where the sequences are downloaded (e.g. bacteria, archaea, viral for the standard DBs, or any name of your choice), which will be prepended to the filenames in the tree. library_main_dir
is the directory where all the sequences are downloaded, which usually is the parent of library_name. This should be used at indexing time (the -g parameter of prophyle index), unless the tree is placed in the very same directory.
Taxonomic identifiers are assigned to the sequences first, and then the tree is
built using ETE Toolkit and saved with newick format
1. Necessary node attributes are:
name
: unique node name (for RefSeq DB: the taxid of the node)path
: paths of the sequences’ fasta files, separated by @ (relative paths from ProPhyle’s home directory)
Other optional attributes are taxid
, sci_name
, named_lineage
, lineage
, rank
(more info
here
). Using ETE library or modifying the prophyle_ncbi_tree.py
script it is
easy to adapt any tree to match the requirements above.
Build the index¶
Run the standard command to build ProPhyle index
prophyle index -k <kmer_length> [-g <library_main_dir>] <tree_1.nw> [<tree_2.nw> ...] <index_dir>