4. Building a standard database¶
ProPhyle comes with several genome libraries containing RefSeq genomes, augmented with the NCBI taxonomy.
4.1. Downloading genomes¶
These libraries can be downloaded using prophyle download <library> [<library> ...]
,
where <library>
should be replaced by bacteria
, viruses
, or plasmids
.
The command also copies a prebuild Newick/NHX tree for the specified library.
If the -d parameter is not specified, all files are placed to ~/prophyle.
To download all viral and bacterial genomes from RefSeq, execute
prophyle download bacteria viruses
4.2. Index construction¶
Once a library is downloaded, a ProPhyle index can be constructed using
prophyle index [-g DIR] [-j INT] [-k INT] [-M] [-P] [-K] <tree.nw> [<tree.nw> ...] <index.dir>
<tree.nw> is a Newick/NHX for the index. The trees from the previous command are placed in ~/prophyle and they are called bacteria.nw, viruses.nw, etc. <index.dir> is the directory directory where your index files are going to be placed.
There are multiple other parameters that can be used. -j can be used to specify the number of CPU cores used for index construction (all cores are used otherwise). -k serves to set the k-mer length (31 in default). -M activates low complexity regions filtering using DustMasker. Please, ensure that the program is install (try to run dustmasker). If multiple trees are used, they are going to be merged. Therefore, a name collision can appear. To prevent such a situation, ProPhyle prepends numerical prefixes to the node names (unless -P is used). The -K parameter can be used to deactivate k-LCP array construction. The resulting index would be slightly smaller, but querying would become much slower.
So the entire command for index construction can look, for instance, like this:
prophyle index -k 25 ~/prophyle/bacteria.nw ~/prophyle/viruses.nw my_BV_index
Index construction might take several hours, based on the database size, k and the number of used cores.