9. File formats

9.1. Input data formats

Node names should not contain the @ character, and they should not be equal to A, 0, and root (reserved names).

Newick trees

Introduction

The Newick format can be used for index construction in combination with the -A parameter. Names of files with sequences will be inferred from the names of leaves as [node_name].fa. If names of internal nodes are not specified in the original tree, they will be assigned automatically as the lexigraphically minimal name of children’s names with incremented ID. Branch lenghts are ignored.

Specification

See specifications of Newick on the Phylip website or on Wikipedia.

Examples

A Newick tree with named leaves:

((n1,n2,n3),(n5,n6));

A Newick tree with named nodes:

((n1,n2)o1,(n3,n4,n5)o2)p1;

A Newick tree with automatically assigned names of internal node names:

((n1,n2,n3)n1-up1,(n4,n5)n4-up1)n1-up2;

NHX trees

Introduction

New Hampshire X Format is parsed using the ETE3 library (see specification of Format 1).

Specification

NHX attributes
Attribute Type Description
(name) string Node name (typically the TaxID of the node). The names should be unique and must not contain @.
fastapath string Files with genomic sequences, separated by @ (relative paths from the directory of the tree). Only for leaves.
rank string/int Taxonomic rank.
dist float To be ignored (an internal parameter of ETE3).
support float To be ignored (an internal parameter of ETE3).
kmers_full integer Number of k-mers associated with this node. Added automatically during index construction.
kmers_reduced integer Number of k-mers represented by this node. Added automatically during index construction.

Example

Previous tree after autocompleting to NHX:

(((n1:1[&&NHX:dist=1.0:fastapath=n1.fa:support=1.0],n2:1[&&NHX:dist=1.0:fastapath=n2.fa:support=1.0])o1:1[&&NHX:dist=1.0:support=1.0],(n3:1[&&NHX:dist=1.0:fastapath=n3.fa:support=1.0],n4:1[&&NHX:dist=1.0:fastapath=n4.fa:support=1.0],n5:1[&&NHX:dist=1.0:fastapath=n5.fa:support=1.0])o2:1[&&NHX:dist=1.0:support=1.0])p1:0[&&NHX:dist=0.0:support=1.0])merge_root:1[&&NHX:dist=1.0:support=1.0];

Sequences

Input sequences can be provided in the FASTA or FASTQ formats. Any non-ACGT characters are treated as unknown nucleotides and k-mers containing them thus discarded. Sequence names are ignored.

9.2. Assignments

Read assignments in SAM/BAM

Introduction

ProPhyle uses SAM/BAM as the main format for reporting the final assignments, i.e., the output of classification.

Specification

ProPhyle SAM headers
Tag Description
HD Version of SAM.
PG Version of ProPhyle.
SQ Description of a leaf. SN: Name of the node. LN: a fake length. UR: Name of the original FASTA file. SP: Name of the species (if present in the tree).

ProPhyle SAM fields
Column Name Description
1 QNAME Query name.
2 FLAG 0 if assigned, 4 otherwise.
3 RNAME Node name.
4 POS 1 if assigned, unused (0) otherwise.
5 MAPQ 60 if assigned, unused (0) otherwise.
6 CIGAR Coverage bit-mask encoded as a CIGAR string if assigned, unused (*) otherwise. For instance, 7=3X3= means 1111111000111.
7 RNEXT Unused (*).
8 PNEXT Unused (0).
9 TLEN Unused (0).
10 SEQ Sequence of bases if -P, unused (*) otherwise.
11 QUAL Base qualities if -P, unused (*) otherwise.

ProPhyle SAM tags
Tag Type Description Range
h1 integer Number of shared k-mers. \{1, \ldots, |query|-k+1\}
h2 float Proportion of hits in the reference. (0,1]
hf float Proportion of hits in the query. (0,1]
c1 integer Number of covered positions in the query. \{k, \ldots, |query|\}
c2 float Normalized number of covered positions in the query. (0,1]
cf float Proportion of covered positions in the query. (0,1]
is int Number of reported assignments (nodes) for the query. \{1, \ldots, |leaves|\}
ii int ID of the curent assignment. \{1, \ldots, is\}
hc string Hit bit-mask encoded as a CIGAR string. For instance, 7=1X3= means 11111110111.  

Read assignments in a Kraken-like format

Introduction

ProPhyle uses a format similar to the Kraken output for reporting k-mer matches by ProPhyle Index. It can also use this format for reporting the final assignments.

Specification

Kraken-like format
Column Description
1 C / U (classified / unclassified)
2 Query name
3 Final assignments – a comma separated list of node names
4 Query length
5 K-mer mappings: a space-delimited lists of mappings. A single mapping is of the form comma_delimited_list_of_nodes:length. Pseudo-nodes A and 0 are used for k-mers with a non-ACGT nucleotide and without any mapping, respectively.

Examples

Assigned k-mers, no sequences:

U       read3   0       8       left,right:1 A:3 0:1 right:1

Assigned k-mers, version with sequences and base qualities:

U       read3   0       8       left,right:1 A:3 0:1 right:1    CTTNGTTT        IGIIIIHI

9.3. Abundances estimates (experimental)

Abundances in the Kraken report format

Introduction

Specification

kraken-report format:

Kraken report format
Column Description
1 Percentage of reads covered by the clade rooted at this taxon
2 Number of reads covered by the clade rooted at this taxon
3 Number of reads assigned directly to this taxon
4 A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply ‘-‘.
5 NCBI taxonomy ID
6 Indented scientific name

Abundances in the MetaPhlAn2 report format

Introduction

MetaPhlAn2 is a computational tool for profiling the composition of microbial communities from metagenomic sequencing data.

Specification

MetaPhlAn2 report format

Metaphlan 2 report format
Column Description
1 Clades, ranging from taxonomic kingdoms (Bacteria, Archaea, etc.) through species
2 The taxonomic level of each clade is prefixed to indicate its level: Kingdom: k__, Phylum: p__, Class: c__, Order: o__, Family: f__, Genus: g__, Species: s__

Since sequence-based profiling is relative and does not provide absolute cellular abundance measures, clades are hierarchically summed. Each level will sum to 100%; that is, the sum of all kindom-level clades is 100%, the sum of all genus-level clades (including unclassified) is also 100%, and so forth. OTU equivalents can be extracted by using only the species-level s__ clades from this file (again, making sure to include clades unclassified at this level).

Abundances in the Centrifuge report format

Introduction

Centrifuge format.

Specification

Centrifuge format
Column Description
1 name of a genome, or the name corresponding to a taxonomic ID (the second column) at a rank higher than the strain (e.g., Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis).
2 taxonomic ID (e.g., 36870).
3 taxonomic rank (e.g., leaf).
4 number of k-mers propagated up to the node (e.g., 703004).
5 number of reads classified to this node including multi-classified reads (divided by the number of assignments, e.g., 5981.37)
6 number of reads uniquely classified to this genomic sequence (e.g., 5964)
7 unused

Example

#name                                                           taxID   taxRank    kmerCount   numReads   numUniqueReads   abundance
Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis 36870   leaf       703004      5981.37    5964             0

9.4. Internal ProPhyle formats

ProPhyle Index

Introduction

ProPhyle index directory contains a BWA index, a k-LCP array and several auxiliary files.

Specification

ProPhyle index
File name Description
index.fa Assembled contigs, name of sequences are of the following format: [node_name]@c[contig_id]
index.fa.amb List of ambiguous nucleotides, no values
index.fa.ann List of contigs and their starting positions in the master string
index.fa.[k].klcp k-LCP array
index.fa.bwt Burrows-Wheeler Transform of the master string (merged sequences + reverse completement) + OCC table (BWA format)
index.fa.kmers.tsv k-mer statistics, format: [node_name].[full|reduced].fa        [#kmers], where full refers to all associated k-mers and reduced to represented k-mers
index.fa.pac Packed sequences (BWA format)
index.fa.sa Sampled suffix array (BWA format)
index.json Index parameters: k-mer size (k), ProPhyle version (prophyle-version, prophyle-revision, prophyle-commit)
log.txt Log
tree.nw Phylogenetic tree adjusted for classification
tree.preliminary.nw Phylogenetic tree before adjusting

Compressed ProPhyle index for transmission

Introduction

ProPhyle can create a .tar.gz archive with the a subset of the index files so that the original index can be derived.

Specification

The archive contains the following subset of the original index files:

Compressed ProPhyle index
File name Description
index.fa.amb Identical
index.fa.ann Identical
index.fa.bwt Burrows-Wheeler Transform without the OCC table (BWA format, before bwa bwtupdate)
index.fa.kmers.tsv Identical
index.json Identical
tree.nw Identical
tree.preliminary.nw Identical