8. File formats

8.1. Input data formats

Node names should not contain the @ character, and they should not be equal to A, 0, and root (reserved names).

Newick trees

Introduction

The Newick format can be used for index construction in combination with the -A parameter. Names of files with sequences will be inferred from the names of leaves as [node_name].fa. If names of internal nodes are not specified in the original tree, they will be assigned automatically as the lexigraphically minimal name of children’s names with incremented ID. Branch lenghts are ignored.

Specification

See specifications of Newick on the Phylip website or on Wikipedia.

Examples

A Newick tree with named leaves:

((n1,n2,n3),(n5,n6));

A Newick tree with named nodes:

((n1,n2)o1,(n3,n4,n5)o2)p1;

A Newick tree with automatically assigned names of internal node names:

((n1,n2,n3)n1-up1,(n4,n5)n4-up1)n1-up2;

NHX trees

Introduction

New Hampshire X Format is parsed using the ETE3 library (see specification of Format 1).

Specification

NHX attributes

Attribute

Type

Description

(name)

string

Node name (typically the TaxID of the node). The names should be unique and must not contain @.

path

string

Files with genomic sequences, separated by @ (relative paths from the directory of the tree). Only for leaves.

fastapath

string

Deprecated (use path instead).

rank

string/int

Taxonomic rank.

dist

float

To be ignored (an internal parameter of ETE3).

support

float

To be ignored (an internal parameter of ETE3).

kmers_full

integer

Number of k-mers associated with this node. Added automatically during index construction.

kmers_reduced

integer

Number of k-mers represented by this node. Added automatically during index construction.

Example

Previous tree after autocompleting to NHX:

(((n1:1[&&NHX:dist=1.0:path=n1.fa:support=1.0],n2:1[&&NHX:dist=1.0:path=n2.fa:support=1.0])o1:1[&&NHX:dist=1.0:support=1.0],(n3:1[&&NHX:dist=1.0:path=n3.fa:support=1.0],n4:1[&&NHX:dist=1.0:path=n4.fa:support=1.0],n5:1[&&NHX:dist=1.0:path=n5.fa:support=1.0])o2:1[&&NHX:dist=1.0:support=1.0])p1:0[&&NHX:dist=0.0:support=1.0])merge_root:1[&&NHX:dist=1.0:support=1.0];

Sequences

Input sequences can be provided in the FASTA or FASTQ formats. Any non-ACGT characters are treated as unknown nucleotides and k-mers containing them thus discarded. Sequence names are ignored.

8.2. Assignments

Read assignments in SAM/BAM

Introduction

ProPhyle uses SAM/BAM as the main format for reporting the final assignments, i.e., the output of classification.

Specification

ProPhyle SAM headers

Tag

Description

HD

Version of SAM.

PG

Version of ProPhyle.

SQ

Description of a leaf. SN: Name of the node. LN: a fake length. UR: Name of the original FASTA file. SP: Name of the species (if present in the tree).


ProPhyle SAM fields

Column

Name

Description

1

QNAME

Query name.

2

FLAG

0 if assigned, 4 otherwise.

3

RNAME

Node name.

4

POS

1 if assigned, unused (0) otherwise.

5

MAPQ

60 if assigned, unused (0) otherwise.

6

CIGAR

Coverage bit-mask encoded as a CIGAR string if assigned, unused (*) otherwise. For instance, 7=3X3= means 1111111000111.

7

RNEXT

Unused (*).

8

PNEXT

Unused (0).

9

TLEN

Unused (0).

10

SEQ

Sequence of bases if -P, unused (*) otherwise.

11

QUAL

Base qualities if -P, unused (*) otherwise.


ProPhyle SAM tags

Tag

Type

Description

Range

ln

integer

Read length.

h1

integer

Number of shared k-mers.

\{1, \ldots, |query|-k+1\}

h2

float

Proportion of hits in the reference.

(0,1]

hf

float

Proportion of hits in the query.

(0,1]

c1

integer

Number of covered positions in the query.

\{k, \ldots, |query|\}

c2

float

Normalized number of covered positions in the query.

(0,1]

cf

float

Proportion of covered positions in the query.

(0,1]

is

int

Number of reported assignments (nodes) for the query.

\{1, \ldots, |leaves|\}

ii

int

ID of the curent assignment.

\{1, \ldots, is\}

hc

string

Hit bit-mask encoded as a CIGAR string. For instance, 7=1X3= means 11111110111.


Read assignments in a Kraken-like format

Introduction

ProPhyle uses a format similar to the Kraken output for reporting k-mer matches by ProPhyle Index. It can also use this format for reporting the final assignments.

Specification

Kraken-like format

Column

Description

1

C / U (classified / unclassified)

2

Query name

3

Final assignments – a comma separated list of node names

4

Query length

5

K-mer mappings: a space-delimited lists of mappings. A single mapping is of the form comma_delimited_list_of_nodes:length. Pseudo-nodes A and 0 are used for k-mers with a non-ACGT nucleotide and without any mapping, respectively.

Examples

Assigned k-mers, no sequences:

U       read3   0       8       left,right:1 A:3 0:1 right:1

Assigned k-mers, version with sequences and base qualities:

U       read3   0       8       left,right:1 A:3 0:1 right:1    CTTNGTTT        IGIIIIHI

8.3. Abundances estimates (experimental)

Abundances in the Kraken report format

Introduction

Specification

kraken-report format:

Kraken report format

Column

Description

1

Percentage of reads covered by the clade rooted at this taxon

2

Number of reads covered by the clade rooted at this taxon

3

Number of reads assigned directly to this taxon

4

A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply ‘-‘.

5

NCBI taxonomy ID

6

Indented scientific name

Abundances in the MetaPhlAn2 report format

Introduction

MetaPhlAn2 is a computational tool for profiling the composition of microbial communities from metagenomic sequencing data.

Specification

MetaPhlAn2 report format

Metaphlan 2 report format

Column

Description

1

Clades, ranging from taxonomic kingdoms (Bacteria, Archaea, etc.) through species

2

The taxonomic level of each clade is prefixed to indicate its level: Kingdom: k__, Phylum: p__, Class: c__, Order: o__, Family: f__, Genus: g__, Species: s__

Since sequence-based profiling is relative and does not provide absolute cellular abundance measures, clades are hierarchically summed. Each level will sum to 100%; that is, the sum of all kindom-level clades is 100%, the sum of all genus-level clades (including unclassified) is also 100%, and so forth. OTU equivalents can be extracted by using only the species-level s__ clades from this file (again, making sure to include clades unclassified at this level).

Abundances in the Centrifuge report format

Introduction

Centrifuge format.

Specification

Centrifuge format

Column

Description

1

name of a genome, or the name corresponding to a taxonomic ID (the second column) at a rank higher than the strain (e.g., Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis).

2

taxonomic ID (e.g., 36870).

3

taxonomic rank (e.g., leaf).

4

number of k-mers propagated up to the node (e.g., 703004).

5

number of reads classified to this node including multi-classified reads (divided by the number of assignments, e.g., 5981.37)

6

number of reads uniquely classified to this genomic sequence (e.g., 5964)

7

unused

Example

#name                                                           taxID   taxRank    kmerCount   numReads   numUniqueReads   abundance
Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis 36870   leaf       703004      5981.37    5964             0

8.4. Internal ProPhyle formats

ProPhyle Index

Introduction

ProPhyle index directory contains a BWA index, a k-LCP array and several auxiliary files.

Specification

ProPhyle index

File name

Description

index.fa

Assembled contigs, name of sequences are of the following format: [node_name]@c[contig_id]

index.fa.amb

List of ambiguous nucleotides, no values

index.fa.ann

List of contigs and their starting positions in the master string

index.fa.[k].klcp

k-LCP array

index.fa.bwt

Burrows-Wheeler Transform of the master string (merged sequences + reverse completement) + OCC table (BWA format)

index.fa.kmers.tsv

k-mer statistics, format: [node_name].[full|reduced].fa        [#kmers], where full refers to all associated k-mers and reduced to represented k-mers

index.fa.pac

Packed sequences (BWA format)

index.fa.sa

Sampled suffix array (BWA format)

index.json

Index parameters: k-mer size (k), ProPhyle version (prophyle-version, prophyle-revision, prophyle-commit)

log.txt

Log

tree.nw

Phylogenetic tree adjusted for classification

tree.preliminary.nw

Phylogenetic tree before adjusting

Compressed ProPhyle index for transmission

Introduction

ProPhyle can create a .tar.gz archive with the a subset of the index files so that the original index can be derived.

Specification

The archive contains the following subset of the original index files:

Compressed ProPhyle index

File name

Description

index.fa.amb

Identical

index.fa.ann

Identical

index.fa.bwt

Burrows-Wheeler Transform without the OCC table (BWA format, before bwa bwtupdate)

index.fa.kmers.tsv

Identical

index.json

Identical

tree.nw

Identical

tree.preliminary.nw

Identical