Filtering

architeuthis is also capable of scoring and filtering reads on several metrics that go a little father than the already available confidence scoring in Kraken2.

Post-filtering reads

You can use the mapping filter to keep only those reads in a Kraken run that have high quality mappings and drop the rest. This can sometimes avoid common issues such as cross-domain mapping.

A diverse Kraken DB

This will work best when using the most diverse Kraken DB available, meaning a database that includes as many of the organisms in the habitat as feasible.

The supported metrics are:

Consistency: Consistency denotes the fraction of kmer-level taxonomy assignments that are contained in the final read classification. So it qunatifies how many of the alternative classifications are contained in a single phylogentic path.
Multiplicity: Multiplicity is the number of unique kmers assignments at the same rank r as the read assignment. So a read classified on the species level with a multiplicity of 2 matches 2 unique species.
Entropy: Entropy is the Shannon index of the kmer assignments on the same rank as the final read assignment. So it measures how surprising or distinct the alternative classifications are. One can see it as an abundance weighted multiplicity.

Usage

To filter reads on one or more metric use

architeuthis mapping filter \
    --min-consistency 0.9 \
    --max-multiplicity 2 \
    --max-entropy 0.1 \
    --out filtered.k2 \
    my_sample.k2

This example is using the default parameters which should lead to fairly high quality read classifications. The output is a valid Kraken output and a strict subset of the input file. mapping filter supports the --data-dir option (see below).

Scoring reads

It is also possible to only output the metrics for all classified reads in a sample.

Usage

architeuthis mapping score my_sample.k2 --out scores.csv

Output

sample_id,read_id,taxid,name,rank,n_kmers,consistency,confidence,multiplicity,entropy
testdata/negative,165179_NZ_CP102288.1_598818_598628_1_0_0_0_0:0:0_0:0:0_f59,165179,s__Segatella copri,s,153,1,1,1,0
testdata/negative,47678_NZ_CP081920.1_2131436_2131626_0_1_0_0_0:0:0_0:0:0_4749,816,g__Bacteroides,g,145,1,1,1,0
testdata/negative,821_NZ_CP103067.1_1529728_1529923_0_1_0_0_2:0:0_1:0:0_4c09,909656,g__Phocaeicola,g,68,1,1,1,0
testdata/negative,299767_NZ_CP099310.1_741506_741350_1_0_0_0_1:0:0_0:0:0_162,547,g__Enterobacter,g,82,0.975609756097561,0.96,2,0.167944147734173
testdata/negative,328813_NZ_AP019738.1_1486487_1486329_1_0_0_0_0:0:0_2:0:0_2ffc,328813,s__Alistipes onderdonkii,s,116,1,1,1,0
testdata/negative,418240_NZ_CP102267.1_4677848_4677952_0_1_0_0_0:0:0_3:0:0_66a6,1121115,s__Blautia wexlerae,s,185,1,1,1,0
testdata/negative,562_NZ_CP038408.1_5034459_5034717_0_1_0_0_0:0:0_0:0:0_f,543,f__Enterobacteriaceae,f,183,1,1,1,0
testdata/negative,821_NZ_CP103067.1_3126223_3126146_1_0_0_0_0:0:0_2:0:0_a0a1,909656,g__Phocaeicola,g,161,1,1,1,0
testdata/negative,821_NZ_CP043529.1_3737695_3737718_0_1_0_0_0:0:0_1:0:0_247c,821,s__Phocaeicola vulgatus,s,135,1,1,1,0
testdata/negative,46503_NZ_CP085927.1_2378513_2378638_0_1_0_0_1:0:0_1:0:0_8897,46503,s__Parabacteroides merdae,s,137,1,1,1,0
testdata/negative,39486_NZ_CP102279.1_1096816_1096614_1_0_0_0_1:0:0_1:0:0_2649,186803,f__Lachnospiraceae,f,108,1,1,1,0
testdata/negative,820_NZ_CP072255.1_61761_61514_1_0_0_0_0:0:0_1:0:0_1e488,820,s__Bacteroides uniformis,s,176,1,1,1,0
[...]

This also reports the Kraken confidence score using the provided taxonomy dump.

Specifying the NCBI Taxonomy dump

You can use any downloaded NCBI Taxonomy dump for lineage annotation by specifying the --data-dir option, for instance:

architeuthis mapping score --data-dir /my/taxdump/ my_sample.k2 --out scores.csv

Specifying the lineage format

You can specify the lineage format using the taxonkit syntax. The defualt lineage format uses all ranks from phylum to species with a prepended modified domain rank which will either be Bacteria, Archaea, Virus or Eukaryotes and with rank prefixes (like d__Bacteria). However, you could change this. For instance, to only keep genus and species without prefixes:

architeuthis mapping score --format "{genus};{species}" my_sample.k2 --out scores.csv