MetaCache is a classification system for mapping (short or long) reads from metagenomic samples to their most likely taxon of origin. It uses locality sensitive hashing to quickly identify candidate regions within one or multiple reference genomes. A read is then classified based on the similarity to those regions.
Clone MetaCache’s Github repository and compile the executable:
git clone https://github.com/muellan/metacache.git
cd metacache
make MACROS="-DMC_TARGET_ID_TYPE=uint32_t"
The data type configuration in the last line makes sure that MetaCache can handle up to 4,294,967,295 reference sequences per database.
This is necessary because many eukaryotic reference genome files contain a lot of genomic sequences.
If you know that your database will include less than 65,535 sequences you can safe memory by just calling make
without the MACROS=
part.
The following snippet downloads the NCBI taxonomic hierarchy and builds a database afs
from all genomes in genomes_folder
. The reference sequence to taxon id mapping is supplied in the file taxa.accession2taxid
which
conforms to the NCBI’s accession2taxid
format.
download-ncbi-taxonomy ncbi_tax
metacache build afs genomes_folder \
-taxonomy ncbi_tax -taxpostmap taxa.accession2taxid \
-remove-overpopulated-features
It is important that you supply the option -remove-overpopulated-features
if you add large eukaryotic genomes. This will improve classification accuracy and runtime performance.
This queries all reads in all FASTA and FASTQ files in myreads_folder
against database afs
. The individual read mappings are written to read_mappings.txt
and the abundance analysis to abund.txt
.
metacache query afs myreads_folder -pairfiles \
-max-cand 4 -hitmin 4 -hitdiff 80 \
-out read_mappings.txt -split-out -abundances abund.txt -abundance-per species
The option -pairfiles
means that each mate of a read pair will be read from a separate file (e.g., fileA_1.fa
+ fileA_2.fa
, fileB_1.fa
+ fileB_2.fa
, …). If both mates of your paired-end reads are in one file (1+2, 3+4, …) then use the option -pairseq
instead.
If the option -split-out
is given, mapping and abundance results will be written to a separate file for each input file(-pair).
Important: For abundance analysis tasks involving large eukaryotic genomes, make sure to supply options
-max-cand 4 -hitmin 4 -hitdiff 80
for optimal results. This tells MetaCache to consider the 4 best matching candidates per read (the default is 2, which is fine for bacteria). It also makes sure that the best matching candidate wins over the lowest common ancestor of all candidates if the input read has at least 80% features more in common with this best candidate than it has with the 2nd best.
build
: build database(s) from reference genomesquery
: query reads against databasesmerge
: merge results of independent queriesmodify
: add reference genomes to database or update taxonomyinfo
: get information about a database