SYNOPSIS metacache query metacache query ... [OPTION]... metacache query [OPTION]... ... DESCRIPTION Map sequences (short reads, long reads, genome fragments, ...) to their most likely taxon of origin. BASIC PARAMETERS database file name; A MetaCache database contains taxonomic information and min-hash signatures of reference sequences (complete genomes, scaffolds, contigs, ...). ... FASTA or FASTQ files containing genomic sequences (short reads, long reads, contigs, complete genomes, ...) that shall be classified. * If directory names are given, they will be searched for sequence files (at most 10 levels deep). * If no input filenames or directories are given, MetaCache will run in interactive query mode. This can be used to load the database into memory only once and then query it multiple times with different query options. MAPPING RESULTS OUTPUT -out Redirect output to file . If not specified, output will be written to stdout. If more than one input file was given all output will be concatenated into one file. -split-out Generate output and statistics for each input file separately. For each input file an output file with name _ will be written. PAIRED-END READ HANDLING -pairfiles Interleave paired-end reads from two consecutive files, so that the nth read from file m and the nth read from file m+1 will be treated as a pair. If more than two files are provided, their names will be sorted before processing. Thus, the order defined by the filenames determines the pairing not the order in which they were given in the command line. -pairseq Two consecutive sequences (1+2, 3+4, ...) from each file will be treated as paired-end reads. -insertsize <#> Maximum insert size to consider. default: sum of lengths of the individual reads READ FILTERS -min-readlen <#> Minimum length of reads (in bp). Reads shorter than the given number will not be classified. default: no limit -max-readlen <#> Maximum length of reads (in bp). Reads longer than the given number will not be classified. default: no limit CLASSIFICATION -lowest Do not classify on ranks below (Valid values: sequence, form, variety, subspecies, species, subgenus, genus, subtribe, tribe, subfamily, family, suborder, order, subclass, class, subphylum, phylum, subkingdom, kingdom, domain) default: sequence -highest Do not classify on ranks above (Valid values: sequence, form, variety, subspecies, species, subgenus, genus, subtribe, tribe, subfamily, family, suborder, order, subclass, class, subphylum, phylum, subkingdom, kingdom, domain) default: domain -hitmin Sets classification threshhold to . A read will not be classified if less than t features from the database match. Higher values will increase precision at the expense of sensitivity. default: 0 -hitdiff Sets classification threshhold to . A read will not be classified if less than t features from the database match. Higher values will increase precision at the expense of sensitivity. default: 0 -maxcand <#> maximum number of reference taxon candidates to consider for each query; A large value can significantly decrease the querying speed!. default: 2 -cov-percentile

Remove the p-th percentile of hit reference sequences with the lowest coverage. Classification is done using only the remaining reference sequences. This can help to reduce false positives, especially whenyour input data has a high sequencing coverage. This feature decreases the querying speed! default: off GENERAL OUTPUT FORMATTING -no-summary Dont't show result summary & mapping statistics at the end of the mapping output default: off -no-query-params Don't show query settings at the beginning of the mapping output default: off -no-err Suppress all error messages. default: off CLASSIFICATION RESULT FORMATTING -no-map Don't report classification for each individual query sequence; show summaries only (useful for quick tests). default: off -mapped-only Don't list unclassified reads/read pairs. default: off -taxids Print taxon ids in addition to taxon names. default: off -taxids-only Print taxon ids instead of taxon names. default: off -omit-ranks Do not print taxon rank names. default: off -separate-cols Prints *all* mapping information (rank, taxon name, taxon ids) in separate columns (see option '-separator'). default: off -separator Sets string that separates output columns. default: '\t|\t' -comment Sets string that precedes comment (non-mapping) lines. default: '# ' -queryids Show a unique id for each query. Note that in paired-end mode a query is a pair of two read sequences. This option will always be activated if option '-hits-per-ref' is given. default: off -lineage Report complete lineage for per-read classification starting with the lowest rank found/allowed and ending with the highest rank allowed. See also options '-lowest' and '-highest'. default: off ANALYSIS: ABUNDANCES -abundances Show absolute and relative abundance of each taxon. If a valid filename is given, the list will be written to this file. default: off -abundance-per Show absolute and relative abundances for each taxon on one specific rank. Classifications on higher ranks will be estimated by distributing them down according to the relative abundances of classifications on or below the given rank. (Valid values: sequence, form, variety, subspecies, species, subgenus, genus, subtribe, tribe, subfamily, family, suborder, order, subclass, class, subphylum, phylum, subkingdom, kingdom, domain) If '-abundances ' was given, this list will be printed to the same file. default: off ANALYSIS: RAW DATABASE HITS -tophits For each query, print top feature hits in database. default: off -allhits For each query, print all feature hits in database. default: off -locations Show locations in candidate reference sequences. Activates option '-tophits'. default: off -hits-per-ref Shows a list of all hits for each reference sequence. If this condensed list is all you need, you should deactive the per-read mapping output with '-no-map'. If a valid filename is given after '-hits-per-ref', the list will be written to a separate file. Option '-queryids' will be activated and the lowest classification rank will be set to 'sequence'. default: off ANALYSIS: ALIGNMENTS -align Show semi-global alignment to best candidate reference sequence. Original files of reference sequences must be available. This feature decreases the querying speed! default: off ADVANCED: GROUND TRUTH BASED EVALUATION -ground-truth Report correct query taxa if known. Queries need to have either a 'taxid|' entry in their header or a sequence id that is also present in the database. This feature decreases the querying speed! default: off -precision Report precision & sensitivity by comparing query taxa (ground truth) and mapped taxa. Queries need to have either a 'taxid|' entry in their header or a sequence id that is also found in the database. This feature decreases the querying speed! default: off -taxon-coverage Report true/false positives and true/false negatives.This option turns on '-precision', so ground truth data needs to be available. This feature decreases the querying speed! default: off ADVANCED: CUSTOM QUERY SKETCHING (SUBSAMPLING) -kmerlen number of nucleotides/characters in a k-mer default: determined by database -sketchlen number of features (k-mer hashes) per sampling window default: determined by database -winlen number of letters in each sampling window default: determined by database -winstride distance between window starting positions default: determined by database ADVANCED: DATABASE MODIFICATION -max-locations-per-feature <#> maximum number of reference sequence locations to be stored per feature; If the value is too high it will significantly impact querying speed. Note that an upper hard limit is always imposed by the data type used for the hash table bucket size (set with compilation macro '-DMC_LOCATION_LIST_SIZE_TYPE'). default: 254 -remove-overpopulated-features Removes all features that have reached the maximum allowed amount of locations per feature. This can improve querying speed and can be used to remove non-discriminative features. default: off Not available in the GPU version. -remove-ambig-features Removes all features that have more distinct reference sequence on the given taxonomic rank than set by '-max-ambig-per-feature'. This can decrease the database size significantly at the expense of sensitivity. Note that the lower the given taxonomic rank is, the more pronounced the effect will be. Valid values: sequence, form, variety, subspecies, species, subgenus, genus, subtribe, tribe, subfamily, family, suborder, order, subclass, class, subphylum, phylum, subkingdom, kingdom, domain default: off Not available in the GPU version. -max-ambig-per-feature <#> Maximum number of allowed different reference sequence taxa per feature if option '-remove-ambig-features' is used. Not available in the GPU version. -max-load-fac maximum hash table load factor; This can be used to trade off larger memory consumption for speed and vice versa. A lower load factor will improve speed, a larger one will improve memory efficiency. default: 0.800000 Not available in the GPU version. ADVANCED: PERFORMANCE TUNING / TESTING -threads <#> Sets the maximum number of parallel threads to use.default (on this machine): 32 -batch-size <#> Process <#> many queries (reads or read pairs) per thread at once. default (on this machine): 4096 -query-limit <#> Classify at max. <#> queries (reads or read pairs) per input file. default: no limit EXAMPLES Query all sequences in 'myreads.fna' against pre-built database 'refseq': metacache query refseq myreads.fna -out results.txt Query all sequences in multiple files against database 'refseq': metacache query refseq reads1.fna reads2.fna reads3.fna Query all sequence files in folder 'test' againgst database 'refseq': metacache query refseq test Query multiple files and folder contents against database 'refseq': metacache query refseq file1.fna folder1 file2.fna file3.fna folder2 Perform a precision test and show all ranks for each classification result: metacache query refseq reads.fna -precision -allranks -out results.txt Load database in interactive query mode, then query multiple read batches metacache query refseq reads1.fa reads2.fa -pairfiles -insertsize 400 reads3.fa -pairseq -insertsize 300 reads4.fa -lineage OUTPUT FORMAT MetaCache's default read mapping output format is: read_header | rank:taxon_name This will not be changed in the future to avoid breaking anyone's pipelines. Command line options won't change in the near future for the same reason. The following table shows some of the possible mapping layouts with their associated command line arguments: read mapping layout command line arguments --------------------------------------- --------------------------------- read_header | taxon_id -taxids-only -omit-ranks read_header | taxon_name -omit-ranks read_header | taxon_name(taxon_id) -taxids -omit-ranks read_header | taxon_name | taxon_id -taxids -omit-ranks -separate-cols read_header | rank:taxon_id -taxids-only read_header | rank:taxon_name read_header | rank:taxon_name(taxon_id) -taxids read_header | rank | taxon_id -taxids-only -separate-cols read_header | rank | taxon_name -separate-cols read_header | rank | taxon_name | taxon_id -taxids -separate-cols Note that the separator '\t|\t' can be changed to something else with the command line option '-separator '. Note that the default lowest taxon rank is 'sequence'. Sequence-level taxon ids have negative numbers in order to not interfere with NCBI taxon ids. Each reference sequence is added as its own taxon below the lowest known NCBI taxon for that sequence. If you do not want to classify at sequence-level, you can set a higher rank as lowest classification rank with the '-lowest' command line option: '-lowest species' or '-lowest subspecies' or '-lowest genus', etc.