SYNOPSIS metacache build ... [OPTION]... metacache build [OPTION]... ... DESCRIPTION Create a new database of reference sequences (usually genomic sequences). REQUIRED PARAMETERS database file name; A MetaCache database contains taxonomic information and min-hash signatures of reference sequences (complete genomes, scaffolds, contigs, ...). ... FASTA or FASTQ files containing genomic sequences (complete genomes, scaffolds, contigs, ...) that shall beused as representatives of an organism/taxon. If directory names are given, they will be searched for sequence files (at most 10 levels deep). BASIC OPTIONS -taxonomy directory with taxonomic hierarchy data (see NCBI's taxonomic data files) -taxpostmap Files with sequence to taxon id mappings that are used as alternative source in a post processing step. default: 'nucl_(gb|wgs|est|gss).accession2taxid' -sequence-id-format (smart|ncbi|gi|filename|leadingword) Method used for extracting sequence IDs from filenames and sequence headers.Sequence IDs are also used to assign taxa to reference sequences. Available types are: smart : try NCBI > genbank > filename ncbi : NCBI-style accession/accession.version gi : genbank identifier filename : filename without extension leadingword : first stretch of non-whitespace characters default: smart -silent|-verbose information level during build: silent => none / verbose => most detailed default: neither => only errors/important info SKETCHING (SUBSAMPLING) -kmerlen number of nucleotides/characters in a k-mer default: 16 -sketchlen number of features (k-mer hashes) per sampling window default: 16 -winlen number of letters in each sampling window default: 127 -winstride distance between window starting positions default: 112 (w-k+1) ADVANCED OPTIONS -reset-taxa Attempts to re-rank all sequences after the main build phase using '.accession2taxid' files. This will reset the taxon id of a reference sequence even if a taxon id could be obtained from other sources during the build phase. default: off -max-locations-per-feature <#> maximum number of reference sequence locations to be stored per feature; If the value is too high it will significantly impact querying speed. Note that an upper hard limit is always imposed by the data type used for the hash table bucket size (set with compilation macro '-DMC_LOCATION_LIST_SIZE_TYPE'). default: 254 -remove-overpopulated-features Removes all features that have reached the maximum allowed amount of locations per feature. This can improve querying speed and can be used to remove non-discriminative features. default: off Not available in the GPU version. -remove-ambig-features Removes all features that have more distinct reference sequence on the given taxonomic rank than set by '-max-ambig-per-feature'. This can decrease the database size significantly at the expense of sensitivity. Note that the lower the given taxonomic rank is, the more pronounced the effect will be. Valid values: sequence, form, variety, subspecies, species, subgenus, genus, subtribe, tribe, subfamily, family, suborder, order, subclass, class, subphylum, phylum, subkingdom, kingdom, domain default: off Not available in the GPU version. -max-ambig-per-feature <#> Maximum number of allowed different reference sequence taxa per feature if option '-remove-ambig-features' is used. Not available in the GPU version. -max-load-fac maximum hash table load factor; This can be used to trade off larger memory consumption for speed and vice versa. A lower load factor will improve speed, a larger one will improve memory efficiency. default: 0.800000 Not available in the GPU version. -parts <#> Splits the database into multiple parts. Each part contains a separate hash table. default: 1 EXAMPLES Build database 'mydb' from sequence file 'genomes.fna': metacache build mydb genomes.fna Build database with latest complete genomes from the NCBI RefSeq download-ncbi-genomes refseq/bacteria myfolder download-ncbi-genomes refseq/viruses myfolder download-ncbi-taxonomy myfolder metacache build myRefseq myfolder -taxonomy myfolder Build database 'mydb' from two sequence files: metacache build mydb mrsa.fna ecoli.fna Build database 'myBacteria' from folder containing sequence files: metacache build myBacteria all_bacteria