SYNOPSIS

    metacache build <database> <sequence file/directory>... [OPTION]...

    metacache build <database> [OPTION]... <sequence file/directory>...


DESCRIPTION

    Create a new database of reference sequences (usually genomic sequences).


REQUIRED PARAMETERS

    <database>        database file name;
                      A MetaCache database contains taxonomic information and
                      min-hash signatures of reference sequences (complete
                      genomes, scaffolds, contigs, ...).

    <sequence file/directory>...
                      FASTA or FASTQ files containing genomic sequences
                      (complete genomes, scaffolds, contigs, ...) that shall
                      beused as representatives of an organism/taxon.
                      If directory names are given, they will be searched for
                      sequence files (at most 10 levels deep).
                      The input files can also be compressed if MetaCache was
                      built with the zlib compression library.


BASIC OPTIONS

    -taxonomy <path>  directory with taxonomic hierarchy data (see NCBI's
                      taxonomic data files)

    -taxpostmap <file>
                      Files with sequence to taxon id mappings that are used as
                      alternative source in a post processing step.
                      default: 'nucl_(gb|wgs|est|gss).accession2taxid'

    -sequence-id-format (smart|ncbi|gi|filename|leadingword)
                      Method used for extracting sequence IDs from filenames and
                      sequence headers.Sequence IDs are also used to assign taxa
                      to reference sequences.
                      Available types are:
                      smart       : try NCBI > genbank > filename
                      ncbi        : NCBI-style accession/accession.version
                      gi          : genbank identifier
                      filename    : filename without extension
                      leadingword : first stretch of non-whitespace characters
                      default: smart

    -silent|-verbose  information level during build:
                      silent => none / verbose => most detailed
                      default: neither => only errors/important info


SKETCHING (SUBSAMPLING)

    -kmerlen <k>      number of nucleotides/characters in a k-mer
                      default: 16

    -sketchlen <s>    number of features (k-mer hashes) per sampling window
                      default: 16

    -winlen <w>       number of letters in each sampling window
                      default: 127

    -winstride <l>    distance between window starting positions
                      default: 112 (w-k+1)


ADVANCED OPTIONS

    -reset-taxa       Attempts to re-rank all sequences after the main build
                      phase using '.accession2taxid' files. This will reset the
                      taxon id of a reference sequence even if a taxon id could
                      be obtained from other sources during the build phase.
                      default: off

    -max-locations-per-feature <#>
                      maximum number of reference sequence locations to be
                      stored per feature;
                      If the value is too high it will significantly impact
                      querying speed. Note that an upper hard limit is always
                      imposed by the data type used for the hash table bucket
                      size (set with compilation macro
                      '-DMC_LOCATION_LIST_SIZE_TYPE').
                      default: 254

    -remove-overpopulated-features
                      Removes all features that have reached the maximum allowed
                      amount of locations per feature. This can improve querying
                      speed and can be used to remove non-discriminative
                      features.
                      default: off
                      Not available in the GPU version.

    -remove-ambig-features <rank>
                      Removes all features that have more distinct reference
                      sequence on the given taxonomic rank than set by
                      '-max-ambig-per-feature'. This can decrease the database
                      size significantly at the expense of sensitivity. Note
                      that the lower the given taxonomic rank is, the more
                      pronounced the effect will be.
                      Valid values: sequence, form, variety, subspecies,
                      species, subgenus, genus, subtribe, tribe, subfamily,
                      family, suborder, order, subclass, class, subphylum,
                      phylum, subkingdom, kingdom, domain
                      default: off
                      Not available in the GPU version.

    -max-ambig-per-feature <#>
                      Maximum number of allowed different reference sequence
                      taxa per feature if option '-remove-ambig-features' is
                      used.
                      Not available in the GPU version.

    -max-load-fac <factor>
                      maximum hash table load factor;
                      This can be used to trade off larger memory consumption
                      for speed and vice versa. A lower load factor will improve
                      speed, a larger one will improve memory efficiency.
                      default: 0.800000
                      Not available in the GPU version.

    -parts <#>        Splits the database into multiple parts. Each part
                      contains a separate hash table.
                      default: 1

EXAMPLES

    Build database 'mydb' from sequence file 'genomes.fna':
        metacache build mydb genomes.fna

    Build database with latest complete genomes from the NCBI RefSeq
        download-ncbi-genomes refseq/bacteria myfolder
        download-ncbi-genomes refseq/viruses myfolder
        download-ncbi-taxonomy myfolder
        metacache build myRefseq myfolder -taxonomy myfolder

    Build database 'mydb' from two sequence files:
        metacache build mydb mrsa.fna ecoli.fna

    Build database 'myBacteria' from folder containing sequence files:
        metacache build myBacteria all_bacteria