• Docs »
  • Database merging

Database merging

Using BroodDBMerge

Getting started

The BROOD distribution includes a utility, BROODDBMERGE, that allows users to create a single database by merging two databases, such as BROOD’s default database and their own corporate database. BROODDBMERGE accomplishes this task using three required parameters: -in1, -in2 and -out. This allows combining two databases in the directories specified by the two in options and writing a new database to the directory specified by the -out flag. By default, BROODDBMERGE will not overwrite a database.

BROOD databases are complex structures with many precalculated parameters and indices organized into multiple binary files for fast access of multiple fragments in an efficient manner. Because of this complex structure, it is not possible to simply string two database files together in order to merge databases. Instead, BROODDBMERGE reorganized the molecular fragments and recalculates the database indexes.

By default BROODDBMERGE removes duplicate fragments from the new database (see -removeDuplicates). When BROODDBMERGE identifies a unique fragment, it is written to the new database. If a second copy of a fragment is identified, it is discarded. Thus, the database passed in using the -in1 parameter takes precedence in duplicate removal. Duplicate fragments in the database specified by -in2 will be discarded. In addition, if a fragment appears twice in either database, the later instance will be discarded. The user can use the database input order to control which duplicate fragments will be present in the final database.

Help

Executing BROODDBMERGE with no arguments will result in:

prompt> brooddbmerge
BroodDBMerge version 3.0.0.1, 20150619
  OEChem version 2.0.1, 20150619
  Platform: osx-10.8-g++4.2-x64
  OpenEye Scientific Software, Inc.
=======================================
No arguments specified on the command line
Required parameters:
    -in1 : First input database directory
    -in2 : Second input database directory
    -out : Merged output database directory
For more help type:
  brooddbmerge --help

A description of the command line interface can be obtained by executing BROODDBMERGE with the --help option. Like most OpenEye command-line programs, BROODDBMERGE will also return appropriate output with the --help simple and --help all. Since BROODDBMERGE is such a simple utility, all the options are presented in each of the outputs. In other words, BROODDBMERGE does not have any complex or advanced parameters.

prompt> brooddbmerge --help
prompt> brooddbmerge --help simple
prompt> brooddbmerge --help all

will each generate the following output:

=======================================
Simple parameter list
  Parameters
    -in1 : First input database directory
    -in2 : Second input database directory
    -verbose : Give more detailed output
    -overwrite : Overwrite output database (useful for repetitive scripting)
    -removeDuplicates : Don't include fragments from the second database that
                        were in the first database


Additional help functions:
  brooddbmerge --help simple      : Get a list of simple parameters (as seen above)
  brooddbmerge --help all         : Get a complete list of parameters
  brooddbmerge --help defaults    : List the defaults for all parameters
  brooddbmerge --help <parameter> : Get detailed help on a parameter
  brooddbmerge --help html        : Create an html help file for this program
  brooddbmerge --help versions    : List the toolkits and versions used in the application

These represent the simplest set of parameters and are the best place to start to learning to use BROODDBMERGE.

Default help

The defaults for each command-line parameter can be examined with the --help defaults flag.

prompt> brooddbmerge --help defaults

will result in the following output:

#Default settings
  #Parameters
      #-in1 : (no default setting)
      #-in2 : (no default setting)
      #-out : (no default setting)
      -verbose false
      -overwrite false
      -removeDuplicates true

Command-line parameters

Input parameters

-in1 (-i1)

This option is for loading the first BROOD database. The value passed with the flag should point at the directory of the database rather than any files inside the database.

-in2 (-i2)

This option is for loading the second BROOD database to be merged with the database specified by the -in1 flag.

Output parameters

-out (-o)

This option indicates the new database to create based on merging the databases specified by the -in1 and -in2 flags. By default BROODDBMERGE will require you to specify a new directory for the new database (see -overwrite for details).

Control parameters

-verbose (-v)

This flag controls the amount of output generated by the program. By default, all essential information is presented in a summary. When the flag is set, a large amount of detailed information is also written to stdout. [default = false]

-overwrite (-over)

By default, BROODDBMERGE will only write a new database (it will not overwrite another file or directory). Brood databases can be quite expensive to generate; to avoid accidental erasure of databases BROODDBMERGE will not overwrite a previous database. If you want to overwrite a database, you can set the -overwrite flag. [default = false]

-removeDuplicates (-duplicates)

When merging two databases, users can choose whether to remove duplicates. By default, any fragment that is present more than once in the combined databases will only be written one time. No fragment-level merging occurs; only the first instance of a fragment is written. [default = true]

Example executions

This section has a brief series of examples of BROODDBMERGE command-line executions. Each example is followed by a brief description of its behavior.

prompt> brooddbmerge -in1 dbA -in2 dbB -out dbAB

This is the most basic merge of the two databases dbA and dbB into the new database dbAB. By default, the new database dbAB will not contain any duplicate fragments, but will contain all of the unique fragments that were present in either dbA or dbB.

prompt> brooddmberge -in1 dbA -in2 dbB -out dbB

In this execution, it is most likely the user accidentally used dbB for the output. Rather than allow this typo to accidentally erase or corrupt the dbB database, the execution will stop. This behavior can be overridden with the -overwrite flag.

prompt> brooddbmerge -in1 dbA -in2 dbB -out dbAB -removeDuplicates false

This execution of BROODDBMERGE will not include duplicate removal. All of the fragments that are in dbA and dbB will occur in dbAB no matter how many times any fragment may appear in either or both databases. When BROOD searches the resulting database, each instance will be treated separately.