DDB help and data formatting guide

Users are highly encouraged  to upload Descriptor Extraction Software,  Protein-Ligand Molecules & Complexes,  Descriptors, and Descriptor Filters to enrich DDB. Please prepare your data according to the following formatting rules so that DDB services can work correctly on your submissions.

To deposit a descriptor extraction software, the donor is advised to follow the following naming and file structure guidelines:

  • The descriptor extraction tool must be a command-line executable program that can run on  Linux  Ubuntu 12.04 or later (amd64 architecture).
  • The descriptor tool (the program & accompanying files) must be self contained and functional without relying on any other utilities outside the descriptor tool directory.
  • The command-line descriptor tool must accept at least one input and it must write (and then appends) its output to a csv file.
  • The first line of the program’s output must be the names of the descriptors it calculates (header line). Each line written to the output file after that must be the descriptors for the provided molecule(s). The number of descriptors in each line must be equal and separated by commas.
  • The mandatory input(s) to the program is (are) the file path(s) to the molecule(s) (i.e. protein, ligand, or both) for which the descriptors need to be extracted.
  • The mandatory output is the output file path to which the descriptors are written.
  • The name of the program must be the same as the name of the feature set. For example, if you are donating a descriptor set type called userscore, then your program must be named userscore as well.
  • Your program must be saved in a directory called bin that lives in a directory with the same name as the descriptor extraction tool. In our example, the directory name would be userscore/ and the program’s path inside it would be: userscore/bin/userscore
  • You must also include an interface file along with the bin subdirectory. The interface file must be named interface.txt. In our hypothetical example, the path of this file would be: userscore/interface.txt
  • The interface file provides details about the command-line arguments. It tells DDB descriptor extraction engine how to run the program and in what order the inputs to and outputs from the program should be presented. For example, if the command line arguments of userscore are “-r receptorFileName.pdb -l /path/to/ligandFileName.mol2 -o userscore.csv -O userscore_extendedDescriptors.csv”, its interface.txt file must then have the following lines:
    • 1          -r         {Receptor}.pdb
    • 2          -l         {ligand}.mol2
    • 3         -c         12
    • 4          -o         {OutputPrefix}/userscore.csv
    • 5          -O        {OutputPrefix}/userscore_extendedDescriptors.csv
    • 6          -x        Null
    • 7          -e        extra
  • Each line in the interface file contains three space-separated fields. These fields are “Order”, “Flag”, and “Value”.
  • The numbers above denote the Order according to which the arguments are introduced to the program upon its execution. The numbers must be unique and sequential even if the arguments can be given in any arbitrary order.
  • Flags denote the different options that the program takes. The snippet above shows a list of ‘short’ flags or options such as ‘-r’, ‘-l’, etc. Other programs use ‘long’ options such as ‘- -receptor’ or ‘- -accuracy’. In such cases, the short options above must be replaced with ‘- -receptor’, ‘- -ligand’, etc. Yet, there exists another category of command-line programs that can recognize their arguments using their order from left to right without relying on flags. For these category of programs, Null must be given for the “Flag” field. As an example:
    • 1               Null          {Receptor}.pdb
    • 2              Null          {Ligand}.mol2
    • 3              Null          {OuputPrefix}/userscore.csv
  • The third field in an interface.txt line is the value of the argument. The value of an argument can be an integer (e.g., 4), floating point (-2.12), string (/path/to/file), or Null if the presence and absence of the flag is interpreted by the program as a switch between two things (e.g. True/False, Large/Small, etc).
  • The words between the curly brackets (or place holders) above will be replaced by the paths of the receptor, ligand, and output files receptively. The file extensions (.pdb & .mol2) tell DDB in what formats the molecules must be fed to the program.
  • Another place holder that can be used in the interface program is {Tool}. DDB scripts will replace {Tool} by its path wherever it is stored in our file system. In our hypothetical case, if userscore was saved in the directory /ddbPath/to/, then /ddbPath/to/userscore will substitute {Tool}. If the program uses a parameter file and a directory of libraries that live in the tool’s directory, it can point to it in the interface.txt file using these extra lines:
    • 8          -p         {Tool}/parameterFileName.txt
    • 9          -L        {Tool}/path/to/library
  • Some programs require certain libraries and packages be added or exported to the system environment. The DDB  descriptor extraction engine allows access to the Linux environment variables PATH, PYTHONPATH, and LD_LIBRARY_PATH via the interface.txt flags PATH, PYTHONPATH, and LD_LIBRARY_PATH respectively. Custom environment variables are also allowed in the interface.txt file (e.g., USERSCORE_HOME or USERSCOREBIN).
  • The order field (the first token in an interface line) for environment variables must be 0 (zero) since environment variables require no specific order.
  • Environment variables and place holders can be used in combination with {Tool} in the interface file so that the program knows where certain files and/or directories are located. As an example, if setting up environment variables is necessary for the program to work properly, the following lines must proceed any other argument-based lines in the interface.txt file as follows:
    • 0          USERSCORE_HOME         {Tool}
    • 0          USERSCOREBIN                 {Tool}/bin
    • 0          PATH                                      {Tool}/libraries/utilityX
    • 0          PYTHONPATH                    {Tool}/somePythonModules
    • 1          -r                                              {Receptor}.pdb
  • After putting all the necessary files in the descriptor tool directory, it must be compressed into a zip file with the same name as the descriptor tool itself.
  • The zip file can then be uploaded.
  • We will then test the program and make it available to the community for descriptor extraction if it passes our security and system usage criteria.

To deposit a set of pre-extracted descriptors, the donor is advised to follow the following naming and file structure guidelines:

  • Descriptors must be submitted in a zipped folder that contains csv files corresponding to the descriptor tools used to calculate them.
  • Each file must contain the same number of records which should correspond to the molecules for which these descriptors are calculated and in the same order.
  • The first line in any descriptor (csv) file must contain coma-separated list of descriptor names. The following lines must contain the values for these descriptors and in the same order.
  • In addition to the csv files for each descriptor type, the uploaded zipped directory must contain a csv file with the name “molNames_and_responseValues.csv”.
  • As the name implies, this file must contain the names of the molecules for which the descriptors are extracted and the response values in the same order.
  • The header line of “molNames_and_responseValues.csv” must contain the fields “Protein_name, Ligand_name, Response_value” separated by commas and without quotation marks.
  • In case any of these three fields is unavailable or inapplicable, then sensible values must be given instead. For example, protein1, protein2, etc. Values of 0’s can be given if the response values are unknown.

To deposit a descriptor filter, the donor is advised to follow these instructions:

  • The filter must be submitted in a form of CSV (Comma Separated Values) file format.
  • The file must contain two lines. The first line is for the names of the descriptors separated by commas.
  • The second line is a binary vector (string) of ones and zeros separated by commas. The number of ones and zeros must be the same as the number of descriptors. The value of one implies that the corresponding descriptor (in the first line) is being selected by the filter as a valid descriptor. The value of zero, on the other hand, means that the corresponding descriptor is either noisy or irrelevant for the given prediction task.
  • Filters can be generated automatically by DDB or manually by the user. In either case, filters created for certain descriptor types can only be applied on the same types of descriptors.
  • In DDB, filters can be applied on some data set to select certain descriptors before fitting and using a machine-learning scoring function.
The donor is encouraged to follow these guidelines to deposit proteins, ligands, or protein-ligand complexes:
  • The donor must store all files in a single directory and compress it into a zip file before uploading. In the instructions below, we will call this directory plcUserData.
  • It is highly recommended that donors assign good descriptive names to their data directories.
  • The names of the molecules, their paths inside the plcUserData directory, and response values (if applicable) must be saved in a CSV file named after the data directory itself. In our example, such information must be saved in the file plcUserData.csv directly inside plcUserData/, or otherwise will not be detected by DDB.
  • If the submitted data are protein-ligand complexes, then the information file plcUserData.csv should look like the following:
    • Protein_file_path, Ligand_file_path, Protein_name, Ligand_name, BA
    • 4tim/4tim_protein.pdb, 4tim/4tim_ligand.mol2, 4tim_protein, 4tim_ligand, 2.16
    • 1drk/1drk_protein.pdb, 1drk/1drk_ligand.mol2, 1drk_protein, 1drk_ligand, 6.82
  • The names of proteins and ligands can be dropped if they are unavailable as seen below. In such case, the names of the files will be used as names for the molecules (e.g., 4tim_protein & 4tim_ligand would be used for the first complex).
    • Protein_file_path, Ligand_file_path, BA
    • 4tim/4tim_protein.pdb, 4tim/4tim_ligand.mol2, 2.16
    • 1drk/1drk_protein.pdb, 1drk/1drk_ligand.mol2, 6.82
  • This implies that the data directory (e.g., plcUserData/) contains the information file and two protein-ligand complexes saved in their own subdirectories. The structure of plcUserData is as follows:
    • plcUserData/
      • plcUserData.csv
      • 4tim/
        • 4tim_protein.pdb
        • 4tim_ligand.mol2
      • 1drk/
        • 1drk_protein.pdb
        • 1drk_ligand.mol2
  • The ligand must be saved in MOL2 or SDF formats.
  • Each ligand file must contain a single ligand.
  • The ligand molecule must not consist of atoms other than carbon, nitrogen, oxygen, phosphorus, halogen, and hydrogen atoms.
  • Molecular weight of the ligand molecule must be lower than 1000.
  • The protein file must be in the PDB format and must not contain any ligands.
  • The ligand and protein files must be saved in the same coordinate system.