Reordering CSV File

Problem

You want to reorder a CSV file to match the OEChem TK ‘s CSV File Format.

Ingredients

  • cvs (Python Standard Library)

  • OEChem TK - cheminformatics toolkit

Difficulty Level

../_images/chilly.png

Download

Download code

reordercsv.py with with-title.csv and no-header.csv supporting data files

See also the Usage subsection.

Solution

The OEChem TK CSV File Format expects the first column to be a SMILES string representing the molecule and the second column to be the molecule title. The CSV file can be with or without a header line.

The get_header function handles various CSV file inputs. If the input CSV file has a header line, then the SMILES and the title field is identified by (case insensitive) substring matching. If there is no header line, then at least one field in the first line has to be interpreted as a valid SMILES string.

def get_header(iname, hasheader):

    header = csv.DictReader(open(iname, 'r')).fieldnames

    if hasheader:
        if not any('smiles' in h.lower() for h in header):
            oechem.OEThrow.Fatal("Input CSV file must have a SMILES column!")

        # find the SMILES column and move it to the front
        smiles = header.pop(next(i for i, h in enumerate(header) if 'smiles' in h.lower()))
        header.insert(0, smiles)

        # move the TITLE column to be right after the SMILES
        if any('title' in h.lower() for h in header):
            title = header.pop(next(i for i, h in enumerate(header) if 'title' in h.lower()))
            header.insert(1, title)
        else:
            header.insert(1, "TITLE")

    else:
        # find column that can be interpreted as SMILES
        for i, h in enumerate(header):
            if is_valid_SMILES(h):
                smiles = header.pop(i)
                break
        header.insert(0, smiles)
        header.insert(1, '')

    return header

After the header of the file is determined the script reorders the file accordingly.

Usage

Download code

reordercsv.py with with-title.csv and no-header.csv supporting data files

Parsing and reordering a CSV file that has a header line to match the OEChem TK CSV file format.

In this example, with-title.csv (shown below) has columns for both the SMILES string and the molecule title, but they are not in the OEChem TK-specific order. The input file is shown below:

Case,xxSMILESxx,MTP,diameter,petitjean,title
0,O=C1Cc2ccccc21,14,5,0.40000001,mol1
0,Clc1ccc(cc1)C1c2c(OC(N)=C1C#N)[nH][nH0]c2C(F)(F)F,20.5,9,0.44444445,mol2
0,O=C(OC)C(=Cc1ccccc1)Cc1ccccc1,27.5,10,0.5,mol3
0,FC(F)(F)c1[nH0]cc2ccccc2c1,30.5,7,0.42857143,mol4
prompt > python3 reordercsv.py -in test_withTitle.csv -out out_withTitle.csv

Running the above command will generate the following CSV file that matches the OEChem TK CSV file format:

xxSMILESxx,title,Case,MTP,diameter,petitjean
O=C1Cc2ccccc21,mol1,0,14,5,0.40000001
Clc1ccc(cc1)C1c2c(OC(N)=C1C#N)[nH][nH0]c2C(F)(F)F,mol2,0,20.5,9,0.44444445
O=C(OC)C(=Cc1ccccc1)Cc1ccccc1,mol3,0,27.5,10,0.5
FC(F)(F)c1[nH0]cc2ccccc2c1,mol4,0,30.5,7,0.42857143

The following example, the input CSV file is missing a header so the -hasheader flag has to be used with a false parameter. The input CSV file, no-header.csv, is shown below.

0,O=C1Cc2ccccc21,14,5,0.40000001
0,Clc1ccc(cc1)C1c2c(OC(N)=C1C#N)[nH][nH0]c2C(F)(F)F,20.5,9,0.44444445
0,O=C(OC)C(=Cc1ccccc1)Cc1ccccc1,27.5,10,0.5
0,FC(F)(F)c1[nH0]cc2ccccc2c1,30.5,7,0.42857143
prompt > python3 reordercsv.py -in no-header.csv -out out.csv -hasheader false

Running the above command will generate the following CSV file. Note that an empty column is inserted after the first column to represent the molecule title. Since there is no header in the input file the column representing the title can not be determined.

O=C1Cc2ccccc21,,0,14,5,0.40000001
Clc1ccc(cc1)C1c2c(OC(N)=C1C#N)[nH][nH0]c2C(F)(F)F,,0,20.5,9,0.44444445
O=C(OC)C(=Cc1ccccc1)Cc1ccccc1,,0,27.5,10,0.5
FC(F)(F)c1[nH0]cc2ccccc2c1,,0,30.5,7,0.42857143

Command Line Parameters

Simple parameter list
    input/output options
      -hasheader : false if the CSV does not have a header
      -in : Input CSV filename
      -out : Output CSV filename

See also in OEChem TK manual

Theory

API