Reordering CSV File¶
Problem¶
You want to reorder a CSV file to match the OEChem TK ‘s CSV File Format.
Difficulty Level¶
Download¶
Download code
reordercsv.py with with-title.csv and no-header.csv supporting data files
See also the Usage subsection.
Solution¶
The OEChem TK CSV File Format expects the first column to be a SMILES string representing the molecule and the second column to be the molecule title. The CSV file can be with or without a header line.
The get_header function handles various CSV file inputs. If the input CSV file has a header line, then the SMILES and the title field is identified by (case insensitive) substring matching. If there is no header line, then at least one field in the first line has to be interpreted as a valid SMILES string.
def get_header(iname, hasheader):
header = csv.DictReader(open(iname, 'r')).fieldnames
if hasheader:
if not any('smiles' in h.lower() for h in header):
oechem.OEThrow.Fatal("Input CSV file must have a SMILES column!")
# find the SMILES column and move it to the front
smiles = header.pop(next(i for i, h in enumerate(header) if 'smiles' in h.lower()))
header.insert(0, smiles)
# move the TITLE column to be right after the SMILES
if any('title' in h.lower() for h in header):
title = header.pop(next(i for i, h in enumerate(header) if 'title' in h.lower()))
header.insert(1, title)
else:
header.insert(1, "TITLE")
else:
# find column that can be interpreted as SMILES
for i, h in enumerate(header):
if is_valid_SMILES(h):
smiles = header.pop(i)
break
header.insert(0, smiles)
header.insert(1, '')
return header
After the header of the file is determined the script reorders the file accordingly.
Usage¶
Download code
reordercsv.py with with-title.csv and no-header.csv supporting data files
Parsing and reordering a CSV file that has a header line to match the OEChem TK CSV file format.
In this example, with-title.csv (shown below) has columns for both the SMILES string and the molecule title, but they are not in the OEChem TK-specific order. The input file is shown below:
Case,xxSMILESxx,MTP,diameter,petitjean,title
0,O=C1Cc2ccccc21,14,5,0.40000001,mol1
0,Clc1ccc(cc1)C1c2c(OC(N)=C1C#N)[nH][nH0]c2C(F)(F)F,20.5,9,0.44444445,mol2
0,O=C(OC)C(=Cc1ccccc1)Cc1ccccc1,27.5,10,0.5,mol3
0,FC(F)(F)c1[nH0]cc2ccccc2c1,30.5,7,0.42857143,mol4
prompt > python3 reordercsv.py -in test_withTitle.csv -out out_withTitle.csv
Running the above command will generate the following CSV file that matches the OEChem TK CSV file format:
xxSMILESxx,title,Case,MTP,diameter,petitjean
O=C1Cc2ccccc21,mol1,0,14,5,0.40000001
Clc1ccc(cc1)C1c2c(OC(N)=C1C#N)[nH][nH0]c2C(F)(F)F,mol2,0,20.5,9,0.44444445
O=C(OC)C(=Cc1ccccc1)Cc1ccccc1,mol3,0,27.5,10,0.5
FC(F)(F)c1[nH0]cc2ccccc2c1,mol4,0,30.5,7,0.42857143
The following example, the input CSV file is missing a header so the -hasheader flag has to be used with a false parameter. The input CSV file, no-header.csv, is shown below.
0,O=C1Cc2ccccc21,14,5,0.40000001
0,Clc1ccc(cc1)C1c2c(OC(N)=C1C#N)[nH][nH0]c2C(F)(F)F,20.5,9,0.44444445
0,O=C(OC)C(=Cc1ccccc1)Cc1ccccc1,27.5,10,0.5
0,FC(F)(F)c1[nH0]cc2ccccc2c1,30.5,7,0.42857143
prompt > python3 reordercsv.py -in no-header.csv -out out.csv -hasheader false
Running the above command will generate the following CSV file. Note that an empty column is inserted after the first column to represent the molecule title. Since there is no header in the input file the column representing the title can not be determined.
O=C1Cc2ccccc21,,0,14,5,0.40000001
Clc1ccc(cc1)C1c2c(OC(N)=C1C#N)[nH][nH0]c2C(F)(F)F,,0,20.5,9,0.44444445
O=C(OC)C(=Cc1ccccc1)Cc1ccccc1,,0,27.5,10,0.5
FC(F)(F)c1[nH0]cc2ccccc2c1,,0,30.5,7,0.42857143
Command Line Parameters¶
Simple parameter list
input/output options
-hasheader : false if the CSV does not have a header
-in : Input CSV filename
-out : Output CSV filename