Depicting CSV or SDF in XLSX (Excel)
Problem
You want to depict molecules along with their associated data read from a
CSV file in an xlsx
Excel file.
See example in drugs.xlsx
and in Figure 1.
Ingredients
|
Difficulty Level
Solution
The CSV file format is a text file format containing comma-separated values. In OEChem TK this file format is implemented to enable data exchange with a wide variety of other software. Each line of a CSV file stores data for a molecule that is represented by a SMILES string.
See also
CSV File Format section of the OEChem TK documentation about the layout of the CSV file format.
When reading a CSV file, the fields of the file are attached to each molecule as SD data. This data can be accessed by the OEGetSDDataIter function that returns an iterator over all the SD data (tag - value) pairs of a molecule. The CollectDataTags function iterates over a list of molecules and returns the unique tags of the data attached to the molecules.
1def CollectDataTags(mollist):
2
3 tags = []
4 for mol in mollist:
5 for dp in oechem.OEGetSDDataIter(mol):
6 if not dp.GetTag() in tags:
7 tags.append(dp.GetTag())
8
9 return tags
The WriteXLSXFile function takes a list of molecules read from a CSV file along with the data tags returned by the CollectDataTags function.
First a ‘worksheet’ is created and styles are added by calling the AddHeadFormat and AddDataFormats functions (lines 5-9). Then the width of each column of the spreadsheet is estimated using the OEEstimateTextWidth function (lines 13-21). This is followed by generating the header of the the table by adding the input filename into the first row (lines 25-27) and each of the data tags into the second (lines 29-37). Then iterating over the molecules, each molecule is depicted in a new row along with the corresponding data with alternating styles. In order to add the images, a temporary image file has to be generated for each molecule. These files can be removed after the workbook is closed (lines 66-71).
1def WriteXLSXFile(oname, mollist, iname, tags, opts):
2
3 # create an Excel file and add formating
4
5 workbook = xlsxwriter.Workbook(oname)
6 worksheet = workbook.add_worksheet()
7
8 headfont, headformat = AddHeadFormat(workbook)
9 datafont, dataformat_even, dataformat_odd = AddDataFormats(workbook)
10
11 # estimate width of columns
12
13 maxwidths = []
14 maxwidths.append(opts.GetWidth() * 0.15)
15
16 for tag in tags:
17 maxwidth = oedepict.OEEstimateTextWidth(tag, headfont) * 2.0
18 for mol in mollist:
19 if oechem.OEHasSDData(mol, tag):
20 value = oechem.OEGetSDData(mol, tag)
21 estimatedwidth = oedepict.OEEstimateTextWidth(value, datafont)
22 maxwidth = max(maxwidth, estimatedwidth)
23 maxwidths.append(maxwidth * 0.12)
24
25 # generate header
26
27 row, col = 0, 0
28 worksheet.set_row(row, None, headformat)
29 worksheet.merge_range('A1:D1', iname)
30
31 row, col = 1, 0
32 worksheet.set_row(row, None, headformat)
33 worksheet.set_column(col, col, maxwidths[col])
34 worksheet.write(row, col, "Molecule")
35
36 for tag in tags:
37 col += 1
38 worksheet.set_column(col, col, maxwidths[col])
39 worksheet.write(row, col, tag)
40
41 # generate rows
42
43 tmpfnames = []
44
45 for mol in mollist:
46 row += 1
47
48 if row % 2 == 0:
49 dataformat = dataformat_even
50 else:
51 dataformat = dataformat_odd
52
53 worksheet.set_row(row, opts.GetHeight() * 0.75, dataformat)
54
55 col = 0
56 fname = "tmp%d.png" % row
57 WriteImageToFile(fname, mol, opts)
58 worksheet.insert_image(row, col, fname)
59 tmpfnames.append(fname)
60
61 for tag in tags:
62 col += 1
63 value = "N/A"
64 if oechem.OEHasSDData(mol, tag):
65 value = oechem.OEGetSDData(mol, tag)
66 worksheet.write(row, col, value)
67
68 workbook.close()
69
70 # remove temporary image files
71
72 for fname in tmpfnames:
73 os.remove(fname)
The WriteImageToFile function generates a molecule depiction and writes it into an image file.
1def WriteImageToFile(fname, mol, opts):
2
3 oedepict.OEPrepareDepiction(mol)
4 disp = oedepict.OE2DMolDisplay(mol, opts)
5 oedepict.OERenderMolecule(fname, disp, False)
The AddHeadFormat function adds a style to the ‘workbook’ that is used in the header of the table.
1def AddHeadFormat(workbook):
2
3 font = oedepict.OEFont(oedepict.OEFontFamily_Default, oedepict.OEFontStyle_Bold, 18,
4 oedepict.OEAlignment_Center, oechem.OEBlack)
5 format = workbook.add_format({'bold': True, 'align': 'center',
6 'valign': 'vcenter', 'size': 18})
7 format.set_bg_color('#F4F4F4')
8 format.set_border_color('#DDDDDD')
9 format.set_border()
10
11 return font, format
The AddDataFormats function adds two formats to the ‘workbook’ one for the even and one for the odd rows of the table.
1def AddDataFormats(workbook):
2
3 font = oedepict.OEFont(oedepict.OEFontFamily_Default, oedepict.OEFontStyle_Default, 12,
4 oedepict.OEAlignment_Center, oechem.OEBlack)
5
6 format_even = workbook.add_format({'bold': False, 'align': 'center',
7 'valign': 'vcenter', 'size': 12})
8 format_even.set_shrink()
9 format_even.set_bg_color('#FFFFF4')
10 format_even.set_border_color('#DDDDDD')
11 format_even.set_border()
12
13 format_odd = workbook.add_format({'bold': False, 'align': 'center',
14 'valign': 'vcenter', 'size': 12})
15 format_odd.set_shrink()
16 format_odd.set_bg_color('#FFF4FF')
17 format_odd.set_border_color('#DDDDDD')
18 format_odd.set_border()
19
20 return font, format_even, format_odd
Download code
csv2xlsx.py
and drugs.csv
supporting data
Usage
Running the above command will generate the
drugs.xlxs
file.
prompt > python3 csv2xlsx.py drugs.csv drugs.xlsx
Discussion
Reading the columns of an CSV file into SD data fields, means
that the OEChem TK provides a meta-data interchange between sdf
files and
CSV files.
Consequently, the same Python script can be used to generate a xlsx
file
reading an sdf
file.
Usage
After downloading drugs.sdf
supporting
data file, the above command will generate the same
drugs.xlsx
file (apart from the input filename on the top).
prompt > python3 csv2xlsx.py drugs.sdf drugs.xlsx
See also
See also in OEChem TK manual
Theory
SD Tagged Data Manipulation section
CSV File Format section
API
OEGetSDDataPairs function
See also in OEDepict TK manual
Theory
Molecule Depiction chapter
API
OE2DMolDisplay class
OE2DMolDisplayOptions class
OEPrepareDepiction function