Data Mining PubChem in Search of Compound Families with Similar NMR Data
Abstract
We are interested in understanding NMR data diversity from known small organic molecules. While chemical structure diversity can be measured by enumerating all possible structures for a given molecular formula, their NMR data is very difficult to obtain, either experimentally or computationally.
To overcome this obstacle, we have used the MolSimNMR similarity calculation method in order to cluster a subset of 2 million small organic molecules from PubChem into compound families expected to present very high NMR data similarity for 1H-1D and 1H-13C HSQC experiments.
Here, we present the results of this work and discuss possible applications ranging from database search, compound classification, and NMR prediction.
Methods
MolSimNMR
In order to design the negative control structures in a consistent and objective way a molecular similarity coefficient was developed. The method, termed MolSimNMR, takes two chemical structures as input and outputs the expected NMR data similarity value between them. It was shown before that this value is predictive of the similarity between the NMR data of input structures.
Database Design
An initial subset of about 20 million compounds was obtained from Pubchem by selecting compounds that passed the following criteria:
- Atoms in {H, B, C, N, O, Si, P, S, F, Cl, Br, I}
- MW <= 500 Da
- Neutral molecules
- # Heavy atoms >= 10
- Isotopic count = 0
- Single component
That number of compounds was still considered too large for further analysis. In order to reduce the size of the database a few other metrics were developed, keeping in mind that the overall goal of looking for compounds that represent the variability in NMR data. For that reason, compounds that present only a few NMR active nuclei were discarded, as well as compounds with reactive groups. The following metrics were designed and used as filters to bring the final number of compounds to around 2 million:
min | max | ave | std dev | |
Hload | 0.00 | 2.19 | 0.83 | 0.24 |
MW | 24.49 | 500.00 | 364.95 | 79.30 |
RBHA | 0.00 | 21.69 | 0.22 | 0.10 |
Cload | 0.00 | 1.00 | 0.74 | 0.08 |
DOU | -7.63 | 33.39 | 10.32 | 3.66 |
By looking at the molecular weight distribution of the 20M compound database it was possible to identify three different distributions, with peaks at around 210, 330 and 410 Da.

Structural Clusters
Below are some of the chemical structure that were cluster together based on the MolSimNMR similarity coefficient.

Cluster Methodology
Any clustering procedure tries to group together object of similar description. Usually these objects are described by a set of descriptors that can be calculated computationally.
In this case our objective is to group together chemical compounds based on the expected similarity of their NMR data. For that reason, we used the MolSimNMR similarity method, developed expressly for that purpose, as an objective measure of NMR data similarity.
The number of descriptors used to represent a molecule were around 700.
Given the large size of the database, and after failing with many other clustering methods, we were able to cluster the whole database with the use of Sequential K-Means Clustering. Total computing time was 37 hours on a Intel Core i7 with 8 GB of RAM, running on Windows 7.

Cluster Centroid Visualization
The star plots below are a representation of the different clusters obtained in this work. Each circle is divided in 180 bars, each one representing a single chemical descriptor. Since not all descriptors are populated for a given cluster it is possible to use this representation without too much overlap.
The diagrams below represent the coordinates of each centroid, in polar coordinates. The angle maps the type of descriptor and the radii maps the number of each descriptor in the centroid.
A total of 825 centroid clusters were finally determined, each containing a varying number of compounds in them, from a single compound to more than 50,000.
Aplications
The information we have unveiled with this work can be used in many applications here are just a few:
The cluster can be used for quick classification compounds
Fingerprint calculation can be used for similarity search, what other known structure would give me a similar spectra?
NMR prediction database optimization. y knowing what type of environments are the most prevalent, it should be possible to optimize a given database to cover that space as efficiently as possible
Optimization of ASV algorithms. Looking for a challenging negative control structures would be very efficient
Cluster analysis of large chemical databases is possible with common PC equipment, though very time consuming
Fingerprint calculation can be used for structure similarity search based on NMR data similarity