Data Mining PubChem in Search of Compound Families with Similar NMR Data


We are interested in understanding NMR data diversity from known small organic molecules. While chemical structure diversity can be measured by enumerating all possible structures for a given molecular formula, their NMR data is very difficult to obtain, either experimentally or computationally.

To overcome this obstacle, we have used the MolSimNMR similarity calculation method in order to cluster a subset of 2 million small organic molecules from PubChem into compound families expected to present very high NMR data similarity for 1H-1D and 1H-13C HSQC experiments.

Here, we present the results of this work and discuss possible applications ranging from database search, compound classification, and NMR prediction.



In order to design the negative control structures in a consistent and objective way a molecular similarity coefficient was developed.  The method, termed MolSimNMR, takes two chemical structures as input and outputs the expected NMR data similarity value between them.   It was shown before that this value is predictive of the similarity between the NMR data of input structures.

Database Design

An initial subset of about 20 million compounds was obtained from Pubchem by selecting compounds that passed the following criteria:

  • Atoms in {H, B, C, N, O, Si, P, S, F, Cl, Br, I}
  • MW <= 500 Da
  • Neutral molecules
  • # Heavy atoms >= 10
  • Isotopic count = 0
  • Single component

That number of compounds was still considered too large for further analysis. In order to reduce the size of the database a few  other metrics were developed,  keeping in mind that the overall goal of  looking for compounds that represent  the variability in NMR data.  For that reason, compounds that present only a few  NMR active nuclei were discarded, as well as compounds with reactive groups. The following metrics were designed and used as filters to bring the final number of compounds to around 2 million:

min max ave std dev
Hload 0.00 2.19 0.83 0.24
MW 24.49 500.00 364.95 79.30
RBHA 0.00 21.69 0.22 0.10
Cload 0.00 1.00 0.74 0.08
DOU -7.63 33.39 10.32 3.66

By looking  at the molecular weight distribution of the 20M compound database it was possible to identify three different distributions, with peaks at around 210, 330 and 410 Da.

Structural Clusters

Below are some of the chemical structure that were cluster together based on the MolSimNMR similarity coefficient.


Cluster Methodology

Any clustering procedure tries to group together object of similar description.  Usually these objects are described by a set of descriptors that can be calculated computationally.

In this case our objective is to group together chemical compounds based on the expected similarity of their NMR data.  For that reason, we used the MolSimNMR similarity method, developed expressly for that purpose, as an objective measure of NMR data similarity.

The number of descriptors used to represent a molecule were  around 700.

Given the large size of the database,  and after failing with many other clustering methods, we were able to cluster the whole database with the use of Sequential K-Means Clustering.  Total computing time was 37 hours on a Intel Core i7 with 8 GB of RAM, running on Windows 7.

Cluster Centroid Visualization

The star plots below are a representation of the different clusters obtained in this work.  Each circle is divided in 180 bars, each one representing a single chemical descriptor. Since not all descriptors are populated for a given cluster it is possible to use this representation without too much overlap.

The diagrams below represent the coordinates of each centroid, in polar coordinates. The angle maps the type of descriptor and the radii maps the number of each descriptor in the centroid.

A total of 825 centroid clusters were finally determined, each containing a varying number of compounds in them, from a single compound to more than 50,000.



The information we have unveiled with this work can be used in many applications here are just a few:

The cluster can be used for quick classification compounds

Fingerprint calculation can be used for similarity search, what other known structure would give me a similar spectra?

NMR prediction database optimization. y knowing what type of environments are the most prevalent, it should be possible to optimize a given database to cover that space as efficiently as possible

Optimization of ASV algorithms. Looking for a challenging negative control structures would be very efficient



Cluster analysis of large chemical databases is possible with common PC equipment, though very time consuming

Fingerprint calculation can be used for structure similarity search based on NMR data similarity