PubChem Data Mining

In this post we’re going to explore how can we use the MolSimNMR method in order to find compounds with similar NMR data in large databases. We can envision at least two scenarios where this can be useful. The first one would be the case where we already have a structure in hand and want to know how many other known structures would present similar NMR data. The second scenario could be one in which we want to group or cluster a given structural database into families, based on the similarity of their NMR data.

The first idea that comes into mind to accomplish this task is by having access to the NMR data and using one of the many spectral matching algorithms in  the literature, or to just come up with a novel one. This approach is usually employed with in-house built databases. Unfortunately the amount of NMR data accessible to us is a tiny fraction when compared to the chemical diversity available in nature. And the publicly available data is even more scarce. The next step would be to use the structure to predict the NMR data and then use one of the many spectral matching algorithms in the literature, or t just come up with a novel one. This is even a worst proposition than before for at least two reasons, one would be the time involved in predicting the NMR data for every structure in the database and the second one would be the error involved in the prediction of nuclei with unknown environments or with minimal underlying data.

The question we have t ask ourselves then is, can we come up with a way of predicting NMR data similarity based solely on chemical structural information? The answer is obviously yes, otherwise I wouldn’t be writing this article. The more interesting question is how, but before we delve into the inner workings of MolSimNMR we need to know a little bit about two key concepts, molecular fingerprints and similarity calculations.

Molecular fingerprints

A molecular fingerprint is a collection of descriptors that is used to characterize a molecule. For example, the molecular formula is a type of descriptor that tells us the elemental composition of a compound. This knowledge decreases the number of possible compounds immensely, still a huge number of possible compounds remain for a given formula. We can further describe our molecule by adding more descriptors and with each addition, the number of possible structures diminishes.  Physical properties can also be another type of descriptors, pKa, logP, etc.  These descriptors are then assembled into a array which is in turn used to calculate a similarity coefficient. The most common fingerprints are: Public MDL keys, fcp4, fragment-based, etc.


Similarity Calculations

A great number of metrics can be calculated between two fingerprints in order to measure their similarity, or how close they are in the multidimensional fingerprint space.