A representation that integrates the analyses of the structural similarity of and potency differences between compounds sharing the same biological activity
A pair of structurally “similar” compounds with “large” differences in potency.This is an intuitive concept for a medicinal chemist, and corresponds to the exceptions of the “similarity principle” or neighbourhood behaviour, assuming that similar structures have similar properties.
The characterization of activity landscapes is performed by visual exploration with the help of SAS maps, network graphs, or by quantifying the relationship between the chemical similarity and activity similarity. The activity similarity is usually defined by absolute differences between activities, or absolute differences, normalized by the activity range: SAR Index ,SALI index.While it has been argued that the activity cliff concept is not applicable to properties beyond receptor interaction , the techniques of detecting discontinuities in SAR landscapes are potentially useful in modelling any chemical property, even though the reason for the cliffs existence may be different.
In the years following ToxMatch 1.x development, it became widely recognized that the “neighbourhood behaviour“, although a useful hypothesis, is not valid everywhere in the chemical space. The multitude of approaches to characterize activity landscapes, exemplify that this is now a mainstream knowledge, but unambiguous characterization of activity landscapes is an yet unsolved scientific problem.
Events | `s` (high similarity) | `!s` ( low similarity) |
---|---|---|
`!t` (small activity difference) | `c ~ P( s| !t)` | `d ~ P (!s | !t)` |
`t` (large activity difference) | `a ~ P(s|t)` | `b ~ P(!s|t)` |
The `G^2` statistics [2,3] estimates the likelihood of an event `t` taking place when another event `s` is also observed. The 2x2 contingency table (Table 1) defines the conditional probabilities of the event `t`, provided that the event `s` or its opposite (`!s`) was observed.
`G^2 = alog((a(c+d))/(c(a+b))) + blog((b(c+d))/(d(a+b))) `
The `G^2` statistics is used in natural language processing as a measure of words co-occurrence. In our case, `G^2` represents the likelihood of a compound forming an activity cliff, which is defined by a large difference in activity (event `t`) with other compounds in the dataset, given high similarity (event `s`). To calculate the activity cliff likelihood, one has to define what is considered a large difference in activity (i.e. an activity threshold), and what is considered a high similarity (i.e. a similarity threshold). Once the thresholds are defined, the 2x2 contingency table (Table 1) is prepared by comparing the compound with all other compounds in the analyzed dataset and incrementing the relevant countIf taking into account only structure pairs between a given compound and all other compounds in the analysed dataset, the `G^2` characterizes the likelihood of this particular compound to form activity cliffs with the compounds in the dataset. By estimating `G^2` of all structures in the dataset,a ranking can be established, thus identifying the most eminent activity cliffs.
The following examples are using the PubChem Thrombin inhibitors assay AID 1215. The dataset is imported in Ambit instance and is available at https://apps.ideaconsult.net/toxmatch/dataset/112. Тhe Tanimoto similarity is calculated by The CDK library 1024 bit hashed fingerprints.
`G^2` rank | ID | `a` | `b` | `c` | `d` | Activity | `G^2` |
---|---|---|---|---|---|---|---|
1 | 2 | 216 | 0 | 310 | 50 (inactive) | 32.34 | |
2 | 1 | 310 | 1 | 216 | 5.84 | 0.07 | |
3 | 1 | 308 | 1 | 218 | 10.90 | 0.07 |
The bubble chart is space efficient and can represent a large number of values in a small space.
Generated from the Sutherland DHFR dataset DOI: 10.1021/ci034143r