We have recently proposed a new and efficient method for identifying activity cliffs and visualization of activity landscapes 
The method ranks the activity cliffs by a probabilistic measure - the likelihood of a compound having large activity difference compared to other compounds, while being highly similar to them.
Table 1. Conditional probability of events co-occurrence
|Events||`s` (high similarity)||`!s` ( low similarity)|
|`t` (large activity difference)||`a ~ P(s|t)`||`b ~ P(!s|t)`|
|`!t` (small activity difference)||`c ~ P( s| !t)`||`d ~ P (!s | !t)`|
The `G^2` statistics [2
] estimates the likelihood of an event `t` taking place when another event `s` is also observed. The 2x2 contingency table (Table 1) defines the conditional probabilities of the event `t`, provided that the event `s` or its opposite (`!s`) was observed.
`G^2 = alog((a(c+d))/(c(a+b))) + blog((b(c+d))/(d(a+b))) `
The `G^2` statistics is used in natural language processing as a measure of words co-occurrence.
In our case, `G^2` represents the likelihood of a compound forming an activity cliff,
which is defined by a large difference in activity (event `t`) with other compounds in the dataset,
given high similarity (event `s`).
To calculate the activity cliff likelihood, one has to define what is considered a large difference
in activity (i.e. an activity threshold
), and what is considered a high similarity
(i.e. a similarity threshold
Once the thresholds are defined, the 2x2 contingency table
(Table 1) is prepared by comparing the compound with all other compounds in the analyzed dataset
and incrementing the relevant count
- `a` is the number of pairs with activity difference above the activity threshold and similarity above the similarity threshold
- `b` is the number of pairs with activity difference above the activity threshold and similarity below the similarity threshold
- `c` is the number of pairs with activity difference below the activity threshold and similarity above the similarity threshold
- `d` is the number of pairs with activity difference below the activity threshold and similarity below the similarity threshold)
The likelihood `G^2` is effectively a quantification of a SAS Map (Structure-Activity-Similarity Map
with defined thresholds. It can be calculated for the entire dataset, for a selected set of compounds, or for an individual compound.
If taking into account only structure pairs between a given compound and all other compounds in the analysed dataset, the `G^2` characterizes the likelihood of this particular compound to form activity cliffs with the compounds in the dataset.
By estimating `G^2` of all structures in the dataset,a ranking
can be established, thus identifying the most eminent activity cliffs.
The following examples are using the PubChem Thrombin inhibitors assay AID 1215
The dataset is imported in Ambit
instance and is available at
Тhe Tanimoto similarity is calculated by
The CDK library
1024 bit hashed fingerprints
Note that this is a ranking of individual structures, not pairs of structures.
This is a significant advantage, especially when processing large datasets,
as only the likelihood (or the four counts) need to be stored per compound,
instead of the entire pairwise matrix
The column `a` gives the number of pairs that form activity cliffs with the compound.
The paired structures can be easily retrieved by a standard similarity query.
The arrangement as a graph naturally emerges from the set of top ranked compounds, as they are usually interconnected as activity cliffs pairs.
The method goes beyond finding structure pairs only.