From SMILES to Graphs: The Next Frontier in ML-Driven Cheminformatics

March 5, 2025

From SMILES to Graphs: The Next Frontier in ML-Driven Cheminformatics

Quantori’s Cheminformatics department partnered with Nebius, recently embarked on a research initiative to develop a molecular generation pipeline leveraging stable diffusion and our own recent developments in the area of graph-convolutional networks.

Since the early days of computing, scientists and engineers have been inventing ways to represent molecules in a language that machines can understand. Whether it’s for searching chemical databases, running simulations, or applying the latest machine learning (ML) methods, these digital representations form the backbone of modern computational chemistry.

Early on, molecular file formats served a simple purpose: store chemical structures so that researchers could exchange and analyze data without losing important details. Over the past 30 years, numerous formats and conventions have emerged, each aiming to tackle unique challenges — like storing stereochemistry, 3D coordinates, or the connectivity between atoms. As machine learning is becoming a critical tool in cheminformatics, these file formats have taken on a new level of importance: they’re not just data containers anymore but fuel for powerful neural networks that learn to predict properties, optimize geometries, and even design brand-new molecules.

A Quick Look at SMILES

Among the most popular ways to describe a molecule in text form is SMILES (Simplified Molecular Input Line Entry System). SMILES encodes the structure of a molecule as a string, typically generated by performing a depth-first traversal of the molecular graph.

Screenshot 2025-03-05 at 14.07.49

_{Image Source: Anderson E, Veith GD, Weininger D (1987). SMILES: A line notation and computerized interpreter for chemical structures (PDF). Duluth, MN: U.S. EPA, Environmental Research Laboratory-Duluth. Report No. EPA/600/M-87/021.}

Why SMILES became popular:

Simplicity: It’s concise, human-readable, and easy to copy-paste between software.

NLP Synergy: Its string-based format tempted researchers to apply Natural Language Processing (NLP) techniques (like Transformers) developed for text to molecular problems.

Despite these advantages, SMILES has some well-known drawbacks:

Multiple Dialects: Different software may produce different SMILES for the same molecule (though canonicalization tools exist).

Lack of Rich 3D Detail: SMILES focuses on connectivity; capturing rotamers, subtle chirality, and conformers is limited or requires extra layers of complexity.

Sensitivity to Atom Order: The SMILES string you get depends on the chosen starting atom. While you can canonicalize, this step is itself non-trivial and sometimes unreliable for large or unusual molecules.

As a result, when it comes to complex tasks — like generating 3D structures directly, ensuring correct stereochemistry, or exploring diverse conformations — SMILES  — based approaches can struggle.

Graph-Based Representations

Graph-based formats describe a molecule in terms of nodes (atoms) and edges (bonds). In the most basic version, each atom is a node tagged with attributes like element type or charge while edges store bond orders (single, double, triple and others) and other relevant information. More advanced versions can also include 3D coordinates for each atom.

molecular_graphs

_{Image Source: Wang, Y., Li, Z., Barati Farimani, A. (2023). Graph Neural Networks for Molecules. In: Qu, C., Liu, H. (eds) Machine Learning in Molecular Sciences. Challenges and Advances in Computational Chemistry and Physics, vol 36. Springer, Cham. https://doi.org/10.1007/978-3-031-37196-7_2}

Why graphs shine in cheminformatics:

Flexibility: They naturally encode connectivity, stereochemistry, and can even incorporate 3D coordinates for each node.

Rich Data Structure: Graphs are better at handling ring systems, branching, and complex topologies without losing crucial information.

GNNs: Graph Neural Networks (GNNs) are designed to process graph-structured data, making them a more direct match for molecular tasks.

Beyond 2D: The importance of 3D

Many chemical phenomena — like intermolecular interactions or conformational stability — are inherently three-dimensional. Graph-based approaches that include 3D coordinates (sometimes referred to as “molecular graphs with geometry”) enable ML models to:

Predict detailed 3D conformations, which is vital for tasks such as structure-based drug design.
Capture stereochemical nuances and subtle conformational preferences.
Generate brand-new 3D structures that meet specific property constraints.

When SMILES isn’t enough

In some scenarios, simple connectivity just doesn’t cut it. This is especially true in generative chemistry, where you might want to:

Design molecules that fit a particular 3D shape or binding pocket.
Optimize properties like solubility, toxicity, or reactivity, which can be strongly influenced by 3D conformation.
Ensure correct stereochemistry in drug-like molecules, where the wrong chirality can make or break a compound’s efficacy.

Using SMILES in generative tasks can lead to problems like invalid or duplicated molecular structures due to the complexities of string-based generation. Graph-based models handle these challenges more gracefully because they operate directly on the underlying molecular graph — so they can keep track of valid bonding patterns and 3D geometry constraints throughout the generation process.

A Glimpse into the Future: EDM and Beyond

One of the most exciting frontiers in graph-based molecular modeling is the use of diffusion models, sometimes referred to under acronyms like EDM (Equivariant Diffusion Models). These models start with random “noise” in 3D space and iteratively learn to “denoise” until a coherent molecular structure emerges  — much like how image-based diffusion models generate pictures pixel by pixel.

edm_illustration_png

_{*f(Mt-1|Mt) represents a denoising function from step t-1 to step t. Inspired by Equivariant Diffusion for Molecule Generation in 3D Emiel Hoogeboom, Victor Garcia Satorras, Clément Vignac, Max Welling 2022, https://doi.org/10.48550/arXiv.2203.17003}

Partnered with Nebius, Quantori has taken on a project to develop an EDM framework to generate molecules similar to a given shape. By leveraging shape descriptors for contextual generation and training of the equivariant graph diffusion network, the model learns to generate molecules that fit the same 3D spaces as the target structures. The atom coordinates generated by the EDM model are then passed to a pre-trained Structure Seer model, which predicts where and which bonds are located.

This approach allows researchers to rapidly ideate on potential candidate molecules — whether for binding to a specific site or to fit into a certain space.

Our EDM model was trained on a set of small molecules from the ChEMBL database, using a cluster on the Nebius cloud equipped with 8 H200 Nvidia GPUs. The quick provisioning and easy configuration of these resources enabled us to set up training and evaluation with minimal effort, while the GPU capacity allowed training on a dataset containing 1.6 million compounds, ultimately leading to better model quality. During training, we generated valid random conformers for each molecule, supplied them to the model, and guided it toward geometrically similar outputs using shape-descriptors.

Imagine a molecule being formed from a cloud of seemingly random points. Step by step, these points move into place, bonding with each other to produce a real, chemically sound 3D structure. The final result: a new molecule ready to be tested for a given property or shape constraint.

sample_3_chain

This visually striking process showcases how diffusion models and graph-based representations bring molecule generation to life — literally building chemistry from noise into a meaningful design.

After 1,500 epochs (about 13.5 days) of training, the results proved promising, showing that the selected shape descriptors captured the necessary information to drive valid 3D conformer generation with a shape similar to a reference. The generation of 50 samples on the GPU without paralellisation or additional optimisation takes around 90-120 seconds, wich results in an average of 2.1 seconds per molecule. Below, you can see some examples of generated molecules alongside a reference structure, and a metric to quantitatively describe their similarity. The metric chosen for the estimation of shape similarity is the shape Tanimoto Score, a metric ranging from 0 to 1, which grasps the molecular volume overlay of the reference and candidate structures. If the molecules are very similar in shape the score will approach 1, whereas when they are dissimilar be close to zero. The Tanimoto Similarity was calculated as described here. A preliminary evaluation of the set of 10.000 generated molecules showed that more than 99.8% are unique in reference to the training dataset with an average shape-similarity to the reference of 0.54.

generated_samples_corrected

Conclusion

While SMILES revolutionized molecular representation by offering a simple, linear encoding well-suited for certain cheminformatics tasks, it falls short for many cutting-edge applications — especially those that demand a precise grasp of stereochemistry and 3D structure. Graph-based formats offer a richer, more flexible approach, opening the door to advanced generative modeling techniques like diffusion models.

From molecular property prediction to in silico drug design, the future of AI-driven chemistry increasingly relies on these more sophisticated, geometry-aware representations. By capturing the full complexity of chemical reality, graph-based models are poised to unlock new horizons in molecule discovery and design — offering not just strings of atoms, but powerful, data-driven insights into the structure and function of the molecules that shape our world.

Cheminformatics

Scientific Informatics

Artificial Intelligence

Quantori blog