Representing Molecules with Algebraic Data Types: Beyond SMILES and SELFIES

2026-01-14 20:00 GMT · 5 months ago aimagpro.com

arXiv:2501.13633v4 Announce Type: replace-cross
Abstract: Algebraic data types (ADTs) let a representation specify at the type level what molecular values are valid and what transformations are meaningful. We propose a molecular representation as a family of typed ADTs that separates (i) constitution (Dietz style bonding systems), (ii) 3D coordinates and stereochemistry, and (iii) electronic structure annotations. This separation makes invariants explicit, supports deterministic local edits, and provides hooks for symmetry aware and Bayesian modeling. These data structures allow us to consider how the representation constrains operations which may be performed over them. Types make invalid manipulations unrepresentable and make it easier to define meaningful priors/likelihoods over generative models (programs with sample and score operations). Unlike string based formats, the ADT exposes chemical structure directly; validity conditions (e.g., valence and symmetry constraints) can be enforced by construction and checked deterministically during transformations. We optionally attach electronic structure annotations (shell/subshell/orbital metadata) to atoms when such information is available; we do not attempt to compute orbitals in this work. We sketch Bayesian probabilistic programming via an integration with LazyPPL, a lazy probabilistic programming library; molecules can be made instances of a group under rotation to support geometric learning settings where molecular properties are invariant under rigid motions and relabellings; and the framework’s flexibility is demonstrated through an extension to represent chemical reactions. We provide a Haskell library implementing the representation, released under an OSI approved open source license and archived with a DOI.