Chapter IV: Multiscale Equivariant Score-based Generative Modeling for
4.2 Method
We assume the model inputs are a receptor protein backbone template containing the amino acid sequence s and (N, CπΌ, C) atomic coordinates Λx β RπresΓ3Γ3, and a set of ligand molecular graphs {Gπ}πΎ
π=1 containing atom/bond types and stereochemistry labels (e.g., tetrahedral or E/Z isomerism [208]). We aim to sample (x,y) βΌ ππ(Β·|s,xΛ,{G})from a generative model ππwith predicted 3D heavy-atom coordinates of the protein x β RπΓ3 and that of the ligands y β RπΓ3. It can be understood as a conditional generative modeling problem for partially-observed systems.
NeuralPLexer adopts a two-stage architecture for protein-ligand structure prediction (Figure4.1a). The input protein backbone template and molecule graphs are first encoded and passed into acontact predictorthat iteratively samples binding interface spatial proximity distributions for each ligand in {G}; the output contact map parameterizes thegeometry prior, a finite-time marginal of a designed SDE that progressively injects structured noise into the data distribution. An equivariant structure diffusion module(ESDM) then jointly generates 3D protein and ligand structures by denoising the atomic coordinates sampled from the geometry prior through a learned reverse-time SDE (Figure4.1b).
Protein-ligand structure generation with biophysics-informed diffusion processes Diffusion models [201] introduce a forward SDE that diffuses data into a noised distribution and a neural-network-parameterized reverse-time SDE that generate data by reverting the noising process. To motivate the design principles for our
Figure4.1:NeuralPLexerenablesprotein-ligandcomplexstructurepredictionwithfullreceptorflexibility.(a)Methodoverview.(b) SamplingfromNeuralPLexer.Theprotein(coloredasred-bluefromN-toC-terminus)andligand(coloredasgrey)3Dstructuresare jointlygeneratedfromalearnedSDE,withapartially-diffusedinitialstateππβapproximatedbytheproteinbackbonetemplateand predictedinterfacecontactmaps.(c-e)KeyelementsoftheNeuralPLexertechnicaldesign.(c)Ligandmoleculesandmonomericentities areencodedasthecollectionofatoms,localcoordinateframes(depictedassemi-transparenttriangles),andstereospecificpairwise embeddings(depictedasdashedlines)representingtheirinteractions.(d)Theforward-timeSDEintroducesrelativedrifttermsamong proteinCπΌatoms,non-CπΌatomsandligandatoms,suchthattheSDEeraseslocal-scaledetailsatπ‘=πβ toenableresamplingfrom anoisedistribution.(e)Informationflowintheequivariantstructurediffusionmodule(ESDM).ESDMoperatesonaheterogeneous graphformedbyproteinatoms(P),ligandatoms(L),proteinbackboneframes(B)andligandlocalframes(F)topredictcleanatomic coordinatesΛx0,Λy0usingthecoordinatesatafinitediffusiontimeπ‘>0.
biomolecular structure generator, we first consider a general class of linear SDEs known as the multivariate OrnsteinβUhlenbeck (OU) process [209] for point cloud ZβRπΓ3:
πZπ‘ =βΞZπ‘π π‘+π πWπ‘ (4.1)
where Ξ β RπΓπ is an invertible matrix of affine drift coefficients and Wπ‘ is a standard 3π-dimensional Wiener process. The forward noising SDEs used in standard diffusion models [210, 211] can be recovered by settingΞ =πI, converging to an isotropic Gaussian prior distribution at theπ‘ β β(often expressed asπ‘ β1 with reparameterizedπ‘[212]) limit. In contrast, we design a multivariate SDE with data-dependent drift matrixΞ(Z0)and truncate the SDE atπ‘ =πβ < βsuch that the final state of forward noising process is a partially-diffused, structured distribution ππβ that can be well approximated by a coarse-scale model. We propose a set of SDEs depicted by Figure4.1d and detailed in Table 4.1, with separated lengthscale parametersπ1, π2such that the forward diffusion process erases residue-scale local details but retains global information about protein domain packing and ligand binding interfaces, yielding the following time-dependent transition kernels:
ππ‘ xCπΌ(π‘) |x(0),y(0)
=N xCπΌ(0);π2
1πΛI
(4.2) ππ‘ xnonCπΌ(π‘) βxCπΌ(π‘) |x(0),y(0)
=N πβπΛ xnonCπΌ(0) βxCπΌ(0)
; 2π2
1(1βπβ2 Λπ)I (4.3) ππ‘ y(π‘) βcTxCπΌ(π‘) |x(0),y(0)
=N πβπΛ y(0) βcTxCπΌ(0)
;π2
1(1βπβ2 Λπ) (I+cTc) (4.4) where we use an exponential schedule Λπ = (π2
min/π2
1)ππ‘ with truncation πβ = 2 log(π2/πmin). c is a softmax-transformed contact map as detailed in Sec. 4.2, which attracts the diffused ligand coordinates y(π‘) towards binding interface CπΌ atoms while preserving SE(3)-equivariance. We chooseπ1 =2.0 Γ to match the average radius of standard amino acids with task-specificπ2> π1such that atπ‘ =πβ: (a) the terms involvingxnonCπΌ(0) andy(0) approximately vanishes thus are set to zeros to initialize the reverse-time SDE, and (b) the CπΌ-atom coordinate marginal ππβ xCπΌ(π‘) |x(0)
is sufficiently close to which approximated by the backbone template ππβ xCπΌ(π‘) |xΛ
, guided by the theoretical result proposed in [213]. Proofs regarding SE(3)-equivariance are stated in the Appendix 4.5.
Contact map prediction and sampling from the truncated reverse-time SDE Given protein-ligand coordinates(x,y), we define the contact mapLβRπresΓπ with matrix elements πΏπ΄π = log(
Γ
πβ {π΄}πβ2πΌβ₯xπβyπβ₯
2
Γ
πβ {π΄}πβ
πΌβ₯xπβyπβ₯2 ) where π runs over all protein atoms in amino acid residue π΄and πΌ = 0.2 Γ β1. The term cin (4.4) is then defined as ππ΄π(L) = Γexp(πΏπ΄π)
π΄exp(πΏπ΄π). To sample from the reverse-time SDE, we use the contact predictor to generate inferred contact maps ΛLand parameterize the geometry prior ππβ(Β·|xΛ,LΛ) β the initial condition of reverse-time SDE β by replacing x(0) in ππβ with the backbone template Λx and the ligand-CπΌ relative drift coefficient c with the predictedc(L). Note that in the general multivariate OU formulation, thisΛ corresponds to replacing the clean-data-dependent drift coefficients Ξ(Z0) by a model estimation ΛΞ. To account for the multimodal nature of protein-ligand contact distributions, the contact predictor modelsLas the logits of a categorical posterior distribution over a sequence of one-hot observations{l}πΎπ=1sampled for individual molecules in{G}. The forward pass of contact predictorπ takes an iterative form:
LΛπ =π(
π
βοΈ
π=1
lπ;s,xΛ,{G}); lπ =OneHot(π΄π, ππ); (π΄π, ππ) βΌCategoricalπresΓπ(LΛπβ1), ππ β Gπ
(4.5) where π β {1,Β· Β· Β· , πΎ} and we set ΛL :=LΛπΎ. All results reported in this study are obtained withπΎ =1 due to the curation scheme of standard annotated protein-ligand datasets, but we note that the model can be readily trained on more diverse structural databases with multi-ligand samples.
Architecture overview
Here we outline the key neural network design ideas and defer the featurization, architecture, and training details to the Appendix. To enable stereospecific molecular geometry generation and explicit reasoning about long-range geometrical correla- tions, NeuralPLexer hybridizes two types of elementary molecular representations (Figure4.1c): (a) atomic nodes and (b) rigid-body nodes representing coordinate frames formed by two adjacent chemical bonds. For small-molecule ligand encoding, we introduce a graph transformer with learnable chirality-aware pairwise embeddings that are constructed through graph-diffusion-kernel-like transformations [214]; such pairwise embeddings are pretrained to align with the intra-molecular 3D coordinate distributions from experimental and computed molecular conformers. The protein backbone template encoding module and the contact predictor are built upon a sparsified version of invariant point attention (IPA) adapted from AlphaFold2 [5]
and are combined with standard graph attention layers [187, 215] and edge update blocks.
The architecture of ESDM (Figure4.1e) is inspired by prior works on 3D graph and attentional neural networks for point clouds [216, 217], rigid-body simulations [218]
and biopolymer representation learning [5, 219β221]. In ESDM, each node is associated with a stack of standard scalar featuresfs βRπand cartesian vector features fv βR3Γπrepresenting the displacements of a virtual point set relative to the nodeβs Euclidean coordinatetβR3. A rotation matrixRβSO(3) is additionally attached to each rigid-body node. Geometry-aware messages are synchronously propagated among all nodes by encoding the pairwise distances among virtual point sets into graph transformer blocks. Explicit non-linear transformation on vector features fvis solely performed on rigid-body nodes through a coordinate-frame-inversion mechanism, such that the node update blocks are sufficiently expressive without sacrificing equivariance or computational efficiency. On the contrary, 3D coordinates are solely updated for atomic nodes while the rigid-body frames(t,R) are passively reconstructed according to the updated atomic coordinates, circumventing numerical issues regarding fitting quaterion or axis-angle variables when manipulating rigid- body objects. The nontrivial actions of a parity inversion operation on rigid-body nodes ensure that ESDM can capture the correct chiral-symmetry-breaking behavior that adheres to the molecular stereochemistry constraints.