Appendix - Multiscale Equivariant Score-based Generative Modeling for

Chapter IV: Multiscale Equivariant Score-based Generative Modeling for

4.5 Appendix

The forward-time and reverse-time SDEs

The forward-time SDEs in NeuralPLexer are summarized in Table 4.1. For generality, we introduce an effective time stamp𝜏such that the drift and diffusion coefficients are constant𝜃(𝑡) =𝜃 , 𝜎(𝜏) =𝜎. The symbolic conventions are as following:

• x_C𝛼 ∈R^𝑛^res^×3denotes the collection of alpha-carbon coordinates in the protein, following the standard nomenclature for amino acid atom types:

• x_nonC𝛼 ∈ R^(𝑛−𝑛^res^)×3 denotes the set of coordinates for all non-alpha-carbon protein atoms (backbone N, C, O, and all side-chain heavy atoms);

• y∈R^𝑚×³denotes all ligand heavy atom coordinates. Note that𝑚:=Í^𝐾

𝑘=1𝑚_𝑘 with𝑚_𝑘 being the number of heavy atoms in each ligand moleculeG𝑘. Transition kernel densities and sampling

Following the general result for Ornstein–Uhlenbeck processes [231]

𝑞_0:_𝑡(x𝑡) =N (exp(−Θ𝑡)x₀;

∫ ^𝑡

𝑒^Θ(^𝑠⁻^𝑡⁾𝝈𝝈^T𝑒^Θ

T(𝑠−𝑡)

𝑑𝑠) (4.6)

given the effective time-homogeneous diffusion process described in Table 4.1, for internal coordinatesx_nonC𝛼−x_C𝛼:

𝑑(x_nonC𝛼−x_C𝛼) =−𝜃(x_nonC𝛼−x_C𝛼)𝑑 𝜏+𝜎 𝑑w₂−𝜎 𝑑w₁ (4.7)

since the Brownian motionsw₁,w₂are independent, we obtain the transition kernel for the finite time interval𝑠:

𝑞(x_nonC𝛼(𝜏+𝑠) −x_C𝛼(𝜏+𝑠) |x_nonC𝛼(𝑡) −x_C𝛼(𝜏)) (4.8)

=N 𝑒^{−𝜃 𝑠}(x_nonC𝛼(𝜏) −x_C𝛼(𝜏));(1−𝑒⁻²^{𝜃 𝑠})𝜎² 𝜃²

Similarly, for the ligand degrees of freedom

𝑑(y−c^Tx_C𝛼) =−𝜃(y−c^Tx_C𝛼)𝑑 𝑡+𝜎 𝑑w₃−𝜎c^T𝑑w₁ (4.9) the transition kernel is

𝑞(y(𝜏+𝑠) −c^Tx_C𝛼(𝜏+𝑠) |y(𝜏) −c^Tx_C𝛼(𝜏)) (4.10)

=N 𝑒⁻^{𝜃 𝑠}(y(𝜏) −c^Tx_C𝛼(𝜏));(1−𝑒⁻²^{𝜃 𝑠}) 𝜎² 2𝜃²

(I+c^Tc) The transition kernel for alpha-carbon atoms is a standard Gaussian

𝑞(x_C𝛼(𝜏+𝑠) |x_C𝛼(𝜏)) =N x_C𝛼(𝜏);𝜎²𝑠I

. (4.11)

Defining 𝜎²

1 = ^𝜎₂_𝜃², 𝜎²

2 = 𝜎²· 𝜏(𝑇^∗), and ˜𝜏 = 2𝜃 𝜏, we recover (2-4). For model training in practice, we use an exponential noise schedule defined by𝜏=𝜏₀𝑒^𝑡 and 𝜏₀= ^𝜎

2 min

𝜎² with𝜎_min being a minimum perturbation scale as commonly adopted in variance-exploding (VE) [201] SDEs. For completeness, the SDEs defined in the transformed time horizon𝑡 ∈ [0, 𝑇^∗]is given by replacing the drift coefficient𝜃and the diffusion coefficient𝜎with the following time-dependent counterparts:

𝜃(𝑡) =𝜃· 𝑑 𝜏 𝑑 𝑡

= 𝜎²

min

2𝜎²

𝑒^𝑡 (4.12)

and

𝜎(𝑡)=

√︂

𝜎²· 𝑑 𝜏 𝑑 𝑡

=𝜎_min𝑒

1 2𝑡

. (4.13)

To sample from the marginal distribution𝑞_𝑡 := 𝑝_data∗𝑞_0:_𝑡 derived from the forward SDEs:

z₁,z₂,z₃∼ N (0;I) (4.14a)

(x,y) ∼ 𝑝_data (4.14b)

x_C𝛼(𝑡)=x_C𝛼+𝜎

√︁

𝜏(𝑡)z₁ (4.14c)

x_nonC𝛼(𝑡)=x_C𝛼(𝑡) +√︁

𝛼(𝑡) (x_nonC𝛼−x_C𝛼) +√︁

1−𝛼(𝑡)𝜎₁(z₂−z₁) (4.14d) y(𝑡)=c^Tx_C𝛼(𝑡) +√︁

𝛼(𝑡) (y−c^Tx_C𝛼) +√︁

1−𝛼(𝑡)𝜎₁(z₃−c^Tz₁) (4.14e)

where𝛼(𝑡) =𝑒⁻²^{𝜃 𝜏(𝑡)}. For the reverse-time SDE

𝑑Z𝑡 =[−Θ(𝑡)Z𝑡 −𝜎²(𝑡)∇_Z_𝑡log𝑞_𝑡(Z𝑡)]𝑑 𝑡+𝜎(𝑡)𝑑W𝑡 (4.15) the ESDM𝜙predicts the denoised observations ˆx(0),y(0)ˆ using ˆx(𝑡),y(ˆ 𝑡) which is formally equivalent to estimating the score function∇_Zlog𝑞_𝑡(Z)[232]. Given a time discretization schedule with interval𝑠, we obtain the expression for the predicted observation mean ¯Z(𝜙, 𝑡−𝑠) in one denoising stepZ(𝑡) ↦→ Z(𝑡−𝑠):

x_C𝛼(𝜙, 𝑡−𝑠)=−(x_C𝛼(𝑡) −xˆ_C𝛼(0))𝜎(𝑡−𝑠)

𝜎(𝑡) +x_C𝛼(𝑡) (4.16a) x¯_nonC𝛼(𝜙, 𝑡−𝑠)=−(x_nonC𝛼(𝑡) −x_C𝛼(𝑡)) ·√︁

𝛼(𝑡) − (xˆ_nonC𝛼(0) −xˆ_C𝛼(0))

√︁1−𝛼(𝑡)

√︁1−𝛼(𝑡−𝑠) (4.16b) +x¯_C𝛼(𝑡−𝑠) +√︁

𝛼(𝑡−𝑠) (xˆ_nonC𝛼(0) −xˆ_C𝛼(0))

y(𝜙, 𝑡−𝑠)=−(y(𝑡) −c^Tx_C𝛼(𝑡)) ·√︁

𝛼(𝑡) − (y(0) −ˆ c^Txˆ_C𝛼(0))

√︁1−𝛼(𝑡)

√︁1−𝛼(𝑡−𝑠) (4.16c) +c^Tx¯_C𝛼(𝑡−𝑠) +√︁

𝛼(𝑡−𝑠) (y(0) −ˆ c^Txˆ_C𝛼(0))

standard ODE-based or SDE-based integrators can then be adapted to sample from (4.15).

Euclidean equivariance

Given group𝐺, a function 𝑓 : 𝑋 → 𝑌 is said to be equivariant if for all 𝑔 ∈ 𝐺 and 𝑥 ∈ 𝑋, 𝑓(𝜑_𝑋(𝑔) ·𝑥) = 𝜑_𝑌(𝑔) · 𝑓(𝑥). Specifically 𝑓 is said to be invariant if 𝑓(𝜑_𝑋(𝑔) ·𝑥) = 𝑓(𝑥). We are interested in the special Euclidean group𝐺 =SE(3) consists of all global rigid translation and rotation operations 𝑔· Z := t+Z·R where t ∈ R³ and R ∈ SO(3). To adhere to the physical constraint that 𝑝_data is always SE(3)-invariant, the transition kernels of forward-time SDE should satisfy SE(3)-equivariance 𝑞(Z𝑡+𝑠|Z𝑡) = 𝑞(𝑔 · Z𝑡+𝑠|𝑔 · Z𝑡) such that the marginals are invariant𝑞_𝑡(Z𝑡) =𝑞_𝑡(𝑔·Z𝑡)for any time𝑡. The proofs are straightforward:

For receptor C𝛼degrees of freedom

𝑞(t+x_C𝛼(𝜏+𝑠) ·R|t+x_C𝛼(𝜏) ·R)

=N t+x_C𝛼(𝜏+𝑠) ·R;t+x_C𝛼(𝜏) ·R, 𝜎²𝑠I

=N (x_C𝛼(𝜏+𝑠) −x_C𝛼(𝜏)) ·RR^T; 0, 𝜎²𝑠R·I·R^T

=N (x_C𝛼(𝜏+𝑠) −x_C𝛼(𝜏)); 0, 𝜎²𝑠I

=𝑞(x_C𝛼(𝜏+𝑠) |x_C𝛼(𝜏)). For receptor non-C𝛼degrees of freedom

𝑞( (t+x_nonC𝛼(𝜏+𝑠) ·R−t−x_C𝛼(𝜏+𝑠) ·R) | (t+x_nonC𝛼(𝜏) ·R−t−x_C𝛼(𝜏) ·R))

=N (x_nonC𝛼(𝜏+𝑠) ·R−x_C𝛼(𝜏+𝑠) ·R);𝑒^{−𝜃 𝑠}(x_nonC𝛼(𝜏) ·R−x_C𝛼(𝜏) ·R),(1−𝑒⁻²^{𝜃 𝑠})𝜎² 𝜃²

=N (x_nonC𝛼(𝜏+𝑠) −x_C𝛼(𝜏+𝑠));𝑒^{−𝜃 𝑠}(x_nonC𝛼(𝜏) −x_C𝛼(𝜏)),(1−𝑒⁻²^{𝜃 𝑠})𝜎² 𝜃²

R·I·R^T

=𝑞( (x_nonC𝛼(𝜏+𝑠) −x_C𝛼(𝜏+𝑠) | (x_nonC𝛼(𝜏) −x_C𝛼(𝜏))). For ligand degrees of freedom

𝑞(t+y(𝜏+𝑠) ·R−c^T(t+x_C𝛼(𝜏+𝑠) ·R) |t+y(𝜏) ·R−c^T(t+x_C𝛼(𝜏) ·R))

=𝑞(t+y(𝜏+𝑠) ·R−c^Tt−c^Tx_C𝛼(𝜏+𝑠) ·R|t+y(𝜏) ·R−c^Tt−c^Tx_C𝛼(𝜏) ·R)

=𝑞(y(𝜏+𝑠) ·R−c^Tx_C𝛼(𝜏+𝑠) ·R|y(𝜏) ·R−c^Tx_C𝛼(𝜏) ·R)

=N 𝑒⁻^{𝜃 𝑠}(y(𝜏) −c^Tx_C𝛼(𝜏));(1−𝑒⁻²^{𝜃 𝑠}) 𝜎² 2𝜃²

R· (I+c^Tc) ·R^T

=𝑞(y(𝜏+𝑠) −c^Tx_C𝛼(𝜏+𝑠) |y(𝜏) −c^Tx_C𝛼(𝜏))

where we have usedc^Tt=tup to a column-wise broadcasting operation based on the row-wise normalization property of the softmax-transformed contact mapc.

Since all transition kernels are SE(3)-equivariant, it then follows that the score

∇_Zlog𝑞_𝑡(Z) is also SE(3)-equivariant: ∇_Z^′log𝑞_𝑡(Z^′) = ∇_Zlog𝑞_𝑡(Z) ·R where Z^′ = t+Z·R and thus the reverse-time SDE is equivariant. While the forward SDE is also E(3)-equivariant as the noising process satisfies𝑞(−Z(𝜏+𝑠) | −Z(𝜏))= 𝑞(Z(𝜏+𝑠) |Z(𝜏)), it is worth noting that the reverse SDE is only SE(3)-equivariant as parity-inversion transformations 𝑖 : Z ↦→ −Z on the data distribution 𝑝_data is physically forbidden and thus the score∇_Zlog𝑞_𝑡(Z)is of broken chiral symmetry in general: ∃Zsuch that∇_−Zlog𝑞_𝑡(−Z) ≠ −∇_Zlog𝑞_𝑡(Z).

Small-molecule featurization and encoding

We consider two types of nodes to construct a graph-based molecular representation: (a) heavy-atoms 𝑖 ∈ {1,2,· · · , 𝑁_atom} and (b) local coordinate frames

𝑢 ∈ {1,2,· · · , 𝑁_frame}, 𝑢 := 𝑢(𝑖 𝑗 𝑘) formed by atom triplets (𝑖, 𝑗 , 𝑘) that are connected by bonds (𝑖 𝑗) and (𝑗 𝑘). We introduce Path-integral Graph Transformer (PiFormer), an attentional neural network with edge-level operations inspired by the path-integral formulation of quantum mechanics, to infer the long-range inter- atomic geometrical correlations for small molecules based on their graph-topological properties. PiFormer operates on the collection of following classes of embeddings:

• Atom representations H ∈ R^𝑁^atom ×𝑐. The input atom representations is a concatenation of one-hot encodings of element group index and period index for the given atom, which is embedded by a linear projection layerR¹⁸⁺⁷→ R^𝑐;

• Frame representationsF∈R^𝑁^frame×𝑐. For a given frame𝑢,F𝑢is initialized by a 2-layer MLP R^4∗2+18+7 → R^𝑐 that embed the bond type encodings (defined as [is_single,is_double,is_triple,is_aromatic]) of the "incoming"

bond(𝑖(𝑢), 𝑗(𝑢)), "outgoing" bond(𝑗(𝑢), 𝑘(𝑢)), and the atom type encoding of the center atom 𝑗(𝑢);

• Stereochemistry encodingsS ∈R^𝑁^frame^×^𝑁^frame^×^𝑐^s. Sis a sparse tensor where an elementS𝑢𝑣 is nonzero only if the pair of frames(𝑢, 𝑣) is adjacent, i.e.,𝑢and 𝑣sharing a common incoming or outgoing bond;

• Pair representationsG∈R^𝑁^frame^×𝑁^atom^×𝑐^p. Gis initialized by an outer sum ofH andFwhich is added to linear-projectedSand passed to a 2-layer MLP.

Elements of the stereochemistry encoding tensorSare defined as

S𝑢𝑣 ,0 :=(common_bond(u,v)=incoming_bond(u)) (4.17a) S𝑢𝑣 ,1 :=(common_bond(u,v)=incoming_bond(v)) (4.17b) S𝑢𝑣 ,2 :=(common_bond(u,v)=outgoing_bond(u)) (4.17c) S𝑢𝑣 ,3 :=(common_bond(u,v)=outgoing_bond(v)) (4.17d)

S𝑢𝑣 ,4 :=i(v) ∈ {i(u),j(u),k(u)} (4.17e)

S𝑢𝑣 ,5 :=j(v) ∈ {i(u),j(u),k(u)} (4.17f)

S𝑢𝑣 ,6 :=k(v) ∈ {i(u),j(u),k(u)} (4.17g) S𝑢𝑣 ,7 :=(j(u) =j(v)) ∧is_above_plane(u,v) (4.17h) S𝑢𝑣 ,8 :=(j(u) =j(v)) ∧is_below_plane(u,v) (4.17i) S𝑢𝑣 ,9 :=is_double_or_aromatic(common_bond(u,v)) ∨is_same_side(u,v)

(4.17j) S𝑢𝑣 ,10 :=is_double_or_aromatic(common_bond(u,v)) ∨not_same_side(u,v)

(4.17k) is_above_plane(𝑢, 𝑣) is defined as one of the three atoms in frame 𝑣 is above the plane formed by frame 𝑢 with normal vector v𝑢 = ^(r_∥r^𝑗⁽^𝑢⁾^−r^𝑖⁽^𝑢⁾^)×(r^𝑘⁽^𝑢⁾^−r^𝑗⁽^𝑢⁾⁾

𝑗(𝑢)−r𝑖(𝑢)∥ ∥r𝑘(𝑢)−r𝑗(𝑢)∥; is_same_side(𝑢, 𝑣) is defined as the two bonds not shared between 𝑢, 𝑣 being on the same side of the common bond, equivalent tov𝑢·v𝑣 > 0, vice versa. Our current technical implementations for is_above_plane and is_same_side are based on computing the normal vectors and dot-products using the coordinates from an auxiliary conformer, but we note that in principle all stereochemistry encodings can be generated based on cheminformatic rules without explicit coordinate generations.

We additionally denote MASK_s as a 𝑁_frame× 𝑁_framelogical matrix defined as the adjacency matrix of frame pairs (𝑢, 𝑣).

The notion of "frames" in a coordinate-free topological molecular graph is justified by the inductive bias that most bending and stretching modes in molecular vibrations are of high frequency, i.e., most bond lengths and bond angles fall into a small range as predicted by valence bond theory, such that the local frames forms a consistent molecular representation without prior knowledge on 3D coordinates. PiFormer operates solely on the molecular representation defined by the input graph, and the frame coordinates(t,R)are initialized right before the ESDM blocks.

Table 4.2: Composition of the dataset used for pretraining the small-molecule encoder.

Data source Num. samples collected Sampling weight L_3D L_CC L_MLM BioLip [233] ligands

(deposited date<2019.1.1) 160k 2.0 + - +

GEOM [234] 450k * 5 0.4 + - +

DES370k [235] 370k 1.0 + - +

PEPCONF [236] 3775 5.0 + - +

PCQM4Mv2 [237, 238] 3.4M 0.1 + - +

Chemical Checker [239] 800k 1.0 - + +

The forward pass of single PiFormer block is expressed as:

U𝑙 =Softmax_row−wise (F·W_K,𝑙) · (F·W_Q,𝑙)^T+S·W_S,𝑙

√ 𝑐_P

+Inf·MASK_s (4.18a) G_out= (1+ 1

𝐾U𝑙)^𝐾 · (G𝑙·W_G,𝑙), G𝑙+1=MLP( [G_out|| (F𝑙)^T·H𝑙||G𝑙]) +G𝑙

(4.18b) F_out=MHAwithEdgeBias(F𝑙,H𝑙,(G𝑙+1)^T), F𝑙+1=MLP(F_out+F𝑙) +F𝑙

(4.18c) H_out=MHAwithEdgeBias(H𝑙,F𝑙+1,G𝑙+1), H𝑙+1=MLP(H_out+H𝑙) +H𝑙

(4.18d) where𝐾 denotes the propagation length truncation for the learnable graph kernel exp(U𝑙) ≈ (1+ ¹

𝐾U𝑙)^𝐾 in a single PiFormer block, MLP denotes a 3-layer multilayer perceptron combined with layer normalization [158]. W_K,W_Q,W_S,W_Gare trainable linear weight matrices. MHAwithEdgeBias(X₁,X₂,X_edge) denotes a multi-head cross-attention layer between source node embeddingsX₁and target node embeddings X₁, with edge embeddingsX_edgeentering attention computation as a relative positional encoding term as in the relation-aware transformer introduced in [187]. For all models descibed in this study, we set𝑙_max=6 and𝐾 =8.

PiFormer model pretraining

In Table 4.2 we summarize the small-molecule datasets used for training the PiFormer encoder used in the reported NeuralPLexer model. The loss function used in PiFormer pretraining is the following:

Llig−pretraining =L3D−marginal+L_3D−DSM+LCC−regression+0.01·L_CC−ismask+0.1·L_MLM (4.19)

We use a mixture density network head to encourage alignment between the learned last-layer pair representationsGand the intra-molecular 3D coordinate marginals.

For a single training sample with 3D coordinate observationy:

L_3D₋_marginal=

𝑁_frame

∑︁

𝑢 𝑁_atom

∑︁

𝑖

log

𝑁_modes

∑︁

𝑙

exp(𝑤_𝑖𝑢𝑙) ·𝑞_3D(𝑇⁻¹

𝑢 ◦y𝑖|m𝑖𝑢𝑙) Í^𝑁modes

𝑙 exp(𝑤_𝑖𝑢𝑙)

(4.20)

where𝑇_𝑢 :=(R𝑢,t𝑢),𝑇⁻¹

𝑢 ◦y𝑖 :=(y𝑖−t𝑢) ·R^T𝑢. t𝑢 ∈R³andR𝑢 ∈SO(3)are given by

(R𝑢,t𝑢) =rigidFrom3Points(y𝑖(𝑢),y𝑗(𝑢),y𝑘(𝑢)) (4.21) where rigidFrom3Points is the Gram–Schmidt-based frame construction operation described in Ref. [5], Alg. 21; we additionally add a numerical stability factor of 0.01 Å to the vector-norm calculations to handle edge cases when computing the rotation matrices from perturbed coordinates. Each component the 3D distance-angle distribution𝑞^3D is parameterized by

𝑞_3D(t|𝜇, 𝜎,v) =Gaussian( ∥t∥₂|𝜇, 𝜎) ×PowerSpherical( t

∥t∥₂|v, 𝑑 =3) (4.22) where PowerSpherical is a power spherical distribution introduced in [240];m𝑖𝑢𝑙 :=

(𝜇, 𝜎,v)𝑖𝑢𝑙, and

[w𝑖𝑢,m𝑖𝑢] =3DMixtureDensityHead G𝑙_max

𝑖𝑢. (4.23)

whre 3DMixtureDensityHead is a 3-layer MLP.

Using an equivariant graph transformer similar to ESDM (see Sec. 4.5) but with all receptor nodes dropped, we construct a geometry prediction head to perform global molecular 3D structure denoising. We sample noised coordinatesy(𝑡)from a VPSDE [201] and introduce a SE(3)-invariant denoising score matching loss based on the Frame Aligned Point Error (FAPE) [5]:

L_3D₋_DSM =E^𝑡∼(0,1],y^𝑡∼𝑞_0:_𝑡(·|y)

mean𝑢,𝑖min( ∥𝑇⁻¹

𝑢 ◦y𝑖−𝑇ˆ⁻¹

𝑢 ◦yˆ𝑖∥₂,10 Å)·√ 𝛼_𝑡

(4.24) where

y=GeometryPredictionHead(y𝑡;H𝑙_max,F𝑙_max,G𝑙_max) (4.25) LCC−regression is a mean squared loss for fitting the "level 1" chemical checker (CC) [239] embeddings which represent harmonized and integrated bioactivity data,

and L_CC₋_ismask is an auxiliary binary cross entropy loss for classifying whether a

specific CC entry is available for any molecule in the chemical checker dataset. Model

Figure 4.4: Network architecture schematics for the encoders and contact prediction modules.

is trained for 20 epochs with 15% masking ratio for atom and bond encodings, 40%

masking ratio for stereochemistry encodings, and dropout=0.1;L_MLM is a standard cross-entropy loss for predicting the masked tokens which is added to encourage learning on molecular graph topology distributions, but empirically we foundL_MLM converged within the first two epochs and did not find it to influence the learning dynamics of other tasks.

Protein sequence and backbone encoding

The inputs to the protein encoder are (i) the one-hot amino-acid type (20 standard residues + 1 "unknown" token) encoding of the 1D sequence𝑠, (ii) the backbone (N,C𝛼,C) coordinates of a perturbed protein structurex(𝑡)sampled from the forward SDEs described in Table 4.1, and (iii) a random Fourier encoding of the diffusion time step𝑡. To reduce memory cost, the protein backbone is represented as a sparse graph with each node mapped to each amino acid residue and randomized edges according to the inclusion probability 𝑝(add_edge(𝑖, 𝑗)) =exp(−∥x𝑖(𝑡) −x𝑗(𝑡) ∥/10.0 Å) for all residue pairs (𝑖, 𝑗). The edge representations are initialized as a random Fourier encoding of the signed sequence distance between two residues(𝑖, 𝑗)if𝑖and 𝑗 are located on the same chain, and are initialized as zeros if(𝑖, 𝑗)are located on different chains.

The protein encoder is composed of 4 stacks of invariant point attention (IPA) [5]

blocks with two technical modifications:

• The attention scores are computed on the sparsified protein graph, instead of the densely-connected graph as in standard self-attention layers;

• Each node 𝑖 is associated with 𝑛_head replicas of coordinate frames {𝑇}𝑖, instead of a single frame as in a static structure representation. {𝑇}𝑖

is initialized as 𝑛_head copies of the backbone frames constructed by rigidFrom3Points(x_N,𝑖,x_C𝛼,𝑖,x_C,𝑖). The layer output is𝑛_head×7 scalars repre- senting the translation vector and the quaternion variable to update the frame associated with each attention head.

the multi-replica design is found to moderately improve model convergence at a fixed network size. For conciseness, we refer to the modified invariant point attention as GraphIPA.

Contact predictor

As illustrated in Figure 4.4, the embeddings from the protein and small-molecule ligand graph encoders are passed to the contact predictor to estimated the contact mapsL. A protein-ligand graph is created before the contact predictor forward pass, with pairwise intermolecular edges connecting all protein residues and ligand atoms.

The contact predictor is composed of 4 modules each comprises of an intra-protein GraphIPA block, a bidirectional intra-ligand-graph self-attention layer, a bidirectional self-attention layer on the protein-ligand intermolecular edges, and a MLP to update protein-ligand edge representations using the attention maps and previous-layer edge representations. The final edge representations are used to predictLas described by Equation 4.5. The contact predictor weights are shared across all one-hot contact matrix sampling iterations.

All-atom graph featurization

All protein heavy-atoms nodes (features and 3D coordinates) and the ligand 3D coordinates sampled from the geometry prior𝑞_𝑇∗ are added to the network inputs right before the ESDM block forward pass. Each protein atom representation is initialized as the concatenation of:

• The residue-wise representation from the protein backbone encoder;

• An one-hot encoding of its atom type as defined by the 37 standard amino acid heavy atom symbols in the PDB format [241];

• A random Fourier encoding of the diffusion time step𝑡.

A random Fourier encoding of the diffusion time step𝑡 is also concatenated to the ligand atom representations from the ligand graph encoder and are transformed by a 2-layer MLP.

Given the noised all-atom protein coordinates at diffusion time𝑡, the following edges are added to the protein-ligand graph:

• Edges connecting a protein atom node and the residue node that the protein atom belongs to;

• Edges connecting two protein atom nodes that are within the same residue;

• Edges connecting two protein atom nodes that are within 6.0 Å distance;

• Edges connecting a protein atom node and a ligand atom node that are within 8.0 Å distance;

The protein-atom-involving edges are initialized as a concatenation of the following features:

• A boolean code indicating whether the source node and target node belong to the same residue or the same ligand molecule;

• A boolean code indicating whether there is a covalent bond between the source and target nodes. The covalent bonding information for protein-ligand edges are resolved based on the reference protein-ligand complex structure, where an atom pair (𝑖, 𝑗) is considered as a covalent bond if the distance satisfies 𝑑_{𝑖 𝑗} < 1.2𝜎_{𝑖 𝑗} where 𝜎_{𝑖 𝑗} = ¹₂(𝜎_𝑖 +𝜎_𝑗) is the average Van der Waals (VdW) radius for the atom pair.

To focus the learning problem on binding-site parts of the protein-ligand complex structure, the followingnative contact encoding features are added to the protein sub-graph that do not involve residues that are within 6.0 Å of any ligand heavy atom; given two amino acid residues, we define the native contact encoding as the concatenation of clean-protein-structure N−N, C𝛼−C𝛼, and C−C distances discretized into[2.0 Å,4.0 Å,6.0 Å,8.0 Å]bins. Such features are embedded by a 2-layer MLP and added to the residue-residue edge representations. Note that at training time the native contact encodings are computed from the protein structure in the ground-truth protein-ligand complex, while at sampling time they are computed from the input backbone template.

The ESDM architecture

The neural network architecture of the proposed equivariant structure diffusion module (ESDM) is summarized in Figure 4.5. The forward pass expression of the trainable

Figure 4.5: Network architecture of a single block in the equivariant structure diffusion module (ESDM). Arrows indicate information flow directions, and "+"

indicates an element-wise tensor summation.

modules PointSetAttentionwithEdgeBias, LocalUpdateUsingChannelWiseGating, LocalUpdateUsingReferenceRotation, PredictDrift are defined as:

f_s^′,f^′_v,e^′=PointSetAttentionwithEdgeBias(f_s,f_v,e,t) where (4.26a) f_Q,f_K,f_V =W_s·f_s, t_Q,t_K,t_V =(t/10 Å+f_v·W_v) (4.26b)

z𝑖 𝑗 = 1

√ 𝑐_head

(f_Q^T_,_i·f_K,j) +W_e·e𝑖 𝑗 − w𝑖 𝑗

√18𝑐_head

∥t_Q −t_K∥²₂ (4.26c) 𝜶𝑖 𝑗 =Softmax𝑗∈{𝑖}(z𝑖 𝑗), e^′=MLP(z𝑖 𝑗) (4.26d)

f_s^′= ∑︁

𝑗∈{𝑖}

𝜶𝑖 𝑗 ⊙f_V, f^′_v=(∑︁

𝑗∈{𝑖}

𝜶𝑖 𝑗 ⊙t_V) −t/10 Å (4.26e)

where f_s ∈ R^𝑁^nodes^×^𝑐,f_v ∈ R^𝑁^nodes^×3×^𝑐,e ∈ R^𝑁^edges^×^𝑐,t ∈ R^𝑁^nodes^×3. Note that the expression for computing attention weightszis directly adapted from IPA.

f^′_s,f_v^′ =LocalUpdateUsingChannelWiseGating(f_s,f_v) where (4.27a)

f_s^′,f_gate=MLP(f_s ⊕ ∥f_v∥₂) (4.27b)

f_v^′ = (f_v·W_v) ⊙f_gate (4.27c)

As only linear layers and vector scaling operations are used to update the vector representationsf_v, LocalUpdateUsingChannelWiseGating is E(3)-equivariant.

f_s^′,f^′_v=LocalUpdateUsingReferenceRotation(f_s,f_v,R ∈SO(3)) where (4.28a) f_s^′,f_vloc =MLP(f_s⊕R^T·f_v⊕ ∥f_v∥₂) (4.28b)

f^′_v=R·f_vloc (4.28c)

Since the third row of R is a pseudovector as described in rigidFrom3Points, the determinant of the rotation matrix R is unchanged under parity inversion transformations 𝑖 : x ↦→ −x and thus the intermediate quantity f_vloc is SE(3)- invariant but in generalnotinvariant under parity inversion𝑖. This property ensures that ESDM can learn the correct chiral symmetry breaking behaviors in molecular 3D conformation distributions.

Δt=PredictDrift(f_s,f_v) where (4.29a)

o_scale =Softplus(MLP(f_s)) (4.29b)

Δt= (f_v·W_drift) ⊙o_scale. (4.29c)

The predicted drift vectors Δt are added to the input node coordinates; the final coordinate outputs are taken as the predicted denoised observations ˆx(0),y(0).ˆ Model training and hyperparameters

The loss function for NeuralPLexer training is:

L_training =E^𝑡∼(0,1]

L_contact(𝑡)+L_gp−mean(𝑡)+L_DSM−prot(𝑡)+L_DSM−ligand(𝑡)+L_DSM−site(𝑡) (4.30) We train the contact predictor 𝜓 to match the posterior distribution defined by the observed contact map 𝑞_L := Categorical𝑛_res×𝑚(L) where L := É

𝑘L𝑘 with intermediate ligand-wise one-hot matricesl𝑘 sampled from𝑞_L

𝑘: L_contact(𝑡) =KL(𝑞_L∥𝑞_𝜓(·|0,s,x(˜ 𝑡),G))+

𝐾

∑︁

𝑘=1

El𝑘∼𝑞_L𝑘

JS(𝑞_L

𝑘∥𝑞_{𝜓 , 𝑘}(·|

𝑘

∑︁

𝑟=1

l𝑟,s,x(˜ 𝑡),G)) (4.31) where KL denotes a Kullback–Leibler divergence and JS denotes a Jensen–Shannon divergence. An auxiliary loss is added to the mean term in the predicted geometry prior:

L_gp−mean(𝑡) =El𝑘∼𝑞_L𝑘

∥c^T_{𝜓 , 𝑘}(

𝑘

∑︁

𝑟=1

l𝑟,s,x˜(𝑡),G) ·x˜(𝑡) −c·x˜(𝑡) ∥

(4.32)

The denoising score matching (DSM) loss expressions are given by L_DSM−prot =Ex(𝑡),y(𝑡)∼𝑞_0:𝑡(·|x(0),y(0))

1 𝑛

∑︁

𝑖

∥x𝑖(0) −xˆ𝑖(0) ∥₂/𝜎(𝑡)

(4.33)

L_DSM−site is defined analogously but averaged for residues that are within 6.0 Å of

the ligand in the ground-truth structure. Lastly L_DSM−ligand =Ex(𝑡),y(𝑡)∼𝑞_0:𝑡(·|x(0),y(0))

1 𝑚

∑︁

𝑖

∥y𝑖(0) −yˆ𝑖(0) ∥₂/𝜎(𝑡)

. (4.34) For the ligand graph encoder, we use 6 PiFormer blocks with an embedding dimension of 512 for atom representation and frame representations, and a dimension of 128 for pair representations. For the protein encoder, we use 4 GraphIPA blocks with a node embedding dimension of 256 and edge embedding dimension of 64. For the contact predictor we use 4 blocks with the same embeddings sizes (256, 64) as in the protein encoder; linear layers are added to project the ligand representations to the length of protein representations before they are passed to the contact predictor. For ESDM, we use a stack of 4 blocks with a embedding dimension of 64 for both node and edge representations, that is, each node𝑖is associated with scalar representationsf_s,𝑖 of size 64 and vector representationsf_v,𝑖 of size [3, 64].

The pretrained small-molecule encoder weights are frozen during training. Model is trained with batch size of 8 for 40 epochs, using dropout=0.05, an inital learning rate of 3E-4 with 1000 warmup steps followed by a cosine annealing learning rate decay schedule. On the PDBBind 2020 training set (170k samples), the training run took 20 hours a single NVIDIA-Tesla-V100-SXM2-32GB GPU.

Task-specific fine-tuning

The model used for fixed-backbone protein-ligand docking is fine-tuned on the original PDBBind training dataset, while all backbone atoms (N,C𝛼, 𝐶 , 𝑂) and C𝛽 atoms are set to the ground-truth coordinates. Fine-tuning is performed for 20 epochs with a batch size of 8 without teacher forcing for the geometry prior (i.e., sampling the one-hot matrixlfrom the observed contact map𝑞_L=Categorical𝑛_res×𝑚(L), using the predicted contact map 𝜓(l,s,x˜,G) to parameterize the finite-time transition kernels 𝑞_𝑡(Z(𝑡) |Z(0)) during model forward pass, and then backpropagating the model end-to-end) using a cosine annealing schedule with an initial learning rate of 1𝐸−4.

The model used for binding-site inpainting is fine-tuned on all split-chain samples from the original PDBBind training dataset. A protein-chain/ligand pair is included

in the fine-tuning dataset if any heavy atom of the ligand is within 10 Å of any heavy atom of the protein chain. All receptor residues that are not within 6.0 Å of the ligand are set to the ground-truth coordinates with the residue-wise and protein-atom-wise time-step encoding set to zeros. Fine-tuning is performed for 40 epochs with a batch size of 10 without teacher forcing for the geometry prior using a cosine annealing schedule with a initial learning rate of 1𝐸−4.

Computational details

Test datasets and post-processing

While the time-split-based PDBBind 2020 dataset has been used in previous works for studying model generalization to novel protein-ligand pairs, we noticed that the 363-sample test set curated by [222] contains samples with improperly removed alternative ligand conformation ground truths or deleted adjacent chains that strongly interact with the ligand molecule in the full structure (e.g., binding sites near protein- protein interfaces). To ensure a reasonable comparison to docking-based methods, for the test dataset used fixed-backbone ligand conformation prediction experiments we keep all protein chains that are within 10 Å of the ligand from the original PDB file instead of using the receptor PDB files curated by PDBBind; we further removed all covalent ligands and pipetide binders from the test set as such cases are usually tackled by specialized algorithms [242, 243], resulting in 275 test samples in total to produce the results presented in Figure 4.2a-d.

The AlphaFold2 structures used in the ligand-coupled binding site repacking task are predicted using ColabFold [244] with default MSA, recycling, and AMBER relaxation settings, and without using templates in order to best reflect the prediction fidelity of AlphaFold2 on novel targets (since all PDBBind test set samples are deposited before year 2021). The input sequences for all protein chains are obtained from https://www.ebi.ac.uk/pdbe/api/pdb/entry/molecules/ to avoid issues related to unresolved residues and to represent a realistic testing scenario where the protein backbone models are obtained from the full sequence.

Baseline method configurations

We run CB-Dock [223] with a heuristic low-sampling-intensity configuration (exhaus- tiveness=1, number of clustered binding sites to start local docking = 1) such that the execution time (43 seconds per ligand on average on single core of an Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz CPU) is comparable to deep-learning-based methods

Dalam dokumen Physics-Informed Neural Approaches for Multiscale Molecular Modeling and Design (Halaman 124-158)