• Tidak ada hasil yang ditemukan

Chapter IV: Multiscale Equivariant Score-based Generative Modeling for

4.5 Appendix

The forward-time and reverse-time SDEs

The forward-time SDEs in NeuralPLexer are summarized in Table 4.1. For generality, we introduce an effective time stamp𝜏such that the drift and diffusion coefficients are constantπœƒ(𝑑) =πœƒ , 𝜎(𝜏) =𝜎. The symbolic conventions are as following:

β€’ xC𝛼 ∈R𝑛resΓ—3denotes the collection of alpha-carbon coordinates in the protein, following the standard nomenclature for amino acid atom types:

β€’ xnonC𝛼 ∈ R(π‘›βˆ’π‘›res)Γ—3 denotes the set of coordinates for all non-alpha-carbon protein atoms (backbone N, C, O, and all side-chain heavy atoms);

β€’ y∈Rπ‘šΓ—3denotes all ligand heavy atom coordinates. Note thatπ‘š:=Í𝐾

π‘˜=1π‘šπ‘˜ withπ‘šπ‘˜ being the number of heavy atoms in each ligand moleculeGπ‘˜. Transition kernel densities and sampling

Following the general result for Ornstein–Uhlenbeck processes [231]

π‘ž0:𝑑(x𝑑) =N (exp(βˆ’Ξ˜π‘‘)x0;

∫ 𝑑

0

π‘’Ξ˜(π‘ βˆ’π‘‘)𝝈𝝈Tπ‘’Ξ˜

T(π‘ βˆ’π‘‘)

𝑑𝑠) (4.6)

given the effective time-homogeneous diffusion process described in Table 4.1, for internal coordinatesxnonCπ›Όβˆ’xC𝛼:

𝑑(xnonCπ›Όβˆ’xC𝛼) =βˆ’πœƒ(xnonCπ›Όβˆ’xC𝛼)𝑑 𝜏+𝜎 𝑑w2βˆ’πœŽ 𝑑w1 (4.7)

since the Brownian motionsw1,w2are independent, we obtain the transition kernel for the finite time interval𝑠:

π‘ž(xnonC𝛼(𝜏+𝑠) βˆ’xC𝛼(𝜏+𝑠) |xnonC𝛼(𝑑) βˆ’xC𝛼(𝜏)) (4.8)

=N π‘’βˆ’πœƒ 𝑠(xnonC𝛼(𝜏) βˆ’xC𝛼(𝜏));(1βˆ’π‘’βˆ’2πœƒ 𝑠)𝜎2 πœƒ2

I

Similarly, for the ligand degrees of freedom

𝑑(yβˆ’cTxC𝛼) =βˆ’πœƒ(yβˆ’cTxC𝛼)𝑑 𝑑+𝜎 𝑑w3βˆ’πœŽcT𝑑w1 (4.9) the transition kernel is

π‘ž(y(𝜏+𝑠) βˆ’cTxC𝛼(𝜏+𝑠) |y(𝜏) βˆ’cTxC𝛼(𝜏)) (4.10)

=N π‘’βˆ’πœƒ 𝑠(y(𝜏) βˆ’cTxC𝛼(𝜏));(1βˆ’π‘’βˆ’2πœƒ 𝑠) 𝜎2 2πœƒ2

(I+cTc) The transition kernel for alpha-carbon atoms is a standard Gaussian

π‘ž(xC𝛼(𝜏+𝑠) |xC𝛼(𝜏)) =N xC𝛼(𝜏);𝜎2𝑠I

. (4.11)

Defining 𝜎2

1 = 𝜎2πœƒ2, 𝜎2

2 = 𝜎2Β· 𝜏(π‘‡βˆ—), and ˜𝜏 = 2πœƒ 𝜏, we recover (2-4). For model training in practice, we use an exponential noise schedule defined by𝜏=𝜏0𝑒𝑑 and 𝜏0= 𝜎

2 min

𝜎2 with𝜎min being a minimum perturbation scale as commonly adopted in variance-exploding (VE) [201] SDEs. For completeness, the SDEs defined in the transformed time horizon𝑑 ∈ [0, π‘‡βˆ—]is given by replacing the drift coefficientπœƒand the diffusion coefficient𝜎with the following time-dependent counterparts:

πœƒ(𝑑) =πœƒΒ· 𝑑 𝜏 𝑑 𝑑

= 𝜎2

min

2𝜎2

1

𝑒𝑑 (4.12)

and

𝜎(𝑑)=

βˆšοΈ‚

𝜎2Β· 𝑑 𝜏 𝑑 𝑑

=𝜎min𝑒

1 2𝑑

. (4.13)

To sample from the marginal distributionπ‘žπ‘‘ := 𝑝dataβˆ—π‘ž0:𝑑 derived from the forward SDEs:

z1,z2,z3∼ N (0;I) (4.14a)

(x,y) ∼ 𝑝data (4.14b)

xC𝛼(𝑑)=xC𝛼+𝜎

√︁

𝜏(𝑑)z1 (4.14c)

xnonC𝛼(𝑑)=xC𝛼(𝑑) +√︁

𝛼(𝑑) (xnonCπ›Όβˆ’xC𝛼) +√︁

1βˆ’π›Ό(𝑑)𝜎1(z2βˆ’z1) (4.14d) y(𝑑)=cTxC𝛼(𝑑) +√︁

𝛼(𝑑) (yβˆ’cTxC𝛼) +√︁

1βˆ’π›Ό(𝑑)𝜎1(z3βˆ’cTz1) (4.14e)

where𝛼(𝑑) =π‘’βˆ’2πœƒ 𝜏(𝑑). For the reverse-time SDE

𝑑Z𝑑 =[βˆ’Ξ˜(𝑑)Z𝑑 βˆ’πœŽ2(𝑑)βˆ‡Z𝑑logπ‘žπ‘‘(Z𝑑)]𝑑 𝑑+𝜎(𝑑)𝑑W𝑑 (4.15) the ESDMπœ™predicts the denoised observations Λ†x(0),y(0)Λ† using Λ†x(𝑑),y(Λ† 𝑑) which is formally equivalent to estimating the score functionβˆ‡Zlogπ‘žπ‘‘(Z)[232]. Given a time discretization schedule with interval𝑠, we obtain the expression for the predicted observation mean Β―Z(πœ™, π‘‘βˆ’π‘ ) in one denoising stepZ(𝑑) ↦→ Z(π‘‘βˆ’π‘ ):

Β―

xC𝛼(πœ™, π‘‘βˆ’π‘ )=βˆ’(xC𝛼(𝑑) βˆ’xΛ†C𝛼(0))𝜎(π‘‘βˆ’π‘ )

𝜎(𝑑) +xC𝛼(𝑑) (4.16a) xΒ―nonC𝛼(πœ™, π‘‘βˆ’π‘ )=βˆ’(xnonC𝛼(𝑑) βˆ’xC𝛼(𝑑)) ·√︁

𝛼(𝑑) βˆ’ (xΛ†nonC𝛼(0) βˆ’xΛ†C𝛼(0))

√︁1βˆ’π›Ό(𝑑)

√︁1βˆ’π›Ό(π‘‘βˆ’π‘ ) (4.16b) +xΒ―C𝛼(π‘‘βˆ’π‘ ) +√︁

𝛼(π‘‘βˆ’π‘ ) (xΛ†nonC𝛼(0) βˆ’xΛ†C𝛼(0))

Β―

y(πœ™, π‘‘βˆ’π‘ )=βˆ’(y(𝑑) βˆ’cTxC𝛼(𝑑)) ·√︁

𝛼(𝑑) βˆ’ (y(0) βˆ’Λ† cTxΛ†C𝛼(0))

√︁1βˆ’π›Ό(𝑑)

√︁1βˆ’π›Ό(π‘‘βˆ’π‘ ) (4.16c) +cTxΒ―C𝛼(π‘‘βˆ’π‘ ) +√︁

𝛼(π‘‘βˆ’π‘ ) (y(0) βˆ’Λ† cTxΛ†C𝛼(0))

standard ODE-based or SDE-based integrators can then be adapted to sample from (4.15).

Euclidean equivariance

Given group𝐺, a function 𝑓 : 𝑋 β†’ π‘Œ is said to be equivariant if for all 𝑔 ∈ 𝐺 and π‘₯ ∈ 𝑋, 𝑓(πœ‘π‘‹(𝑔) Β·π‘₯) = πœ‘π‘Œ(𝑔) Β· 𝑓(π‘₯). Specifically 𝑓 is said to be invariant if 𝑓(πœ‘π‘‹(𝑔) Β·π‘₯) = 𝑓(π‘₯). We are interested in the special Euclidean group𝐺 =SE(3) consists of all global rigid translation and rotation operations 𝑔· Z := t+ZΒ·R where t ∈ R3 and R ∈ SO(3). To adhere to the physical constraint that 𝑝data is always SE(3)-invariant, the transition kernels of forward-time SDE should satisfy SE(3)-equivariance π‘ž(Z𝑑+𝑠|Z𝑑) = π‘ž(𝑔 Β· Z𝑑+𝑠|𝑔 Β· Z𝑑) such that the marginals are invariantπ‘žπ‘‘(Z𝑑) =π‘žπ‘‘(𝑔·Z𝑑)for any time𝑑. The proofs are straightforward:

For receptor C𝛼degrees of freedom

π‘ž(t+xC𝛼(𝜏+𝑠) Β·R|t+xC𝛼(𝜏) Β·R)

=N t+xC𝛼(𝜏+𝑠) Β·R;t+xC𝛼(𝜏) Β·R, 𝜎2𝑠I

=N (xC𝛼(𝜏+𝑠) βˆ’xC𝛼(𝜏)) Β·RRT; 0, 𝜎2𝑠RΒ·IΒ·RT

=N (xC𝛼(𝜏+𝑠) βˆ’xC𝛼(𝜏)); 0, 𝜎2𝑠I

=π‘ž(xC𝛼(𝜏+𝑠) |xC𝛼(𝜏)). For receptor non-C𝛼degrees of freedom

π‘ž( (t+xnonC𝛼(𝜏+𝑠) Β·Rβˆ’tβˆ’xC𝛼(𝜏+𝑠) Β·R) | (t+xnonC𝛼(𝜏) Β·Rβˆ’tβˆ’xC𝛼(𝜏) Β·R))

=N (xnonC𝛼(𝜏+𝑠) Β·Rβˆ’xC𝛼(𝜏+𝑠) Β·R);π‘’βˆ’πœƒ 𝑠(xnonC𝛼(𝜏) Β·Rβˆ’xC𝛼(𝜏) Β·R),(1βˆ’π‘’βˆ’2πœƒ 𝑠)𝜎2 πœƒ2

I

=N (xnonC𝛼(𝜏+𝑠) βˆ’xC𝛼(𝜏+𝑠));π‘’βˆ’πœƒ 𝑠(xnonC𝛼(𝜏) βˆ’xC𝛼(𝜏)),(1βˆ’π‘’βˆ’2πœƒ 𝑠)𝜎2 πœƒ2

RΒ·IΒ·RT

=π‘ž( (xnonC𝛼(𝜏+𝑠) βˆ’xC𝛼(𝜏+𝑠) | (xnonC𝛼(𝜏) βˆ’xC𝛼(𝜏))). For ligand degrees of freedom

π‘ž(t+y(𝜏+𝑠) Β·Rβˆ’cT(t+xC𝛼(𝜏+𝑠) Β·R) |t+y(𝜏) Β·Rβˆ’cT(t+xC𝛼(𝜏) Β·R))

=π‘ž(t+y(𝜏+𝑠) Β·Rβˆ’cTtβˆ’cTxC𝛼(𝜏+𝑠) Β·R|t+y(𝜏) Β·Rβˆ’cTtβˆ’cTxC𝛼(𝜏) Β·R)

=π‘ž(y(𝜏+𝑠) Β·Rβˆ’cTxC𝛼(𝜏+𝑠) Β·R|y(𝜏) Β·Rβˆ’cTxC𝛼(𝜏) Β·R)

=N π‘’βˆ’πœƒ 𝑠(y(𝜏) βˆ’cTxC𝛼(𝜏));(1βˆ’π‘’βˆ’2πœƒ 𝑠) 𝜎2 2πœƒ2

RΒ· (I+cTc) Β·RT

=π‘ž(y(𝜏+𝑠) βˆ’cTxC𝛼(𝜏+𝑠) |y(𝜏) βˆ’cTxC𝛼(𝜏))

where we have usedcTt=tup to a column-wise broadcasting operation based on the row-wise normalization property of the softmax-transformed contact mapc.

Since all transition kernels are SE(3)-equivariant, it then follows that the score

βˆ‡Zlogπ‘žπ‘‘(Z) is also SE(3)-equivariant: βˆ‡Zβ€²logπ‘žπ‘‘(Zβ€²) = βˆ‡Zlogπ‘žπ‘‘(Z) Β·R where Zβ€² = t+ZΒ·R and thus the reverse-time SDE is equivariant. While the forward SDE is also E(3)-equivariant as the noising process satisfiesπ‘ž(βˆ’Z(𝜏+𝑠) | βˆ’Z(𝜏))= π‘ž(Z(𝜏+𝑠) |Z(𝜏)), it is worth noting that the reverse SDE is only SE(3)-equivariant as parity-inversion transformations 𝑖 : Z ↦→ βˆ’Z on the data distribution 𝑝data is physically forbidden and thus the scoreβˆ‡Zlogπ‘žπ‘‘(Z)is of broken chiral symmetry in general: βˆƒZsuch thatβˆ‡βˆ’Zlogπ‘žπ‘‘(βˆ’Z) β‰  βˆ’βˆ‡Zlogπ‘žπ‘‘(Z).

Small-molecule featurization and encoding

We consider two types of nodes to construct a graph-based molecular represen- tation: (a) heavy-atoms 𝑖 ∈ {1,2,Β· Β· Β· , 𝑁atom} and (b) local coordinate frames

𝑒 ∈ {1,2,Β· Β· Β· , 𝑁frame}, 𝑒 := 𝑒(𝑖 𝑗 π‘˜) formed by atom triplets (𝑖, 𝑗 , π‘˜) that are con- nected by bonds (𝑖 𝑗) and (𝑗 π‘˜). We introduce Path-integral Graph Transformer (PiFormer), an attentional neural network with edge-level operations inspired by the path-integral formulation of quantum mechanics, to infer the long-range inter- atomic geometrical correlations for small molecules based on their graph-topological properties. PiFormer operates on the collection of following classes of embeddings:

β€’ Atom representations H ∈ R𝑁atom ×𝑐. The input atom representations is a concatenation of one-hot encodings of element group index and period index for the given atom, which is embedded by a linear projection layerR18+7β†’ R𝑐;

β€’ Frame representationsF∈R𝑁frame×𝑐. For a given frame𝑒,F𝑒is initialized by a 2-layer MLP R4βˆ—2+18+7 β†’ R𝑐 that embed the bond type encodings (defined as [is_single,is_double,is_triple,is_aromatic]) of the "incoming"

bond(𝑖(𝑒), 𝑗(𝑒)), "outgoing" bond(𝑗(𝑒), π‘˜(𝑒)), and the atom type encoding of the center atom 𝑗(𝑒);

β€’ Stereochemistry encodingsS ∈R𝑁frame×𝑁frame×𝑐s. Sis a sparse tensor where an elementS𝑒𝑣 is nonzero only if the pair of frames(𝑒, 𝑣) is adjacent, i.e.,𝑒and 𝑣sharing a common incoming or outgoing bond;

β€’ Pair representationsG∈R𝑁frame×𝑁atom×𝑐p. Gis initialized by an outer sum ofH andFwhich is added to linear-projectedSand passed to a 2-layer MLP.

Elements of the stereochemistry encoding tensorSare defined as

S𝑒𝑣 ,0 :=(common_bond(u,v)=incoming_bond(u)) (4.17a) S𝑒𝑣 ,1 :=(common_bond(u,v)=incoming_bond(v)) (4.17b) S𝑒𝑣 ,2 :=(common_bond(u,v)=outgoing_bond(u)) (4.17c) S𝑒𝑣 ,3 :=(common_bond(u,v)=outgoing_bond(v)) (4.17d)

S𝑒𝑣 ,4 :=i(v) ∈ {i(u),j(u),k(u)} (4.17e)

S𝑒𝑣 ,5 :=j(v) ∈ {i(u),j(u),k(u)} (4.17f)

S𝑒𝑣 ,6 :=k(v) ∈ {i(u),j(u),k(u)} (4.17g) S𝑒𝑣 ,7 :=(j(u) =j(v)) ∧is_above_plane(u,v) (4.17h) S𝑒𝑣 ,8 :=(j(u) =j(v)) ∧is_below_plane(u,v) (4.17i) S𝑒𝑣 ,9 :=is_double_or_aromatic(common_bond(u,v)) ∨is_same_side(u,v)

(4.17j) S𝑒𝑣 ,10 :=is_double_or_aromatic(common_bond(u,v)) ∨not_same_side(u,v)

(4.17k) is_above_plane(𝑒, 𝑣) is defined as one of the three atoms in frame 𝑣 is above the plane formed by frame 𝑒 with normal vector v𝑒 = (rβˆ₯r𝑗(𝑒)βˆ’r𝑖(𝑒))Γ—(rπ‘˜(𝑒)βˆ’r𝑗(𝑒))

𝑗(𝑒)βˆ’r𝑖(𝑒)βˆ₯ βˆ₯rπ‘˜(𝑒)βˆ’r𝑗(𝑒)βˆ₯; is_same_side(𝑒, 𝑣) is defined as the two bonds not shared between 𝑒, 𝑣 being on the same side of the common bond, equivalent tov𝑒·v𝑣 > 0, vice versa. Our current technical implementations for is_above_plane and is_same_side are based on computing the normal vectors and dot-products using the coordinates from an auxiliary conformer, but we note that in principle all stereochemistry encodings can be generated based on cheminformatic rules without explicit coordinate generations.

We additionally denote MASKs as a 𝑁frameΓ— 𝑁framelogical matrix defined as the adjacency matrix of frame pairs (𝑒, 𝑣).

The notion of "frames" in a coordinate-free topological molecular graph is justified by the inductive bias that most bending and stretching modes in molecular vibrations are of high frequency, i.e., most bond lengths and bond angles fall into a small range as predicted by valence bond theory, such that the local frames forms a consistent molecular representation without prior knowledge on 3D coordinates. PiFormer operates solely on the molecular representation defined by the input graph, and the frame coordinates(t,R)are initialized right before the ESDM blocks.

Table 4.2: Composition of the dataset used for pretraining the small-molecule encoder.

Data source Num. samples collected Sampling weight L3D LCC LMLM BioLip [233] ligands

(deposited date<2019.1.1) 160k 2.0 + - +

GEOM [234] 450k * 5 0.4 + - +

DES370k [235] 370k 1.0 + - +

PEPCONF [236] 3775 5.0 + - +

PCQM4Mv2 [237, 238] 3.4M 0.1 + - +

Chemical Checker [239] 800k 1.0 - + +

The forward pass of single PiFormer block is expressed as:

U𝑙 =Softmaxrowβˆ’wise (FΒ·WK,𝑙) Β· (FΒ·WQ,𝑙)T+SΒ·WS,𝑙

√ 𝑐P

+InfΒ·MASKs (4.18a) Gout= (1+ 1

𝐾U𝑙)𝐾 Β· (G𝑙·WG,𝑙), G𝑙+1=MLP( [Gout|| (F𝑙)TΒ·H𝑙||G𝑙]) +G𝑙

(4.18b) Fout=MHAwithEdgeBias(F𝑙,H𝑙,(G𝑙+1)T), F𝑙+1=MLP(Fout+F𝑙) +F𝑙

(4.18c) Hout=MHAwithEdgeBias(H𝑙,F𝑙+1,G𝑙+1), H𝑙+1=MLP(Hout+H𝑙) +H𝑙

(4.18d) where𝐾 denotes the propagation length truncation for the learnable graph kernel exp(U𝑙) β‰ˆ (1+ 1

𝐾U𝑙)𝐾 in a single PiFormer block, MLP denotes a 3-layer multilayer perceptron combined with layer normalization [158]. WK,WQ,WS,WGare trainable linear weight matrices. MHAwithEdgeBias(X1,X2,Xedge) denotes a multi-head cross-attention layer between source node embeddingsX1and target node embeddings X1, with edge embeddingsXedgeentering attention computation as a relative positional encoding term as in the relation-aware transformer introduced in [187]. For all models descibed in this study, we set𝑙max=6 and𝐾 =8.

PiFormer model pretraining

In Table 4.2 we summarize the small-molecule datasets used for training the PiFormer encoder used in the reported NeuralPLexer model. The loss function used in PiFormer pretraining is the following:

Lligβˆ’pretraining =L3Dβˆ’marginal+L3Dβˆ’DSM+LCCβˆ’regression+0.01Β·LCCβˆ’ismask+0.1Β·LMLM (4.19)

We use a mixture density network head to encourage alignment between the learned last-layer pair representationsGand the intra-molecular 3D coordinate marginals.

For a single training sample with 3D coordinate observationy:

L3Dβˆ’marginal=

𝑁frame

βˆ‘οΈ

𝑒 𝑁atom

βˆ‘οΈ

𝑖

log

𝑁modes

βˆ‘οΈ

𝑙

exp(𝑀𝑖𝑒𝑙) Β·π‘ž3D(π‘‡βˆ’1

𝑒 β—¦y𝑖|m𝑖𝑒𝑙) Í𝑁modes

𝑙 exp(𝑀𝑖𝑒𝑙)

(4.20)

where𝑇𝑒 :=(R𝑒,t𝑒),π‘‡βˆ’1

𝑒 β—¦y𝑖 :=(yπ‘–βˆ’t𝑒) Β·RT𝑒. t𝑒 ∈R3andR𝑒 ∈SO(3)are given by

(R𝑒,t𝑒) =rigidFrom3Points(y𝑖(𝑒),y𝑗(𝑒),yπ‘˜(𝑒)) (4.21) where rigidFrom3Points is the Gram–Schmidt-based frame construction operation described in Ref. [5], Alg. 21; we additionally add a numerical stability factor of 0.01 Γ… to the vector-norm calculations to handle edge cases when computing the rotation matrices from perturbed coordinates. Each component the 3D distance-angle distributionπ‘ž3D is parameterized by

π‘ž3D(t|πœ‡, 𝜎,v) =Gaussian( βˆ₯tβˆ₯2|πœ‡, 𝜎) Γ—PowerSpherical( t

βˆ₯tβˆ₯2|v, 𝑑 =3) (4.22) where PowerSpherical is a power spherical distribution introduced in [240];m𝑖𝑒𝑙 :=

(πœ‡, 𝜎,v)𝑖𝑒𝑙, and

[w𝑖𝑒,m𝑖𝑒] =3DMixtureDensityHead G𝑙max

𝑖𝑒. (4.23)

whre 3DMixtureDensityHead is a 3-layer MLP.

Using an equivariant graph transformer similar to ESDM (see Sec. 4.5) but with all receptor nodes dropped, we construct a geometry prediction head to perform global molecular 3D structure denoising. We sample noised coordinatesy(𝑑)from a VPSDE [201] and introduce a SE(3)-invariant denoising score matching loss based on the Frame Aligned Point Error (FAPE) [5]:

L3Dβˆ’DSM =Eπ‘‘βˆΌ(0,1],yπ‘‘βˆΌπ‘ž0:𝑑(Β·|y)

mean𝑒,𝑖min( βˆ₯π‘‡βˆ’1

𝑒 β—¦yπ‘–βˆ’π‘‡Λ†βˆ’1

𝑒 β—¦yˆ𝑖βˆ₯2,10 Γ…)·√ 𝛼𝑑

(4.24) where

Λ†

y=GeometryPredictionHead(y𝑑;H𝑙max,F𝑙max,G𝑙max) (4.25) LCCβˆ’regression is a mean squared loss for fitting the "level 1" chemical checker (CC) [239] embeddings which represent harmonized and integrated bioactivity data,

and LCCβˆ’ismask is an auxiliary binary cross entropy loss for classifying whether a

specific CC entry is available for any molecule in the chemical checker dataset. Model

Figure 4.4: Network architecture schematics for the encoders and contact prediction modules.

is trained for 20 epochs with 15% masking ratio for atom and bond encodings, 40%

masking ratio for stereochemistry encodings, and dropout=0.1;LMLM is a standard cross-entropy loss for predicting the masked tokens which is added to encourage learning on molecular graph topology distributions, but empirically we foundLMLM converged within the first two epochs and did not find it to influence the learning dynamics of other tasks.

Protein sequence and backbone encoding

The inputs to the protein encoder are (i) the one-hot amino-acid type (20 standard residues + 1 "unknown" token) encoding of the 1D sequence𝑠, (ii) the backbone (N,C𝛼,C) coordinates of a perturbed protein structurex(𝑑)sampled from the forward SDEs described in Table 4.1, and (iii) a random Fourier encoding of the diffusion time step𝑑. To reduce memory cost, the protein backbone is represented as a sparse graph with each node mapped to each amino acid residue and randomized edges according to the inclusion probability 𝑝(add_edge(𝑖, 𝑗)) =exp(βˆ’βˆ₯x𝑖(𝑑) βˆ’x𝑗(𝑑) βˆ₯/10.0 Γ…) for all residue pairs (𝑖, 𝑗). The edge representations are initialized as a random Fourier encoding of the signed sequence distance between two residues(𝑖, 𝑗)if𝑖and 𝑗 are located on the same chain, and are initialized as zeros if(𝑖, 𝑗)are located on different chains.

The protein encoder is composed of 4 stacks of invariant point attention (IPA) [5]

blocks with two technical modifications:

β€’ The attention scores are computed on the sparsified protein graph, instead of the densely-connected graph as in standard self-attention layers;

β€’ Each node 𝑖 is associated with 𝑛head replicas of coordinate frames {𝑇}𝑖, instead of a single frame as in a static structure representation. {𝑇}𝑖

is initialized as 𝑛head copies of the backbone frames constructed by rigidFrom3Points(xN,𝑖,xC𝛼,𝑖,xC,𝑖). The layer output is𝑛headΓ—7 scalars repre- senting the translation vector and the quaternion variable to update the frame associated with each attention head.

the multi-replica design is found to moderately improve model convergence at a fixed network size. For conciseness, we refer to the modified invariant point attention as GraphIPA.

Contact predictor

As illustrated in Figure 4.4, the embeddings from the protein and small-molecule ligand graph encoders are passed to the contact predictor to estimated the contact mapsL. A protein-ligand graph is created before the contact predictor forward pass, with pairwise intermolecular edges connecting all protein residues and ligand atoms.

The contact predictor is composed of 4 modules each comprises of an intra-protein GraphIPA block, a bidirectional intra-ligand-graph self-attention layer, a bidirectional self-attention layer on the protein-ligand intermolecular edges, and a MLP to update protein-ligand edge representations using the attention maps and previous-layer edge representations. The final edge representations are used to predictLas described by Equation 4.5. The contact predictor weights are shared across all one-hot contact matrix sampling iterations.

All-atom graph featurization

All protein heavy-atoms nodes (features and 3D coordinates) and the ligand 3D coordinates sampled from the geometry priorπ‘žπ‘‡βˆ— are added to the network inputs right before the ESDM block forward pass. Each protein atom representation is initialized as the concatenation of:

β€’ The residue-wise representation from the protein backbone encoder;

β€’ An one-hot encoding of its atom type as defined by the 37 standard amino acid heavy atom symbols in the PDB format [241];

β€’ A random Fourier encoding of the diffusion time step𝑑.

A random Fourier encoding of the diffusion time step𝑑 is also concatenated to the ligand atom representations from the ligand graph encoder and are transformed by a 2-layer MLP.

Given the noised all-atom protein coordinates at diffusion time𝑑, the following edges are added to the protein-ligand graph:

β€’ Edges connecting a protein atom node and the residue node that the protein atom belongs to;

β€’ Edges connecting two protein atom nodes that are within the same residue;

β€’ Edges connecting two protein atom nodes that are within 6.0 Γ… distance;

β€’ Edges connecting a protein atom node and a ligand atom node that are within 8.0 Γ… distance;

The protein-atom-involving edges are initialized as a concatenation of the following features:

β€’ A boolean code indicating whether the source node and target node belong to the same residue or the same ligand molecule;

β€’ A boolean code indicating whether there is a covalent bond between the source and target nodes. The covalent bonding information for protein-ligand edges are resolved based on the reference protein-ligand complex structure, where an atom pair (𝑖, 𝑗) is considered as a covalent bond if the distance satisfies 𝑑𝑖 𝑗 < 1.2πœŽπ‘– 𝑗 where πœŽπ‘– 𝑗 = 12(πœŽπ‘– +πœŽπ‘—) is the average Van der Waals (VdW) radius for the atom pair.

To focus the learning problem on binding-site parts of the protein-ligand complex structure, the followingnative contact encoding features are added to the protein sub-graph that do not involve residues that are within 6.0 Γ… of any ligand heavy atom; given two amino acid residues, we define the native contact encoding as the concatenation of clean-protein-structure Nβˆ’N, Cπ›Όβˆ’C𝛼, and Cβˆ’C distances discretized into[2.0 Γ…,4.0 Γ…,6.0 Γ…,8.0 Γ…]bins. Such features are embedded by a 2-layer MLP and added to the residue-residue edge representations. Note that at training time the native contact encodings are computed from the protein structure in the ground-truth protein-ligand complex, while at sampling time they are computed from the input backbone template.

The ESDM architecture

The neural network architecture of the proposed equivariant structure diffusion mod- ule (ESDM) is summarized in Figure 4.5. The forward pass expression of the trainable

Figure 4.5: Network architecture of a single block in the equivariant structure diffusion module (ESDM). Arrows indicate information flow directions, and "+"

indicates an element-wise tensor summation.

modules PointSetAttentionwithEdgeBias, LocalUpdateUsingChannelWiseGating, LocalUpdateUsingReferenceRotation, PredictDrift are defined as:

fsβ€²,fβ€²v,eβ€²=PointSetAttentionwithEdgeBias(fs,fv,e,t) where (4.26a) fQ,fK,fV =WsΒ·fs, tQ,tK,tV =(t/10 Γ…+fvΒ·Wv) (4.26b)

z𝑖 𝑗 = 1

√ 𝑐head

(fQT,iΒ·fK,j) +WeΒ·e𝑖 𝑗 βˆ’ w𝑖 𝑗

√18𝑐head

βˆ₯tQ βˆ’tKβˆ₯22 (4.26c) πœΆπ‘– 𝑗 =Softmaxπ‘—βˆˆ{𝑖}(z𝑖 𝑗), eβ€²=MLP(z𝑖 𝑗) (4.26d)

fsβ€²= βˆ‘οΈ

π‘—βˆˆ{𝑖}

πœΆπ‘– 𝑗 βŠ™fV, fβ€²v=(βˆ‘οΈ

π‘—βˆˆ{𝑖}

πœΆπ‘– 𝑗 βŠ™tV) βˆ’t/10 Γ… (4.26e)

where fs ∈ R𝑁nodes×𝑐,fv ∈ R𝑁nodesΓ—3×𝑐,e ∈ R𝑁edges×𝑐,t ∈ R𝑁nodesΓ—3. Note that the expression for computing attention weightszis directly adapted from IPA.

fβ€²s,fvβ€² =LocalUpdateUsingChannelWiseGating(fs,fv) where (4.27a)

fsβ€²,fgate=MLP(fs βŠ• βˆ₯fvβˆ₯2) (4.27b)

fvβ€² = (fvΒ·Wv) βŠ™fgate (4.27c)

As only linear layers and vector scaling operations are used to update the vector representationsfv, LocalUpdateUsingChannelWiseGating is E(3)-equivariant.

fsβ€²,fβ€²v=LocalUpdateUsingReferenceRotation(fs,fv,R ∈SO(3)) where (4.28a) fsβ€²,fvloc =MLP(fsβŠ•RTΒ·fvβŠ• βˆ₯fvβˆ₯2) (4.28b)

fβ€²v=RΒ·fvloc (4.28c)

Since the third row of R is a pseudovector as described in rigidFrom3Points, the determinant of the rotation matrix R is unchanged under parity inversion transformations 𝑖 : x ↦→ βˆ’x and thus the intermediate quantity fvloc is SE(3)- invariant but in generalnotinvariant under parity inversion𝑖. This property ensures that ESDM can learn the correct chiral symmetry breaking behaviors in molecular 3D conformation distributions.

Ξ”t=PredictDrift(fs,fv) where (4.29a)

oscale =Softplus(MLP(fs)) (4.29b)

Ξ”t= (fvΒ·Wdrift) βŠ™oscale. (4.29c)

The predicted drift vectors Ξ”t are added to the input node coordinates; the final coordinate outputs are taken as the predicted denoised observations Λ†x(0),y(0).Λ† Model training and hyperparameters

The loss function for NeuralPLexer training is:

Ltraining =Eπ‘‘βˆΌ(0,1]

Lcontact(𝑑)+Lgpβˆ’mean(𝑑)+LDSMβˆ’prot(𝑑)+LDSMβˆ’ligand(𝑑)+LDSMβˆ’site(𝑑) (4.30) We train the contact predictor πœ“ to match the posterior distribution defined by the observed contact map π‘žL := Categorical𝑛resΓ—π‘š(L) where L := Γ‰

π‘˜Lπ‘˜ with intermediate ligand-wise one-hot matriceslπ‘˜ sampled fromπ‘žL

π‘˜: Lcontact(𝑑) =KL(π‘žLβˆ₯π‘žπœ“(Β·|0,s,x(˜ 𝑑),G))+

𝐾

βˆ‘οΈ

π‘˜=1

Elπ‘˜βˆΌπ‘žLπ‘˜

JS(π‘žL

π‘˜βˆ₯π‘žπœ“ , π‘˜(Β·|

π‘˜

βˆ‘οΈ

π‘Ÿ=1

lπ‘Ÿ,s,x(˜ 𝑑),G)) (4.31) where KL denotes a Kullback–Leibler divergence and JS denotes a Jensen–Shannon divergence. An auxiliary loss is added to the mean term in the predicted geometry prior:

Lgpβˆ’mean(𝑑) =Elπ‘˜βˆΌπ‘žLπ‘˜

βˆ₯cTπœ“ , π‘˜(

π‘˜

βˆ‘οΈ

π‘Ÿ=1

lπ‘Ÿ,s,x˜(𝑑),G) Β·x˜(𝑑) βˆ’cΒ·x˜(𝑑) βˆ₯

(4.32)

The denoising score matching (DSM) loss expressions are given by LDSMβˆ’prot =Ex(𝑑),y(𝑑)βˆΌπ‘ž0:𝑑(Β·|x(0),y(0))

1 𝑛

βˆ‘οΈ

𝑖

βˆ₯x𝑖(0) βˆ’xˆ𝑖(0) βˆ₯2/𝜎(𝑑)

(4.33)

LDSMβˆ’site is defined analogously but averaged for residues that are within 6.0 Γ… of

the ligand in the ground-truth structure. Lastly LDSMβˆ’ligand =Ex(𝑑),y(𝑑)βˆΌπ‘ž0:𝑑(Β·|x(0),y(0))

1 π‘š

βˆ‘οΈ

𝑖

βˆ₯y𝑖(0) βˆ’yˆ𝑖(0) βˆ₯2/𝜎(𝑑)

. (4.34) For the ligand graph encoder, we use 6 PiFormer blocks with an embedding dimension of 512 for atom representation and frame representations, and a dimension of 128 for pair representations. For the protein encoder, we use 4 GraphIPA blocks with a node embedding dimension of 256 and edge embedding dimension of 64. For the contact predictor we use 4 blocks with the same embeddings sizes (256, 64) as in the protein encoder; linear layers are added to project the ligand representations to the length of protein representations before they are passed to the contact predictor. For ESDM, we use a stack of 4 blocks with a embedding dimension of 64 for both node and edge representations, that is, each node𝑖is associated with scalar representationsfs,𝑖 of size 64 and vector representationsfv,𝑖 of size [3, 64].

The pretrained small-molecule encoder weights are frozen during training. Model is trained with batch size of 8 for 40 epochs, using dropout=0.05, an inital learning rate of 3E-4 with 1000 warmup steps followed by a cosine annealing learning rate decay schedule. On the PDBBind 2020 training set (170k samples), the training run took 20 hours a single NVIDIA-Tesla-V100-SXM2-32GB GPU.

Task-specific fine-tuning

The model used for fixed-backbone protein-ligand docking is fine-tuned on the original PDBBind training dataset, while all backbone atoms (N,C𝛼, 𝐢 , 𝑂) and C𝛽 atoms are set to the ground-truth coordinates. Fine-tuning is performed for 20 epochs with a batch size of 8 without teacher forcing for the geometry prior (i.e., sampling the one-hot matrixlfrom the observed contact mapπ‘žL=Categorical𝑛resΓ—π‘š(L), using the predicted contact map πœ“(l,s,x˜,G) to parameterize the finite-time transition kernels π‘žπ‘‘(Z(𝑑) |Z(0)) during model forward pass, and then backpropagating the model end-to-end) using a cosine annealing schedule with an initial learning rate of 1πΈβˆ’4.

The model used for binding-site inpainting is fine-tuned on all split-chain samples from the original PDBBind training dataset. A protein-chain/ligand pair is included

in the fine-tuning dataset if any heavy atom of the ligand is within 10 Γ… of any heavy atom of the protein chain. All receptor residues that are not within 6.0 Γ… of the ligand are set to the ground-truth coordinates with the residue-wise and protein-atom-wise time-step encoding set to zeros. Fine-tuning is performed for 40 epochs with a batch size of 10 without teacher forcing for the geometry prior using a cosine annealing schedule with a initial learning rate of 1πΈβˆ’4.

Computational details

Test datasets and post-processing

While the time-split-based PDBBind 2020 dataset has been used in previous works for studying model generalization to novel protein-ligand pairs, we noticed that the 363-sample test set curated by [222] contains samples with improperly removed alternative ligand conformation ground truths or deleted adjacent chains that strongly interact with the ligand molecule in the full structure (e.g., binding sites near protein- protein interfaces). To ensure a reasonable comparison to docking-based methods, for the test dataset used fixed-backbone ligand conformation prediction experiments we keep all protein chains that are within 10 Γ… of the ligand from the original PDB file instead of using the receptor PDB files curated by PDBBind; we further removed all covalent ligands and pipetide binders from the test set as such cases are usually tackled by specialized algorithms [242, 243], resulting in 275 test samples in total to produce the results presented in Figure 4.2a-d.

The AlphaFold2 structures used in the ligand-coupled binding site repacking task are predicted using ColabFold [244] with default MSA, recycling, and AMBER relaxation settings, and without using templates in order to best reflect the prediction fidelity of AlphaFold2 on novel targets (since all PDBBind test set samples are deposited before year 2021). The input sequences for all protein chains are obtained from https://www.ebi.ac.uk/pdbe/api/pdb/entry/molecules/ to avoid issues related to unresolved residues and to represent a realistic testing scenario where the protein backbone models are obtained from the full sequence.

Baseline method configurations

We run CB-Dock [223] with a heuristic low-sampling-intensity configuration (exhaus- tiveness=1, number of clustered binding sites to start local docking = 1) such that the execution time (43 seconds per ligand on average on single core of an Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz CPU) is comparable to deep-learning-based methods