Binding sites regulating divergent operons

In addition to discovering new binding sites, we have discovered additional func- tions of known binding sites. In particular, in the case ofbdcR, the repressor for the divergently transcribed genebdcA (Partridge et al., 2009), is also shown to repress bdcR in Figure B.8(A). Similarly in Figure B.8(B) IvlY is shown to repress ilvC in the absence of inducer. Divergently transcribed operons that share regulatory regions are plentiful inE. coli, and although there are already many known examples of transcription factor binding sites regulating several different operons, there are almost certainly many examples of this type of transcription that have yet to be discovered.

0.003

0 -95 -75 -55 -35 -15 5 25

information (bits) information (bits)

position

mutation decreases expression mutation increases expression bdcR

bdcR bdcA

0.002 IlvY

0 -95 -75 -55 -35 -15 5 25 position

ilvC NsrR

ilvC ilvY

IlvY

RNAP RNAP

NsrR

Figure B.8: Multipurpose binding sites. Two cases in which we see transcription factor binding sites that we have found to regulate both of the two divergently transcribed genes.

Multi-purpose binding sites allow for more genes to be regulated with fewer binding

Gene Location compared to TSS 𝜎Factor

acuI -105 𝜎70

acuI -25 𝜎70

aegA -9 𝜎70

adiY 19 𝜎70

aphA -11 𝜎70

araAB -13 𝜎70

araC -32 𝜎70

arcA -99 𝜎38

arcA -12 𝜎70

arcB -11 𝜎70

asnA -10 𝜎70

bdcR -10 𝜎70

coaA -12 𝜎70

cra -12 𝜎70

dicC -11 𝜎70

dinJ -11 𝜎70

dnaE -11 𝜎24

dpiBA -24 𝜎70

dusC -23 𝜎70

ecnB -11 𝜎70

fdhE -12 𝜎70

ftsK -11 𝜎38

groSL -15 𝜎32

groSL 19 𝜎70

hicB -11 𝜎70

holC -14 𝜎32

hslU -10 𝜎32

htrB -10 𝜎38

iap -12 𝜎38

iap 33 𝜎38

ilvC -8 𝜎70

maoP -21 𝜎70

minC -12 𝜎70

modE -26 𝜎70

mscK -14 𝜎38

mscL -13 𝜎38

mscM 35 𝜎54

ompR -11 𝜎70

pcm -11 𝜎70

pit -13 𝜎70

poxB -12 𝜎38

rapA -10 𝜎70

rapA -103 𝜎38

rcsF -62 𝜎38

rcsF -11 𝜎70

Gene Location compared to TSS 𝜎Factor

rlmA -9 𝜎70

rlmA -64 𝜎70

rspA 5 𝜎38

rumB -11 𝜎70

sbcB -11 𝜎70

sdaB -19 𝜎70

sdiA -54 𝜎70

tff-rpsB-tsf -12 𝜎70

tff-rpsB-tsf -84 𝜎70

thiM -11 𝜎70

thrLABC -14 𝜎70

tig -11 𝜎70

uvrD -12 𝜎70

WaaA-coaD -13 𝜎70

xylA -8 𝜎70

xylF -12 𝜎70

ybdG -11 𝜎70

ybeZ 29 𝜎70

ybeZ -14 𝜎32

ybiO 19 𝜎38

ybiO 2 𝜎70

ybjL -11 𝜎38

ybjL -58 𝜎70

ybjL 20 𝜎70

ybjT -12 𝜎70

ybjT 10 𝜎70

ycbZ -8 𝜎70

ycbZ -11 𝜎70

ycbZ -25 𝜎70

ycgB -11 𝜎38

ydhO 12 𝜎70

ydjA -13 𝜎70

ydjA 17 𝜎24

yecE -1 𝜎70

yecE -33 𝜎70

yedJ -30 𝜎70

yedJ -11 𝜎70

yedK 13 𝜎70

yedK 25 𝜎38

yehS 8 𝜎70

yehT -11 𝜎70

yehT 12 𝜎38

Gene Location compared to TSS 𝜎Factor

yehU -8 𝜎70

yehU 36 𝜎70

yeiQ -12 𝜎70

yfhG 32 𝜎70

ygdH -12 𝜎70

ygeR -10 𝜎70

yggW -14 𝜎32

ygjP -23 𝜎70

yicI 5 𝜎70

yjjJ 12 𝜎70

ykgE -41 𝜎70

ykgE 25 𝜎70

ymgG -12 𝜎70

ynaI -11 𝜎70

yqhC 40 𝜎70

zapB -13 𝜎70

znuA 36 𝜎70

znuCB -11 𝜎70

znuCB -88 𝜎70

Table B.2: Identification of the𝜎factors used for each RNAP binding site.

sites. However, they can also serve to sharpen the promoter’s response to environ- mental cues. In the case of ilvC, IlvY is known to activate ilvCin the presence of inducer. However, we now see that it also represses the promoter in the absence of that inducer. The production ofilvCis known to increase by approximately a factor of 100 in the presence of inducer (Rhee, Senear, and Hatfield, 1998). The magnitude of the change is attributed to the cooperative binding of two IlvY binding sites, but the lowered expression of the promoter due to IlvY repression in the absence of inducer is also a factor.

Comparison of Reg-Seq results to regulonDB B.9 Neural network fitting

Although neural network models are still in development, they are a more efficient fitting method than the Markov Chain Monte Carlo fitting methods used previously.

Additionally this fits the results in𝑘_𝑏𝑇units rather than arbitrary units as this method fits thermodynamic models. The training set for one neural network fit comprises DNA sequences from one promoter concatenated across all different experimental conditions. The categorical variable is meant to represent the experimental condition

promoter architecture

Percentage

(0, 1) (0, 1) (0, 2) (2, 0) (1, 1) (2, 1) (1, 2) (2, 2) 20

30 40

50 Reg-Seq

regulonDB

Figure B.9: Comparison of Reg-Seq architectures to RegulonDB. A comparison of the types of architectures found in RegulonDB (Santos-Zavaleta et al., 2019) to the architectures with newly discovered binding sites found in the Reg-Seq study.

(e.g. heat, M9 etc.). The training data are split as follows: 70% training, 20%

validation, 10% test.

A schematic of the architecture of the neural network is shown in Fig. B.10 (schematic adapted from (Tareen and Kinney, 2019)). The sequence dependence of Δ𝐺_𝐶 andΔ𝐺_𝑅is given by:

Δ𝐺_{𝑇 𝐹} =−−→

𝜃_{𝑇 𝐹} ·→−

𝑥_{𝑇 𝐹}+→− 𝜇_{𝑇 𝐹} ·→−

𝑧 +𝑏_{𝑇 𝐹} (B.21)

Δ𝐺_𝑅 =→− 𝜃_𝑅 ·→−

𝑥_𝑅+→− 𝜇_𝑅·→−

𝑧 +𝑏_𝑅. (B.22)

These Gibbs free energies are represented by the values of nodes in the first hidden layer. →−

𝑥 is a one-hot encoding of the input DNA sequence and→−

𝑧 is the condition categorical variable. One-hot encoding is a method to represent categorical vari- ables. For example, if you were considering 3 growth conditions (M9, LB, Xylose), instead of labeling the categories M9=1, LB=2, Xylose=3, one-hot encoding lables M9= (1,0,0), LB=(0,1,0), and Xylose=(0,0,1). Without this encoding scheme, the

"Xylose" growth condition would be weighted more than the "M9" growth condition

because the magnitude of the original label of "3" is larger than the magnitude of the

"1" label of M9. With one-hot encoding, the magnitude of each label is the same.

𝜇 represents the condition dependent part of the position weight matrix and 𝑏 represents an overall bias/chemical potential.→−

𝜃 represents the PWMs of the RNAP and the transcription factor. 𝑥_𝑅 represents the one-hot encoded sequence of the RNAP (similar for 𝑋_{𝑇 𝐹}). The microstates of the thermodynamic model (see Fig.

B.11), and equivalently the softmin acctivations of the second hidden layer, are given by

𝑃_𝑠 = 𝑒^−Δ^𝐺^𝑠 Í

𝑠⁰𝑒^−Δ^𝐺^𝑠⁰

(B.23) The nodes from the second hidden layer feed into a single, linearly activated, noted representing transcription rate. A dense feed-forward network, with a Relu activated hidden layer and softmin activated output layer, maps transcription rate𝑡 to counts in bins. This network represents the error model 𝑝(𝑏𝑖𝑛|𝑡). The promoter activity𝑡 is given by Eq. B.24:

𝑡 =𝑡_𝑠𝑎𝑡

𝑒^−Δ^𝐺^𝑅+𝑒^−Δ^𝐺^𝑅^−Δ^𝐺^{𝑇 𝐹}^−Δ^𝐺^𝐼 1+𝑒^−Δ^𝐺^{𝑇 𝐹} +𝑒^−Δ^𝐺^𝑅+𝑒^−Δ^𝐺^𝑅^−Δ^𝐺^{𝑇 𝐹}^−Δ^𝐺^𝐼

(B.24) To fit the network, we minimize negative log-likelihood, given by Eq B.25:

Loss Function=− 1 Í

𝑖 𝑗𝑐_{𝑖 𝑗}

𝑚

𝑖=1 𝑁

𝑗=1

𝑐_{𝑖 𝑗}log

𝑃(bin𝑗|𝑡(→− 𝑥))

. (B.25)

Here 𝑐_{𝑖 𝑗} represents the counts of sequence 𝑖 in bin 𝑗 (𝑁 bins, and𝑚 sequences).

Eq. B.25 represents log-Poisson loss, minimizing which is equivalent, in the large data limit, to maximizing mutual information𝐼[𝑡 , 𝑏𝑖𝑛]. We use stochastic gradient descent, in particular, the Adam optimizer, to back propagate losses.

For each promoter, the neural network model was fit 100 times; the two models with the lowest losses (each) on a held out test set were chosen to be the best models.

The total procedure takes less than 15 minutes, which is a significant improvement from the several hours that fitting models with Markov Chain Monte Carlo can take.

For future endeavors, where thousands of models will need to be fit, this is a crucial advance.

0 1 2 3 4 5 6 7 8 9

sequence

B bins

condition

Figure B.10: Architecture of neural network used to fit data.→−

𝑥 represent a one-hot encoding of the input sequence. ‘condition’ is a categorical variable meant to represent the experimental condition of the experiment for each sequence. The condition variable feeds into in the energy nodes of the first hidden layer, and also to the dense non-linear sub-network mapping𝑡 to bins; this latter skipped

connection has reduced opacity only to reduce visual clutter and does not represent any constraint on these skipped weights. Gray lines connecting first hidden layer weights to second hidden layer weights are fixed at 0. The weights linking nodes 𝑃₃and𝑃₄to node𝑡 are constrained to have the same value is a diffeomorphic mode (Atwal, 2016).

Dalam dokumen Regulation in Escherichia coli (Halaman 174-180)