In addition to discovering new binding sites, we have discovered additional func- tions of known binding sites. In particular, in the case ofbdcR, the repressor for the divergently transcribed genebdcA (Partridge et al., 2009), is also shown to repress bdcR in Figure B.8(A). Similarly in Figure B.8(B) IvlY is shown to repress ilvC in the absence of inducer. Divergently transcribed operons that share regulatory regions are plentiful inE. coli, and although there are already many known exam- ples of transcription factor binding sites regulating several different operons, there are almost certainly many examples of this type of transcription that have yet to be discovered.
0.003
0 -95 -75 -55 -35 -15 5 25
information (bits) information (bits)
position
mutation decreases expression mutation increases expression bdcR
bdcR bdcA
0.002 IlvY
0 -95 -75 -55 -35 -15 5 25 position
ilvC NsrR
ilvC ilvY
IlvY
RNAP RNAP
NsrR
Figure B.8: Multipurpose binding sites. Two cases in which we see transcription factor binding sites that we have found to regulate both of the two divergently transcribed genes.
Multi-purpose binding sites allow for more genes to be regulated with fewer binding
Gene Location compared to TSS πFactor
acuI -105 π70
acuI -25 π70
aegA -9 π70
adiY 19 π70
aphA -11 π70
araAB -13 π70
araC -32 π70
arcA -99 π38
arcA -12 π70
arcB -11 π70
asnA -10 π70
bdcR -10 π70
coaA -12 π70
cra -12 π70
dicC -11 π70
dinJ -11 π70
dnaE -11 π24
dpiBA -24 π70
dusC -23 π70
ecnB -11 π70
fdhE -12 π70
ftsK -11 π38
groSL -15 π32
groSL 19 π70
hicB -11 π70
holC -14 π32
hslU -10 π32
htrB -10 π38
iap -12 π38
iap 33 π38
ilvC -8 π70
maoP -21 π70
minC -12 π70
modE -26 π70
mscK -14 π38
mscL -13 π38
mscM 35 π54
ompR -11 π70
pcm -11 π70
pit -13 π70
poxB -12 π38
rapA -10 π70
rapA -103 π38
rcsF -62 π38
rcsF -11 π70
Gene Location compared to TSS πFactor
rlmA -9 π70
rlmA -64 π70
rspA 5 π38
rumB -11 π70
sbcB -11 π70
sdaB -19 π70
sdiA -54 π70
tff-rpsB-tsf -12 π70
tff-rpsB-tsf -84 π70
thiM -11 π70
thrLABC -14 π70
tig -11 π70
uvrD -12 π70
WaaA-coaD -13 π70
xylA -8 π70
xylF -12 π70
ybdG -11 π70
ybeZ 29 π70
ybeZ -14 π32
ybiO 19 π38
ybiO 2 π70
ybjL -11 π38
ybjL -58 π70
ybjL 20 π70
ybjT -12 π70
ybjT 10 π70
ycbZ -8 π70
ycbZ -11 π70
ycbZ -25 π70
ycgB -11 π38
ydhO 12 π70
ydjA -13 π70
ydjA 17 π24
yecE -1 π70
yecE -33 π70
yedJ -30 π70
yedJ -11 π70
yedK 13 π70
yedK 25 π38
yehS 8 π70
yehT -11 π70
yehT 12 π38
Gene Location compared to TSS πFactor
yehU -8 π70
yehU 36 π70
yeiQ -12 π70
yfhG 32 π70
ygdH -12 π70
ygeR -10 π70
yggW -14 π32
ygjP -23 π70
yicI 5 π70
yjjJ 12 π70
ykgE -41 π70
ykgE 25 π70
ymgG -12 π70
ynaI -11 π70
yqhC 40 π70
zapB -13 π70
znuA 36 π70
znuCB -11 π70
znuCB -88 π70
Table B.2: Identification of theπfactors used for each RNAP binding site.
sites. However, they can also serve to sharpen the promoterβs response to environ- mental cues. In the case of ilvC, IlvY is known to activate ilvCin the presence of inducer. However, we now see that it also represses the promoter in the absence of that inducer. The production ofilvCis known to increase by approximately a factor of 100 in the presence of inducer (Rhee, Senear, and Hatfield, 1998). The magni- tude of the change is attributed to the cooperative binding of two IlvY binding sites, but the lowered expression of the promoter due to IlvY repression in the absence of inducer is also a factor.
Comparison of Reg-Seq results to regulonDB B.9 Neural network fitting
Although neural network models are still in development, they are a more efficient fitting method than the Markov Chain Monte Carlo fitting methods used previously.
Additionally this fits the results inπππunits rather than arbitrary units as this method fits thermodynamic models. The training set for one neural network fit comprises DNA sequences from one promoter concatenated across all different experimental conditions. The categorical variable is meant to represent the experimental condition
10
promoter architecture
Percentage
(0, 1) (0, 1) (0, 2) (2, 0) (1, 1) (2, 1) (1, 2) (2, 2) 20
30 40
50 Reg-Seq
regulonDB
Figure B.9: Comparison of Reg-Seq architectures to RegulonDB. A comparison of the types of architectures found in RegulonDB (Santos-Zavaleta et al., 2019) to the architectures with newly discovered binding sites found in the Reg-Seq study.
(e.g. heat, M9 etc.). The training data are split as follows: 70% training, 20%
validation, 10% test.
A schematic of the architecture of the neural network is shown in Fig. B.10 (schematic adapted from (Tareen and Kinney, 2019)). The sequence dependence of ΞπΊπΆ andΞπΊπ is given by:
ΞπΊπ πΉ =βββ
ππ πΉ Β·ββ
π₯π πΉ+ββ ππ πΉ Β·ββ
π§ +ππ πΉ (B.21)
ΞπΊπ =ββ ππ Β·ββ
π₯π +ββ ππ Β·ββ
π§ +ππ . (B.22)
These Gibbs free energies are represented by the values of nodes in the first hidden layer. ββ
π₯ is a one-hot encoding of the input DNA sequence andββ
π§ is the condition categorical variable. One-hot encoding is a method to represent categorical vari- ables. For example, if you were considering 3 growth conditions (M9, LB, Xylose), instead of labeling the categories M9=1, LB=2, Xylose=3, one-hot encoding lables M9= (1,0,0), LB=(0,1,0), and Xylose=(0,0,1). Without this encoding scheme, the
"Xylose" growth condition would be weighted more than the "M9" growth condition
because the magnitude of the original label of "3" is larger than the magnitude of the
"1" label of M9. With one-hot encoding, the magnitude of each label is the same.
π represents the condition dependent part of the position weight matrix and π represents an overall bias/chemical potential.ββ
π represents the PWMs of the RNAP and the transcription factor. π₯π represents the one-hot encoded sequence of the RNAP (similar for ππ πΉ). The microstates of the thermodynamic model (see Fig.
B.11), and equivalently the softmin acctivations of the second hidden layer, are given by
ππ = πβΞπΊπ Γ
π 0πβΞπΊπ 0
(B.23) The nodes from the second hidden layer feed into a single, linearly activated, noted representing transcription rate. A dense feed-forward network, with a Relu activated hidden layer and softmin activated output layer, maps transcription rateπ‘ to counts in bins. This network represents the error model π(πππ|π‘). The promoter activityπ‘ is given by Eq. B.24:
π‘ =π‘π ππ‘
πβΞπΊπ +πβΞπΊπ βΞπΊπ πΉβΞπΊπΌ 1+πβΞπΊπ πΉ +πβΞπΊπ +πβΞπΊπ βΞπΊπ πΉβΞπΊπΌ
(B.24) To fit the network, we minimize negative log-likelihood, given by Eq B.25:
Loss Function=β 1 Γ
π πππ π
π
Γ
π=1 π
Γ
π=1
ππ πlog
π(binπ|π‘(ββ π₯))
. (B.25)
Here ππ π represents the counts of sequence π in bin π (π bins, andπ sequences).
Eq. B.25 represents log-Poisson loss, minimizing which is equivalent, in the large data limit, to maximizing mutual informationπΌ[π‘ , πππ]. We use stochastic gradient descent, in particular, the Adam optimizer, to back propagate losses.
For each promoter, the neural network model was fit 100 times; the two models with the lowest losses (each) on a held out test set were chosen to be the best models.
The total procedure takes less than 15 minutes, which is a significant improvement from the several hours that fitting models with Markov Chain Monte Carlo can take.
For future endeavors, where thousands of models will need to be fit, this is a crucial advance.
0 1 2 3 4 5 6 7 8 9
sequence
B bins
condition
TF
TF
Figure B.10: Architecture of neural network used to fit data.ββ
π₯ represent a one-hot encoding of the input sequence. βconditionβ is a categorical variable meant to represent the experimental condition of the experiment for each sequence. The condition variable feeds into in the energy nodes of the first hidden layer, and also to the dense non-linear sub-network mappingπ‘ to bins; this latter skipped
connection has reduced opacity only to reduce visual clutter and does not represent any constraint on these skipped weights. Gray lines connecting first hidden layer weights to second hidden layer weights are fixed at 0. The weights linking nodes π3andπ4to nodeπ‘ are constrained to have the same value is a diffeomorphic mode (Atwal, 2016).