Design and Analysis of Nucleic Acid Reaction Pathways

Building on previous algorithms to characterize the ensemble of dilute solutions of nucleic acids, we present a design algorithm that enables optimization of structural features and binding energy of a test tube of interacting nucleic acid strands. These diverse nucleic acid systems are feasible to design because nucleic acids are typically dominated by the energetics of local interactions.

Nucleic acids

Here we describe the types of systems we aim to design and analyze, the underlying free-energy model and kinetic model, and previous work to compute nucleic acids and coarse-grain their kinetics. By taking advantage of this specificity, researchers have been able to rationally design molecules, systems, and devices from nucleic acids [ 57 , 86 ].

Rationally designed nucleic acid systems

Starting with the unstructured initiator, each hairpin adds to the free end of the complex, performs a branch migration, and reveals a new unstructured tail. After the third hairpin is added, the initiator, which is identical to the first four domains of the now free tail of the third hairpin, is regenerated via a final branch migration step.

Table 2.1: IUPAC notation for degenerate nucleotides. The symbol on the left represents any of the possible nucleotides listed on the right.

Physical model

Analyzing equilibrium properties

For a test tube containing the set of strings, Ψ0, the total number of complexes that can be formed up to size Lmaxis. The FKM algorithm [60] can be used to efficiently enumerate the set of complexes (see Appendix, Section A.1).

Analyzing kinetic properties

The overall time complexity to parse the test tube is O(|Ψ||φ|3max), where|φ|max is the size of the largest complex. The set of possible complexes is equal to the set of all possible necklaces over the alphabet Ψ0 from length 1 to Lmax.

Previous design algorithms

Single-complex design

A weakness of this approach is its assumption that the MFE structure dominates the base-pairing properties of the ensemble. The advantages of using the more physically relevant ensemble error instead of other targets have been discussed in detail by Dirks and Zadeh [15, 83].

Pathway design

Previous coarse-graining algorithms

Test tube design problem specification

Unfortunately, subsequent thermodynamic analysis in a test tube context reveals that the desired heterodimer occurs in negligible concentration relative to other undesired monomers and homodimers (right). A sequence design formulated in the context of a test tube (left) ensures that in equilibrium the desired “on target” is achieved.

Figure 3.1: Complex design versus test tube design. (a) Complex design. Sequence design formu- formu-lated in the context of a complex (left) ensures that at equilibrium the target structure dominates the structural ensemble of the complex (center)

Test tube ensemble defect objective function

In this case, the optimization of the test group defect, C(φ1, s1, y1), is equal to the optimization of the complex ensemble defect, n(φ1, s1). Therefore, the time complexity to evaluate the test tube array defect is the same as the time complexity to analyze the equilibrium base pairing in a test tube.

Algorithm

Overview
Test tube ensemble focusing
Hierarchical decomposition of on-target structures
Test tube ensemble defect estimation from nodal contributions
Sequence optimization at the leaves of the decomposition forest
Subsequence merging, redecomposition, and reoptimization
Test tube evaluation, refocusing, and reoptimization
Hierarchical ensemble decomposition using multiple exclusive split-points
Test tube ensemble defect estimation using multiple exclusive split points

This estimate becomes accurate in the limit as the equilibrium probabilities of the base pairs sandwiching the degradation split points approach unity. A candidate sequence ˆφΛD is evaluated via calculation of the test tube ensemble defect estimate, C˜D, if the candidate mutation, ξ, is not in the set of previously rejected mutations, γunfavorable.

Figure 3.2: Ensemble decomposition of a parent node using one or more split-points sandwiched between base pairs

Methods

Implementation

Together, the complex partition function estimate and pair probability estimates enable the calculation of complex concentration estimates (Section 3.2.4.2) and complex ensemble defect estimates (3.15), from which the test tube defect estimate at level can be calculated (3.18).

Target test tubes

Within each target tube, there are two target dimers (each with a target concentration of 1 µM) and 106 off-target monomers, dimers, trimers, and tetramers (each with vanishing target concentrations), representing all complexes up to Lmax = 4 strands (excluding the two target dimers). The structural properties of the target structures in the engineering and random test groups are summarized in Figure 3.5.

Sequence design trials

Structures shown for the constructed test set (solid lines) and random test set (dashed lines). RNA designs at 37◦C for the engineered test set (solid lines) and random test sets (dashed lines).

Results and discussion

Algorithm performance for test tube design
Importance of designing against off-targets
Contributions of algorithmic ingredients
Robustness to model perturbations
Designing competing on-target complexes
Test tube design with large numbers of on- and off-target complexes

The design quality is evaluated by calculating the test tube ensemble defect for a reference ensemble containing all off-targets up to size Lmax= 5. The quality of the resulting design is evaluated using a reference reagent ensemble including all off-targets up to pentamers.

Figure 3.7: The importance of designing against off-target complexes. Comparison of design quality for test tube design performed using an ensemble containing all off-targets up to size L max = 0 (dotted line; |Ψ off | = 0), L max = 2 (dashed line; |Ψ off

Conclusion

The initial GC content is depicted as a dashed line. d) The cost of sequence design in relation to a single evaluation of the test group defect for a test tube containing all complexes up to size Lmax= 4. The optimization of the test group defect applies a positive design paradigm (stabilization of targets) and a negative design paradigm (destabilization of off-target targets) at two levels: a) design for on-target structure and against off-target structures within the structural ensemble of any complex target [15, 83], and b) design for complexes on target and against off-target complexes within the test tube ensemble.

Figure 3.12: Test tube design with large numbers of on- and off-target complexes. Target test tubes contain Ψ on = 1, 2, 4, or 8 on-target dimers and all off-target complexes up to size L max = 4 (corresponding to |Ψ off | = 14, 106, 1260, or 17976 off-tar

Appendix

Multistate design problem specification

C(φΨh, sΨh, yΨh)≤Cstoph , (4.1) where Cstoph is the allowed concentration of nucleotides in the irregular state in the test tube. If the design problem is a single tube, this is equivalent to optimizing the tube ensemble errors as described in the previous chapter.

Constraints

These constraints are specified implicitly through the use of sequence domains and explicitly defined through the use of the identical statement. 4.5) These constraints are specified implicitly through the use of sequence domains and base pairing of the target structures and explicitly defined through the use of the complement statement.

Algorithm

Ensemble focusing
Hierarchical ensemble decomposition
Multistate defect estimate from nodal contributions
Leaf mutation
Leaf reoptimization
Subsequence merging, redecomposition, and reoptimization
Test tube evaluation, refocusing, and reoptimization
Constraint solving
Structural defect weighting

As in test tube design, let Λ denote the set of all nodes in the forest. The result of leaf mutation is the set of leaf sequences, φΛD, corresponding to the lowest ˜CD found.

Methods

Single complex and single test tube designs

The complex ensemble weighted error and the complex ensemble weighted error estimate replace their unweighted versions everywhere in the optimization algorithm.

Pathway designs

Each "Cross-target" tube optimizes against the binding of a hairpin from one system, X, to the input strands and the exposed tails of intermediates from the remaining systems, Y. The "Step 1" tube optimizes the first step of the reaction: binding of A-B to the input window Xs to form an intermediate B and waste Xs-A.

Figure 4.1: Design objectives for HCR [17, 11]. This design takes advantage of the symmetry of the addition steps, designing the end of the polymer after each addition step by creating a virtual initiator I2 in addition to the normal initiator I1

Implementation

The number of tubes, on-targets and off-targets for each design type is shown in Table 4.1. The number of instantiations is the number of orthogonal instantiations of the specified design type being designed.

Results

Special-case comparisons to previous algorithms
Design of nucleic acid reaction pathways
Preventing sequence patterns
Constraining content
Weighting structural defects

The effect of these sequence content constraints on the quality and cost of design is shown in Figure 4.11. The effect of using structural weights is shown in Figure 4.13 for the Detect 1 target structure from a designed HCR system.

Figure 4.6: Multistate algorithm performance for single complex design. Algorithm performance for test tube design

Conclusion

Panel b shows the target structure designed with nucleotide defect weights of 0.25 on the exposed branch migration region b∗. The complex designed with all weights equal to unity shows reduced base pairing in the exposed b∗ region, while the structure with reduced weights there maintains low nucleotide defects in the stem and in the toehold, but some pairing in the exposed b∗ domain makes possible.

Appendix and archive content

Overview
Algorithm
Methods
Results

Base pair distance between full master equation and coarse master equation simulations. The coarse-grained algorithm was tested on HCR sequences designed with the multistate sequence design algorithm presented in Chapter 4. The macrostates, stages, and master equation simulations for the HCR system are shown in Figure 5.6.

Figure 5.1: The algorithm is initialized with a set of sequences and an initial secondary structure.

Large box coarse-graining

Overview
Algorithm
Methods
Results

A unimolecular reaction was detected to reactant species A (the previously detected reactant species) and a bimolecular self-reaction was detected to give rise to a new reactant species C (detected using the local equilibration criterion). Reaction species are selected for investigation based on their concentration during coarse-grained big-box simulations.

Figure 5.7: Diagram of large box reacting species exploration. a) The algorithm is initialized with the set of complexes, each with an initial structure

Limitations

Only the enumerated reacting species formed at higher than 1 pM during the entire simulation time frame are shown. The mass performance results were clipped at 100 s during the coarse-grained algorithm (shown by the dashed black line in panel c).

Figure 5.10: Large box results for HCR. a) Reacting species discovered. b) Reaction simulation with a linear time axis

Conclusions

Appendix and archive content

This algorithm uses the same components as test tube design to efficiently optimize the pathways of interacting nucleic acid strands by designing initial, intermediate, and final states. A partition function algorithm for nucleic acid secondary structure including pseudoknots.Journal of Computational Chemistry October 2003.

Engineered test set generation

This appendix contains some algorithms that were either used as subroutines for the main part of the thesis, or were used as inspiration for this work. The computational cost of either method is low, but this algorithm is systematic and elegant.

Branch and propagate

Given the number of strings typesk and a lengthn, this algorithm returns the set of all sequences of lengthn, and treats all circular permutations as equivalent. The variable is instantiated to that value and all other values are removed from its domain1.

Iterated local search

Typically, the random test set contains on-target structures with a lower fraction of paired nucleotides (panel a), more stems (panel b), and shorter stems (panel c) and a higher minimum cutoff (panel d). The target structures for the engineered test set, random test set, and large tube test set used for Figure 3.12 are provided as text files in the test tube directory of the supplementary archive.

Selection and use of multiple exclusive split-points

Ensemble decomposition using multiple exclusive split points allows low-cost estimation of ensemble properties in some situations where individual split points do not. From these graphs, we see that only a small proportion of parents use multiple exclusive split points (panel a), but almost 30% of complexes (panel b) and 80% of models (panel c) include at least one parent with four children. (and thus two breakpoints) for target lengths of 200 nt.

Figure B.1: Loop composition of the (dimer) on-target structures in the engineered (solid lines) and random (dashed lines) test sets

Algorithm performance for complex design

Sequence initialization

The initial GC content is depicted as a dashed line. d) The cost of the sequence design with respect to a single evaluation of the objective function. The initial GC content is depicted as a dashed line. d) The cost of the sequence design with respect to a single evaluation of the objective function.

Figure B.3: Algorithm performance for complex design using the on-target structures from the engi- engi-neered test set

RNA vs DNA design

The GC content used for seeding/reseeding is shown as dotted lines. d) Cost of sequence design relative to a single evaluation of the objective function. RNA design at 37 ◦C for the subset of the technical test set with 100 nt on-targets.

Sensitivity of algorithm performance to design parameters

Single complex and test tube designs: random test set

The multistate design algorithm allows the specification of multiple target structures for a single complex, but the stopping conditions must be chosen carefully in this case. The two target structures are nearly isoenergetic to the designed sequence, and both ensemble defects are nearly 8 nt (panel b).

Language definition

Structure definitions
Sequence definitions
Tube definitions
Advanced sequence constraints
Global parameters and options

The following productions define a single target structure using DU-plus notation or dot-parens-plus notation. Explicit external sequence constraints, complementary and identical domains, percent matches and library definitions can be defined using the following productions.

Figure C.2: Sequence designed with two target structures. a) The target structures with each nucleotide shaded according to its identity

Example design scripts

Simple examples

Python-generated scripts

Extending the design algorithm

In the current algorithm, we would first select molecule A, explore all unimolecular reactions and self-reactions, and find none. If fstop = 0.001, the coarse-grained algorithm would then stop, regardless of τmax, because there would be no current in Ju.

Archive content

Small box input files

This input file was used to compare the coarse grain with the full master equation solution.

Large box input files

Exhaustive enumeration

Loop types and polymer graph
DNA AND gate mechanism
Hybridization chain reaction mechanism
Three-arm junction mechanism
Cooperative AND gate mechanism
Conditional dicer substrate mechanism
Motivation for test tube design versus complex design
Ensemble decomposition of a parent node using one or more split-points sandwiched
Hierarchical decomposition of a target structure
Estimation of physical quantities from nodal contributions
Structural features of the test set
Test tube design algorithm performance
The importance of off-target destabilization
Performance of test tube ensemble defect estimation
Efficiency implications of test tube ensemble focusing and hierarchical ensemble de-
Robustness of design quality predictions to model perturbations
Test tube design with competing on-target complexes
Test tube design with large numbers of on- and off-target complexes
Design objectives for HCR
Design objectives for cooperative gates
Design objectives for AND gates
Design objectives for catalytic three-arm junction assembly
Design objectives for conditional Dicer
Multistate algorithm performance for single complex design on the engineered test set 67
Multistate design performance
Summary of one HCR design result
Effect of preventing sequence patterns
Effect of content constraints
Effect of including structural defect weights
Example designed with and without structural defect weighting
Diagram of small box coarse-graining algorithm
Centroid structure versus MFE structure
Distance metric characterization
Small box coarse-grained results versus exhaustively enumerated results
Single strand coarse-grained results
HCR small box coarse-graining
Diagram of large box reacting species exploration
Large box coarse-grained results for AND gate
Large box coarse-grained results for cooperative gate
Large box coarse-grained results for HCR

Loop composition of target structures

Extent of multiple split point usage

Test tube algorithm performance for complex design: engineered test set

Test tube algorithm performance for complex design: random test set

Effect of sequence initialization on algorithm performance

Effect of design material on algorithm performance

Test tube design sensitivity to f stop

Test tube design sensitivity to f stringent

Test tube design sensitivity to f stringent with f stop = 0.003

Test tube design sensitivity to M reopt

Test tube design sensitivity to M unfavorable

Test tube design sensitivity to f redecomp

Test tube design sensitivity to f refocus

Test tube design sensitivity to f passive

Test tube design sensitivity to f split

Test tube design sensitivity to H split

Test tube design sensitivity to N split

Multistate algorithm performance for single complex design on the random test set

Test tube design

Multiobjective design

The 'Cross-active 1' tube is designed against interactions between the gate complex of one system, of one system, The 'Step 2' optimizes the B bond to hairpin C, forming dimeric B-C.

Small box coarse-graining

Each newly discovered responding species is added to the set of undiscovered responding species, Ju. Comparison of the current test tube design algorithm (solid lines) with the previously published single complex design algorithm [83] (dashed lines). a) Design quality.

Enumerating complexes (necklaces)