Building on previous algorithms to characterize the ensemble of dilute solutions of nucleic acids, we present a design algorithm that enables optimization of structural features and binding energy of a test tube of interacting nucleic acid strands. These diverse nucleic acid systems are feasible to design because nucleic acids are typically dominated by the energetics of local interactions.
Nucleic acids
Here we describe the types of systems we aim to design and analyze, the underlying free-energy model and kinetic model, and previous work to compute nucleic acids and coarse-grain their kinetics. By taking advantage of this specificity, researchers have been able to rationally design molecules, systems, and devices from nucleic acids [ 57 , 86 ].
Rationally designed nucleic acid systems
Starting with the unstructured initiator, each hairpin adds to the free end of the complex, performs a branch migration, and reveals a new unstructured tail. After the third hairpin is added, the initiator, which is identical to the first four domains of the now free tail of the third hairpin, is regenerated via a final branch migration step.
Physical model
Analyzing equilibrium properties
For a test tube containing the set of strings, Ψ0, the total number of complexes that can be formed up to size Lmaxis. The FKM algorithm [60] can be used to efficiently enumerate the set of complexes (see Appendix, Section A.1).
Analyzing kinetic properties
The overall time complexity to parse the test tube is O(|Ψ||φ|3max), where|φ|max is the size of the largest complex. The set of possible complexes is equal to the set of all possible necklaces over the alphabet Ψ0 from length 1 to Lmax.
Previous design algorithms
Single-complex design
A weakness of this approach is its assumption that the MFE structure dominates the base-pairing properties of the ensemble. The advantages of using the more physically relevant ensemble error instead of other targets have been discussed in detail by Dirks and Zadeh [15, 83].
Pathway design
Previous coarse-graining algorithms
Test tube design problem specification
Unfortunately, subsequent thermodynamic analysis in a test tube context reveals that the desired heterodimer occurs in negligible concentration relative to other undesired monomers and homodimers (right). A sequence design formulated in the context of a test tube (left) ensures that in equilibrium the desired “on target” is achieved.
Test tube ensemble defect objective function
In this case, the optimization of the test group defect, C(φ1, s1, y1), is equal to the optimization of the complex ensemble defect, n(φ1, s1). Therefore, the time complexity to evaluate the test tube array defect is the same as the time complexity to analyze the equilibrium base pairing in a test tube.
Algorithm
- Overview
- Test tube ensemble focusing
- Hierarchical decomposition of on-target structures
- Test tube ensemble defect estimation from nodal contributions
- Sequence optimization at the leaves of the decomposition forest
- Subsequence merging, redecomposition, and reoptimization
- Test tube evaluation, refocusing, and reoptimization
- Hierarchical ensemble decomposition using multiple exclusive split-points
- Test tube ensemble defect estimation using multiple exclusive split points
This estimate becomes accurate in the limit as the equilibrium probabilities of the base pairs sandwiching the degradation split points approach unity. A candidate sequence ˆφΛD is evaluated via calculation of the test tube ensemble defect estimate, C˜D, if the candidate mutation, ξ, is not in the set of previously rejected mutations, γunfavorable.
Methods
Implementation
Together, the complex partition function estimate and pair probability estimates enable the calculation of complex concentration estimates (Section 3.2.4.2) and complex ensemble defect estimates (3.15), from which the test tube defect estimate at level can be calculated (3.18).
Target test tubes
Within each target tube, there are two target dimers (each with a target concentration of 1 µM) and 106 off-target monomers, dimers, trimers, and tetramers (each with vanishing target concentrations), representing all complexes up to Lmax = 4 strands (excluding the two target dimers). The structural properties of the target structures in the engineering and random test groups are summarized in Figure 3.5.
Sequence design trials
Structures shown for the constructed test set (solid lines) and random test set (dashed lines). RNA designs at 37◦C for the engineered test set (solid lines) and random test sets (dashed lines).
Results and discussion
- Algorithm performance for test tube design
- Importance of designing against off-targets
- Contributions of algorithmic ingredients
- Robustness to model perturbations
- Designing competing on-target complexes
- Test tube design with large numbers of on- and off-target complexes
The design quality is evaluated by calculating the test tube ensemble defect for a reference ensemble containing all off-targets up to size Lmax= 5. The quality of the resulting design is evaluated using a reference reagent ensemble including all off-targets up to pentamers.
Conclusion
The initial GC content is depicted as a dashed line. d) The cost of sequence design in relation to a single evaluation of the test group defect for a test tube containing all complexes up to size Lmax= 4. The optimization of the test group defect applies a positive design paradigm (stabilization of targets) and a negative design paradigm (destabilization of off-target targets) at two levels: a) design for on-target structure and against off-target structures within the structural ensemble of any complex target [15, 83], and b) design for complexes on target and against off-target complexes within the test tube ensemble.
Appendix
Multistate design problem specification
C(φΨh, sΨh, yΨh)≤Cstoph , (4.1) where Cstoph is the allowed concentration of nucleotides in the irregular state in the test tube. If the design problem is a single tube, this is equivalent to optimizing the tube ensemble errors as described in the previous chapter.
Constraints
These constraints are specified implicitly through the use of sequence domains and explicitly defined through the use of the identical statement. 4.5) These constraints are specified implicitly through the use of sequence domains and base pairing of the target structures and explicitly defined through the use of the complement statement.
Algorithm
- Ensemble focusing
- Hierarchical ensemble decomposition
- Multistate defect estimate from nodal contributions
- Leaf mutation
- Leaf reoptimization
- Subsequence merging, redecomposition, and reoptimization
- Test tube evaluation, refocusing, and reoptimization
- Constraint solving
- Structural defect weighting
As in test tube design, let Λ denote the set of all nodes in the forest. The result of leaf mutation is the set of leaf sequences, φΛD, corresponding to the lowest ˜CD found.
Methods
Single complex and single test tube designs
The complex ensemble weighted error and the complex ensemble weighted error estimate replace their unweighted versions everywhere in the optimization algorithm.
Pathway designs
Each "Cross-target" tube optimizes against the binding of a hairpin from one system, X, to the input strands and the exposed tails of intermediates from the remaining systems, Y. The "Step 1" tube optimizes the first step of the reaction: binding of A-B to the input window Xs to form an intermediate B and waste Xs-A.
Implementation
The number of tubes, on-targets and off-targets for each design type is shown in Table 4.1. The number of instantiations is the number of orthogonal instantiations of the specified design type being designed.
Results
- Special-case comparisons to previous algorithms
- Design of nucleic acid reaction pathways
- Preventing sequence patterns
- Constraining content
- Weighting structural defects
The effect of these sequence content constraints on the quality and cost of design is shown in Figure 4.11. The effect of using structural weights is shown in Figure 4.13 for the Detect 1 target structure from a designed HCR system.
Conclusion
Panel b shows the target structure designed with nucleotide defect weights of 0.25 on the exposed branch migration region b∗. The complex designed with all weights equal to unity shows reduced base pairing in the exposed b∗ region, while the structure with reduced weights there maintains low nucleotide defects in the stem and in the toehold, but some pairing in the exposed b∗ domain makes possible.
Appendix and archive content
- Overview
- Algorithm
- Methods
- Results
Base pair distance between full master equation and coarse master equation simulations. The coarse-grained algorithm was tested on HCR sequences designed with the multistate sequence design algorithm presented in Chapter 4. The macrostates, stages, and master equation simulations for the HCR system are shown in Figure 5.6.
Large box coarse-graining
- Overview
- Algorithm
- Methods
- Results
A unimolecular reaction was detected to reactant species A (the previously detected reactant species) and a bimolecular self-reaction was detected to give rise to a new reactant species C (detected using the local equilibration criterion). Reaction species are selected for investigation based on their concentration during coarse-grained big-box simulations.
Limitations
Only the enumerated reacting species formed at higher than 1 pM during the entire simulation time frame are shown. The mass performance results were clipped at 100 s during the coarse-grained algorithm (shown by the dashed black line in panel c).
Conclusions
Appendix and archive content
This algorithm uses the same components as test tube design to efficiently optimize the pathways of interacting nucleic acid strands by designing initial, intermediate, and final states. A partition function algorithm for nucleic acid secondary structure including pseudoknots.Journal of Computational Chemistry October 2003.
Engineered test set generation
This appendix contains some algorithms that were either used as subroutines for the main part of the thesis, or were used as inspiration for this work. The computational cost of either method is low, but this algorithm is systematic and elegant.
Branch and propagate
Given the number of strings typesk and a lengthn, this algorithm returns the set of all sequences of lengthn, and treats all circular permutations as equivalent. The variable is instantiated to that value and all other values are removed from its domain1.
Iterated local search
Typically, the random test set contains on-target structures with a lower fraction of paired nucleotides (panel a), more stems (panel b), and shorter stems (panel c) and a higher minimum cutoff (panel d). The target structures for the engineered test set, random test set, and large tube test set used for Figure 3.12 are provided as text files in the test tube directory of the supplementary archive.
Selection and use of multiple exclusive split-points
Ensemble decomposition using multiple exclusive split points allows low-cost estimation of ensemble properties in some situations where individual split points do not. From these graphs, we see that only a small proportion of parents use multiple exclusive split points (panel a), but almost 30% of complexes (panel b) and 80% of models (panel c) include at least one parent with four children. (and thus two breakpoints) for target lengths of 200 nt.
Algorithm performance for complex design
Sequence initialization
The initial GC content is depicted as a dashed line. d) The cost of the sequence design with respect to a single evaluation of the objective function. The initial GC content is depicted as a dashed line. d) The cost of the sequence design with respect to a single evaluation of the objective function.
RNA vs DNA design
The GC content used for seeding/reseeding is shown as dotted lines. d) Cost of sequence design relative to a single evaluation of the objective function. RNA design at 37 ◦C for the subset of the technical test set with 100 nt on-targets.
Sensitivity of algorithm performance to design parameters
Single complex and test tube designs: random test set
The multistate design algorithm allows the specification of multiple target structures for a single complex, but the stopping conditions must be chosen carefully in this case. The two target structures are nearly isoenergetic to the designed sequence, and both ensemble defects are nearly 8 nt (panel b).
Language definition
- Structure definitions
- Sequence definitions
- Tube definitions
- Advanced sequence constraints
- Global parameters and options
The following productions define a single target structure using DU-plus notation or dot-parens-plus notation. Explicit external sequence constraints, complementary and identical domains, percent matches and library definitions can be defined using the following productions.
Example design scripts
Simple examples
Python-generated scripts
Extending the design algorithm
In the current algorithm, we would first select molecule A, explore all unimolecular reactions and self-reactions, and find none. If fstop = 0.001, the coarse-grained algorithm would then stop, regardless of τmax, because there would be no current in Ju.
Archive content
Small box input files
This input file was used to compare the coarse grain with the full master equation solution.
Large box input files
Exhaustive enumeration
- Loop types and polymer graph
- DNA AND gate mechanism
- Hybridization chain reaction mechanism
- Three-arm junction mechanism
- Cooperative AND gate mechanism
- Conditional dicer substrate mechanism
- Motivation for test tube design versus complex design
- Ensemble decomposition of a parent node using one or more split-points sandwiched
- Hierarchical decomposition of a target structure
- Estimation of physical quantities from nodal contributions
- Structural features of the test set
- Test tube design algorithm performance
- The importance of off-target destabilization
- Performance of test tube ensemble defect estimation
- Efficiency implications of test tube ensemble focusing and hierarchical ensemble de-
- Robustness of design quality predictions to model perturbations
- Test tube design with competing on-target complexes
- Test tube design with large numbers of on- and off-target complexes
- Design objectives for HCR
- Design objectives for cooperative gates
- Design objectives for AND gates
- Design objectives for catalytic three-arm junction assembly
- Design objectives for conditional Dicer
- Multistate algorithm performance for single complex design on the engineered test set 67
- Multistate design performance
- Summary of one HCR design result
- Effect of preventing sequence patterns
- Effect of content constraints
- Effect of including structural defect weights
- Example designed with and without structural defect weighting
- Diagram of small box coarse-graining algorithm
- Centroid structure versus MFE structure
- Distance metric characterization
- Small box coarse-grained results versus exhaustively enumerated results
- Single strand coarse-grained results
- HCR small box coarse-graining
- Diagram of large box reacting species exploration
- Large box coarse-grained results for AND gate
- Large box coarse-grained results for cooperative gate
- Large box coarse-grained results for HCR
Loop composition of target structures
Extent of multiple split point usage
Test tube algorithm performance for complex design: engineered test set
Test tube algorithm performance for complex design: random test set
Effect of sequence initialization on algorithm performance
Effect of design material on algorithm performance
Test tube design sensitivity to f stop
Test tube design sensitivity to f stringent
Test tube design sensitivity to f stringent with f stop = 0.003
Test tube design sensitivity to M reopt
Test tube design sensitivity to M unfavorable
Test tube design sensitivity to f redecomp
Test tube design sensitivity to f refocus
Test tube design sensitivity to f passive
Test tube design sensitivity to f split
Test tube design sensitivity to H split
Test tube design sensitivity to N split
Multistate algorithm performance for single complex design on the random test set
Test tube design
Multiobjective design
The 'Cross-active 1' tube is designed against interactions between the gate complex of one system, of one system, The 'Step 2' optimizes the B bond to hairpin C, forming dimeric B-C.
Small box coarse-graining
Each newly discovered responding species is added to the set of undiscovered responding species, Ju. Comparison of the current test tube design algorithm (solid lines) with the previously published single complex design algorithm [83] (dashed lines). a) Design quality.
Enumerating complexes (necklaces)