Without my father’s commitment to my education and the stories he told me about Caltech when I was growing up, I would never be here today

I am grateful for the advice and opportunities you have given me and I could not have asked for a better advisor. Through the application of molecular dynamics and modeling, I have studied insulin from several perspectives, including the incorporation of non-canonical amino acids and how these modifications may be responsible for this. Additionally, I have investigated how insulin behaves at the water-silica interface, a property which is critical for the effective delivery and administration of this therapeutic molecule.

I have helped develop a new computationally driven workflow for the integration of drug conjugates into antibody CDRs. Aiden Aceves implemented all the code used in the protein engineering parts of the paper and planned and performed experiments including data collection through analysis. Aiden Aceves also contributed to the writing of the paper, responding to reviewers and preparing revisions.

Aiden Aceves created the computational workflow to determine the optimal location for incorporation of small molecule conjugates and ran the workflow for the biotin proof-of-concept system published. Aiden Aceves parameterized and performed molecular dynamics simulations of the designed complex to help diagnose necessary stabilizing framework mutations.

Creation and refinement of enabling tools: VoxLearn

Build a neural network to predict the label(s) from the voxelized data. VoxLearn includes tools to take a .pdb file and convert it into an atomic dictionary. Users can augment this data with their own additional channels such as descriptors derived from molecular mechanics force fields. To capture 3D spatial information, a 4D tensor format was adopted, where the first three dimensions are the (x, y, z) coordinates of the protein, and the last dimension is the features associated with that voxel, such as atomic identity or force field derivatives expression.

Processed tensors can be combined as one very large file to be read into a machine learning framework, or loaded incrementally with a generator function. Additional features have been added to this library over time, including the ability to parse and voxel-encode small molecule file formats, updated generator functions to work with the latest version of Tensorflow, and a number of neural network templates that have been adapted for protein technology. . I have also used this package to train three undergraduate students in the basics of neural networks and contribute to Git repositories.

We have also created a separate version of the package owned by Novartis and used by the modeling and chemical informatics groups.

Hello world: Experiments with a model system

The remarkable network that drove deep learning forward was the winner of the 2012 ImageNet LARGE-scale visual recognition competition, known as AlexNet.7 The deep convolutional neural network (CNN) trained and evaluated with more than 10 million labeled images corresponding to more than 20,000 categories, showed that GPUs help and significantly improve network learning rates, and CNNs can facilitate learning with more ease since there are sparser connections between nodes, thus fewer parameters to train. Several state-of-the-art networks have since implemented unique architecture designs cross-disciplinary with. However, note that the georgieV encoding of the protein could not adequately capture the chromophore as it had.

Crystal structure for the 4D tensor data generation workflow (a) representation of eqFP650 (PDBID:4EDO, generated by chimera), one of the 130 proteins in our dataset. The predictions of the most efficient coding type for each linear regression of the data set reveal a weak correlation indicated by 0.2324 and 0.2478 R2. Values are sorted by distribution. a) region of high density of data points observed at x = 0.4.

The unexpected correlation values suggest that there may not have been a smart diversification of the data. Computational design of the β-sheet surface of a red fluorescent protein allows control of protein oligomerization.

Figure 1. Crystal structure to 4D tensor data generation workflow (a) stick representation of eqFP650 (PDBID:4EDO, generated with chimera), one of the 130 proteins in our dataset

Low data regimes - Transfer learning and Siamese networks: Protein solubility and serum albumin binding

As part of this project, we elucidated key features for the model to learn well (aside from tuning model hyperparameters), including the number of pairs, the number of increments of the data, and the method of generating conformers. Because the number of pairs of data points to be compared rapidly expands as a function of . Spearman rho was used to evaluate the reliability of the predicted solubility, with the model producing rho of 0.62 (Figure 1).

In the graph, we tested our best model and then calculated its estimate of the proteins' solubility based on its prediction. Examining these results, our first observation is that the transferability of the model is highly dependent on the choice of conformer generation method. In particular, OpenBabel outperformed fellow paid competitor ChemAxon on two of the three datasets listed above.

In examining the experimental details of the six data sets, it is not immediately clear what might affect the performance of the transferred model. Of the three data sets that did not achieve statistically significant predictions with the superimposed model using Schrodinger conformers, two were not determinations of binding to human serum albumin, but rather bovine (in vitro) and porcine (in vivo).

Figure 1: Predicted rank vs. True rank for proteins. In the graph, we tested our best, model and then calculated its estimate of the proteins’ solubility based on its prediction

Leveraging existing datasets: Learning to active learn with submodular regularization regularization

We provide a rigorous analysis of the proposed algorithm and prove strong performance guarantees for the learned objective. For example, in the case of maximizing a monotone submodular function subject to a cardinality constraint, it is shown that the greedy algorithm achieves an approximate ratio of (1−1/e) of the optimal solution (Nemhauser et al., 1978). We first consider the loss margin of the learned policy measured against the expert policy.

We then provide a bound on the expected value of the learned policy, measured against the value of the optimal policy. This is an important finding, since the method proposed by Liu et al. 2018) without LISA, has been shown to be a state-of-the-art method for imitation learning. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.

Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. Proceedings of the 56th Annual Meeting of the Society for Computational Linguistics (Volume 1: Long Papers), p.

The high-level idea is to first connect the total expected utility of the learned policyπˆ with the expected utility of the expert policyπexp, according to the analysis in DAgger (Ross et al., 2011). As a result, the variance of the total noise is linear in the number of sets. The implementation of this part of the project is almost identical to that of Liu et al.

When lambda takes the value 0.01, the magnitude of the (scaled) regularization term (represented by the blue bar) best matches the magnitude of the cross-entropy loss (represented by the orange bar). Tversky loss function (Salehi et al., 2017), a weighted modification of the Dice similarity coefficient, usually written as shown in Figure 6. Regarding the literature, we found this to be a relatively immature area of research (a good overview of current state of the art is presented in Worrall et al., 2018).

The concentrations of the conjugates were then determined by BCA assay using commercially available kits (Thermo). Webster, et al., Brain penetration, target engagement, and disposition of the blood-brain barrier-crossing bispecific antibody antagonist of metabotropic glutamate receptor type 1. Mei, Origin of the cooperativity in the streptavidin-biotin system: A computational study through molecular dynamics simulations.

Analysis of the interaction interface in the triple MD simulations of 4NBX.B-biotin103 v186_Fr against mSAWT.

Figure 1: Evaluating L EA S U R E against baselines on set cover instances

Design process of 4NBX.B-derived nanobody-biotin conjugates

To prepare the final conjugation structure, excess atoms were removed and a bond was made between the Cβ of the biotin-CH2-CH2-. Insulin lispro, the first marketed rapid-acting insulin (Holleman et al., 1997), favors the association of subunits in higher-order oligomers by exchanging Proline B28 and Lysine B29 near the C-terminus of the B-chain (Bakaysa et al., 1996). Because the noncanonical prolines that are integrated are all at the C terminus of the insulin B chain, we focused several of our analyzes of the systems on them.

Using the monomeric simulations described previously, we measured the solvent accessible surface area of the last four residues of the B chain, frame by frame using VMD ( Humphrey et al., 1996 ). To begin our analysis, we simulated each of the insulin varieties using standard minimization, equilibration, and production protocols. These selections were made by viewing the molecular dynamics trajectories of the protein monomers, and observing that these segments bookend the relatively stationary helix between them in the B chain.

The RMSD of each terminus was then calculated using the position of the backbone atoms compared to the reference structure. While considering the mean values of these properties produced indistinguishable results, the correlation matrices of the solvent. These sheets are placed in fully periodic cells that are 75x75x90 Angstrom in diameter, with the quartz plate placed perpendicular to the long axis of the box, and exactly in the center of the system (Figure 26).

We first validate the size of our simulation system by assessing the formation of different water layers on the surface of the quartz slab. To characterize the position of the protein relative to the quartz surface, the Z coordinate of the geometric center of all protein atoms was calculated and plotted for the duration of the simulations (Figures 30, 31). In none of these simulations does insulin come into contact with the quartz surface, despite an average maximum displacement in this system of 35 angstroms.

To appropriately represent the protonation state of the surface at a pH of 7.4, we randomly deprotonated 51 silanols and added sodium counterions to the quartz surface. These plots are shown in Figures 36-38, and allow for a relatively easy view of when aromatic rings are near cations on the surface of the quartz plate. Filter for when these rings are within 6 Angstroms of the surface to identify when a cation pi.

This revealed that the ends of the proteins are by far the most likely to be in close proximity to the quartz surface, most likely the N-terminus of chain B. In all three replicates of lispro and after one replicate of aspart and wild-type insulin, this region spends 10%+ of the simulations in close proximity to the quartz surface .