Ernul

1. User interface. Ernul is started by simply typing "emul-.."⁶ The user interface begins by asking for the name of the preferences file. The default name of this file is always "nmr.prf." A preferences file is read by each of the programs in the package. Preference files contain instructions for the operation of all of the programs; a user may customize the package by writing a special preferences file.

The next name Ernul asks for is that of the text input file. This is the file containing the description of the experimental design and all of the experimental observations. This request is followed immediately by a request for the name of the output file. The default output file name is taken from the chosen input file name, and is given the extension specified in the preferences file. The user may either accept the default file name by typing a carriage return ( -.), or may enter a different name. The user is asked for the name of the text output file only; the names of the other files written by Ernul are automatically based on this file name. All of the output files (text, residuals, simulator input, and the optional data summary file) will have the same name, with extensions as specified by the preferences file. The user is asked only for the name of the text output file at run-time. The programs do not check if files of the given names already exist; consequently, if they do exist, they will be overwritten without notice.

Ernul then asks for the name of the error bars file. This file contains general information about the sizes of the experimental errors. All of this information may

(6) Hereafter, the symbol "-." will be used to denote a carriage return.

be superceded by specific information in the input file, but an error bars file must nonetheless be specified. The next information requested is a header for the text output file. This is optional; it is merely an opportunity for the user to type a line describing the experiment at the top of the text output file. Ernul asks next if it should create a data sumary file. If the answer is yes, then it writes a file detailing its internal representation of the experiment, including all assumed measurement uncertainties and calculations for propagating them.

The program is finally ready for an initial guess of the binding constant. This is the kick it needs to begin adjusting the model parameters to fit the experimental data. No more information from the user is required for the rest of the run. When the run is finished, the program will report the names of all the files it has generated.

2. Input files.

(a) The preferences file. This file contains information, some of it obscure, for the operation of all of the programs in the package. Not every entry is meaningful to every program. I will attempt to explain this file by presenting a listing of a sample preferences file and describing every entry in turn. The discussion will occasionally be technical and incomprehensible; some of these values control inner workings of the program that the user never sees directly. When the entry is a numerical value, the name in the source code of the variable to which the value is assigned is given in italics; in order to fully understand the function of some of these variables, consulting the source code may be necessary. For the most part, however, there should be no need to delve into the inner workings of the procedures. The values presented in this example file should permit satisfactory performance of the programs under most conditions.

5300 This quantity, submax, is the number of replications to perform in a Monte Carlo simulation study of an NMR titration experiment. This number has no meaning for Ernul, but it is used by Lucius, Portia, and Brutus. The particular number "5300" is the number of simulations required for Brutus to be able to find 99% parameter confidence limits.

Submax cannot exceed 100,000.

51 This quantity, spread, is the smoothing factor for determining empirical probability densities from the empirical cumulative distribution functions of the parameter estimates from Lucius, Portia, and the first run of Brutus. These densities are calculated by approximating the derivative of the c.d.f.; the density at the ith parameter value is approximated as the slope of the line drawn between the ( i -l)th (parameter, c.d.f.) point and the (i

+

l)th (parameter, c.d.f.) point. Spread = 21

+

1. It must therefore be a positive odd number.

nmr. in This is the default name and extension for the Ernul input file.

nmr. bri This is the default name for the binary-format simulator input file. The actual file created by Ernul will have a. file name dictated by the name of the Ernul text output file, but the extension will always be the extension given here.

nmr .lim This is the default name and extension for the confidence limits file read by Lucius, Portia, and Brutus.

eml This is the default extension of the Ernul text output file.

bru This is the default extension of the Brutus text output file.

luc This is the default extension of the Lucius text output file.

por This is the default extension of the Portia text output file.

res This is the extension of the Ernul tabular residuals file.

lpm This is the extension of the parameter scatter file from Lucius or Brutus.

ldn This is the extension of the parameter distribution file from Lucius or Brutus.

ppm This is the extension of the parameter scatter file from Portia.

pdn This is the extension of the parameter distribution file from Portia.

nmr. err This is the default name and extension of the error bars file read by Ernul.

dsm This is the extension for the data summary file written by Ernul.

0 . 6 This quantity, intwidth, affects the interpolated confidence function in Brutus. It is a scaling factor for the formula that assigns weights to each point (see the discussion for share). This variable is sort of a smoothing factor; if it is small, the value of the interpolated function is determined almost entirely from the closest point; if it is large, the contributions from more distant points become more significant.

0.1 This quantity, share, also is a parameter for the spline interpolation function used by Brutus. They-value of the interpolated C* function is determined at an arbitrary x-value X by a weighted average of all the y-values in the set. Points whose x-values are close to X are weighted more heavily than those that are farther away. The actual weighting factor for each point is proportional to e- d"f, where di is the distance from

Xi to X. It is not a simple arithmetic difference (X- Xi), however. The set of points to be interpolated is unevenly-spaced, and the spline would resemble a step-funcion between points that are far apart. (Figure 10 shows an example of just such behavior, tamed greatly by these parameters.) To correct for this somewhat, an alternative distance measure is also calculated; this is proportional to the number of points between X and point i. Let us call this measure w, and the measure proportional to the arithmetic difference let us call v. The actual weighting factor for point i is (share) x v²

+

⁽¹^- ^{share) x}^w²^. Acceptable values are 0 ::; share ::; 1.

0. 01 This quantity, a1, is a significance cutoff specifying how close to the desired confidence values (say, 0.05 and 0.95 if the 90% confidence region is sought) the empirical

C*

estimates at the sampled parameter values bracketing the eventual confidence limit estimate must fall. For example, if 5000 replications are performed in a Monte Carlo study, the 99% (that is, 1 - ^a1)confidence limits for the actual probability of success that returns 25 hits (0.500%) are 0.300% and 0.831%. Thus, the estimate of the parameter value that gives a 0.5% probability of success must be bracketed on the low side by a sampled value that experienced between 15 and 25 hits, and on the high side by a sampled value that experienced between 25 and 41 hits. Allowed values are 0

<

^a¹

<

0. 01 This quantity, a2, is the significance cutoff for Brutus to report confidence limits for the parameter confidence limits in its output files. It is strictly for the edification of the user; this number does not affect the program run in any way. Allowed values are the same as for a1.

2000

0.0001

This quantity, maxit, is the maximum number of iterations the Leven- berg-Marquardt SSR minimization procedures will perform without converging before giving up.

This quantity, concrit, is the convergence criterion for the Levenberg- Marquardt minimization procedures. If SSR is improved by less than this fractional amount twice in a row, the routine considers itself converged.

Allowed values are 0

<

concrit

<

1. Oe100 This quantity, maxlam, is the maximum allowable value for .A, a parameter in the Levenberg-Marquardt minimization procedures whose size corresponds to the inability of the quadratic approximation to improve SSR. If .A gets this big, the procedure gives up without converging.

0.001

This quantity, upcrit, is the tolerance for a slightly worse SSR in suc- cessive 1.-M. steps. If the proportional increase in SSR is less than this value x the convergence criterion, the procedure does not count that step as an actual worsening. This exists to counteract an observed tendency of these procedures to fail to converge in perfectly acceptable regions of parameter space. Allowed values: 0 ::; upcrit < 1.

This quantity, maxconsecups, is the number of consecutive failures to improve SSR that the 1.-M. procedures will tolerate before considering themselves converged. If a parameter guess happens to be very close to the global minimum of the SSR surface, it is possible that the procedure will be unable to ever get any closer. This criterion allows it to consider itself converged in such cases.

10 This quantity, [amine, is the factor by which .>. is multiplied if SSR in- creases in a L.-M. step.

0. 1 This quantity, lamdec, ^ISthe factor by which .>. ^IS multiplied if SSR decreases in a L.-M. step.

44 This is the ASCII code for the column-delimiting character in the tabular data files. 44 is the code for a comma; 9 is the code for a horizontal tab. Since horizontal tabs are transferred to the Macintosh from the IRIS by Versaterm-Pro as spaces, it is necessary to specify some other character. Commas are a good choice, because Kaleidagraph can be told to recognize them as column markers.

15 When Brutus tries to find the parameter value X that gives a probability C*(X) equal to some desired confidence value, it performs a simulation at the parameter value that is the best current guess of X. This is determined from the interpolated confidence function: the x-value of the interpolated function where y is equal to the desired confidence value is this best guess. Unfortunately, the nonparametric spline is set up so that it is easy to determine the value of y at a given x, but difficult to determine x from a given y. This is the problem of finding the root of an equation. Numerical methods for finding roots involve either iterative narrowing of a region known to contain the root, or iterative improvement of an estimate of the root. The first method employed in Brutus is bisection, a member of the former class; after a number of bisection steps have been performed, the method of false position, a member of

the second class, is employed. ⁷ The entry on this line is bisitmax, the maximum number of bisection steps to perform.

30 This is fispitmax, the maximum number of false-position iterations.

i .Oe-5 This is the precision required for convergence of the root-finding procedures.

0. 8 This quantity, squeeze, affects the root-finding procedures in Brutus by helping to find x-values of the interpolated function that bracket the desired values. H the current guess for one of the brackets (high or low) is on the wrong side of the root, squeeze tells the program how far to move it away. If squeeze = 1, the guess for the bracketing value will not move at all; if squeeze = 0, the guess will be moved all the way out to the most distant sampled point. Intermediate values cause intermediate displacements. Allowable values are 0 ~ squeeze

<

0. 8 This quantity, lim weight, tells Brutus what y (that is, C*) value to aim for when it tries to closely bracket a desired limit. The extreme values that the brackets must fall within to have "found" the limit are defined by a1; the lim weight tells how far inside these extreme values Brutus should attempt to sample. If limweight = 1, then Brutus will always shoot for the most extreme value; if limweight = 0.5, then it will try to sample an x that will return a y exactly halfway betwen the extreme value and the desired limit. Allowable values are 0.5 ::; limweight ::; 1.

0. 85 When Brutus tries to sample the parameter values that will bracket some desired confidence limit, it estimates the necessary parameter values from (7) Press, W. H.; Flannery, D. P.; Teukolsky, S. A.; Vetter ling, W.T Numerical Recipes: the Art of Scientific Computing; Cambridge University: New York, 1986; pp 243-251.

the current interpolated C* function. If the sampled points defining this function are widely or unevenly spaced, the function is often unreliable.

Such problems are especially acute when they-value of one of the bracketing points is much closer to the desired confidence value than is the y-value of the other bracketing point. The empirical solution to this problem is to narrow down the interval. The entry in this line, maxlop- side, defines how lopsided the bracketing must be to cause Brutus to disregard the interpolated function. In this example, maxlopside = 0.85, so corrective action will be called for if the difference between the desired confidence value and the y-value of the most distant point is more than 85% of the total height of the interval. If this happens, the next sampled parameter ( x) value will not be determined from the interpolated C*

function. Instead, it will be the parameter value that is 85% across the interval between the x-values of the bracketing points. This is illustrated in Figure 11. If maxlopside is set to 0.5, Brutus will never choose parameter values from the interpolated function, but instead will simply bisect the interval enclosing the desired confidence value. Allowed values are 0.5 ::; maxlopside ::; 1.

200 This quantity, morenmr, is the number of points of the interpolated confidence function that Brutus includes in its confidence function files.

a;

S-+---?c:;__---:'--

...

e --,

e

e'ez

~85%

.. ¹ ^..1

~100%---1-~·

m

2....

ca- _s

-+---~~---

... ...

Figure 11. In some cases, the interpolated confidence function

f:*

(dashed line) has undesirable behavior between sampled points. In this illustration, the desired confidence limit S is between the sampled

f:*

values obtained by simulation about the parameter values (}₁ and 8_{2 ,}but it is much closer to

q

^{than to}

Ct.

Instead of using the interpolated parameter value (}', the procedure performs its next simulation about the parameter value B3, which is 85% across the interval between the sampled points. This gives a better subsequent interpolated function in the region of interest.

(b) Ernul input and error bars files. Before describing the format of these files, it is appropriate to explain how these programs represent a binding study. This will help to justify the input file format.

(i) Internal representation of the binding study. In order for the fitting and simulation programs to properly model a binding titration experiment, they must be able to internally represent the study. Since these experiments may follow a wide range of procedures, the internal representation must be very adaptable.

Although it is easy for humans to understand different procedures, computers, which are not as flexible, must follow a specific format. The challenge to the programmer, then, is to develop a format general enough to accomodate any feasible experiment.

I hope that the current format fulfils these criteria.

All experiments are assumed to involve NMR observations of at least two samples containing both host and guest. Any or all protons of the host or guest may

be observed in any of the samples. These samples are created by placing solutions containing host and guest into NMR sample tubes. Samples may be made either by adding solution to an existing sample, or by placing solution into an empty tube.

Every delivery of a solution aliquot to a sample tube defines a new sample.

There are two types of solution that may be added in such an aliquot. The first type is a stock solution. An experiment may use any number of stock solutions, any of which may contain any concentration of host, guest, both, or neither. The concentration of a stock solution .is considered to be a fundamental random variable;

its estimated value depends on no other measurements. The second type of solution is a sample solution, which is any solution made from other solutions. The concentration of a sample solution may depend on the volumes and concentrations of all solutions from which it is made. These two types of solution are distinct. Stock solutions cannot be observed, but sample solutions can. The conceptual distinction between stock and sample solutions is so complete that if only a single stock solution is placed in a sample tube, the solution inside the tube is considered a new sample solution.

When creating samples, however, both stock and sample solutions may be added to a sample tube with impunity. The only constraint on adding sample solutions is that they must be created before they are added. This means that the first sample solution must be composed of so'me quantity of a stock solution added to the NMR sample tube. All sample solutions are, fundamentally, made up of nothing but stock solutions, even though they may contain complex mixtures of mixtures.

The actual observations of the NMR spectra complete the experiment. These data are grouped by proton. Associated with each proton is its uncomplexed chemical shift Dfree' and a list of chemical shifts Dobs observed in samples containing both

host and guest. It is not necessary for a single proton to be followed in all of the samples, nor is it necessary for a given sample to have any observations associated with it. It is necessary, however, for observations to be made only on samples that contain both host and guest. The measurement of the free chemical shift is considered separately from the other NMR observations, and would require illegal mathematical operations if it were not.

All binding studies are represented by these programs in the same way. Asso- ciated with each experiment is a list of stock solutions and a list of delivery devices.

Samples are in a list of lists. One list contains the different sample tubes used;

the sample tubes are lists in their own right of the solutions created in them. If a binding study is conducted so that all aliquots are combined together in a single sample tube, then the list of sample tubes is but one item long. The list of observ- able samples comprises only the sample solutions that contain both host and guest.

The proton chemical shift information is also a list of lists. One list contains all of the observed protons, and each proton has a number of lists, corresponding to its observations, associated with it.

Every measurement has at least two stored quantities associated with it: its value and its uncertainty. The value is self-explanatory; the uncertainty is represented by the expected variance of the measurement. The measurement variances are employed in calculating the weighting factors, as described in Chapter 2. To speed the calculation of the weighting factors, the computer also stores the quantities o[H]o/OXj and o[G]of8xj, the derivatives with respect to Xj of the host and guest concentrations, of every sample affected by measurement Xj·

In order for these programs to properly internally represent an experimental data set, the experimental design must be specified to them. The design includes the

Dalam dokumen Investigations in molecular recognition: statistical tools and experimental studies (Halaman 150-172)

+

+

+

C*

<

<

<

<

<

a;

e --,

e

e'ez

~85%

m

ca- s

... ...

f:*

f:*

q

Ct.

ca- _s