QuB Structure and Function A Historical Guide to the Code (Omnia qubia en tres partes divisa est) The development of QuB has passed through three distinct stages. First, Dr. Feng Qin created the core algorithms: SKM, MIL, AMP, and MPL, and embodied each as a stand-alone windows program. Then, working independently, Lorin Milescu wrote QuB.exe as an integrated environment for data acquisition, preprocessing, and analysis. In the third stage, Chris Nicolai joined the QuB.exe project, integrating Dr. Qin's algorithms and taking charge of interface and plumbing improvements while Lorin developed a new suite of algorithms: Idl/Base, MIP, MAC, and Staircase. By approaching these developments historically, I hope to clarify the structure of a code base that has suffered from multiple authors, each with their own style and vision. Stage One: The Core Algorithms Overview The core algorithms are primarily mathematical works, applying hidden Markov models (HMMs) to single-channel recordings. Each algorithm was written as a paper and coded for the command line in ANSI C/C++. They were then embodied in small, special-purpose windows programs. Together with utilities for data acquisition and pre-processing, they made up the QUB Suite. PRE.exe pre-processed a recording. The user could load a raw data file and save selections ("segments") in a LDT file. SKM.exe idealized segmented data. The user could load an LDT file, create a simple kinetic model, run SKM, and save an event list as a DWT file. MIL.exe solved the kinetics of an event list for a particular model. The user could load a DWT file, create a kinetic model, and run MIL to find optimal rate constants. MPL.exe solved the kinetics of segmented data directly. The user could load an LDT file, create a kinetic model, run AMP to estimate conductance levels, and run MPL to find optimal rate constants. MSearch.exe found the MIL likelihood of an event list for each distinct n-state model. MMerge.exe combined two or more models into a constrained aggregate model. This section will describe the algorithms in their original form. In part three, I will explain how they have been wrapped and re-packaged to fit inside QuB.exe. The Model This section describes a sample model and gives its representation in memory, with typical variable names. A Markov model is represented by a graph. Each vertex represents a state (shape of the molecule). If there is an edge between two states, it is labeled with the transition rates (macroscopic rate constants). ___ k12 = 1000 ___ |1| ___________________\ |2| --- \ k21 = 200 --- Example: the transition rate from state 1 to state 2 is 1000 (per second). Rates are given by k = k0 * P * exp(k1 * Q), where P is typically the experimental concentration of a drug, or 1 if none Q is typically the experimental voltage or pressure, or 0 if none k0 is the pre-exponential rate constant k1 is the exponential rate constant, generally 0 for rates that don't depend on voltage or pressure Each state has a starting probability, P0, the chance a recording will start in that state. States are grouped by color into conductance classes ("classes" for short). A conductance class has amplitude: average current level std. dev: of the amplitude States within a class are indistinguishable by amplitude, making it a hidden markov model. For this example, let's say state 1 is black (closed), and state 2 is red (open). The model above is kept in memory as: nstate = 2 clazz = [0, 1] // conductance class of each state pr = [0.5, 0.5] // start probability of each state npath = 2 path = [1, 0, 0 // each row of path is one direction along an edge, 0, 1, 0] // in the form [state from, state to, index of drug (or 0 if none)] // the third column is often kept as a separate array named idrug x = [6.91, 0, 5.30, 0] // x[2*i ] = log(k0) of path row i // x[2*i+1] = k1 of path row i A recording may have more than one channel governed by the same model: nchannel = 2 And some of the rate constants (x) may be constrained by the equation [Mtx * x + vct = 0] nconstraint = (integer) Mtx = matrix [nconstraint X 2*npath] vct = vector [2*npath] (they actually have dimension (2*npath+1) to facilitate shifting operations) Constraints are entered by the user in the form [operation | list of states]. Each user constraint is translated into a row of Mtx and an entry into vct. [fix k0 | 0, 1] -> [0.0, 0.0, 1.0, 0.0], 6.91 [scale k0 | 0, 1, 1, 0] -> [-1.0, 0.0, 1.0, 0.0], (5.30 - 6.91) [fix k1 | 0, 1] -> [0.0, 1.0, 0.0, 0.0], 0.0 [scale k1 | 0, 1, 1, 0] -> [0.0, 1.0, 0.0, -(k1[0,1] / k1[1,0])], 0.0 [balance loop | 0, 1] -> * see ModelTree.cpp: ConstraintsToMdlinf() * A "fix k1" constraint is auto-generated for each zero-valued k1. All the algorithms derive two matrices from the model: Q, with Q[i, j] = kij and Q[i, i] such that rows sum to 0 A = exp(Q * dt) A[i, j] is the probability that if you are in state i, you will be in state j after dt seconds. The Data Each data file has conductance parameters and experimental conditions to be used with the model: amp = [0.0, 1.0] // average current in each class xms = [0.1, 0.1] // std. deviation of each amp C = [1.0, ...] // drug concentrations referenced in the third column of "path" // C[0] is always 1.0, indicating no drug-dependence V = 0.0 // voltage Raw data also has a sampling interval, dt, the number of milliseconds between consecutive samples. A segment of raw data is kept as: ndata // number of data points data // array of float or double, supposedly in picoAmps A segment of idealized data is kept as: ndwt // number of intervals idwt // conductance class of each interval tdwt // length of each interval in milliseconds Vector, Matrix, and Tensor (matrix.h) fqvector is more or less equivalent to STL's vector. It can also cast to T*, and operator* is overloaded as the dot product. matrix is implemented as an array of rows of type fqvector. There are three natural approaches to building a matrix: (1) matrix m(nr, nc, 0); for ( int i=0; i m; m.resize(nr, nc); for ( int i=0; i m; fqvector v(nc); for ( int i=0; i can be cast to (T**). The resulting T** is managed by the matrix object, and may not stay valid when the matrix dimensions are changed. Other utility methods of matrix include transpose, submatrix, diagonal vector, and multiplication. A tensor is a 3-d matrix, implemented as an array of matrix. No casts or utility methods are provided. vector, matrix, and tensor are preferred to memory directly allocated with new because they are more resistant to memory leaks. If an exception occurs and the stack is unwound, stack-based objects are still properly destroyed. (similar to try/finally in newer languages.) SKM: Segmental K-Means SKM idealizes raw data by alternating (a) finding the most likely state sequence given the data and model then (b) reestimating model parameters, until (c) convergence. (a) the most likely state sequence The likelihood of the first point being in class c is: L[0, c] = pr[c] * gauss(amp[c], xms[c])(data[0]) For subsequent points: L[i, c] = max( L[i-1, c'] * A[c', c] * gauss(amp[c], xms[c](data[i]) ) over c' The back-path, g[i, c] is the value of c' which maximized L[i, c]. B describes a lattice on which you can trace the most likely state sequence backwards from any i, c: (x: data, y: state) * * *-*-*-*-*-* *-* * * * *-*-* * * *-* * * * * * * * * * *-*-* / / \ \ / \ / / / / \ \ -> / \ / \ / *-*-* * *-*-*-*-* *-*-*-*-* * * *-* * * *-*-*-* * *-*-*-* * * * After finding L[n, c] and g[n, c] for all c, pick MLSS[n] = the c that maximizes L[n, c] then trace backwards: MLSS[i] = g[ MLSS[i+1] ] L(MLSS) = L[n, MLSS[n]] // the likelihood of the MLSS given the model and data. (b) reestimation With each data point classified, we re-calculate (amp[c], xms[c]) as the mean and std.dev. of those data[i] for which MLSS[i] == c. Similarly, A[i, j] becomes the fraction of observed transitions out of state i into state j. The user can choose to skip reestimation for some or all of these. fixA: don't re-estimate any A[i,j] fixI[c]: don't re-estimate amp[c] fixSD[c]: don't re-estimate xms[c] (c) convergence SKM converges slowly, so it most often stops after a given number of iterations, usually 10. It also stops if L(MLSS) increases by less than tol. MLSS (aka istate) is converted into an interval list (ndwt, idwt, tdwt) by grouping consecutive points of the same class. SKM's Simplified Model Since SKM re-estimates the A matrix directly, it ignores constraints on the rates. Concentration, voltage, and k1 are also ignored, though they appear to be supported in xtoq() (skm-utils.cpp). SKM's multi-channel model If model.nchannel > 1, SKM generates an aggregate model: nmutistate // number of aggregated states mutistate[nmutistate * (nstate+1)] // mutistate[i][s+1] == mutistate[i*(nstate+1) + (s+1)] is the number of channels in state s for mutistate i nmuticlass // number of distinct conductance levels cmutistate[nmutistate] // multichannel equivalent of clazz mutipr[nmutistate] // " " pr mutiamp[nmuticlass] // " " amp, in ascending order mutixms[nmuticlass] // " " xms nmutipath // number of one-way transitions between mutistates, where only one channel changes state along a single-channel path mutipath[nmutipath * 4] // [mutistate from, mutistate to, state from, state to] mutiq[nmutistate, nmutistate] // multichannel q matrix mutia[nmutistate, nmutistate] // multichannel a matrix mutiFixAmp[nmuticlass] // true if any contributing state is in a class with fixed amp mutiFixSD[nmuticlass] // true if any contributing state is in a class with fixed std.dev. Mutistates are joined into muticlasses with the help of float ratio[nclass]; // ratio[i] = amp[i] - amp[0] Assuming class 0 is closed, ratio[i] tells how much the current changes when another channel goes from closed to class i. Inputs and Outputs SKM operates on each segment independently. The inputs for one segment are: data, ndata model (nstate, nclass, clazz, pr, A, amp, xms, nchannel) tol maxIter fixA, fixI, fixSD The outputs are: ndwt, idwt, tdwt mutiamp, mutixms, mutia (re-estimated) Practical Matters of Implementation The likelihood becomes extremely tiny extremely quickly. To stay within IEEE float range, we work with the log of likelihood: f[i, c] = log(L[i, c]) Before (a), log(pr[c]) and log(A[i,j]) are precomputed, and log( gauss(amp[c], xms[c]) ) is sampled. Multiplication becomes addition. f[i, c] is only needed until all f[i+1, c] and g[i+1, c] are calculated, so f is stored as a matrix[nstate, 2]. The algorithm is robust enough (with reasonable starting amps) to accommodate some half-measures: "pr" is ignored; mutipr[i] = 1.0 / nmutistate. The astute reader will have noticed that gauss(amp, xms)(at) is not strictly a probability. However, it's more or less proportional to the probability you'd get by integrating the gaussian distribution. Assumptions and Limitations SKM works with one segment at a time. The segment and its multichannel lattice must fit in RAM. There is no re-estimation across segments. The input quantities (rates, single-channel amp, xms) are not re-estimated. Instead you get the multi-channel amp, xms, and A matrix. Parameter optimization with DFPMIN MIL and MPL aim to find the optimal values of model parameters for a given data set, that is, the values which have the highest probability of generating the data. This is a function maximization problem: expressing the probability (likelihood) as a function of the model parameters, then maximizing that function. There is no closed-form solution, so we use the gradient-based optimizer DFPMIN from Numerical Recipes. Since DFPMIN is a minimizer we minimize the negative (log) likelihood. Inputs: nz aka n // number of parameters z[nz] aka p // initial parameter values func(z) -> ll, gz[nz], err // computes function value (ll) and partial derivative w.r.t each parameter (gz) check(iter, ll, z, gz) -> bool // return false to stop iterating; called each iteration // aka oupt pvoid // (void *) passed to func and check maxIter aka itmax // maximum number of iterations convLL aka tolf // stop if ll increases by less than convLL from one iteration to the next convGrad aka tolg // stop if all partial derivatives are less than convGrad stepMax // scaled maximum distance for line search (see algorithm). Outputs: z // the final parameter values ll // func(z) iter // number of iterations hessian // the inverted hessian matrix of second derivatives; useful for calculating error limits err // 0: no problem // -1: func returned error // -2: check returned false // -3: iter >= maxIter nf // supposedly the number of function calls, but it's up to you to keep the count, in func() ndf // supposedly the number of gradient calls, but " " " ... Algorithm Overview Consider func as a surface in nz dimensions whose height is in the ll dimension. DFPMIN assumes it's roughly parabolic, using the slope to infer which part of the parabola it's at. Each iteration, it searches along a straight line in the direction of steepest descent. The lowest point becomes z for the next iteration. Cautions and Surprises DFPMIN finds a local minimum. The global minimum may be somewhere else. Line search can search too far and get stuck. Try decreasing stepMax. Sometimes it appears to iterate well beyond maxIter. The last z passed to check() is the optimal z. The last z passed to func() may suboptimal -- part of a failed line search. MIL: Maximum Interval Likelihood MIL finds optimal rate constants for a given model and idealized data. It can optimize across multiple segments, and even across multiple datasets (sets of segments) recorded with different concentrations and voltage. The optimal rates are transformed into time constants which are displayed as probability density functions (pdfs) overlaid on dwell-time histograms. The MIL Multichannel Model: nmetastate // number of aggregated states metastate[nmetastate][nstate] // metastate[i][s] is the number of channels in state s for metastate i nmetaclass // number distinct conductance classes metaclass[nmetastate] // conductance class of each metastate mpr[nmetastate] // probability of starting a segment in each metastate nmetagroup[nmetaclass] // number of metastates in each metaclass metaindex[nmetaclass][2*nmetastate] // metaindex[i][0..(nmetagroup[i]-1)]: indices of metastates in metaclass i // metaindex[i][nmetastate..(2*nmetastate - nmetagroup[i] - 1)]: indeces of metastates not in metaclass i nmetapath // number of one-way transitions between metastates, where only one channel changes state along a single-channel path metapath[nmetapath][2] // [metastate from, metastate to] of each metapath subpath[nmetapath][2] // [state from, state to] of individual channel transition involved in each metapath qq[nmetastate][nmetastate] // multichannel Q matrix qqe[nmetastate][nmetastate] // multichannel Q matrix with missed event correction (dead time) The rate constants (x) may be constrained (Mtx, vct). In freePar(), MIL generates a smaller list of unconstrained parameters (z) and a transformation from z to x. Confusingly, the transformation is given by a different pair also called (Mtx, vct). For clarity, I'll call them (Mtx', vct'): given x and (Mtx, vct) such that [Mtx * x + vct = 0], generate z and (Mtx', vct') such that [x = Mtx' * z + vct'] The transformation from z to qqe is accomplished by a set of functions XtoY(), which also generate derivatives dY_X (derivative of Y w.r.t. X): ztox(z, Mtx', vct') -> x, dx_z xtoq(x, path, idrug, C, V) -> q, dq_x qtometaq(q, path, metapath, subpath, metastate) -> qq, dqq_q qtoqe(qq, metapath, metaclass, td/*see below*/) -> qqe, dqqe_qq Thus MIL can compute the derivative of log likelihood (LL) w.r.t qqe, df_q, then compute the derivative of LL w.r.t z using the chain rule: df_z = df_q * dqqe_qq * dqq_q * dq_x * dx_z Missed Event Correction: the Dead Time (td) MIL incorporates missed event correction: modifying the rates to account for events that are too short to detect. An event is too short if it is shorter than the dead time, td (milliseconds). qqe (above) is the matrix of modified rates. The event lists are also modified; events shorter than td are joined with the preceding event. Log Likelihood Calculation with Forward-Backward Forward-backward computes likelihoods in two grids: (for one segment) alpha[nmetastate][2*ndwell+1] is the "forward" likelihood. Its first column is equal to mpr -- the starting probabilities. Odd-numbered columns store the likelihood of staying in a class for the length of a dwell. Even-numbered columns store the likelihood of a transition from states of one class to the next. The final column has the likelihood of a transition from states of the last class to any other class. Probabilities are brought forward by multiplying a column by an A (transition) matrix to get the next column, then zeroing states of the wrong class. alpha[i][2*ndwell] is the probability this model generated these events, ending in metastate i. The overall likelihood is the sum of the final column. beta[nmetastate][2*ndwell+1] is the "backward" likelihood. beta[i][2*ndwell] = 1.0 / nmetastate. Probabilities are brought backward similarly to alpha, such that for any t: overall likelihood = sum over i( alpha[i][t] * beta[i][t] ) beta is needed only for computing derivatives. The elements of alpha and beta can easily exceed the dynamic range of IEEE floating point, so a trick is in order: scale[2*ndwell+1]. After computing a column of alpha, scale[t] = 1.0 / sum(alpha[i][t]), then alpha[i][t] *= scale[t]. The same scale factors are used for beta: beta[i][t] *= scale[t]. Note that the magnitude of alpha's columns is moved entirely to scale -- the sum of each column in alpha becomes 1.0 -- so: log likelihood = LL = log( product over i(scale[i]) ) = sum over i( log(scale[i]) ) The LL of multiple segments is the sum of each segment's LL. Same goes for the derivatives. Global Fitting Each file (dataset) can have different C, V, nchannel, td, so each has its own multichannel model. The overall LL and gz are the sum over all datasets. The Transition Matrices The A matrix for a dwell of length dt is exp( qqe * dt ). MIL computes the exponential of a matrix by the method of spectral decomposition. For efficiency, the spectra of qqe are precomputed, then for each dwell A = expQ(class, spectra, dt). The A matrix for an instantaneous transition is exp( qqe * infinitesimal ). This is proportional to qqe, so qqe is used directly. Note that this is not strictly a probability, so the likelihood is often greater than 1. Tricks with Transition Matrices and alpha/beta Columns Rather than computing the probability of going into each state, then zeroing out the entries ruled not possible by the data, MIL uses appropriate sub-transition-matrices, with rows corresponding to legit start states, and columns to legit end states. This means that alpha[i][t] is the probability of being in the i'th state of class idwell[(t+1)/2], and only the first nmetagroup[ idwell[(t+1)/2] ] rows of column t are in use. Other Tricks The "search limit" says not to try any rates outside [initial_rate / searchlimit, initial_rate * searchlimit]. When preparing the model, xlimit[0][i] = x[i] / searchlimit xlimit[1][i] = x[i] * searchlimit MIL checks each x against xlimit in ztox(). A value that is out-of-bounds causes the dfp func() to fail, and line search resumes at a shorter distance. If there is a "balance loop" constraint but the initial rates are not in detailed balance, they are automatically balanced by changing one rate that is otherwise unconstrained. Time Constants Time constants (Tau) and their relative weight (Amp) are computed from the eigenvalues of qqe. There are nmetagroup[c] time constants for metaclass c. Each describes a probability distribution of the duration of events in one metastate. // error limits on the rate constants? Limitations MIL does not work with more than two (single-channel) conductance classes. If all starting probabilities (pr) are 0.0, they are initialized to (1.0 / nstate). Perhaps the equilibrium probabilities would be better. MPL: Maximum Point Likelihood MPL finds optimal rate constants for a given model and raw data. It can optimize across multiple segments, and even across multiple datasets recorded with different concentrations and voltage. Like MIL, MPL calculates LL and derivatives using forward-backward, and optimizes with DFPMIN. MPL finds the most likely state sequence (MLSS) at each iteration. The MPL Model In addition to amp[nclass] and xms[nclass], MPL supports correlated noise: nar // number of correlation coefficients (same for all classes) ar[nclass][nar] // correlation coefficients MPL theoretically has a multichannel model much like MIL's. However, it is incomplete, lacking multichannel amp, xms, and pr. Here's what it does have: nmutistate // number of aggregated states mutistate[nmutistate][nstate+1] // mutistate[i][s+1] is the number of channels in state s for mutistate i nmuticlass // number of distinct conductance classes muticlass[nmutistate] // conductance class of each mutistate nmutipath // number of one-way transitions between mutistates, where only one channel changes state along a single-channel path mutipath[nmutipath][4] // [mutistate from, mutistate to, state from, state to] of each mutipath qq // multichannel Q matrix a // multichannel A matrix = exp( qq * dt ) Like MIL, MPL honors constraints on rates, and keeps idrug[npath] and xlimit[2][npath]. Like SKM, MPL keeps ratio[nclass], though ratio[i] == i. Someone trying to support multi-channel amp and xms might try imitating the use of ratio in SKM. MPL also accepts filter coefficients, stored in: nfir // number of filter coefficients fir[nfir+1] // fir[0] == 1.0; the rest are filter coefficients The meaning and use of these coefficients is lost in the mists of time. nfir is typically 0. The transformation from z (the DFPMIN free parameters) to a is accomplished by a set of functions which also generate derivatives: ztox(z, Mtx', vct') -> x, dx (deriv. of x wrt z) xtoq(x, path, idrug, C, V, dx) -> q, dq (deriv. of q wrt z) qtomutiq(q, path, mutistate, mutipath, dq) -> qq, dqq (deriv. of qq wrt z) mexp(qq, dt, dqq) -> a, da (deriv. of a wrt. z) Thus MPL can compute the derivative of LL wrt a, dfa, then compute the derivate of LL wrt z using the chain rule: gz = dfa * da This is oversimplified. There's some fancy footwork obscuring an optimization. Log Likelihood Calculation with Forward-Backward I can only describe the simplified theory. The actual algorithm is complicated by ar and fir. "b" is a combination of gauss(amp[c], xms[c]), ar, and fir. alpha[nmutistate][ndata] is the forward likelihood. Its first column is pr, the starting probability. Column t = column t-1 * a. beta[nmutistate][ndata] is calculated similarly but backward, as in MIL. scale[ndata] also plays a similar role as in MIL, so LL = sum( log(scale[i]) ). Global fitting is just like MIL. Finding the Most Likely State Sequence gamma[s][t] = alpha[s][t] * beta[s][t] is the probability that the MLSS includes state s at time t. MLSS[t] = s such that gamma[s][t] is maximized. AMP: Re-estimating Conductance Parameters and the A Matrix AMP re-estimates conductance parameters (amp, xms) and the A matrix using an iterative procedure. It generates the MLSS in the same way as MPL. AMP uses the same model as MPL. It is also limited to one channel, due to the lack of multi-channel amp, xms, and pr. AMP uses Baum-Welch, which alternates between forward-backward and a re-estimation procedure. // MSearch // the msearch model class and model number // next model, mil, repeat // "AIC", LL, npar // references in case you want more states // MMerge