mssm.src.python package

Submodules

mssm.src.python.compare module

mssm.src.python.compare.compare_CDL(model1: GAMM, model2: GAMM, correct_V: bool = True, correct_t1: bool = False, perform_GLRT: bool = False, nR=250, n_c=10, alpha=0.05, grid='JJJ2', a=1e-07, b=10000000.0, df=40, verbose=False, drop_NA=True, method='Chol', seed=None, use_upper=True, Vp_fidiff=True, use_reml_weights=True, **bfgs_options)

Computes the AIC difference and (optionally) performs an approximate GLRT on twice the difference in unpenalized likelihood between models model1 and model2 (see Wood et al., 2016).

For the GLRT to be appropriate model1 should be set to the model containing more effects and model2 should be a nested, simpler, variant of model1. For the degrees of freedom for the test, the expected degrees of freedom (EDF) of each model are used (i.e., this is the conditional test discussed in Wood (2017: 6.12.4)). The difference between the models in EDF serves as DoF for computing the Chi-Square statistic. In addition, correct_t1 should be set to True, when computing the GLRT.

To get the AIC for each model, 2*edf is added to twice the negative (conditional) likelihood (see Wood et al., 2016).

By default (correct_V=True), mssm will attempt to correct the edf for uncertainty in the estimated \(\lambda\) parameters. Which correction is computed depends on the choice for the grid argument. Approximately the analytic solution for the correction proposed by Wood, Pya, & Säfken (2016) is computed when grid='JJJ1' - which is exact for strictly Gaussian and canonical Generalized additive models. This is not efficient for sparse models and will thus become too expensive for large sparse multi-level models. When setting grid='JJJ2' (default), the analytic solution proposed by Wood, Pya, & Säfken (2016) is used as a basis for the correction and subsequently refined via numeric integration. This requires a costly sampling step (see Greven & Scheipl, 2016 and the correct_VB function in the utils module) which will take quite some time for reasonably large models with more than 3-4 smoothing parameters, but remains efficient for sparse multi-level models (assuming use_upper is set to True and correct_t1 is set to False).

In case either form of correction is too expensive, it might be better to rely on hypothesis tests for individual smooths, confidence intervals, and penalty-based selection approaches instead (see Marra & Wood, 2011 for details on the latter).

In case correct_t1=True the EDF will be set to the (smoothness uncertainty corrected in case correct_V=True) smoothness bias corrected exprected degrees of freedom (t1 in section 6.1.2 of Wood, 2017), for the GLRT (based on recomendation given in section 6.12.4 in Wood, 2017). The AIC (Wood, 2017) of both models will still be based on the regular (smoothness uncertainty corrected) edf.

The computation here is different to the one performed by the compareML function in the R-package itsadug - which rather performs a version of the marginal GLRT (also discussed in Wood, 2017: 6.12.4) - and more similar to the anova.gam implementation provided by mgcv (if grid='JJJ1'). The returned p-value is approximate - very **very** much so if ``correct_V=False. Also, the GLRT should **no**t be used to compare models differing in their random effect structures - the AIC is more appropriate for this (see Wood, 2017: 6.12.4).

References:
  • Marra, G., & Wood, S. N. (2011) Practical variable selection for generalized additive models.

  • Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models

  • Greven, S., & Scheipl, F. (2016). Comment on: Smoothing Parameter and Model Selection for General Smooth Models

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

  • compareML function from itsadug R-package: https://rdrr.io/cran/itsadug/man/compareML.html

  • anova.gam function from mgcv, see: https://www.rdocumentation.org/packages/mgcv/versions/1.9-1/topics/anova.gam

Parameters:
  • model1 (GAMM or GAMMLSS or GSMM) – GAMM, GAMMLSS, or GSMM 1.

  • model2 (GAMM or GAMMLSS or GSMM) – GAMM, GAMMLSS, or GSMM 2.

  • correct_V (bool, optional) – Whether or not to correct for smoothness uncertainty. Defaults to True

  • correct_t1 (bool, optional) – Whether or not to also correct the smoothness bias corrected edf for smoothness uncertainty. Defaults to True.

  • perform_GLRT (bool, optional) – Whether to perform both a GLRT and to compute the AIC or to only compute the AIC. Defaults to True.

  • nR (int, optional) – In case grid!="JJJ1", nR samples/reml scores are generated/computed to numerically evaluate the expectations necessary for the uncertainty correction, defaults to 250

  • n_c (int, optional) – Number of cores to use to compute the smoothness uncertaincy correction, defaults to 10

  • alpha (float, optional) – alpha level of the GLRT. Defaults to 0.05

  • grid (str, optional) – How to compute the smoothness uncertainty correction, defaults to ‘JJJ2’

  • a (float, optional) – Minimum \(\lambda\) value that is included when correcting for uncertainty and grid!='JJJ1'. In addition, any of the \(\lambda\) estimates obtained from model (used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\)) which are smaller than this are set to this value as well, defaults to 1e-7 the minimum possible estimate

  • b (float, optional) – Maximum \(\lambda\) value that is included when correcting for uncertainty and grid!='JJJ1'. In addition, any of the \(\lambda\) estimates obtained from model (used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\)) which are larger than this are set to this value as well, defaults to 1e7 the maximum possible estimate

  • df (int, optional) – Degrees of freedom used for the multivariate t distribution used to sample the next set of candidates. Setting this to np.inf means a multivariate normal is used for sampling, defaults to 40

  • verbose (bool, optional) – Whether to print progress information or not, defaults to False

  • drop_NA (bool,optional) – Whether to drop rows in the model matrices corresponding to NAs in the dependent variable vector. Defaults to True.

  • method (str,optional) – Which method to use to solve for the coefficients (and smoothing parameters). The default (“Chol”) relies on Cholesky decomposition. This is extremely efficient but in principle less stable, numerically speaking. For a maximum of numerical stability set this to “QR/Chol”. In that case a QR decomposition is used - which is first pivoted to maximize sparsity in the resulting decomposition but also pivots for stability in order to get an estimate of rank defficiency. A Cholesky is than used using the combined pivoting strategy obtained from the QR. This takes substantially longer. If this is set to 'qEFS', then the coefficients are estimated via quasi netwon and the smoothing penalties are estimated from the quasi newton approximation to the hessian. This only requieres first derviative information. Defaults to “Chol”.

  • seed (int,optional) – Seed to use for random parts of the correction. Defaults to None

  • use_upper (bool,optional) – Whether to compute edf. from trace of covariance matrix (use_upper=False) or based on numeric integration weights. The latter is much more efficient for sparse models. Only makes sense when grid='JJJ2'. Defaults to True

  • Vp_fidiff (bool,optional) – Whether to rely on a finite difference approximation to compute \(\mathbf{V}_{\boldsymbol{p}}\) for grid JJJ1 and JJJ2 or on a PQL approximation. The latter is exact for Gaussian and canonical GAMs and far cheaper if many penalties are to be estimated. Defaults to True (Finite difference approximation)

  • use_reml_weights (bool,optional) – Whether to rely on REMl scores to compute the numerical integration when grid_type != 'JJJ1' or on the log-densities of \(\mathbf{V}_{\boldsymbol{p}}\) - the latter assumes that the unconditional posterior is normal. Defaults to True (REML scores are used)

  • bfgs_options (key=value,optional) – Any additional keyword arguments that should be passed on to the call of scipy.optimize.minimize(). If none are provided, the gtol argument will be initialized to 1e-3. Note also, that in any case the maxiter argument is automatically set to 100. Defaults to None.

Raises:
  • ValueError – If both models are from different families.

  • ValueError – If perform_GLRT=True and model1 has fewer coef than model2 - i.e., model1 has to be the notationally more complex one.

Returns:

A dictionary with outcomes of all tests. Key H1 will be a bool indicating whether Null hypothesis was rejected or not, p will be the p-value, chi^2 will be the test statistic used, Res. DOF will be the degrees of freedom used by the test, aic1 and aic2 will be the aic scores for both models.

Return type:

dict

mssm.src.python.constraints module

class mssm.src.python.constraints.ConstType(*values)

Bases: Enum

DIFF = 3
DROP = 1
QR = 2
class mssm.src.python.constraints.Constraint(Z: array = None, type: ConstType = None)

Bases: object

Constraint storage. Z, either holds the Qr-based correction matrix that needs to be multiplied with \(\mathbf{X}\), \(\mathbf{S}\), and \(\mathbf{D}\) (where \(\mathbf{D}\mathbf{D}^T = \mathbf{S}\)) to make terms subject to the conventional sum-to-zero constraints applied also in mgcv (Wood, 2017), the column/row that should be dropped from those - then \(\mathbf{X}\) can also no longer take on a constant, or None indicating that the model should be “difference re-coded” to enable sparse sum-to-zero constraints. The latter two are available in mgcv’s smoothCon function by setting the sparse.cons argument to 1 or 2 respectively.

The QR-based approach is described in detail by Wood (2017) and is similar to just mean centering every basis function involved in the smooth and then dropping one column from the corresponding centered model matrix. The column-dropping approach is self-explanatory. The difference re-coding deserves some explanation. Consider the model:

\[y = f(x) = b_1*f_1(x) + b_2*f_2(x) + b_3*f_3(x) + b_4*f_4(x) + b_5*f_5(x)\]

In mssm \(f(x)\) will be parameterized via five B-spline basis functions \(f_i(x)\), each weighted by a coefficient \(b_i\). Each of the \(f_i(x)\) has support only over a small range of \(x\) - resulting in a sparse model matrix \(\mathbf{X}\). This desirable property is lost when enforcing the conventional sum-to-zero constraints (either QR-based or by subtracting the mean of each basis over \(x\) from the corresponding \(f_i(x)\)). To maintain sparsity, one can enforce a coefficients-sum-to-zero constraint (Wood, 2020) by re-coding instead:

Set \(b_1 = -b_2\)

Then, re-write:

\[ \begin{align}\begin{aligned}y = -b_2 * f_1(x) + b_2 * f_2(x) + b_3 * f_3(x) - b_3 * f_2(x) + b_4 * f_4(x) - b_4 * f_3(x) + b_5 * f_5(x) - b_5 * f_4(x)\\y = -b_2 * f_1(x) + (b_2 - b_3) * f_2(x) + (b_3 - b_4) * f_3(x) + (b_4 - b_5) * f_4(x) + b_5 * f_5(x)\\y = b_2 * (f_2(x) - f_1(x)) + b_3 * (f_3(x) - f_2(x)) + b_4 * (f_4(x) - f_3(x)) + b_5 * (f_5(x) - f_4(x))\end{aligned}\end{align} \]

Line 3 shows how the constraint can be absorbed for model fitting by first computing new bases \((f_i(x) - f_{i-1}(x))\) and then estimating the coefficients based on those (this is implemented in mgcv’s smoothCon function when setting sparse.cons=2). Note that the constrained model, just like when using the QR-based or dropping approach, requires dropping one of the \(k\) original coefficients for estimation since only \(k-1\) coefficients are allowed to vary. The same sum-to-zero constraint can be achieved by fixing one of the central bases in the original model to it’s neighbor (e.g., setting \(b_2 = -b_3\)) or by setting \(b_1= -b_2 - b_3 - b_4 - b_5\). mssm fixes one of the central bases to it’s neighbor.

With a B-splines basis, it would be necessary that \(b_1 = b_2 = b_3 = b_4 = b_5\) for the model to fit a constant \(f(x)\) over all values of \(x\) (Eilers & Marx, 1996). In the constrained model this is no longer possible due to \(b_1 = -b_2\), effectively removing the constant from the space of functions that \(f(x)\) can approximate. Instead, in the constrained model if \(b_2 = b_3 = b_4 = b_5\), \(f(x)\) summed over all \(x\) will be equal to zero!

Importantly, the “new” bases obtained by computing \((f_i(x) - f_{i-1}(x))\) are all still relatively sparse. This is because neighboring \(f_i(x)\) overlap in their support and because the support of each individual \(f_i(x)\) is narrow, i.e., \(f_i(x)\) is sparse itself.

However, this is not a true centering constraint: \(f(x)\) will not necessarily be orthogonal to the intercept, i.e., \(\mathbf{1}^T \mathbf{f(x)}\) will not necessarily be 0. Hence, confidence intervals will usually be wider when using ConstType.DIFF (also when using ConstType.DROP, for the same reason) instead of ConstType.QR (see Wood; 2017,2020)!

A final note regards the use of tensor smooths when te==False. Since the value of any constant estimated for a smooth depends on the type of constraint used, the mmarginal functions estimated for the “main effects” (\(f(x)\), \(f(z)\)) and “interaction effect” (\(f(x,z)\)) in a model: \(y = a + f(x) + f(z) + f(x,z)\) will differ depending on the type of constraint used. The “Anova-like” decomposition described in detail in Wood (2017) is achievable only when using ConstType.QR.

References:
  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

  • Wood, S. N. (2020). Inference and computation with generalized additive models and their extensions. TEST, 29(2), 307–339. https://doi.org/10.1007/s11749-020-00711-5

  • Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89–121. https://doi.org/10.1214/ss/1038425655

Z: array = None
type: ConstType = None

mssm.src.python.exp_fam module

class mssm.src.python.exp_fam.Binomial(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.Logit object>, n: int = 1)

Bases: Family

Binomial family. For this implementation we assume that we have collected proportions of success, i.e., the dependent variables specified in the model Formula needs to hold observed proportions and not counts! If we assume that each observation \(y_i\) reflects a single independent draw from a binomial, (with \(n=1\), and \(p_i\) being the probability that the result is 1) then the dependent variable should either hold 1 or 0. If we have multiple independent draws from the binomial per observation (i.e., row in our data-frame), then \(n\) will usually differ between observations/rows in our data-frame (i.e., we observe \(k_i\) counts of success out of \(n_i\) draws - so that \(y_i=k_i/n_i\)). In that case, the Binomial() family accepts a vector for argument \(\mathbf{n}\) (which is simply set to 1 by default, assuming binary data), containing \(n_i\) for every observation \(y_i\).

In this implementation, the scale parameter is kept fixed/known at 1.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • link (Link) – The link function to be used by the model of the mean of this family. By default set to the canonical logit link.

  • n (int or [int], optional) – Number of independent draws from a Binomial per observation/row of data-frame. For binary data this can simply be set to 1, which is the default.

D(y, mu)

Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

V(mu)

The variance function (of the mean; see Wood, 2017, 3.1.2) for the Binomial model. Variance is minimal for \(\mu=1\) and \(\mu=0\), maximal for \(\mu=0.5\).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted probability for the response distribution corresponding to each observation.

deviance(y, mu)

Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

init_mu(y)

Function providing initial \(\boldsymbol{\mu}\) vector for GAMM.

Estimation assumes proportions as dep. variable. According to: https://stackoverflow.com/questions/60526586/ the glm() function in R always initializes \(\mu\) = 0.75 for observed proportions (i.e., elements in \(\mathbf{y}\)) of 1 and \(\mu\) = 0.25 for proportions of zero. This can be achieved by adding 0.5 to the observed proportion of success (and adding one observation).

Parameters:

y ([float]) – The vector containing each observation.

llk(y, mu)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

lp(y, mu)

Log-probability of observing every proportion in \(\mathbf{y}\) under their respective binomial with mean = \(\boldsymbol{\mu}\).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observed proportion.

  • mu ([float]) – The vector containing the predicted probability for the response distribution corresponding to each observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.Binomial2(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.Logit object>, n: int = 1)

Bases: GAMLSSFamily

Another implementation of the Binomial family. That allows estimating binomial models via GAMMLSS models (And thus full-newton, no PQL!).

For this implementation we again assume that we have collected proportions of success, i.e., the dependent variables specified in the model Formula needs to hold observed proportions and not counts! If we assume that each observation \(y_i\) reflects a single independent draw from a binomial, (with \(n=1\), and \(p_i\) being the probability that the result is 1) then the dependent variable should either hold 1 or 0. If we have multiple independent draws from the binomial per observation (i.e., row in our data-frame), then \(n\) will usually differ between observations/rows in our data-frame (i.e., we observe \(k_i\) counts of success out of \(n_i\) draws - so that \(y_i=k_i/n_i\)). In that case, the Binomial() family accepts a vector for argument \(\mathbf{n}\) (which is simply set to 1 by default, assuming binary data), containing \(n_i\) for every observation \(y_i\).

In this implementation, the scale parameter is kept fixed/known at 1.

References:

  • Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • link (Link) – The link function to be used by the model of the mean of this family. By default set to the canonical logit link.

  • n (int or [int], optional) – Number of independent draws from a Binomial per observation/row of data-frame. For binary data this can simply be set to 1, which is the default.

get_resid(y, mu)

Get standardized residuals for a Binomial model

Essentially, the deviance residuals are returned, which are equivalent to \(sign(y_i - \mu_i)*D_i^{0.5}\), where \(\sum_{i=1,...N} D_i\) equals the model deviance (see Wood 2017, section 3.1.7).

Parameters:
  • y ([float]) – The vector containing each observed proportion.

  • mu ([float]) – The vector containing the predicted probability for the response distribution corresponding to each observation.

Returns:

A list of deviance residuals that should be ~ N(0,1) if the model is correct.

Return type:

[float]

init_coef(models)

Function to initialize the coefficients of the model.

Fits a GAMM (via PQL) for the mean.

Parameters:

models ([mssm.models.GAMM]) – A list of mssm.models.GAMM’s, - each based on one of the formulas provided to a model.

Returns:

A numpy.array of shape (-1,1), holding initial values for all model coefficients.

Return type:

numpy array

init_lambda(formulas)

Function to initialize the smoothing parameters of the model.

Returns values in line with what is used to initialize GAMM

Parameters:

formulas ([mssm.src.python.formula.Formula]) – A list of mssm.src.python.formula.Formula’s provided to a model.

Returns:

A list, holding - for each \(\lambda\) parameter to be estimated - an initial value.

Return type:

[float]

llk(y, mu)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

lp(y, mu)

Log-probability of observing every proportion in \(\mathbf{y}\) under their respective binomial with mean = \(\boldsymbol{\mu}\).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observed proportion.

  • mu ([float]) – The vector containing the predicted probability for the response distribution corresponding to each observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.Family(link: Link, twopar: bool, scale: float = None)

Bases: object

Base class to be implemented by Exp. family member.

Parameters:
  • link (Link) – The link function to be used by the model of the mean of this family.

  • twopar (bool) – Whether the family has two parameters (mean,scale) to be estimated (i.e., whether the likelihood is a function of two parameters), or only a single one (usually the mean).

  • scale (float or None, optional) – Known/fixed scale parameter for this family.

D(y, mu)

Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

V(mu, **kwargs)

The variance function (of the mean; see Wood, 2017, 3.1.2). Different exponential families allow for different relationships between the variance in our random response variable and the mean of it. For the normal model this is assumed to be constant.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

deviance(y, mu)

Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

init_mu(y)

Convenience function to compute an initial \(\boldsymbol{\mu}\) estimate passed to the GAMM/PIRLS estimation routine.

Parameters:

y ([float]) – The vector containing each observation.

llk(y, mu, **kwargs)

log-probability of \(\mathbf{y}\) under this family with mean = \(\boldsymbol{\mu}\). Essentially sum over all elements in the vector returned by the lp() method.

Families with more than one parameter that needs to be estimated in order to evaluate the model’s log-likelihood (i.e., two_par=True) must pass as key-word argument a scale parameter with a default value, e.g.,:

def llk(self, mu, scale=1):
   ...

You can check the implementation of the Gaussian Family for an example.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

lp(y, mu, **kwargs)

Log-probability of observing every value in \(\mathbf{y}\) under this family with mean = \(\boldsymbol{\mu}\).

Families with more than one parameter that needs to be estimated in order to evaluate the model’s log-likelihood (i.e., two_par=True) must pass as key-word argument a scale parameter with a default value, e.g.,:

def lp(self, mu, scale=1):
   ...

You can check the implementation of the Gaussian Family for an example.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.GAMLSSFamily(pars: int, links: [<class 'mssm.src.python.exp_fam.Link'>])

Bases: object

Base-class to be implemented by families of Generalized Additive Mixed Models of Location, Scale, and Shape (GAMMLSS; Rigby & Stasinopoulos, 2005).

Apart from the required methods, three mandatory attributes need to be defined by the __init__() constructor of implementations of this class. These are required to evaluate the first and second (pure & mixed) derivative of the log-likelihood with respect to any of the log-likelihood’s parameters. See the variables below.

Optionally, a mean_init_fam attribute can be defined - specfiying a Family member that is fitted to the data to get an initial estimate of the mean parameter of the assumed distribution.

References:

  • Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

Parameters:
  • pars (int) – Number of parameters of the distribution belonging to the random variables assumed to have generated the observations, e.g., 2 for the Normal: mean and standard deviation.

  • links ([Link]) – Link functions for each of the parameters of the distribution.

Variables:
  • d1 ([Callable]) – A list holding n_par functions to evaluate the first partial derivatives of llk with respect to each parameter of the llk. Needs to be initialized when calling __init__().

  • d2 ([Callable]) – A list holding n_par functions to evaluate the second (pure) partial derivatives of llk with respect to each parameter of the llk. Needs to be initialized when calling __init__().

  • d2m ([Callable]) – A list holding n_par*(n_par-1)/2 functions to evaluate the second mixed partial derivatives of llk with respect to each parameter of the llk in order: d2m[0] = \(\partial l/\partial \mu_1 \partial \mu_2\), d2m[1] = \(\partial l/\partial \mu_1 \partial \mu_3\), …, d2m[n_par-1] = \(\partial l/\partial \mu_1 \partial \mu_{n_{par}}\), d2m[n_par] = \(\partial l/\partial \mu_2 \partial \mu_3\), d2m[n_par+1] = \(\partial l/\partial \mu_2 \partial \mu_4\), … . Needs to be initialized when calling __init__().

get_resid(y, *mus)

Get standardized residuals for a GAMMLSS model (Rigby & Stasinopoulos, 2005).

Any implementation of this function should return a vector that looks like what could be expected from taking len(y) independent draws from \(N(0,1)\).

References:

  • Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observed value.

  • mus ([[float]]) – A list including self.n_par lists - one for each parameter of the distribution. Each of those lists contains the expected value for a particular parmeter for each of the N observations.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

init_coef(models)

(Optional) Function to initialize the coefficients of the model.

Can return None , in which case random initialization will be used.

Parameters:

models ([mssm.models.GAMM]) – A list of mssm.models.GAMM’s, - each based on one of the formulas provided to a model.

Returns:

A numpy.array of shape (-1,1), holding initial values for all model coefficients.

Return type:

numpy array

init_lambda(formulas)

(Optional) Function to initialize the smoothing parameters of the model.

Can return None , in which case random initialization will be used.

Parameters:

formulas ([mssm.src.python.formula.Formula]) – A list of mssm.src.python.formula.Formula’s provided to a model.

Returns:

A list, holding - for each \(\lambda\) parameter to be estimated - an initial value.

Return type:

[float]

llk(y, *mus)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mus ([[float]]) – A list including self.n_par lists - one for each parameter of the distribution. Each of those lists contains the expected value for a particular parmeter for each of the N observations.

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, *mus)

Log-probability of observing every element in \(\mathbf{y}\) under their respective distribution parameterized by mus.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observed value.

  • mus ([[float]]) – A list including self.n_par lists - one for each parameter of the distribution. Each of those lists contains the expected value for a particular parmeter for each of the N observations.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.GAMMALS(links: [<class 'mssm.src.python.exp_fam.Link'>])

Bases: GAMLSSFamily

Family for a GAMMA GAMMLSS model (Rigby & Stasinopoulos, 2005).

This Family follows the Gamma family, in that we assume: \(Y_i \sim \Gamma(\mu_i,\phi_i)\). The difference to the Gamma family is that we now also model \(\phi\) as an additive combination of smooth variables and other parametric terms. The Gamma distribution is usually not expressed in terms of the mean and scale (\(\phi\)) parameter but rather in terms of a shape and rate parameter - called \(\alpha\) and \(\beta\) respectively. Wood (2017) provides \(\alpha = 1/\phi\). With this we can obtain \(\beta = 1/\phi/\mu\) (see the source-code for lp() method of the Gamma family for details).

References:

  • Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

Parameters:

links ([Link]) – Link functions for the mean and standard deviation. Standard would be links=[LOG(),LOG()].

get_resid(y, mu, scale)

Get standardized residuals for a Gamma GAMMLSS model (Rigby & Stasinopoulos, 2005).

Essentially, to get a standaridzed residual vector we first have to account for the mean-variance relationship of our RVs (which we also have to do for the Gamma family) - for this we can simply compute deviance residuals again (see Wood, 2017). These should be \(\sim N(0,\phi_i)\) (where \(\phi_i\) is the element in scale for a specific observation) - so if we divide each of those by the observation-specific scale we can expect the resulting standardized residuals to be :math:` sim N(0,1)` if the model is correct.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

  • scale ([float]) – The vector containing the predicted scale parameter for the response distribution corresponding to each observation.

Returns:

A list of standardized residuals that should be ~ N(0,1) if the model is correct.

Return type:

[float]

init_coef(models)

Function to initialize the coefficients of the model.

Fits a GAMM for the mean and initializes all coef. for the scale parameter to 1.

Parameters:

models ([mssm.models.GAMM]) – A list of mssm.models.GAMM’s, - each based on one of the formulas provided to a model.

Returns:

A numpy.array of shape (-1,1), holding initial values for all model coefficients.

Return type:

numpy array

llk(y, mu, scale)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

  • scale ([float]) – The vector containing the predicted scale parameter for the response distribution corresponding to each observation.

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, mu, scale)

Log-probability of observing every proportion in \(\mathbf{y}\) under their respective Gamma with mean = \(\boldsymbol{\mu}\) and scale = \(\boldsymbol{\phi}\).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observed value.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

  • scale ([float]) – The vector containing the predicted scale parameter for the response distribution corresponding to each observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.GAUMLSS(links: [<class 'mssm.src.python.exp_fam.Link'>])

Bases: GAMLSSFamily

Family for a Normal GAMMLSS model (Rigby & Stasinopoulos, 2005).

This Family follows the Gaussian family, in that we assume: \(Y_i \sim N(\mu_i,\sigma_i)\). i.e., each of the \(N\) observations is still believed to have been generated from an independent normally distributed RV with observation-specific mean.

The important difference is that the scale parameter, \(\sigma\), is now also observation-specific and modeled as an additive combination of smooth functions and other parametric terms, just like the mean is in a Normal GAM. Note, that this explicitly models heteroscedasticity - the residuals are no longer assumed to be i.i.d samples from \(\sim N(0,\sigma)\), since \(\sigma\) can now differ between residual realizations.

References:

  • Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

Parameters:

links ([Link]) – Link functions for the mean and standard deviation. Standard would be links=[Identity(),LOG()].

get_resid(y, mu, sigma)

Get standardized residuals for a Normal GAMMLSS model (Rigby & Stasinopoulos, 2005).

Essentially, each residual should reflect a realization of a normal with mean zero and observation-specific standard deviation. After scaling each residual by their observation-specific standard deviation we should end up with standardized residuals that can be expected to be i.i.d \(\sim N(0,1)\) - assuming that our model is correct.

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

  • sigma ([float]) – The vector containing the predicted stdandard deviation for the response distribution corresponding to each observation.

Returns:

A list of standardized residuals that should be ~ N(0,1) if the model is correct.

Return type:

[float]

init_coef(models)

Function to initialize the coefficients of the model.

Fits a GAMM for the mean and initializes all coef. for the standard deviation to 1.

Parameters:

models ([mssm.models.GAMM]) – A list of mssm.models.GAMM’s, - each based on one of the formulas provided to a model.

Returns:

A numpy.array of shape (-1,1), holding initial values for all model coefficients.

Return type:

numpy array

llk(y, mu, sigma)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

  • sigma ([float]) – The vector containing the predicted stdandard deviation for the response distribution corresponding to each observation.

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, mu, sigma)

Log-probability of observing every proportion in y under their respective Normal with observation-specific mean and standard deviation.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observed value.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

  • sigma ([float]) – The vector containing the predicted stdandard deviation for the response distribution corresponding to each observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.GENSMOOTHFamily(pars: int, links: [<class 'mssm.src.python.exp_fam.Link'>], *llkargs)

Bases: object

Base-class for General Smooth “families” as discussed by Wood, Pya, & Säfken (2016). For estimation of :class:mssm.models.GSMM models via BFGS it is sufficient to implement llk(). gradient() and hessian() can then simply return None. For exact estimation via Newton’s method, the latter two functions need to be implemented and have to return the gradient and hessian at the current coefficient estimate respectively.

Additional parameters needed for likelihood, gradient, or hessian evaluation can be passed along via the llkargs.

References:

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

  • Nocedal & Wright (2006). Numerical Optimization. Springer New York.

Parameters:
  • pars (int) – Number of parameters of the likelihood.

  • links ([Link]) – List of Link functions for each parameter of the likelihood, e.g., links=[Identity(),LOG()].

gradient(coef, coef_split_idx, y, Xs)

Function to evaluate the gradient of the llk at current coefficient estimate coef.

Parameters:
  • coef ([float]) – The current coefficient estimate (as np.array of shape (-1,1) - so it must not be flattened!).

  • coef_split_idx ([int]) – A list used to split (via np.split()) the coef into the sub-sets associated with each paramter of the llk.

  • y ([float]) – The vector containing each observation.

  • Xs ([scp.sparse.csc_array]) – A list of sparse model matrices per likelihood parameter.

Returns:

The Gradient of the log-likelihood evaluated at coef as numpy.array) of shape (-1,1).

Return type:

[float]

hessian(coef, coef_split_idx, y, Xs)

Function to evaluate the hessian of the llk at current coefficient estimate coef.

Parameters:
  • coef ([float]) – The current coefficient estimate (as np.array of shape (-1,1) - so it must not be flattened!).

  • coef_split_idx ([int]) – A list used to split (via np.split()) the coef into the sub-sets associated with each paramter of the llk.

  • y ([float]) – The vector containing each observation.

  • Xs ([scp.sparse.csc_array]) – A list of sparse model matrices per likelihood parameter.

Returns:

The Hessian of the log-likelihood evaluated at coef.

Return type:

scipy.sparse.csc_array

init_coef(models)

(Optional) Function to initialize the coefficients of the model.

Can return None , in which case random initialization will be used.

Parameters:

models ([mssm.models.GAMM]) – A list of mssm.models.GAMM’s, - each based on one of the formulas provided to a model.

Returns:

A numpy.array of shape (-1,1), holding initial values for all model coefficients.

Return type:

numpy array

init_lambda(formulas)

(Optional) Function to initialize the smoothing parameters of the model.

Can return None , in which case random initialization will be used.

Parameters:

formulas ([mssm.src.python.formula.Formula]) – A list of mssm.src.python.formula.Formula’s provided to a model.

Returns:

A list, holding - for each \(\lambda\) parameter to be estimated - an initial value.

Return type:

[float]

llk(coef, coef_split_idx, y, Xs)

log-probability of data under given model.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • coef ([float]) – The current coefficient estimate (as np.array of shape (-1,1) - so it must not be flattened!).

  • coef_split_idx ([int]) – A list used to split (via np.split()) the coef into the sub-sets associated with each paramter of the llk.

  • y ([float]) – The vector containing each observation.

  • Xs ([scp.sparse.csc_array]) – A list of sparse model matrices per likelihood parameter.

Returns:

The log-likelihood evaluated at coef.

Return type:

float

class mssm.src.python.exp_fam.Gamma(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.LOG object>, scale: float = None)

Bases: Family

Gamma Family.

We assume: \(Y_i \sim \Gamma(\mu_i,\phi)\). The Gamma distribution is usually not expressed in terms of the mean and scale (\(\phi\)) parameter but rather in terms of a shape and rate parameter - called \(\alpha\) and \(\beta\) respectively. Wood (2017) provides \(\alpha = 1/\phi\). With this we can obtain \(\beta = 1/\phi/\mu\) (see the source-code for lp() method for details).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • link (Link) – The link function to be used by the model of the mean of this family. By default set to the log link.

  • scale (float or None, optional) – Known scale parameter for this family - by default set to None so that the scale parameter is estimated.

D(y, mu)

Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

A N-dimensional vector containing the contribution of each data-point to the overall model deviance.

Return type:

[float]

V(mu)

Variance function for the Gamma family.

The variance of random variable \(Y\) is proportional to it’s mean raised to the second power.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – a N-dimensional vector of the model prediction/the predicted mean

Returns:

mu raised to the power of 2

Return type:

[float]

deviance(y, mu)

Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

The model deviance.

Return type:

float

llk(y, mu, scale=1)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

  • scale (float, optional) – The (estimated) scale parameter, defaults to 1

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, mu, scale=1)

Log-probability of observing every proportion in \(\mathbf{y}\) under their respective Gamma with mean = \(\boldsymbol{\mu}\).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observed value.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

  • scale (float, optional) – The (estimated) scale parameter, defaults to 1

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.Gaussian(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.Identity object>, scale: float = None)

Bases: Family

Normal/Gaussian Family.

We assume: \(Y_i \sim N(\mu_i,\sigma)\) - i.e., each of the \(N\) observations is generated from a normally distributed RV with observation-specific mean and shared scale parameter \(\sigma\). Equivalent to the assumption that the observed residual vector - the difference between the model prediction and the observed data - should look like what could be expected from drawing \(N\) independent samples from a Normal with mean zero and standard deviation equal to \(\sigma\).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • link (Link) – The link function to be used by the model of the mean of this family. By default set to the canonical identity link.

  • scale (float or None, optional) – Known scale parameter for this family - by default set to None so that the scale parameter is estimated.

D(y, mu)

Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

A N-dimensional vector containing the contribution of each data-point to the overall model deviance.

Return type:

[float]

V(mu)

Variance function for the Normal family.

Not really a function since the link between variance and mean of the RVs is assumed constant for this model.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – a N-dimensional vector of the model prediction/the predicted mean

Returns:

a N-dimensional vector of 1s

Return type:

[float]

deviance(y, mu)

Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

The model deviance.

Return type:

float

llk(y, mu, sigma=1)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

  • sigma (float, optional) – The (estimated) sigma parameter, defaults to 1

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, mu, sigma=1)

Log-probability of observing every proportion in \(\mathbf{y}\) under their respective Normal with mean = \(\boldsymbol{\mu}\).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

  • sigma (float, optional) – The (estimated) sigma parameter, defaults to 1

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.Identity

Bases: Link

Identity Link function. \(\boldsymbol{\mu}=\boldsymbol{\eta}\) and so this link is trivial.

dy1(mu)

First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for Fisher scoring/PIRLS (Wood, 2017).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

dy2(mu)

Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).

References:

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

f(mu)

Canonical link for normal distribution with \(\boldsymbol{\eta} = \boldsymbol{\mu}\).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

fi(eta)

For the identity link, \(\boldsymbol{\eta} = \boldsymbol{\mu}\), so the inverse is also just the identity. see Faraway (2016)

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

eta ([float]) – The vector containing the model prediction corresponding to each observation.

class mssm.src.python.exp_fam.InvGauss(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.LOG object>, scale: float = None)

Bases: Family

Inverse Gaussian Family.

We assume: \(Y_i \sim IG(\mu_i,\phi)\). The Inverse Gaussian distribution is usually not expressed in terms of the mean and scale (\(\phi\)) parameter but rather in terms of a shape and scale parameter - called \(\nu\) and \(\lambda\) respectively (see the scipy implementation). We can simply set \(\nu=\mu\) (compare scipy density to the one in table 3.1 of Wood, 2017). Wood (2017) shows that \(\phi=1/\lambda\), so this provides \(\lambda=1/\phi\)

References:

Parameters:
  • link (Link) – The link function to be used by the model of the mean of this family. By default set to the log link.

  • scale (float or None, optional) – Known scale parameter for this family - by default set to None so that the scale parameter is estimated.

D(y, mu)

Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

A N-dimensional vector containing the contribution of each data-point to the overall model deviance.

Return type:

[float]

V(mu)

Variance function for the Inverse Gaussian family.

The variance of random variable \(Y\) is proportional to it’s mean raised to the third power.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – a N-dimensional vector of the model prediction/the predicted mean

Returns:

mu raised to the power of 3

Return type:

[float]

deviance(y, mu)

Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

The model deviance.

Return type:

float

llk(y, mu, scale=1)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

  • scale (float, optional) – The (estimated) scale parameter, defaults to 1

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, mu, scale=1)

Log-probability of observing every value in \(\mathbf{y}\) under their respective inverse Gaussian with mean = \(\boldsymbol{\mu}\).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observed value.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

  • scale (float, optional) – The (estimated) scale parameter, defaults to 1

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.LOG

Bases: Link

Log Link function. \(log(\boldsymbol{\mu}) = \boldsymbol{\eta}\).

dy1(mu)

First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for Fisher scoring/PIRLS (Wood, 2017).

References:
  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

dy2(mu)

Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).

References:

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

f(mu)

Non-canonical link for Gamma distribution with \(log(\boldsymbol{\mu}) = \boldsymbol{\eta}\).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

fi(eta)

For the log link, \(\boldsymbol{\eta} = log(\boldsymbol{\mu})\), so \(exp(\boldsymbol{\eta})=\boldsymbol{\mu}\). see Faraway (2016)

References:
  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

eta ([float]) – The vector containing the model prediction corresponding to each observation.

class mssm.src.python.exp_fam.LOGb(b)

Bases: Link

Log + b Link function. \(log(\boldsymbol{\mu} + b) = \boldsymbol{\eta}\).

Parameters:

b (float) – The constant to add to \(\mu\) before taking the log.

dy1(mu)

First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for Fisher scoring/PIRLS (Wood, 2017).

References:
  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

dy2(mu)

Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).

References:

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

f(mu)

\(log(\boldsymbol{\mu} + b) = \boldsymbol{\eta}\).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

fi(eta)

For the logb link, \(\boldsymbol{\eta} = log(\boldsymbol{\mu} + b)\), so \(exp(\boldsymbol{\eta})-b =\boldsymbol{\mu}\)

References:
  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

eta ([float]) – The vector containing the model prediction corresponding to each observation.

Bases: object

Link function base class. To be implemented by any link functiion used for GAMMs and GAMMLSS models. Only links used by GAMLSS models require implementing the dy2 function. Note, that care must be taken that every method returns only valid values. Specifically, no returned element may be numpy.nan or numpy.inf.

dy1(mu)

First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\) Needed for Fisher scoring/PIRLS (Wood, 2017).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

dy2(mu)

Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).

References:

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

f(mu)

Link function \(f()\) mapping mean \(\boldsymbol{\mu}\) of an exponential family to the model prediction \(\boldsymbol{\eta}\), so that \(f(\boldsymbol{\mu}) = \boldsymbol{\eta}\). see Wood (2017, 3.1.2) and Faraway (2016).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

fi(eta)

Inverse of the link function mapping \(\boldsymbol{\eta} = f(\boldsymbol{\mu})\) to the mean \(fi(\boldsymbol{\eta}) = fi(f(\boldsymbol{\mu})) = \boldsymbol{\mu}\). see Faraway (2016) and the Link.f function.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

eta ([float]) – The vector containing the model prediction corresponding to each observation.

class mssm.src.python.exp_fam.Logit

Bases: Link

Logit Link function, which is canonical for the binomial model. \(\boldsymbol{\eta}\) = log-odds of success.

dy1(mu)

First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for Fisher scoring/PIRLS (Wood, 2017):

\[ \begin{align}\begin{aligned}f(\mu) = log(\mu / (1 - \mu))\\f(\mu) = log(\mu) - log(1 - \mu)\\\partial f(\mu)/ \partial \mu = 1/\mu - 1/(1 - \mu)\end{aligned}\end{align} \]

Faraway (2016) simplifies this to: \(\partial f(\mu)/ \partial \mu = 1 / (\mu - \mu^2) = 1/ ((1-\mu)\mu)\)

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

dy2(mu)

Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).

References:

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

f(mu)

Canonical link for binomial distribution with \(\boldsymbol{\mu}\) holding the probabilities of success, so that the model prediction \(\boldsymbol{\eta}\) is equal to the log-odds.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

fi(eta)

For the logit link and the binomial model, \(\boldsymbol{\eta}\) = log-odds, so the inverse to go from \(\boldsymbol{\eta}\) to \(\boldsymbol{\mu}\) is \(\boldsymbol{\mu} = exp(\boldsymbol{\eta}) / (1 + exp(\boldsymbol{\eta}))\). see Faraway (2016)

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

eta ([float]) – The vector containing the model prediction corresponding to each observation.

class mssm.src.python.exp_fam.MULNOMLSS(pars: int)

Bases: GAMLSSFamily

Family for a Multinomial GAMMLSS model (Rigby & Stasinopoulos, 2005).

This Family assumes that each observation \(y_i\) corresponds to one of \(K\) classes (labeled as 0, …, \(K\)) and reflects a realization of an independent RV \(Y_i\) with observation-specific probability mass function defined over the \(K\) classes. These \(K\) probabilities - that \(Y_i\) takes on class 1, …, \(K\) - are modeled as additive combinations of smooth functions of covariates and other parametric terms.

As an example, consider a visual search experiment where \(K-1\) distractors are presented on a computer screen together with a single target and subjects are instructed to find the target and fixate it. With a Multinomial model we can estimate how the probability of looking at each of the \(K\) stimuli on the screen changes (smoothly) over time and as a function of other predictor variables of interest (e.g., contrast of stimuli, dependening on whether parfticipants are instructed to be fast or accurate).

References:

  • Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

Parameters:

pars (int) – K-1, i.e., 1- Number of classes or the number of linear predictors.

get_resid(y, *mus)

Get standardized residuals for a GAMMLSS model (Rigby & Stasinopoulos, 2005).

Any implementation of this function should return a vector that looks like what could be expected from taking len(y) independent draws from \(N(0,1)\).

References:

  • Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observed value.

  • mus ([[float]]) – A list including self.n_par lists - one for each parameter of the distribution. Each of those lists contains the expected value for a particular parmeter for each of the N observations.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

llk(y, *mus)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observed class, every element must be larger than or equal to 0 and smaller than self.n_par + 1.

  • mus ([[float]]) – A list containing K-1 (self.n_par) lists, each containing the non-normalized probabilities of observing class k for every observation.

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, *mus)

Log-probability of observing class k under current model.

Our DV consists of K classes but we essentially enforce a sum-to zero constraint on the DV so that we end up modeling only K-1 (non-normalized) probabilities of observing class k (for all k except k==K) as an additive combination of smooth functions of our covariates and other parametric terms. The probability of observing class K as well as the normalized probabilities of observing each other class can readily be computed from these K-1 non-normalized probabilities. This is explained quite well on Wikipedia (see refs).

Specifically, the probability of the outcome being class k is simply:

\(p(Y_i == k) = \mu_k / (1 + \sum_j^{K-1} \mu_j)\) where \(\mu_k\) is the aforementioned non-normalized probability of observing class \(k\) - which is simply set to 1 for class \(K\) (this follows from the sum-to-zero constraint; see Wikipedia).

So, the log-prob of the outcome being class k is:

\(log(p(Y_i == k)) = log(\mu_k) - log(1 + \sum_j^{K-1} \mu_j)\)

References:

Parameters:
  • y ([float]) – The vector containing each observed class, every element must be larger than or equal to 0 and smaller than self.n_par + 1.

  • mus ([[float]]) – A list containing K-1 (self.n_par) lists, each containing the non-normalized probabilities of observing class k for every observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.Poisson(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.LOG object>)

Bases: Family

Poisson Family.

We assume: \(Y_i \sim P(\lambda)\). We can simply set \(\lambda=\mu\) (compare scipy density to the one in table 3.1 of Wood, 2017) and treat the scale parameter of a GAMM (\(\phi\)) as fixed/known at 1.

References:

Parameters:

link (Link) – The link function to be used by the model of the mean of this family. By default set to the log link.

D(y, mu)

Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

A N-dimensional vector containing the contribution of each data-point to the overall model deviance.

Return type:

[float]

V(mu)

Variance function for the Poisson family.

The variance of random variable \(Y\) is proportional to it’s mean.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

mu ([float]) – a N-dimensional vector of the model prediction/the predicted mean

Returns:

mu

Return type:

[float]

deviance(y, mu)

Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

The model deviance.

Return type:

float

init_mu(y)

Function providing initial \(\boldsymbol{\mu}\) vector for Poisson GAMM.

We shrink extreme observed counts towards mean.

Parameters:

y ([float]) – The vector containing each observation.

llk(y, mu)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observation.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

  • scale (float, optional) – The (estimated) scale parameter, defaults to 1

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, mu)

Log-probability of observing every value in \(\mathbf{y}\) under their respective Poisson with mean = \(\boldsymbol{\mu}\).

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • y ([float]) – The vector containing each observed value.

  • mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.PropHaz(ut, r)

Bases: GENSMOOTHFamily

Family for proportional Hazard model - a type of General Smooth model as discussed by Wood, Pya, & Säfken (2016).

Based on Supplementary materials G in Wood, Pya, & Säfken (2016).

References:

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

  • Nocedal & Wright (2006). Numerical Optimization. Springer New York.

Parameters:
  • ut (numpy.array) – Unique event time vector (each time represnted as int) as described by WPS (2016), holding unique event times in decreasing order.

  • r (numpy.array) – Index vector as described by WPS (2016), holding for each data-point (i.e., for each row in Xs[0]) the index to it’s corresponding event time in ut.

get_baseline_hazard(coef, delta, Xs)

Get the cumulative baseline hazard function + variance estimate as defined by Wood, Pya, & Säfken (2016).

The function is evaluated for all k unique event times that were available in the data.

References:

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

Parameters:

coef (numpy.array) – Coefficient vector as numpy.array of shape (-1,1).

:param Xs The list model matrices (here holding a single model matrix) obtained from mssm.models.GSMM.get_mmat()`(). :type Xs: [scipy.sparse.csc_array] :param delta: Dependent variable passed to mssm.src.python.formula.Formula(), holds (for each row in Xs[0]) a value in {0,1}, indicating whether for that observation the event was observed or not. :type delta: numpy.array :return: Two arrays, the first holds k baseline hazard function estimates, the latter holds k variance estimates for each of the survival function estimates. :rtype: (numpy.array,numpy.array)

get_survival(coef, Xs, delta, t, x, V)

Compute survival function + variance at time-point t, given k optional covariate vector(s) x as defined by Wood, Pya, & Säfken (2016).

References:

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

Parameters:

coef (numpy.array) – Coefficient vector as numpy.array of shape (-1,1).

:param Xs The list model matrices (here holding a single model matrix) obtained from mssm.models.GSMM.get_mmat()`(). :type Xs: [scipy.sparse.csc_array] :param delta: Dependent variable passed to mssm.src.python.formula.Formula(), holds (for each row in Xs[0]) a value in {0,1}, indicating whether for that observation the event was observed or not. :type delta: numpy.array :param t: Time-point at which to evaluate the survival function. :type t: int :param x: Optional vector (or matrix) of covariate values. Needs to be of shape (k,len(coef)). :type x: numpy.array :param V: Estimated Co-variance matrix of posterior for coef :type V: scipy.sparse.csc-array :return: Two arrays, the first holds k survival function estimates, the latter holds k variance estimates for each of the survival function estimates. :rtype: (numpy.array,numpy.array)

gradient(coef, coef_split_idx, delta, Xs)

Gradient as defined by Wood, Pya, & Säfken (2016).

References:

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

delta (passed as dependent variable) holds values in {0,1}, indicating whether the event was observed or not.

hessian(coef, coef_split_idx, delta, Xs)

Hessian as defined by Wood, Pya, & Säfken (2016).

References:

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

delta (passed as dependent variable) holds values in {0,1}, indicating whether the event was observed or not.

init_coef(models)

Function to initialize the coefficients of the model.

Parameters:

models ([mssm.models.GAMM]) – A list of mssm.models.GAMM’s, - each based on one of the formulas provided to a model.

Returns:

A numpy.array of shape (-1,1), holding initial values for all model coefficients.

Return type:

numpy array

llk(coef, coef_split_idx, delta, Xs)

Log-likelihood function as defined by Wood, Pya, & Säfken (2016).

References:

  • Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.

delta (passed as dependent variable) holds values in {0,1}, indicating whether the event was observed or not.

mssm.src.python.exp_fam.est_scale(res, rows_X, total_edf)

Scale estimate from Wood & Fasiolo (2017).

Refereces:
  • Wood, S. N., & Fasiolo, M. (2017). A generalized Fellner-Schall method for smoothing parameter optimization with application to Tweedie location, scale and shape models.

Parameters:
  • res ([float]) – The vector containing the difference between the model prediction and the (pseudo) data.

  • rows_X (int) – The number of observations collected.

  • total_edf (float) – The expected degrees of freedom for the model.

mssm.src.python.formula module

class mssm.src.python.formula.Formula(lhs: ~mssm.src.python.formula.lhs, terms: [<class 'mssm.src.python.terms.GammTerm'>], data: ~pandas.core.frame.DataFrame, series_id: str = None, codebook: dict = None, print_warn=True, keep_cov=False, file_paths=[], file_loading_nc=1, file_loading_kwargs: dict = {'header': 0, 'index_col': False})

Bases: object

The formula of a regression equation.

Parameters:
  • lhs – The lhs object defining the dependent variable.

  • terms ([GammTerm]) – A list of the terms which should be added to the model. See mssm.src.python.terms for info on which terms can be added.

  • data (pd.DataFrame or None) – A pandas dataframe (with header!) of the data which should be used to estimate the model. The variable specified for lhs as well as all variables included for a term in terms need to be present in the data, otherwise the call to Formula will throw an error.

  • series_id (str, optional) – A string identifying the individual experimental units. Usually a unique trial identifier. Only necessary if approximate derivative computations are to be utilized for random smooth terms.

  • codebook (dict or None) – Codebook - keys should correspond to factor variable names specified in terms. Values should again be a dict, with keys for each of K levels of the factor and value corresponding to an integer in {0,K}.

  • print_warn (bool,optional) – Whether warnings should be printed. Useful when fitting models from terminal. Defaults to True.

  • keep_cov (bool,optional) – Whether or not the internal encoding structure of all predictor variables should be created when forming \(\mathbf{X}^T\mathbf{X}\) iteratively instead of forming \(\mathbf{X}\) directly. Can speed up estimation but increases memory footprint. Defaults to True.

  • file_paths ([str],optional) – A list of paths to .csv files from which \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) should be created iteratively. Setting this to a non-empty list will prevent fitting X as a whole. data should then be set to None. Defaults to an empty list.

  • file_loading_nc (int,optional) – How many cores to use to a) accumulate \(\mathbf{X}\) in parallel (if data is not None and file_paths is an empty list) or b) to accumulate \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) (and \(\mathbf{\eta}\) during estimation) (if data is None and file_paths is a non-empty list). For case b, this should really be set to the maimum number of cores available. For a this only really speeds up accumulating \(\mathbf{X}\) if \(\mathbf{X}\) has many many columns and/or rows. Defaults to 1.

  • file_loading_kwargs (dict,optional) – Any key-word arguments to pass to pandas.read_csv when \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) should be created iteratively (if data is None and file_paths is a non-empty list). Defaults to {"header":0,"index_col":False}.

Variables:
  • coef_per_term ([int]) – A list containing the number of coefficients corresponding to each term included in terms. Initialized at construction.

  • coef_names ([str]) – A list containing a named identifier (e.g., “Intercept”) for each coefficient estimated by the model. Initialized at construction.

  • n_coef (int) – The number of coefficients estimated by the model in total. Initialized at construction.

  • unpenalized_coef (int) – The number of un-penalized coefficients estimated by the model. Initialized at construction.

  • y_flat ([float] or None) – An array, containing all values on the dependent variable (i.e., specified by lhs.variable) in order of the data-frame passed to data. This variable will be initialized at construction but only if file_paths=None, i.e., in case \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) are not created iteratively.

  • cov_flat ([float] or None) – An array, containing all (encoded, in case of categorical predictors) values on each predictor (each columns of cov_flat corresponds to a different predictor) variable included in any of the terms in order of the data-frame passed to data. This variable will be initialized at construction but only if file_paths=None, i.e., in case \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) are not created iteratively.

  • NOT_NA_flat ([bool] or None) – An array, containing an indication for each value on the dependent variable (i.e., specified by lhs.variable) whether the corresponding value is not a number (“NA”) or not. In order of the data-frame passed to data. This variable will be initialized at construction but only if file_paths=None, i.e., in case \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) are not created iteratively.

build_penalties()

Builds the penalties required by self.__terms. Called automatically whenever needed. Call manually only for testing.

encode_data(data, prediction=False)

Encodes data, which needs to be a pd.DataFrame and by default (if prediction==False) builds an index of which rows in data are NA in the column of the dependent variable described by self.lhs.

Parameters:
  • data (pd.DataFrame) – The data to encode.

  • prediction (bool, optional) – Whether or not a NA index and a column for the dependent variable should be generated.

Returns:

A tuple with 7 (optional) entries: the dependent variable described by self.__lhs, the encoded predictor variables as a (N,k) array (number of rows matches the number of rows of the first entry returned, the number of columns matches the number of k variables present in the formula), an indication for each row whether the dependent variable described by self.__lhs is NA, like the first entry but split into a list of lists by self.series_id, like the second entry but split into a list of lists by self.series_id, ike the third entry but split into a list of lists by self.series_id, start and end points for the splits used to split the previous three elements (identifying the start and end point of every level of self.series_id).

Return type:

(numpy.array or None, numpy.array or None, numpy.array or None, [numpy.array] or None, [numpy.array] or None, [numpy.array] or None, numpy.array or None)

get_coding_factors() dict

Get a copy of the factor coding dictionary. Keys are factor variables in the data, values are dictionaries, where the keys correspond to the encoded levels (int) of the factor and the values to their levels (str).

get_data() DataFrame

Get a copy of the data specified for this formula.

get_depvar() list

Get a copy of the encoded dependent variable (defined via self.__lhs).

get_factor_codings() dict

Get a copy of the factor coding dictionary. Keys are factor variables in the data, values are dictionaries, where the keys correspond to the levels (str) of the factor and the values to their encoded levels (int).

get_factor_levels() dict

Get a copy of the factor levels dictionary. Keys are factor variables in the data, values are np.arrays holding the unique levels (as str) of the corresponding factor.

get_ir_smooth_term_idx() list[int]

Get a copy of the list of indices that identify impulse response terms in self.__terms.

get_lhs() lhs

Get a copy of the lhs specified for this formula.

get_linear_term_idx() list[int]

Get a copy of the list of indices that identify linear terms in self.__terms.

get_n_coef() int

Get the number of coefficients that are implied by the formula.

get_notNA() list

Get a copy of the encoded ‘not a NA’ vector for the dependent variable (defined via self.__lhs).

get_penalties() list

Get a copy of the penalties implied by the formula. Will be None if the penalties have not been initizlized yet.

get_random_term_idx() list[int]

Get a copy of the list of indices that identify random terms in self.__terms.

get_smooth_term_idx() list[int]

Get a copy of the list of indices that identify smooth terms in self.__terms.

get_subgroup_variables() list

Returns a copy of sub-group variables for factor smooths.

get_term_names() list

Returns a copy of the list with the names of the terms specified for this formula.

get_terms() list[GammTerm]

Get a copy of the terms specified for this formula.

get_var_map() dict

Get a copy of the var map dictionary. Keys are variables in the data, values their column index in the encoded predictor matrix returned by self.encode_data.

get_var_maxs() dict

Get a copy of the var maxs dictionary. Keys are variables in the data, values are either the maximum value the variable takes on in self.__data for continuous variables or None for categorical variables.

get_var_mins() dict

Get a copy of the var mins dictionary. Keys are variables in the data, values are either the minimum value the variable takes on in self.__data for continuous variables or None for categorical variables.

get_var_mins_maxs() -> (<class 'dict'>, <class 'dict'>)

Get a tuple containing copies of both the mins and maxs directory. See self.get_var_mins and self.get_var_maxs.

get_var_types() dict

Get a copy of the var types dictionary. Keys are variables in the data, values are either VarType.NUMERIC for continuous variables or VarType.FACTOR for categorical variables.

has_intercept() bool

Does this formula include an intercept or not.

has_ir_terms() bool

Does this formula include impulse response terms or not.

class mssm.src.python.formula.VarType(*values)

Bases: Enum

FACTOR = 2
NUMERIC = 1
class mssm.src.python.formula.lhs(variable: str, f: Callable = None)

Bases: object

The Left-hand side of a regression equation.

Parameters:

Parameters:
  • variable (str) – The dependent variable. Can point to continuous and categorical variables.

  • f (Callable, optional) – A function that will be applied to the variable before fitting. For example: np.log(). By default no function is applied to the variable.

mssm.src.python.formula.reparam(X, S, cov, option=1, n_bins=30, QR=False, identity=False, scale=False)

Options 1 - 3 are natural reparameterization discussed in Wood (2017; 5.4.2) with different strategies for the QR computation of \(\mathbf{X}\). Option 4 helps with stabilizing the REML computation and is from Wood (2011) and section 6.2.7 in Wood (2017):

  1. Form complete matrix \(\mathbf{X}\) based on entire covariate.

  2. Form matrix \(\mathbf{X}\) only based on unique covariate values.

  3. Form matrix \(\mathbf{X}\) on a sample of values making up covariate. Covariate is split up into n_bins equally wide bins. The number of covariate values per bin is then calculated. Subsequently, the ratio relative to minimum bin size is computed and each ratio is rounded to the nearest integer. Then ratio samples are obtained from each bin. That way, imbalance in the covariate is approximately preserved when forming the QR.

  4. Transform term-specific \(\mathbf{S}_{\boldsymbol{\lambda}}\) based on section Wood (2011) and section 6.2.7 in Wood (2017) so that they are full-rank and their log-determinant can be computed safely. In that case, only S needs to be provided and has to be a list holding the penalties to be transformed

For Options 1-3:

If QR==True then \(\mathbf{X}\) is decomposed into \(\mathbf{Q}\mathbf{R}\) directly via QR decomposition. Alternatively, we first form \(\mathbf{X}^T\mathbf{X}\) and then compute the cholesky \(\mathbf{L}\) of this product - note that \(\mathbf{L}^T = \mathbf{R}\). Overall the latter strategy is much faster (in particular if option==1), but the increased loss of precision in \(\mathbf{L}^T = \mathbf{R}\) might not be ok for some.

After transformation S only contains elements on it’s diagonal and \(\mathbf{X}\) the transformed functions. As discussed in Wood (2017), the transformed functions are decreasingly flexible - so the elements on \(\mathbf{S}\) diagonal become smaller and eventually zero, for elements that are in the kernel of the original \(\mathbf{S}\) (un-penalized == not flexible).

For a similar transformation (based solely on \(\mathbf{S}\)), Wood et al. (2013) show how to further reduce the diagonally transformed \(\mathbf{S}\) to an even simpler identity penalty. As discussed also in Wood (2017) the same behavior of decreasing flexibility if all entries on the diagonal of \(\mathbf{S}\) are 1 can only be maintained if the transformed functions are multiplied by a weight related to their wiggliness. Specifically, more flexible functions need to become smaller in amplitude - so that for the same level of penalization they are removed earlier than less flexible ones. To achieve this Wood further post-multiply the transformed matrix \(\mathbf{X}'\) with a matrix that contains on it’s diagonal the reciprocal of the square root of the transformed penalty matrix (and 1s in the last cells corresponding to the kernel). This is done here if identity=True.

In mgcv the transformed model matrix and penalty can optionally be scaled by the root mean square value of the transformed model matrix (see the nat.param function in mgcv). This is done here if scale=True.

For Option 4:

Option 4 enforces re-parameterization of term-specific \(\mathbf{S}_{\boldsymbol{\lambda}}\) based on section Wood (2011) and section 6.2.7 in Wood (2017). In mssm multiple penalties can be placed on individual terms (i.e., tensor terms, random smooths, Kernel penalty) but it is not always the case that the term-specific \(\mathbf{S}_{\boldsymbol{\lambda}}\) - i.e., the sum over all those individual penalties multiplied with their \(\lambda\) parameters, is of full rank. If we need to form the inverse of the term-specific \(\mathbf{S}_{\boldsymbol{\lambda}}\) this is problematic. It is also problematic, as discussed by Wood (2011), if the different \(\lambda\) are all of different magnitude in which case forming the term-specific \(log(|\mathbf{S}_{\boldsymbol{\lambda}}|+)\) becomes numerically difficult.

The re-parameterization implemented by option 4, based on Appendix B in Wood (2011), solves these issues. After this re-parameterization a term-specific \(\mathbf{S}_{\boldsymbol{\lambda}}\) has been formed that is full rank. And \(log(|\mathbf{S}_{\boldsymbol{\lambda}}|)\) - no longer just a generalized determinant - can be computed without running into numerical problems.

The strategy by Wood (2011) is more general and could be applied to form an overall - not just term-specific - \(\mathbf{S}_{\boldsymbol{\lambda}}\) with these properties. However, in mssm penalties currently cannot overlap, so this is not necessary at the moment.

References:
  • Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.

  • Wood, S. N., Scheipl, F., & Faraway, J. J. (2013). Straightforward intermediate rank tensor product smoothing in mixed models.

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

  • mgcv source code (accessed 2024). smooth.R file, nat.param function.

mssm.src.python.gamm_solvers module

class mssm.src.python.gamm_solvers.Fit_info(lambda_updates: int = 0, iter: int = 0, code: int = 1, eps: float = None, K2: float = None, dropped: [<class 'int'>] = None)

Bases: object

Holds information related to convergence (speed) for GAMMs, GAMMLSS, and GSMMs.

Variables:
  • lambda_updates (int) – The total number of lambda updates computed during estimation. Initialized with 0.

  • iter (int) – The number of outer iterations (a single outer iteration can involve multiple lambda updates) completed during estimation. Initialized with 0.

  • code (int) – Convergence status. Anything above 0 indicates that the model did not converge and estimates should be considered carefully. Initialized with 1.

  • eps (float) – The fraction added to the last estimate of the negative Hessian of the penalized likelihood during GAMMLSS or GSMM estimation. If this is not 0 - the model should not be considered as converged, irrespective of what code indicates. This most likely implies that the model is not identifiable. Initialized with None and ignored for GAMM estimation.

  • K2 (float) – An estimate for the condition number of matrix A, where A.T@A=H and H is the final estimate of the negative Hessian of the penalized likelihood. Only available if check_cond>0 when model.fit() is called for any model (i.e., GAMM, GAMMLSS, GSMM). Initialized with None.

  • dropped ([int]) – The final set of coefficients dropped during GAMMLSS/GSMM estimation when using method="QR/Chol" or None in which case no coefficients were dropped. Initialized with 0.

K2: float = None
code: int = 1
dropped: [<class 'int'>] = None
eps: float = None
iter: int = 0
lambda_updates: int = 0

mssm.src.python.penalties module

class mssm.src.python.penalties.LambdaTerm(S_J: ~scipy.sparse._csc.csc_array = None, S_J_emb: ~scipy.sparse._csc.csc_array = None, D_J_emb: ~scipy.sparse._csc.csc_array = None, rep_sj: int = 1, lam: float = 1.1, start_index: int = None, frozen: bool = False, type: ~mssm.src.python.penalties.PenType = None, rank: int = None, term: int = None, clust_series: [<class 'int'>] = None, clust_weights: [[<class 'float'>]] = None)

Bases: object

\(\lambda\) storage term.

Variables:
  • S_J (scipy.sparse.csc_array) – The penalty matrix associated with this lambda term. Note, in case multiple penalty matrices share the same lambda value, the rep_sj argument determines how many diagonal blocks we need to fill with this penalty matrix to get S_J_emb. Initialized with None.

  • S_J_emb (scipy.sparse.csc_array) – A zero-embedded version of the penalty matrix associated with this lambda term. Note, this matrix contains rep_sj diagonal sub-blocks each filled with S_J. Initialized with None.

  • D_J_emb (scipy.sparse.csc_array) – Root of S_J_emb, so that D_J_emb@D_J_emb.T=S_J_emb. Initialized with None.

  • rep_sj (int) – How many sequential sub-blocks of S_J_emb need to be filled with S_J. Useful if all levels of a categorical variable for which a separate smooth is to be estimated are assumed to share the same lambda value. Initialized with 1.

  • lam (float) – The current estimate for \(\lambda\). Initialized with 1.1.

  • start_index (int) – The first row and column in the overall penalty matrix taken up by S_J. Initialized with None.

  • type (PenType) – The type of this penalty term. Initialized with None.

  • rank (int) – The rank of S_J. Initialized with None.

  • term (int) – The index of the term in a mssm.src.python.formula.Formula with which this penalty is associated. Initialized with None.

D_J_emb: csc_array = None
S_J: csc_array = None
S_J_emb: csc_array = None
clust_series: [<class 'int'>] = None
clust_weights: [[<class 'float'>]] = None
frozen: bool = False
lam: float = 1.1
rank: int = None
rep_sj: int = 1
start_index: int = None
term: int = None
type: PenType = None

mssm.src.python.terms module

class mssm.src.python.terms.GammTerm(variables: list[str], type: TermType, is_penalized: bool, penalty: list[PenType], pen_kwargs: list[dict])

Bases: object

Base-class implemented by the terms passed to mssm.src.python.formula.Formula.

Parameters:
  • variables ([str]) – List of variables as strings.

  • type (TermType) – Type of term as enum

  • is_penalized (bool) – Whether the term is penalized/can be penalized or not

  • penalty ([penalties.PenType]) – The default penalties associated with a term.

  • pen_kwargs ([dict]) – A list of dictionaries, each with key-word arguments passed to the construction of the corresponding penalties.PenType in penalty.

class mssm.src.python.terms.TermType(*values)

Bases: Enum

LINEAR = 3
LSMOOTH = 1
RANDINT = 4
RANDSLOPE = 5
SMOOTH = 2
class mssm.src.python.terms.f(variables: list, by: str = None, by_cont: str = None, binary: list[str, str] = None, id: int = None, nk: int = 9, te: bool = False, rp: int = 0, constraint: ~mssm.src.python.constraints.ConstType = ConstType.QR, identifiable: bool = True, basis: ~collections.abc.Callable = <function B_spline_basis>, basis_kwargs: dict = {}, is_penalized: bool = True, penalize_null: bool = False, penalty: list[~mssm.src.python.penalties.PenType] = None, pen_kwargs: list[dict] = None)

Bases: GammTerm

A univariate or tensor interaction smooth term. If variables only contains a single variable \(x\), this term will represent a univariate \(f(x)\) in a model:

\[\mu_i = a + f(x_i)\]

For example, the model below in mgcv:

bam(y ~ s(x,k=10) + s(z,k=20))

would be expressed as follows in mssm:

GAMM(Formula(lhs("y"),[i(),f(["x"],nk=9),f(["z"],nk=19)]),Gaussian())

If variables contains two variables \(x\) and \(z\), then this term will either represent the tensor interaction \(f(x,z)\) in model:

\[\mu_i = a + f(x_i) + f(z_i) + f(x_i,z_i)\]

or in model:

\[\mu_i = a + f(x_i,z_i)\]

The first behavior is achieved by setting te=False. In that case it is necessary to add ‘main effect’ f terms for \(x\) and \(y\). In other words, the behavior then mimicks the ti() term available in mgcv (Wood, 2017). If te=True, the term instead behaves like a te() term in mgcv, so no separate smooth effects for the main effects need to be included.

For example, the model below in mgcv:

bam(y ~ te(x,z,k=10))

would be expressed as follows in mssm:

GAMM(Formula(lhs("y"),[i(),f(["x","z"],nk=9,te=True)]),Gaussian())

In addition, the model below in mgcv:

bam(y ~ s(x,k=10) + s(z,k=20) + ti(x,z,k=10))

would be expressed as follows in mssm:

GAMM(Formula(lhs("y"),[i(),f(["x"],nk=9),f(["z"],nk=19),f(["x","z"],nk=9,te=False)]),Gaussian())

By default a B-spline basis is used with nk=9 basis functions (after removing identifiability constrains). This is equivalent to mgcv’s default behavior of using 10 basis functions (before removing identifiability constrains). In case variables contains more then one variable nk can either bet set to a single value or to a list containing the number of basis functions that should be used to setup the spline matrix for every variable. The former implies that the same number of coefficients should be used for all variables. Keyword arguments that change the computation of the spline basis can be passed along via a dictionary to the basis_kwargs argument. Importantly, if multiple variables are present and a list is passed to nk, a list of dictionaries with keyword arguments of the same length needs to be passed to basis_kwargs as well.

Multiple penalties can be placed on every term by adding penalties.PenType to the penalties argument. In case variables contains multiple variables a separate tensor penalty (see Wood, 2017) will be created for every penalty included in penalties. Again, key-word arguments that alter the behavior of the penalty creation need to be passed as dictionaries to pen_kwargs for every penalty included in penalties. By default, a univariate term is penalized with a difference penalty of order 2 (Eilers & Marx, 2010).

References:

  • Eilers, P., & Marx, B. (2010). Splines, knots, and penalties. https://doi.org/10.1002/WICS.125

  • Marra, G., & Wood, S. N. (2011). Practical variable selection for generalized additive models. Computational Statistics & Data Analysis, 55(7), 2372–2387. https://doi.org/10.1016/j.csda.2011.02.004

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • variables (list[str]) – A list of the variables (strings) of which the term is a function. Need to exist in data passed to Formula. Need to be continuous.

  • by (str, optional) – A string corresponding to a factor in data passed to Formula. Separate f(variables) (and smoothness penalties) will be estimated per level of by.

  • by_cont (str, optional) – A string corresponding to a numerical variable in data passed to Formula. The model matrix for the estimated smooth term f(variables) will be multiplied by the column of this variable. Can be used to estimate ‘varying coefficient’ models but also to set up binary smooths or to only estimate a smooth term for specific levels of a factor (i.e., what is possible for ordered factors in R & mgcv).

  • binary ([str,str], optional) – A list containing two strings. The first string corresponds to a factor in data passed to Formula. A separate f(variables) will be estimated for the level of this factor corresponding to the second string.

  • id (int, optional) – Only useful in combination with specifying a by variable. If id is set to any integer the penalties placed on the separate f(variables) will share a single smoothness penalty.

  • nk (int or list[int], optional) – Number of basis functions to use. Even if identifiable is true, this number will reflect the final number of basis functions for this term (i.e., mssm acts like you would have asked for 10 basis functions if nk=9 and identifiable=True; the default).

  • te (bool, optional) – For tensor interaction terms only. If set to false, the term mimics the behavior of ti() in mgcv (Wood, 2017). Otherwise, the term behaves like a te() term in mgcv - i.e., the marginal basis functions are not removed from the interaction.

  • rp (int, optional) – Experimental - will currently break for tensor smooths or in case by is provided. Whether or not to re-parameterize the term - see mssm.src.python.formula.reparam() for details. Defaults to no re-parameterization.

  • constraint (mssm.src.constraints.ConstType, optional) – What kind of identifiability constraints should be absorbed by the terms (if they are to be identifiable). Either QR-based constraints (default, well-behaved), by means of column-dropping (no infill, not so well-behaved), or by means of difference re-coding (little infill, not so well behaved either).

  • identifiable (bool, optional) – Whether or not the constant should be removed from the space of functions this term can fit. Achieved by enforcing that \(\mathbf{1}^T \mathbf{X} = 0\) (\(\mathbf{X}\) here is the spline matrix computed for the observed data; see Wood, 2017 for details). Necessary in most cases to keep the model identifiable.

  • basis (Callable, optional) – The basis functions to use to construct the spline matrix. By default a B-spline basis (Eilers & Marx, 2010) implemented in mssm.src.smooths.B_spline_basis().

  • basis_kwargs (dict, optional) – A list containing one or multiple dictionaries specifying how the basis should be computed. For the B-spline basis the following arguments (with default values) are available: convolve``=``False, min_c``=``None, max_c``=``None, deg``=``3. See mss.src.smooths.B_spline_basis() for details, but the default should work for most cases.

  • is_penalized (bool, optional) – Should the term be left unpenalized or not. There are rarely good reasons to set this to False.

  • penalize_null (bool, optional) – Should a separate Null-space penalty (Marra & Wood, 2011) be placed on the term. By default, the term here will leave a linear f(variables) un-penalized! Thus, there is no option for the penalty to achieve f(variables) = 0 even if that would be supported by the data. Adding a Null-space penalty provides the penalty with that power. This can be used for model selection instead of Hypothesis testing and is the preferred way in mssm (see Marra & Wood, 2011 for details).

  • penalty (list[penalties.PenType], optional) – A list of penalty types to be placed on the term.

  • pen_kwargs (list[dict], optional) – A list containing one or multiple dictionaries specifying how the penalty should be created. For the default difference penalty (Eilers & Marx, 2010) the only keyword argument (with default value) available is: m=2. This reflects the order of the difference penalty. Note, that while a higher m permits penalizing towards smoother functions it also leads to an increased dimensionality of the penalty Kernel (the set of bases functions which will not be penalized). In other words, increasingly more complex functions will be left un-penalized for higher m (except if penalize_null is set to True). m=2 is usually a good choice and thus the default but see Eilers & Marx (2010) for details.

class mssm.src.python.terms.fs(variables: list, rf: str = None, nk: int = 9, m: int = 1, rp: int = 1, by_subgroup: [<class 'str'>, <class 'str'>] = None, approx_deriv: dict = None, basis: ~collections.abc.Callable = <function B_spline_basis>, basis_kwargs: dict = {})

Bases: f

Essentially a f term with by=rf, id != None, penalize_null= True, pen_kwargs = [{"m":1}], and rp=1.

This term approximates the “factor-smooth interaction” basis “fs” with m= 1 available in mgcv (Wood, 2017). For example, the term below from mgcv:

s(x,sub,bs="fs"))

would approximately correspond to the following term in mssm:

fs(["x"],rf="sub")

They are however not equivalent (mgcv by default uses a different basis for which the m key-word has a different functionality).

Specifically, here m= 1 implies that the only function left unpenalized by the default (difference) penalty is the constant (Eilers & Marx, 2010). Thus, a linear basis is penalized by the same default penalty that also penalizes smoothness (and not by a separate penalty as is the case in mgcv when m=1 for the default basis)! Any constant basis is penalized by the null-space penalty (in both mgcv and mssm; see Marra & Wood, 2011) - the term thus shrinks towards zero (Wood, 2017).

The factor smooth basis in mgcv allows to let the penalty be different for different levels of an additional factor (by additionally specifying the by argument for a smooth with basis “fs”). I.e.,

s(Time,Subject,by='condition',bs='fs')

in mgcv would estimate a non-linear random smooth of “time” per level of the “subject” & “condition” interaction - with the same penalty being placed on all random smooth terms within the same “condition” level.

This can be achieved in mssm by adding multiple fs terms to the Formula and utilising the by_subgroup argument. This needs to be set to a list where the first element identifies the additional factor variable (e.g., “condition”) and the second element corresponds to a level of said factor variable. E.g., to approximate the aforementioned mgcv term we have to add:

*[fs(["Time"],rf="subject_cond",by_subgroup=["cond",cl]) for cl in np.unique(dat["cond"])]

to the Formula terms list. Importantly, “subject_cond” is the interaction of “subject” and “condition” - not just the “subject variable in the data.

Model estimation can become quite expensive for fs terms, when the factor variable for rf has many levels. (> 10000) In that case, approximate derivative evaluation can speed things up considerably. To enforce this, the approx_deriv argument needs to be specified with a dict, having the following structure: {"no_disc":[str],"excl":[str],"split_by":[str],"restarts":int,"seed":None or int}. “no_disc” should usually be set to an empty list, and should in general only contain names of continuous variables included in the formula. Any variable mentioned here will not be discretized before clustering - this will make the approximation a bit more accurate but also require more time. Similarly, “excl” specifies any continuous variables that should be excluded for clustering. “split_by” should generally be set to a list containing all categorical variables present in the formula. “restarts” indicates the number of times to re-produce the clustering (40 seems to be a good number). “seed” can either be set to None or to an integer - in the latter case, the random cluster initialization will use that seed, ensuring that the clustering outcome (and hence model fit) is replicable.

References:

  • Eilers, P., & Marx, B. (2010). Splines, knots, and penalties. https://doi.org/10.1002/WICS.125

  • Marra, G., & Wood, S. N. (2011). Practical variable selection for generalized additive models.Computational Statistics & Data Analysis, 55(7), 2372–2387. https://doi.org/10.1016/j.csda.2011.02.004

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.). Chapman and Hall/CRC.

Parameters:
  • variables (list[str]) – A list of the variables (strings) of which the term is a function. Need to exist in data passed to Formula. Need to be continuous.

  • rf (str, optional) – A string corresponding to a (random) factor in data passed to Formula. Separate f(variables) (but a shared smoothness penalty!) will be estimated per level of rf.

  • nk (int or list[int], optional) – Number of basis functions -1 to use. I.e., if nk=9 (the default), the term will use 10 basis functions. By default f() has identifiability constraints applied and we act as if nk``+ 1 coefficients were requested. The ``fs() term needs no identifiability constrains so if the same number of coefficients used for a f() term is requested (the desired approach), one coefficient is added to compensate for the lack of identifiability constraints. This is the opposite to how this is handled in mgcv: specifying nk=10 for “fixed” univariate smooths results in 9 basis functions being available. However, for a smooth in mgcv with basis=’fs’, 10 basis functions will remain available.

  • basis (Callable, optional) – The basis functions to use to construct the spline matrix. By default a B-spline basis (Eilers & Marx, 2010) implemented in mssm.src.smooths.B_spline_basis().

  • basis_kwargs (dict, optional) – A list containing one or multiple dictionaries specifying how the basis should be computed. For the B-spline basis the following arguments (with default values) are available: convolve``=``False, min_c``=``None, max_c``=``None, deg``=``3. See mssm.src.smooths.B_spline_basis() for details.

  • by_subgroup ([str,str], optional) – List including a factor variable and specific level of said variable. Allows for separate penalties as described above.

  • approx_deriv (dict, optional) – Dict holding important info for the clustering algorithm. Structure: {"no_disc":[str],"excl":[str],"split_by":[str],"restarts":int}

class mssm.src.python.terms.i

Bases: GammTerm

An intercept/offset term. In a model

\[\mu_i = a + f(x_i)\]

it reflects \(a\).

class mssm.src.python.terms.irf(variables: [<class 'str'>], event_onset: [<class 'int'>], basis_kwargs: list[dict], by: str = None, id: int = None, nk: int = 10, basis: ~collections.abc.Callable = <function B_spline_basis>, is_penalized: bool = True, penalty: list[~mssm.src.python.penalties.PenType] = None, pen_kwargs: list[dict] = None)

Bases: GammTerm

A simple impulse response term, designed to correct for events with overlapping responses in multi-level time-series modeling.

The idea (see Ehinger & Dimigen; 2019 for a detailed introduction to this kind of deconvolution analysis) is that some kind of event happens during each recorded time-series (e.g., stimulus onset, distractor display, mask onset, etc.) which is assumed to affect the recorded signal in the next X ms in some way. The moment of event onset can differ between recorded time-series. In other words, the event is believed to act like an impulse which triggers a delayed response on the signal. This term class can be used to estimate the shape of this impulse response. Multiple irf terms can be included in a Formula if multiple events happen, potentially with overlapping responses.

Example:

# Simulate time-series based on two events that elicit responses which vary in their overlap.
# The summed responses + a random intercept + noise is then the signal.
overlap_dat,onsets1,onsets2 = sim7(100,1,2,seed=20)

# Model below tries to recover the shape of the two responses in the 200 ms after event onset (max_c=200) + the random intercepts:
overlap_formula = Formula(lhs("y"),[irf(["time"],onsets1,nk=15,basis_kwargs=[{"max_c":200,"min_c":0,"convolve":True}]),
                                    irf(["time"],onsets2,nk=15,basis_kwargs=[{"max_c":200,"min_c":0,"convolve":True}]),
                                    ri("factor")],
                                    data=overlap_dat,
                                    series_id="series") # For models with irf terms, the column in the data identifying unique series need to be specified.

model = GAMM(overlap_formula,Gaussian())
model.fit()

Note, that care needs to be taken when predicting for models including irf terms, because the onset of events can differ between time-series. Hence, model predictions + standard errors should first be obtained for the entire data-set used also to train the model and then extract series-specific predictions from the model-matrix as follows:

# Get model matrix for entire data-set but only based on the estimated shape for first irf term:
_,pred_mat,ci_b = model.predict([0],overlap_dat,ci=True)

# Now extract the prediction + approximate ci boundaries for a single series:
s = 8
s_pred = pred_mat[overlap_dat["series"] == s,:]@model.coef
s_ci = ci_b[overlap_dat["series"] == s]

# Now the estimated response following the onset of the first event can be visualized + an approximate CI:
from matplotlib import pyplot as plt
plt.plot(overlap_dat["time"][overlap_dat["series"] == s],s_pred,color='blue')
plt.plot(overlap_dat["time"][overlap_dat["series"] == s],s_pred+s_ci,color='blue',linestyle='dashed')
plt.plot(overlap_dat["time"][overlap_dat["series"] == s],s_pred-s_ci,color='blue',linestyle='dashed')

References:

  • Ehinger, B. V., & Dimingen, O. (2019). Unfold: an integrated toolbox for overlap correction, non-linear modeling, and regression-based EEG analysis. https://doi.org/10.7717/peerj.7838

Parameters:
  • variables (list[str]) – A list of the variables (strings) of which the term is a function. Need to exist in data passed to Formula. Need to be continuous.

  • event_onset ([int]) – A np.array containing, for each individual time-series, the index corresponding to the sample/time-point at which the event eliciting the response to be estimate by this term happened.

  • basis_kwargs (dict) – A list containing one or multiple dictionaries specifying how the basis should be computed. For irf terms, the convolve argument has to be set to True! Also, min_c and max_c must be specified. min_c corresponds to the assumed min. delay of the response after event onset and can usually be set to 0. max_c corresponds to the assumed max. delay of the response (in ms) after which the response is believed to have returned to a zero base-line.

  • by (str, optional) – A string corresponding to a factor in data passed to Formula. Separate irf(variables) (and smoothness penalties) will be estimated per level of by.

  • id (int, optional) – Only useful in combination with specifying a by variable. If id is set to any integer the penalties placed on the separate irff(variables) will share a single smoothness penalty.

  • nk (int, optional) – Number of basis functions to use. I.e., if nk=10 (the default), the term will use 10 basis functions (Note that these terms are not made identifiable by absorbing any kind of constraint).

  • basis (Callable, optional) – The basis functions to use to construct the spline matrix. By default a B-spline basis (Eilers & Marx, 2010) implemented in src.smooths.B_spline_basis.

  • is_penalized (bool, optional) – Should the term be left unpenalized or not. There are rarely good reasons to set this to False.

  • penalty (list[penalties.PenType], optional) – A list of penalty types to be placed on the term.

  • pen_kwargs (list[dict], optional) – A list containing one or multiple dictionaries specifying how the penalty should be created. For the default difference penalty (Eilers & Marx, 2010) the only keyword argument (with default value) available is: m=2. This reflects the order of the difference penalty. Note, that while a higher m permits penalizing towards smoother functions it also leads to an increased dimensionality of the penalty Kernel (the set of f[variables] which will not be penalized). In other words, increasingly more complex functions will be left un-penalized for higher m (except if penalize_null is set to True). m=2 is usually a good choice and thus the default but see Eilers & Marx (2010) for details.

class mssm.src.python.terms.l(variables: list)

Bases: GammTerm

Adds a parametric (linear) term to the model formula. The model \(\mu_i = a + b*x_i\) can for example be achieved by adding [i(), l(['x'])] to the term argument of a Formula. The coefficient \(b\) estimated for the term will then correspond to the slope of \(x\). This class can also be used to add predictors for categorical variables. If the formula includes an intercept, binary coding will be utilized to add reference-level adjustment coefficients for the remaining k-1 levels of any additional factor variable.

If more than one variable is included in variables the model will only add the the len(variables)-interaction to the model! Lower order interactions and main effects will not be included by default (see li() function instead, which automatically includes all lower-order interactions and main effects).

Example: The interaction effect of factor variable “cond”, with two levels “1” and “2”, and acontinuous variable “x” on the dependent variable “y” are of interest. To estimate such a model, the following formula can be used:

formula = Formula(lhs("y"),terms=[i(),l(["cond"]),l(["x"]),l(["cond","x"])])

This formula will estimate the following model:

\[\mu_i = a + b_1*c_i + b_2*x_i + b_3*c_i*x_i\]

Here, \(c\) is a binary predictor variable created so that it is 1 if “cond”=2 else 0 and \(b_3\) is the coefficient that is added because l(["cond","x"]) is included in the terms (i.e., the interaction effect).

To get a model with only main effects for “cond” and “x”, the following formula could be used:

formula = Formula(lhs("y"),terms=[i(),l(["cond"]),l(["x"])])

This formula will estimate:

\[\mu_i = a + b_1*c_i + b_2*x_i\]
Parameters:

variables ([str]) – A list of the variables (strings) for which linear predictors should be included

mssm.src.python.terms.li(variables: list[str])

Behaves like the l class but automatically includes all lower-order interactions and main effects.

Example: The interaction effect of factor variable “cond”, with two levels “1” and “2”, and acontinuous variable “x” on the dependent variable “y” are of interest. To estimate such a model, the following formula can be used:

formula = Formula(lhs("y"),terms=[i(),*li(["cond","x"])])

Note, the use of the * operator to unpack the individual terms returned from li!

This formula will still (see l) estimate the following model:

\[\mu = a + b_1*c_i + b_2*x_i + b_3*c_i*x_i\]

with: \(c\) corresponding to a binary predictor variable created so that it is 1 if “cond”=2 else 0.

To get a model with only main effects for “cond” and “x” li cannot be used and l needs to be used instead:

formula = Formula(lhs("y"),terms=[i(),l(["cond"]),l(["x"])])

This formula will estimate:

\[\mu_i = a + b_1*c_i + b_2*x_i\]
Parameters:

variables (list[str]) – A list of the variables (strings) for which linear predictors should be included

class mssm.src.python.terms.ri(variable: str)

Bases: GammTerm

Adds a random intercept for the factor variable to the model. The random intercepts \(b_i\) are assumed to be i.i.d \(b_i \sim N(0,\sigma_b)\) i.e., normally distributed around zero - the simplest random effect supported by mssm.

Thus, this term achieves exactly what is achieved in mgcv by adding the term:

s(variable,bs="re")

The variable needs to identify a factor-variable in the data (i.e., the .dtype of the variable has to be equal to ‘O’). If you want to add more complex random effects to the model (e.g., random slopes for continuous variable “x” per level of factor variable) use the rs class.

Parameters:

variable (str) – The name (string) of a factor variable. For every level of this factor a random intercept will be estimated. The random intercepts are assumed to follow a normal distribution centered around zero.

class mssm.src.python.terms.rs(variables: list[str], rf: str)

Bases: GammTerm

Adds random slopes for the effects of variables for each level of the random factor rf. The type of random slope created depends on the content of variables.

If len(variables)==1, and the string in variables identifies a categorical variable in the data, then a random offset adjustment (for every level of the categorical variable, so without binary coding!) will be estimated for every level of the random factor rf.

Example: The factor variable “cond”, with two levels “1” and “2” is assumed to have a general effect on the DV “y”. However, data was collected from multiple subjects (random factor rf = “subject”) and it is reasonable to assume that the effect of “cond” is slightly different for every subject (it is also assumed that all subjects took part in both conditions identified by “cond”). A model that accounts for this is estimated via:

formula = Formula(lhs("y"),terms=[i(),l(["cond"]),rs(["cond"],rf="subject")])

This formula will estimate the following model:

\[\mu = a + b_1*c_i + a_{j(i),cc(i)}\]

Here, \(c\) is again a binary predictor variable created so that it is 1 if “cond”=2 for observation i else 0, \(cc(i)\) indexes the level of “cond” at observation \(i\), \(j(i)\) indexes the level of “subject” at observation \(i\), and \(a_{j,cc(i)}\) identifies the random offset estimated for subject \(j\) at the level of “cond” indicated by \(cc(i)\). The \(a_{j,cc(i)}\) are assumed to be i.i.d \(\sim N(0,\sigma_a)\). Note that the fixed effect sturcture uses binary coding but the random effect structure does not!

Hence, rs(["cond"],rf="subject") in mssm corresponds to adding the term below to a mgcv model:

s(cond,subject,bs="re")

If all the strings in variables identify continuous variables in the data, then a random slope for the len(variables)-way interaction (will simplify to a slope for a single continuous variable if len(variables) == 1) will be estimated for every level of the random factor rf.

Example: The continuous variable “x” is assumed to have a general effect on the DV “y”. However, data was collected from multiple subjects (random factor rf =”subject”) and it is reasonable to assume that the effect of “x” is slightly different for every subject. A model that accounts for this is estimated via:

formula = Formula(lhs("y"),terms=[i(),l(["x"]),rs(["x"],rf="subject")])

This formula will estimate the following model:

\[\mu = a + b*x_i + b_{j(i)} * x_i\]

Where, \(j(i)\) again indexes the level of “subject” at observation \(i\), \(b_j(i)\) identifies the random slope (the subject-specific slope adjustment for \(b\)) for variable “x” estimated for subject \(j\) and the \(b_{j(i)}\) are again assumed to be i.i.d from a single \(\sim N(0,\sigma_b)\)

Note, lower-order interaction slopes (as well as main effects) are not pulled in by default! Consider the following formula:

formula = Formula(lhs("y"),terms=[i(),*li(["x","z"]),rs(["x","z"],rf="subject")])

with another continuous variable “z”. This corresponds to the model:

\[\mu = a + b_1*x_i + b_2*z_i + b_3*x_i*z_i + b_{j(i)}*x_i*z_i\]

With \(j(i)\) again indexing the level of “subject” at observation i, \(b_{j(i)}\) identifying the random slope (the subject-specific slope adjustment for \(b_3\)) for the interaction of variables \(x\) and \(z\) estimated for subject \(j\). The \(b_{j(i)}\) are again assumed to be i.i.d from a single \(\sim N(0,\sigma_b)\).

To add random slopes for the main effects of either \(x\) or \(z\) as well as an additional random intercept, additional rs and a ri terms would have to be added to the formula:

formula = Formula(lhs("y"),terms=[i(),*li(["x","z"]),
                                 ri("subject"),
                                 rs(["x"],rf="subject"),
                                 rs(["z"],rf="subject"),
                                 rs(["x","z"],rf="subject")])

If len(variables) > 1 and at least one string in variables identifies a categorical variable in the data then random slopes for the len(variables)-way interaction will be estimated for every level of the random factor rf. Separate distribution parameters (the \(\sigma\) of the Normal) will be estimated for every level of the resulting interaction.

Example: The continuous variable “x” and the factor variable “cond”, with two levels “1” and “2” are assumed to have a general interaction effect on the DV “y”. However, data was collected from multiple subjects (random factor rf =”subject”) and it is reasonable to assume that their interaction effect is slightly different for every subject. A model that accounts for this is estimated via:

formula = Formula(lhs("y"),terms=[i(),*li(["x","cond"]),rs(["x","cond"],rf="subject")])

This formula will estimate the following model:

\[\mu = a + b_1*c_i + b_2*x_i + b_3*x_i*c_i + b_{j(i),cc(i)}*x_i\]

With, \(c\) corresponding to a binary predictor variable created so that it is 1 if “cond”=2 for observation \(i\) else 0, \(cc(i)\) corresponds to the level of “cond” at observation \(i\), \(j(i)\) corresponds to the level of “subject” at observation \(i\), and \(b_{j(i),cc(i)}\) identifies the random slope for variable \(x\) at “cond” = \(cc(i)\) estimated for subject \(j\). That is: the \(b_{j,cc(i)}\) where \(cc(i)=1\) are assumed to be i.i.d realizations from normal distribution \(N(0,\sigma_{b_1})\) and the \(b_{j,cc(i)}\) where \(cc(i)=2\) are assumed to be i.i.d realizations from a separate normal distribution \(N(0,\sigma_{b_2})\).

Hence, adding rs(["x","cond"],rf="subject") to a mssm model, is equivalent to adding the term below to a mgcv model:

s(x,subject,by=cond,bs="re")

Correlations between random effects cannot be taken into account by means of parameters (this is possible for example in lme4).

Parameters:
  • variables ([str]) – A list of variables. Can point to continuous and categorical variables.

  • rf (str) – A factor variable. Identifies the random factor in the data.

mssm.src.python.utils module

mssm.src.python.utils.REML(llk, nH, coef, scale, penalties)

Based on Wood (2011). Exact REML for Gaussian GAM, Laplace approximate (Wood, 2016) for everything else. Evaluated after applying stabilizing reparameterization discussed by Wood (2011).

References:
  • Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.

  • Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models

Parameters:
  • llk (float) – log-likelihood of model

  • nH (scipy.sparse.csc_array) – negative hessian of log-likelihood of model

  • coef (numpy.array) – Estimated vector of coefficients of shape (-1,1)

  • scale (float) – (Estimated) scale parameter - can be set to 1 for GAMLSS or GSMMs.

  • penalties ([LambdaTerm]) – List of penalties that were part of the model.

Returns:

(Approximate) REML score

Return type:

float

mssm.src.python.utils.adjust_CI(model, n_ps, b, predi_mat, use_terms, alpha, seed)

Internal function to adjust point-wise CI to behave like whole-function interval (based on Wood, 2017; section 6.10.2 and Simpson, 2016):

model.coef +- b gives point-wise interval, and for the interval to cover the whole-function, 1-alpha % of posterior samples should be expected to fall completely within these boundaries.

From section 6.10 in Wood (2017) we have that \(\boldsymbol{\beta} | \mathbf{y}, \boldsymbol{\lambda} \sim N(\hat{\boldsymbol{\beta}},\mathbf{V})\). \(\mathbf{V}\) is the covariance matrix of this conditional posterior, and can be obtained by evaluating model.lvi.T @ model.lvi * model.scale (model.scale should be set to 1 for msssm.models.GAMMLSS and msssm.models.GSMM).

The implication of this result is that we can also expect the deviations \(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}\) to follow \(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}} | \mathbf{y}, \boldsymbol{\lambda} \sim N(0,\mathbf{V})\). In line with the whole-function interval definition above, 1-alpha % of predi_mat@[*coef - coef] (where [*coef - coef] representes the deviations \(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}\)) should fall within [b,-b]. Wood (2017) suggests to find a so that [a*b,a*-b] achieves this.

To do this, we find a for every predi_mat@[*coef - coef] and then select the final one so that 1-alpha % of samples had an equal or lower one. The consequence: 1-alpha % of samples drawn should fall completely within the modified boundaries.

References:
  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

  • Simpson, G. (2016). Simultaneous intervals for smooths revisited.

Parameters:
  • model (GAMM or GAMLSS or GSMM) – GAMM,GAMLSS, or GSMM model (which has been fitted) for which to estimate \(\mathbf{V}\)

  • n_ps (int) – Number of samples to obtain from posterior.

  • b ([float]) – Ci boundary of point-wise CI.

  • predi_mat (scipy.sparse.csc_array) – Model matrix for a particular smooth term or additive combination of parameters evaluated usually at a representative sample of predictor variables.

  • use_terms (list[int] or None) – The indices corresponding to the terms that should be used to obtain the prediction or None in which case all terms will be used.

  • alpha (float) – The alpha level to use for the whole-function interval adjustment calculation as outlined above.

  • seed (int or None) – Can be used to provide a seed for the posterior sampling.

mssm.src.python.utils.central_hessian(coef, llkfun, *llkgargs, **llkkwargs)

Generic function to approximate the hessian of llkfun at coef.

llkfun is called as llkfun(coef,*llkargs) - so all additional arguments have to be passed via the latter.

Uses central finite differences with fixed value for epsilon, based on eq. 8 and following paragraphs in Ridout (2009) (also used similarly in numdifftools).

References:
Parameters:
  • coef (numpy.array) – Current estimate of coefficients of which llkfun is some function.

  • llkfun (Callable) – log-likelihood function to optimize.

Returns:

An approximation of the Hessian as a numpy array.

Return type:

numpy.array

mssm.src.python.utils.compute_REML_candidate_GSMM(family, y, Xs, penalties, coef, n_coef, coef_split_idx, method='Chol', conv_tol=1e-07, n_c=10, bfgs_options={}, origNH=None)

Allows to evaluate REML criterion (e.g., Wood, 2011; Wood, 2016) efficiently for a set of lambda values for a GSMM or GAMMLSS.

Internal function used for computing the correction applied to the edf for the GLRT - based on Wood (2017) and Wood et al., (2016).

See REML() function for more details.

mssm.src.python.utils.compute_Vb_corr_WPS(Vbr, Vpr, Vr, H, S_emb, penalties, coef, scale=1)

Computes both correction terms for Vb or \(\mathbf{V}_{\boldsymbol{\beta}}\), which is the co-variance matrix for the conditional posterior of \(\boldsymbol{\beta}\) so that \(\boldsymbol{\beta} | y, \boldsymbol{\lambda} \sim N(\hat{\boldsymbol{\beta}},\mathbf{V}_{\boldsymbol{\beta}})\), described by Wood, Pya, & Säfken (2016).

References:
  • Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.

  • Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • Vbr (scipy.sparse.csc_array) – Transpose of root for the estimate for the (unscaled) covariance matrix of \(\boldsymbol{\beta} | y, \boldsymbol{\lambda}\) - the coefficients estimated by the model.

  • Vpr (numpy.array) – A (regularized) estimate of the covariance matrix of \(\boldsymbol{\rho}\) - the log smoothing penalties.

  • Vr (numpy.array) – Transpose of root of un-regularized covariance matrix of \(\boldsymbol{\rho}\) - the log smoothing penalties.

  • H (scipy.sparse.csc_array) – The Hessian of the log-likelihood

  • S_emb (scipy.sparse.csc_array) – The weighted penalty matrix.

  • penalties ([LambdaTerm]) – A list holding the :class:`Lambdaterm`s estimated for the model.

  • coef (numpy.array) – An array holding the estimated regression coefficients. Has to be of shape (-1,1)

  • scale (float) – Any scale parameter estimated as part of the model. Can be omitted for more generic models beyond GAMMs. Defaults to 1.

Raises:

ArithmeticError – Will throw an error when the negative Hessian of the penalized likelihood is ill-scaled so that a Cholesky decomposition fails.

Returns:

A tuple containing: Vc and Vcc. Vbr.T@Vbr*scale + Vc + Vcc is then approximately the correction devised by WPS (2016).

Return type:

(numpy.array, numpy.array)

mssm.src.python.utils.compute_Vp_WPS(Vbr, H, S_emb, penalties, coef, scale=1)

Computes the inverse of what is approximately the negative Hessian of the Laplace approximate REML criterion with respect to the log smoothing penalties.

The derivatives computed are only exact for Gaussian additive models and canonical generalized additive models. For all other models they are in-exact in that they assume that the hessian of the log-likelihood does not depend on \(\lambda\) (or \(log(\lambda)\)), so they are essentially the PQL derivatives of Wood et al. (2017). The inverse computed here acts as an approximation to the covariance matrix of the log smoothing parameters.

References:
  • Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.

  • Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

  • Wood, S. N., Li, Z., Shaddick, G., & Augustin, N. H. (2017). Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data.

Parameters:
  • Vbr (scipy.sparse.csc_array) – Transpose of root for the estimate for the (unscaled) covariance matrix of \(\boldsymbol{\beta} | y, \boldsymbol{\lambda}\) - the coefficients estimated by the model.

  • H (scipy.sparse.csc_array) – The Hessian of the log-likelihood

  • S_emb (scipy.sparse.csc_array) – The weighted penalty matrix.

  • penalties ([LambdaTerm]) – A list holding the :class:`Lambdaterm`s estimated for the model.

  • coef (numpy.array) – An array holding the estimated regression coefficients. Has to be of shape (-1,1)

  • scale (float) – Any scale parameter estimated as part of the model. Can be omitted for more generic models beyond GAMMs. Defaults to 1.

mssm.src.python.utils.compute_reml_candidate_GAMM(family, y, X, penalties, n_c=10, offset=0, init_eta=None, method='Chol', compute_inv=False, origNH=None)

Allows to evaluate REML criterion (e.g., Wood, 2011; Wood, 2016) efficiently for a set of lambda values.

Internal function used for computing the correction applied to the edf for the GLRT - based on Wood (2017) and Wood et al., (2016).

See REML() function for more details.

mssm.src.python.utils.correct_VB(model, nR=250, grid_type='JJJ3', a=1e-07, b=10000000.0, df=40, n_c=10, form_t=True, form_t1=False, verbose=False, drop_NA=True, method='Chol', only_expected_edf=False, Vp_fidiff=True, use_reml_weights=True, seed=None, **bfgs_options)

Estimate \(\mathbf{V}\), the covariance matrix of the unconditional posterior \(\boldsymbol{\beta} | y \sim N(\hat{\boldsymbol{\beta}},\mathbf{V})\) to account for smoothness uncertainty.

Wood et al. (2016) and Wood (2017) show that when basing conditional versions of model selection criteria or hypothesis tests on \(\mathbf{V}_{\boldsymbol{\beta}}\), which is the co-variance matrix for the conditional posterior of \(\boldsymbol{\beta}\) so that \(\boldsymbol{\beta} | y, \boldsymbol{\lambda} \sim N(\hat{\boldsymbol{\beta}},\mathbf{V}_{\boldsymbol{\beta}})\), the tests are severely biased. To correct for this they show that uncertainty in \(\boldsymbol{\lambda}\) needs to be accounted for. Hence they suggest to base these tests on \(\mathbf{V}\), the covariance matrix of the unconditional posterior \(\boldsymbol{\beta} | y \sim N(\hat{\boldsymbol{\beta}},\mathbf{V})\). They show how to obtain an estimate of \(\mathbf{V}\), but this requires \(\mathbf{V}_{\boldsymbol{p}}\) - an estimate of the covariance matrix of \(\boldsymbol{p}=log(\boldsymbol{\lambda})\). \(\mathbf{V}_{\boldsymbol{p}}\) requires derivatives that are not available when using the efs update.

Greven & Scheipl (2016) in their comment to the paper by Wood et al. (2016) show another option to estimate \(\mathbf{V}\) that does not require \(\mathbf{V}_{\boldsymbol{p}}\), based either on the total variance property and numeric integration. The latter is implemented below, based on the equations for the expectations outlined in their response.

This function implements multiple strategies to correct for smoothing parameter uncertainty, based on the Wood et al. (2016) and Greven & Scheipl (2016) proposals. The most straightforward strategy (grid_type = 'JJJ1') is to obtain a finite difference approximation for \(\mathbf{V}_{\boldsymbol{p}}\) and to then complete approximately the Wood et al. (2016) correction (exactly done here only for Gaussian additive or canonical Generalized models). This is too costly for large sparse multi-level models. In those cases, the Greven & Scheipl (2016) proposal is attractive. When grid_type = 'JJJ2', the approximation to \(\mathbf{V}_{\boldsymbol{p}}\) is used to adaptively sample the grid to numerically evaluate the expectations required in their proposal. From this grid, \(\mathbf{V}\) could then be evaluated as shown by Greven & Scheipl (2016). However, by default \(\mathbf{V}\) is here obtained by first computing it according to the Wood et al. (2016) correction and then taking a weighted sum of the former (weighted by V_shrinkage_weight) and the \(\mathbf{V}\) (weighted by 1-V_shrinkage_weight) obtained as suggested by Greven & Scheipl (2016). Setting V_shrinkage_weight=0 to 0 (and only_expected_edf=False) thus recovers exactly the Greven & Scheipl (2016) correction, albeit based on an adaptive grid. This is not efficient for large sparse models either. However, when setting only_expected_edf=True, \(\mathbf{V}\) is never actually computed and the uncertainty corrected model edf are instead approximated directly. This remains efficient for large sparse multi-level models.

References:
  • Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.

  • Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models

  • Greven, S., & Scheipl, F. (2016). Comment on: Smoothing Parameter and Model Selection for General Smooth Models

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:
  • model (GAMM or GAMMLSS or GSMM) – GAMM, GAMMLSS, or GSMM model (which has been fitted) for which to estimate \(\mathbf{V}\)

  • nR (int, optional) – In case grid!="JJJ1", nR samples/reml scores are generated/computed to numerically evaluate the expectations necessary for the uncertainty correction, defaults to 250

  • grid_type (str, optional) – How to compute the smoothness uncertainty correction - see above for details, defaults to ‘JJJ3’

  • a (float, optional) – Minimum \(\lambda\) value that is included when forming the initial grid. In addition, any of the \(\lambda\) estimates obtained from model (used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\)) which are smaller than this are set to this value as well, defaults to 1e-7 the minimum possible estimate

  • b (float, optional) – Maximum \(\lambda\) value that is included when forming the initial grid. In addition, any of the \(\lambda\) estimates obtained from model (used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\)) which are larger than this are set to this value as well, defaults to 1e7 the maximum possible estimate

  • df (int, optional) – Degrees of freedom used for the multivariate t distribution used to sample the next set of candidates. Setting this to np.inf means a multivariate normal is used for sampling, defaults to 40

  • n_c (int, optional) – Number of cores to use to compute the correction, defaults to 10

  • form_t (bool, optional) – Whether or not the smoothness uncertainty corrected edf should be computed, defaults to True

  • form_t1 (bool, optional) – Whether or not the smoothness uncertainty + smoothness bias corrected edf should be computed, defaults to False

  • verbose (bool, optional) – Whether to print progress information or not, defaults to False

  • drop_NA (bool,optional) – Whether to drop rows in the model matrices corresponding to NAs in the dependent variable vector. Defaults to True.

  • method (str,optional) – Which method to use to solve for the coefficients (and smoothing parameters). The default (“Chol”) relies on Cholesky decomposition. This is extremely efficient but in principle less stable, numerically speaking. For a maximum of numerical stability set this to “QR/Chol”. In that case a QR decomposition is used - which is first pivoted to maximize sparsity in the resulting decomposition but also pivots for stability in order to get an estimate of rank defficiency. A Cholesky is than used using the combined pivoting strategy obtained from the QR. This takes substantially longer. If this is set to 'qEFS', then the coefficients are estimated via quasi netwon and the smoothing penalties are estimated from the quasi newton approximation to the hessian. This only requieres first derviative information. Defaults to “Chol”.

  • only_expected_edf (bool,optional) – Whether to compute edf. from trace of covariance matrix (only_expected_edf=False) or based on numeric integration weights. The latter is much more efficient for sparse models. Only makes sense when grid_type!='JJJ1'. Defaults to True

  • Vp_fidiff (bool,optional) – Whether to rely on a finite difference approximation to compute \(\mathbf{V}_{\boldsymbol{p}}\) for grid_type JJJ1 and JJJ2 or on a PQL approximation. The latter is exact for Gaussian and canonical GAMs and far cheaper if many penalties are to be estimated. Defaults to True (Finite difference approximation)

  • use_reml_weights (bool,optional) – Whether to rely on REMl scores to compute the numerical integration when grid_type != 'JJJ1' or on the log-densities of \(\mathbf{V}_{\boldsymbol{p}}\) - the latter assumes that the unconditional posterior is normal. Defaults to True (REML scores are used)

  • seed (int,optional) – Seed to use for random parts of the correction. Defaults to None

  • bfgs_options (key=value,optional) – Any additional keyword arguments that should be passed on to the call of scipy.optimize.minimize(). If none are provided, the gtol argument will be initialized to 1e-3. Note also, that in any case the maxiter argument is automatically set to 100. Defaults to None.

Returns:

A tuple containing: V - an estimate of the unconditional covariance matrix, LV - the Cholesky of the former, Vp - an estimate of the covariance matrix for \(\boldsymbol{\rho}\), Vpr - a root of the former, edf - smoothness uncertainty corrected coefficient-wise edf, total_edf - smoothness uncertainty corrected edf, edf2 - smoothness uncertainty + smoothness bias corrected coefficient-wise edf, total_edf2 - smoothness uncertainty + smoothness bias corrected edf, upper_edf - a heuristic upper bound on the uncertainty corrected edf

Return type:

(scipy.sparse.csc_array,scipy.sparse.csc_array,numpyp.array,numpy.array,numpy.array,float,numpy.array,float,float)

mssm.src.python.utils.estimateVp(model, nR=20, lR=100, n_c=10, a=1e-07, b=10000000.0, verbose=False, drop_NA=True, method='Chol', seed=None, conv_tol=1e-07, df=40, strategy='JJJ3', **bfgs_options)

Estimate covariance matrix \(\mathbf{V}_{\boldsymbol{p}}\) of posterior for \(\mathbf{p} = log(\boldsymbol{\lambda})\). Either (see strategy parameter) based on finite difference approximation or on using REML scores to approximate the expectation for the covariance matrix, similar to what was suggested by Greven & Scheipl (2016).

References:
Parameters:
  • model (GAMM or GAMLSS or GSMM) – GAMM,GAMLSS, or GSMM model (which has been fitted) for which to estimate \(\mathbf{V}\)

  • nR (int, optional) – Initial \(\lambda\) Grid is based on nR equally-spaced samples from \(\lambda/lr\) to \(\lambda*lr\). In addition, nR*len(model.formula.penalties) updates to \(\mathbf{V}_{\boldsymbol{p}}\) are performed during each of which additional nR \(\lambda\) samples/reml scores are generated/computed, defaults to 20

  • lR (int, optional) – Initial \(\lambda\) Grid is based on nR equally-spaced samples from \(\lambda/lr\) to \(\lambda*lr\), defaults to 100

  • n_c (int, optional) – Number of cores to use to compute the estimate, defaults to 10

  • a (float, optional) – Minimum \(\lambda\) value that is included when forming the initial grid. In addition, any of the \(\lambda\) estimates obtained from model (used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\)) which are smaller than this are set to this value as well, defaults to 1e-7 the minimum possible estimate

  • b (float, optional) – Maximum \(\lambda\) value that is included when forming the initial grid. In addition, any of the \(\lambda\) estimates obtained from model (used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\)) which are larger than this are set to this value as well, defaults to 1e7 the maximum possible estimate

  • verbose (bool, optional) – Whether to print progress information or not, defaults to False

  • drop_NA (bool,optional) – Whether to drop rows in the model matrices corresponding to NAs in the dependent variable vector. Defaults to True.

  • method (str,optional) – Which method to use to solve for the coefficients (and smoothing parameters). The default (“Chol”) relies on Cholesky decomposition. This is extremely efficient but in principle less stable, numerically speaking. For a maximum of numerical stability set this to “QR/Chol”. In that case a QR decomposition is used - which is first pivoted to maximize sparsity in the resulting decomposition but also pivots for stability in order to get an estimate of rank defficiency. A Cholesky is than used using the combined pivoting strategy obtained from the QR. This takes substantially longer. If this is set to 'qEFS', then the coefficients are estimated via quasi netwon and the smoothing penalties are estimated from the quasi newton approximation to the hessian. This only requieres first derviative information. Defaults to “Chol”.

  • seed (int,optional) – Seed to use for random parts of the estimate. Defaults to None

  • conv_tol (float, optional) – Deprecated, defaults to 1e-7

  • df (int, optional) – Degrees of freedom used for the multivariate t distribution used to sample the next set of candidates. Setting this to np.inf means a multivariate normal is used for sampling, defaults to 40

  • strategy (str, optional) – By default (strategy='JJJ3'), the expectation for the covariance matrix is approximated via numeric integration. When strategy='JJJ1' a finite difference approximation is obtained instead. When strategy='JJJ2', the finite difference approximation is updated through numeric integration iteratively. Defaults to ‘JJJ3’

  • bfgs_options (key=value,optional) – Any additional keyword arguments that should be passed on to the call of scipy.optimize.minimize(). If none are provided, the gtol argument will be initialized to 1e-3. Note also, that in any case the maxiter argument is automatically set to 100. Defaults to None.

Returns:

An estimate of the covariance matrix of the posterior for \(\mathbf{p} = log(\boldsymbol{\lambda})\)

Return type:

numpy.array

mssm.src.python.utils.forward_hessian(coef, llkfun, *llkgargs, **llkkwargs)

Generic function to approximate the hessian of llkfun at coef.

llkfun is called as llkfun(coef,*llkargs) - so all additional arguments have to be passed via the latter.

Uses finite differences with fixed value for epsilon, based on chapter 5.5.2 in Wood (2015).

References:
    1. Wood (2015). Core Statistics

Parameters:
  • coef (numpy.array) – Current estimate of coefficients of which llkfun is some function.

  • llkfun (Callable) – log-likelihood function to optimize.

Returns:

An approximation of the Hessian as a numpy array.

Return type:

numpy.array

mssm.src.python.utils.sample_MVN(n, mu, scale, P, L, LI=None, use=None, seed=None)

Draw n samples from multivariate normal with mean \(\boldsymbol{\mu}\) (mu) and covariance matrix \(\boldsymbol{\Sigma}\).

\(\boldsymbol{\Sigma}\) does not need to be provided. Rather the function expects either L (\(\mathbf{L}\) in what follows) or LI (\(\mathbf{L}^{-1}\) in what follows) and scale (\(\phi\) in what follows). These relate to \(\boldsymbol{\Sigma}\) so that \(\boldsymbol{\Sigma}/\phi = \mathbf{L}^{-T}\mathbf{L}^{-1}\) or \(\mathbf{L}\mathbf{L}^T = [\boldsymbol{\Sigma}/\phi]^{-1}\) so that \(\mathbf{L}*(1/\phi)^{0.5}\) is the Cholesky of the precision matrix of \(\boldsymbol{\Sigma}\).

Notably, for models available in mssm L (and LI) have usually be computed for a permuted matrix, e.g., \(\mathbf{P}[\mathbf{X}^T\mathbf{X} + \mathbf{S}_{\lambda}]\mathbf{P}^T\) (see Wood & Fasiolo, 2017). Hence for sampling we often need to correct for permutation matrix \(\mathbf{P}\) (P). if LI is provided, then P can be omitted and is assumed to have been used to un-pivot LI already.

Used for example sample the uncorrected posterior \(\boldsymbol{\beta} | \mathbf{y}, \boldsymbol{\lambda} \sim N(\boldsymbol{\mu} = \hat{\boldsymbol{\beta}},[\mathbf{X}^T\mathbf{X} + \mathbf{S}_{\lambda}]^{-1}\phi)\) for a GAMM (see Wood, 2017). Based on section 7.4 in Gentle (2009), assuming \(\boldsymbol{\Sigma}\) is \(p*p\) and covariance matrix of uncorrected posterior, samples \(\boldsymbol{\beta}\) are then obtained by computing:

\[\boldsymbol{\beta} = \hat{\boldsymbol{\beta}} + [\mathbf{P}^T \mathbf{L}^{-T}*\phi^{0.5}]\mathbf{z}\ \text{where}\ z_i \sim N(0,1)\ \forall i = 1,...,p\]

Alternatively, relying on the fact of equivalence that:

\[[\mathbf{L}^T*(1/\phi)^{0.5}]\mathbf{P}[\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}] = \mathbf{z}\]

we can first solve for \(\mathbf{y}\) in:

\[[\mathbf{L}^T*(1/\phi)^{0.5}] \mathbf{y} = \mathbf{z}\]

followed by computing:

\[ \begin{align}\begin{aligned}\mathbf{y} = \mathbf{P}[\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}]\\\boldsymbol{\beta} = \hat{\boldsymbol{\beta}} + \mathbf{P}^T\mathbf{y}\end{aligned}\end{align} \]

The latter avoids forming \(\mathbf{L}^{-1}\) (which unlike \(\mathbf{L}\) might not benefit from the sparsity preserving permutation \(\mathbf{P}\)). If LI is None, L will thus be used for sampling as outlined in these alternative steps.

Often we care only about a handfull of elements in mu (e.g., the first ones corresponding to “fixed effects’” in a GAMM). In that case we can generate samles only for this sub-set of interest by only using a sub-block of rows of \(\mathbf{L}\) or \(\mathbf{L}^{-1}\) (all columns remain). Argument use can be a np.array containg the indices of elements in mu that should be sampled. Because this only works efficiently when LI is available an error is raised when not use is None and LI is None.

If mu is set to any integer (i.e., not a Numpy array/list) it is automatically treated as 0. For mssm.models.GAMMLSS or mssm.models.GSMM models, scale can be set to 1.

References:

  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

  • Gentle, J. (2009). Computational Statistics.

Parameters:
  • n (int) – Number of samples to generate

  • mu (np.array) – mean of normal distribution as described above

  • scale (float) – scaling parameter of covariance matrix as described above

  • P (scp.sparse.csc_array) – Permutation matrix, optional.

  • L (scp.sparse.csc_array) – Cholesky of precision of scaled covariance matrix as described above.

  • LI (scp.sparse.csc_array, optional) – Inverse of cholesky factor of precision of scaled covariance matrix as described above.

  • use ([int], optional) – Indices of parameters in mu for which to generate samples, defaults to None in which case all parameters will be sampled

  • seed (int, optional) – Seed to use for random sample generation, defaults to None

Raises:
  • ValueError – In case neither LI nor L are provided.

  • ValueError – In case L is provided but P is not.

  • ValueError – In case use is provided but LI is not.

Returns:

Samples from multi-variate normal distribution. In case use is not provided, the returned array will be of shape (p,n) where p==LI.shape[1]. Otherwise, the returned array will be of shape (len(use),n).

Return type:

np.array

mssm.src.python.utils.updateVp(ep, remls, rGrid)

Update covariance matrix of posterior for \(\mathbf{p} = log(\boldsymbol{\lambda})\). REML scores are used to approximate expectation, similar to what was suggested by Greven & Scheipl (2016).

References:
Parameters:
  • ep ([float]) – Model estimate log(lambda), i.e., the expectation over rGrid

  • remls ([float]) – REML score associated with each lambda candidate in rGrid

  • rGrid ([float]) – A 2d array, holding all lambda samples considered so far. Each row is one sample

Returns:

An estimate of the covariance matrix of log(lambda) - 2d array of shape len(mp)*len(mp).

Return type:

[float]