mssm.src.python package

Submodules

mssm.src.python.compare module

mssm.src.python.compare.compare_CDL(model1: GAMM, model2: GAMM, correct_V: bool = True, correct_t1: bool = False, perform_GLRT: bool = False, nR=250, n_c=10, alpha=0.05, grid='JJJ1', a=1e-07, b=10000000.0, df=40, verbose=False, drop_NA=True, method='Chol', seed=None, only_expected_edf=False, Vp_fidiff=False, use_importance_weights=True, prior=None, recompute_H=False, **bfgs_options)

Computes the AIC difference and (optionally) performs an approximate GLRT on twice the difference in unpenalized likelihood between models model1 and model2 (see Wood et al., 2016).

For the GLRT to be appropriate model1 should be set to the model containing more effects and model2 should be a nested, simpler, variant of model1. For the degrees of freedom for the test, the expected degrees of freedom (EDF) of each model are used (i.e., this is the conditional test discussed in Wood (2017: 6.12.4)). The difference between the models in EDF serves as DoF for computing the Chi-Square statistic. In addition, correct_t1 should be set to True, when computing the GLRT.

To get the AIC for each model, 2*edf is added to twice the negative (conditional) likelihood (see Wood et al., 2016).

By default (correct_V=True), mssm will attempt to correct the edf for uncertainty in the estimated \(\lambda\) parameters. Which correction is computed depends on the choice for the grid argument. Approximately the analytic solution for the correction proposed by Wood, Pya, & Säfken (2016) is computed when grid='JJJ1' (the default) - which is exact for strictly Gaussian and some canonical Generalized additive models. This is too costly for very large sparse multi-level models and not exact for more generic models. The MC based alternative available via grid = 'JJJ2' addresses the first problem (Important, set: use_importance_weights=False and only_expected_edf=True.). The second MC based alternative available via grid_type = 'JJJ3' is most appropriate for more generic models (The prior argument can be used to specify any prior to be placed on \(\boldsymbol{\rho}\) also you will need to set: use_importance_weights=True and only_expected_edf=False). For more details consult the mssm.src.python.utils.correct_VB() function and Krause et al. (in preparation).

In case any of those correction strategies is too expensive, it might be better to rely on hypothesis tests for individual smooths, confidence intervals, and penalty-based selection approaches instead (see Marra & Wood, 2011 for details on the latter).

In case correct_t1=True the EDF will be set to the (smoothness uncertainty corrected in case correct_V=True) smoothness bias corrected exprected degrees of freedom (t1 in section 6.1.2 of Wood, 2017), for the GLRT (based on recomendation given in section 6.12.4 in Wood, 2017). The AIC (Wood, 2017) of both models will still be based on the regular (smoothness uncertainty corrected) edf.

The computation here is different to the one performed by the compareML function in the R-package itsadug - which rather performs a version of the marginal GLRT (also discussed in Wood, 2017: 6.12.4) - and more similar to the anova.gam implementation provided by mgcv (particularly if grid='JJJ1'). The returned p-value is approximate - very **very** much so if ``correct_V=False (this should really never be done). Also, the GLRT should not be used to compare models differing in their random effect structures - the AIC is more appropriate for this (see Wood, 2017: 6.12.4).

Examples:

### Model comparison and smoothness uncertainty correction for strictly additive model

# Simulate some data
sim_fit_dat = sim3(n=500,scale=2,c=0.1,family=Gaussian(),seed=21)

# Now fit nested models
sim_fit_formula = Formula(lhs("y"),
                            [i(),f(["x0"],nk=20,rp=1),f(["x1"],nk=20,rp=1),f(["x2"],nk=20,rp=1),f(["x3"],nk=20,rp=1)],
                            data=sim_fit_dat,
                            print_warn=False)

sim_fit_model = GAMM(sim_fit_formula,Gaussian())
sim_fit_model.fit(exclude_lambda=False,progress_bar=False,max_outer=100)

sim_fit_formula2 = Formula(lhs("y"),
                            [i(),f(["x1"],nk=20,rp=1),f(["x2"],nk=20,rp=1),f(["x3"],nk=20,rp=1)],
                            data=sim_fit_dat,
                            print_warn=False)

sim_fit_model2 = GAMM(sim_fit_formula2,Gaussian())
sim_fit_model2.fit(exclude_lambda=False,progress_bar=False,max_outer=100)


# And perform a smoothness uncertainty corrected comparisons
cor_result1 = compare_CDL(sim_fit_model,sim_fit_model2,grid='JJJ1',seed=22)

# To perform a GLRT and correct the edf for smoothness bias as well (e.g., Wood, 2017) run:
cor_result2 = compare_CDL(sim_fit_model,sim_fit_model2,grid='JJJ1',seed=22,perform_GLRT=True,correct_t1=True)

### Model comparison and smoothness uncertainty correction for very large strictly additive model

# If the models are quite large (many coefficients) the following can be much faster:
nR = 250 # Number of samples to use for the numeric integration
cor_result3 = compare_CDL(sim_fit_model,sim_fit_model2,nR=nR,n_c=10,correct_t1=False,grid='JJJ2',seed=22,only_expected_edf=True,use_importance_weights=False)

### Model comparison and smoothness uncertainty correction for more generic smooth model (GAMM, GAMMLSS, etc.)

# Simulate some data
sim_fit_dat = sim3(n=500,scale=2,c=0.1,family=Gamma(),seed=21)

# Now fit nested models
sim_fit_formula = Formula(lhs("y"),
                            [i(),f(["x0"],nk=20,rp=1),f(["x1"],nk=20,rp=1),f(["x2"],nk=20,rp=1),f(["x3"],nk=20,rp=1)],
                            data=sim_fit_dat,
                            print_warn=False)

sim_fit_formula_sd = Formula(lhs("y"),
                            [i()],
                            data=sim_fit_dat,
                            print_warn=False)

sim_fit_model = GAMMLSS([sim_fit_formula,copy.deepcopy(sim_fit_formula_sd)],family = GAMMALS([LOG(),LOGb(-0.01)]))
sim_fit_model.fit(progress_bar=False,max_outer=200,extend_lambda=False,control_lambda=2,max_inner=500)

sim_fit_formula2 = Formula(lhs("y"),
                            [i(),f(["x1"],nk=20,rp=1),f(["x2"],nk=20,rp=1),f(["x3"],nk=20,rp=1)],
                            data=sim_fit_dat,
                            print_warn=False)

sim_fit_model2 = GAMMLSS([sim_fit_formula2,copy.deepcopy(sim_fit_formula_sd)],family = GAMMALS([LOG(),LOGb(-0.01)]))
sim_fit_model2.fit(progress_bar=False,max_outer=200,extend_lambda=False,control_lambda=2,max_inner=500)

# Set up a uniform prior from log(1e-7) to log(1e12) for each regularization parameter
prior = DummyRhoPrior(b=np.log(1e12))

# Now correct for uncertainty in regularization parameters:
cor_result_gs_1 = compare_CDL(sim_fit_model,sim_fit_model2,n_c=10,grid='JJJ3',seed=22,only_expected_edf=False,use_importance_weights=True,prior=prior)

# You can also set prior to ``None`` in which case the proposal distribution (by default a T-distribution with 40 df of freedom) is used as prior.

References:

Marra, G., & Wood, S. N. (2011) Practical variable selection for generalized additive models.
Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models
Greven, S., & Scheipl, F. (2016). Comment on: Smoothing Parameter and Model Selection for General Smooth Models
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
compareML function from itsadug R-package: https://rdrr.io/cran/itsadug/man/compareML.html
anova.gam function from mgcv, see: https://www.rdocumentation.org/packages/mgcv/versions/1.9-1/topics/anova.gam

Parameters:

model1 (GAMM or GAMMLSS or GSMM) – GAMM, GAMMLSS, or GSMM 1.
model2 (GAMM or GAMMLSS or GSMM) – GAMM, GAMMLSS, or GSMM 2.
correct_V (bool, optional) – Whether or not to correct for smoothness uncertainty. Defaults to True
correct_t1 (bool, optional) – Whether or not to also correct the smoothness bias corrected edf for smoothness uncertainty. Defaults to True.
perform_GLRT (bool, optional) – Whether to perform both a GLRT and to compute the AIC or to only compute the AIC. Defaults to True.
nR (int, optional) – In case grid!="JJJ1", nR samples/reml scores are generated/computed to numerically evaluate the expectations necessary for the uncertainty correction, defaults to 250
n_c (int, optional) – Number of cores to use during parallel parts of the correction, defaults to 10
alpha (float, optional) – alpha level of the GLRT. Defaults to 0.05
grid (str, optional) – How to compute the smoothness uncertainty correction, defaults to ‘JJJ1’
a (float, optional) – Any of the \(\lambda\) estimates obtained from model (used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\) used to sample nR candidates) which are smaller than this are set to this value as well, defaults to 1e-7 the minimum possible estimate
b (float, optional) – Any of the \(\lambda\) estimates obtained from model (used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\) used to sample nR candidates) which are larger than this are set to this value as well, defaults to 1e7 the maximum possible estimate
df (int, optional) – Degrees of freedom used for the multivariate t distribution used to sample/propose the next set of candidates. Setting this to np.inf means a multivariate normal is used for sampling, defaults to 40
verbose (bool, optional) – Whether to print progress information or not, defaults to False
drop_NA (bool,optional) – Whether to drop rows in the model matrices corresponding to NAs in the dependent variable vector. Defaults to True.
method (str,optional) – Which method to use to solve for the coefficients (and smoothing parameters). The default (“Chol”) relies on Cholesky decomposition. This is extremely efficient but in principle less stable, numerically speaking. For a maximum of numerical stability set this to “QR/Chol”. In that case a QR decomposition is used - which is first pivoted to maximize sparsity in the resulting decomposition but also pivots for stability in order to get an estimate of rank defficiency. A Cholesky is than used using the combined pivoting strategy obtained from the QR. This takes substantially longer. If this is set to 'qEFS', then the coefficients are estimated via quasi netwon and the smoothing penalties are estimated from the quasi newton approximation to the hessian. This only requieres first derviative information. Defaults to “Chol”.
seed (int,optional) – Seed to use for random parts of the correction. Defaults to None
only_expected_edf (bool,optional) – Whether to compute edf. by explicitly forming covariance matrix (only_expected_edf=False) or not. The latter is much more efficient for sparse models at the cost of access to the covariance matrix and the ability to compute an upper bound on the smoothness uncertainty corrected edf. Only makes sense when grid_type!='JJJ1'. Defaults to False
Vp_fidiff (bool,optional) – Whether to rely on a finite difference approximation to compute \(\mathbf{V}_{\boldsymbol{p}}\) or on a PQL approximation. The latter is exact for Gaussian and canonical GAMs and far cheaper if many penalties are to be estimated. Defaults to False (PQL approximation)
use_importance_weights (bool,optional) – Whether to rely importance weights to compute the numerical integration when grid_type != 'JJJ1' or on the log-densities of \(\mathbf{V}_{\boldsymbol{p}}\) - the latter assumes that the unconditional posterior is normal. Defaults to True (Importance weights are used)
prior (any, optional) – An (optional) instance of an arbitrary class that has a .logpdf() method to compute the prior log density of a sampled candidate. If this is set to None, the prior is assumed to coincide with the proposal distribution, simplifying the importance weight computation. Ignored when use_importance_weights=False. Defaults to None
recompute_H (bool, optional) – Whether or not to re-compute the Hessian of the log-likelihood at an estimate of the mean of the Bayesian posterior \(\boldsymbol{\beta}|y\) before computing the (uncertainty/bias corrected) edf. Defaults to False
bfgs_options (key=value,optional) – Any additional keyword arguments that should be passed on to the call of scipy.optimize.minimize(). If none are provided, the gtol argument will be initialized to 1e-3. Note also, that in any case the maxiter argument is automatically set to 100. Defaults to None.

Raises:

ValueError – If both models are from different families.
ValueError – If perform_GLRT=True and model1 has fewer coef than model2 - i.e., model1 has to be the notationally more complex one.

Returns:

A dictionary with outcomes of all tests. Key H1 will be a bool indicating whether Null hypothesis was rejected or not, p will be the p-value, chi^2 will be the test statistic used, Res. DOF will be the degrees of freedom used by the test, aic1 and aic2 will be the aic scores for both models.

Return type:

dict

mssm.src.python.constraints module

class mssm.src.python.constraints.ConstType(*values)

Bases: Enum

DIFF = 3

DROP = 1

QR = 2

class mssm.src.python.constraints.Constraint(Z: array = None, type: ConstType = None)

Bases: object

Constraint storage. Z, either holds the Qr-based correction matrix that needs to be multiplied with \(\mathbf{X}\), \(\mathbf{S}\), and \(\mathbf{D}\) (where \(\mathbf{D}\mathbf{D}^T = \mathbf{S}\)) to make terms subject to the conventional sum-to-zero constraints applied also in mgcv (Wood, 2017), the column/row that should be dropped from those - then \(\mathbf{X}\) can also no longer take on a constant, or None indicating that the model should be “difference re-coded” to enable sparse sum-to-zero constraints. The latter two are available in mgcv’s smoothCon function by setting the sparse.cons argument to 1 or 2 respectively.

The QR-based approach is described in detail by Wood (2017) and is similar to just mean centering every basis function involved in the smooth and then dropping one column from the corresponding centered model matrix. The column-dropping approach is self-explanatory. The difference re-coding deserves some explanation. Consider the model:

\[y = f(x) = b_1*f_1(x) + b_2*f_2(x) + b_3*f_3(x) + b_4*f_4(x) + b_5*f_5(x)\]

In mssm \(f(x)\) will be parameterized via five B-spline basis functions \(f_i(x)\), each weighted by a coefficient \(b_i\). Each of the \(f_i(x)\) has support only over a small range of \(x\) - resulting in a sparse model matrix \(\mathbf{X}\). This desirable property is lost when enforcing the conventional sum-to-zero constraints (either QR-based or by subtracting the mean of each basis over \(x\) from the corresponding \(f_i(x)\)). To maintain sparsity, one can enforce a coefficients-sum-to-zero constraint (Wood, 2020) by re-coding instead:

Set \(b_1 = -b_2\)

Then, re-write:

\[ \begin{align}\begin{aligned}y = -b_2 * f_1(x) + b_2 * f_2(x) + b_3 * f_3(x) - b_3 * f_2(x) + b_4 * f_4(x) - b_4 * f_3(x) + b_5 * f_5(x) - b_5 * f_4(x)\\y = -b_2 * f_1(x) + (b_2 - b_3) * f_2(x) + (b_3 - b_4) * f_3(x) + (b_4 - b_5) * f_4(x) + b_5 * f_5(x)\\y = b_2 * (f_2(x) - f_1(x)) + b_3 * (f_3(x) - f_2(x)) + b_4 * (f_4(x) - f_3(x)) + b_5 * (f_5(x) - f_4(x))\end{aligned}\end{align} \]

Line 3 shows how the constraint can be absorbed for model fitting by first computing new bases \((f_i(x) - f_{i-1}(x))\) and then estimating the coefficients based on those (this is implemented in mgcv’s smoothCon function when setting sparse.cons=2). Note that the constrained model, just like when using the QR-based or dropping approach, requires dropping one of the \(k\) original coefficients for estimation since only \(k-1\) coefficients are allowed to vary. The same sum-to-zero constraint can be achieved by fixing one of the central bases in the original model to it’s neighbor (e.g., setting \(b_2 = -b_3\)) or by setting \(b_1= -b_2 - b_3 - b_4 - b_5\). mssm fixes one of the central bases to it’s neighbor.

With a B-splines basis, it would be necessary that \(b_1 = b_2 = b_3 = b_4 = b_5\) for the model to fit a constant \(f(x)\) over all values of \(x\) (Eilers & Marx, 1996). In the constrained model this is no longer possible due to \(b_1 = -b_2\), effectively removing the constant from the space of functions that \(f(x)\) can approximate. Instead, in the constrained model if \(b_2 = b_3 = b_4 = b_5\), \(f(x)\) summed over all \(x\) will be equal to zero!

Importantly, the “new” bases obtained by computing \((f_i(x) - f_{i-1}(x))\) are all still relatively sparse. This is because neighboring \(f_i(x)\) overlap in their support and because the support of each individual \(f_i(x)\) is narrow, i.e., \(f_i(x)\) is sparse itself.

However, this is not a true centering constraint: \(f(x)\) will not necessarily be orthogonal to the intercept, i.e., \(\mathbf{1}^T \mathbf{f(x)}\) will not necessarily be 0. Hence, confidence intervals will usually be wider when using ConstType.DIFF (also when using ConstType.DROP, for the same reason) instead of ConstType.QR (see Wood; 2017,2020)!

A final note regards the use of tensor smooths when te==False. Since the value of any constant estimated for a smooth depends on the type of constraint used, the mmarginal functions estimated for the “main effects” (\(f(x)\), \(f(z)\)) and “interaction effect” (\(f(x,z)\)) in a model: \(y = a + f(x) + f(z) + f(x,z)\) will differ depending on the type of constraint used. The “Anova-like” decomposition described in detail in Wood (2017) is achievable only when using ConstType.QR.

References:

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
Wood, S. N. (2020). Inference and computation with generalized additive models and their extensions. TEST, 29(2), 307–339. https://doi.org/10.1007/s11749-020-00711-5
Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89–121. https://doi.org/10.1214/ss/1038425655

Z: array = None

type: ConstType = None

mssm.src.python.exp_fam module

class mssm.src.python.exp_fam.Binomial(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.Logit object>, n: int = 1)

Bases: Family

Binomial family. For this implementation we assume that we have collected proportions of success, i.e., the dependent variables specified in the model Formula needs to hold observed proportions and not counts! If we assume that each observation \(y_i\) reflects a single independent draw from a binomial, (with \(n=1\), and \(p_i\) being the probability that the result is 1) then the dependent variable should either hold 1 or 0. If we have multiple independent draws from the binomial per observation (i.e., row in our data-frame), then \(n\) will usually differ between observations/rows in our data-frame (i.e., we observe \(k_i\) counts of success out of \(n_i\) draws - so that \(y_i=k_i/n_i\)). In that case, the Binomial() family accepts a vector for argument \(\mathbf{n}\) (which is simply set to 1 by default, assuming binary data), containing \(n_i\) for every observation \(y_i\).

In this implementation, the scale parameter is kept fixed/known at 1.

References:

Parameters:

link (Link) – The link function to be used by the model of the mean of this family. By default set to the canonical logit link.
n (int or [int], optional) – Number of independent draws from a Binomial per observation/row of data-frame. For binary data this can simply be set to 1, which is the default.

D(y, mu)

Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

V(mu)

The variance function (of the mean; see Wood, 2017, 3.1.2) for the Binomial model. Variance is minimal for \(\mu=1\) and \(\mu=0\), maximal for \(\mu=0.5\).

References:

Parameters:: mu ([float]) – The vector containing the predicted probability for the response distribution corresponding to each observation.

dVy1(mu)

The first derivative of the variance function (of the mean; see Wood, 2017, 3.1.2) with respect ot the mean.

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

deviance(y, mu)

Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

init_mu(y)

Function providing initial \(\boldsymbol{\mu}\) vector for GAMM.

Estimation assumes proportions as dep. variable. According to: https://stackoverflow.com/questions/60526586/ the glm() function in R always initializes \(\mu\) = 0.75 for observed proportions (i.e., elements in \(\mathbf{y}\)) of 1 and \(\mu\) = 0.25 for proportions of zero. This can be achieved by adding 0.5 to the observed proportion of success (and adding one observation).

Parameters:: y ([float]) – The vector containing each observation.

llk(y, mu)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

lp(y, mu)

Log-probability of observing every proportion in \(\mathbf{y}\) under their respective binomial with mean = \(\boldsymbol{\mu}\).

References:

Parameters:

y ([float]) – The vector containing each observed proportion.
mu ([float]) – The vector containing the predicted probability for the response distribution corresponding to each observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.Binomial2(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.Logit object>, n: int = 1)

Bases: GAMLSSFamily

Another implementation of the Binomial family. That allows estimating binomial models via GAMMLSS models (And thus full-newton, no PQL!).

For this implementation we again assume that we have collected proportions of success, i.e., the dependent variables specified in the model Formula needs to hold observed proportions and not counts! If we assume that each observation \(y_i\) reflects a single independent draw from a binomial, (with \(n=1\), and \(p_i\) being the probability that the result is 1) then the dependent variable should either hold 1 or 0. If we have multiple independent draws from the binomial per observation (i.e., row in our data-frame), then \(n\) will usually differ between observations/rows in our data-frame (i.e., we observe \(k_i\) counts of success out of \(n_i\) draws - so that \(y_i=k_i/n_i\)). In that case, the Binomial() family accepts a vector for argument \(\mathbf{n}\) (which is simply set to 1 by default, assuming binary data), containing \(n_i\) for every observation \(y_i\).

In this implementation, the scale parameter is kept fixed/known at 1.

References:

Parameters:

link (Link) – The link function to be used by the model of the mean of this family. By default set to the canonical logit link.
n (int or [int], optional) – Number of independent draws from a Binomial per observation/row of data-frame. For binary data this can simply be set to 1, which is the default.

get_resid(y, mu)

Get standardized residuals for a Binomial model

Essentially, the deviance residuals are returned, which are equivalent to \(sign(y_i - \mu_i)*D_i^{0.5}\), where \(\sum_{i=1,...N} D_i\) equals the model deviance (see Wood 2017, section 3.1.7).

Parameters:

y ([float]) – The vector containing each observed proportion.
mu ([float]) – The vector containing the predicted probability for the response distribution corresponding to each observation.

Returns:

A list of deviance residuals that should be ~ N(0,1) if the model is correct.

Return type:

[float]

init_coef(models)

Function to initialize the coefficients of the model.

Fits a GAMM (via PQL) for the mean.

Parameters:: models ([mssm.models.GAMM]) – A list of mssm.models.GAMM’s, - each based on one of the formulas provided to a model.
Returns:: A numpy.array of shape (-1,1), holding initial values for all model coefficients.
Return type:: numpy array

init_lambda(formulas)

Function to initialize the smoothing parameters of the model.

Returns values in line with what is used to initialize GAMM

Parameters:: formulas ([mssm.src.python.formula.Formula]) – A list of mssm.src.python.formula.Formula’s provided to a model.
Returns:: A list, holding - for each \(\lambda\) parameter to be estimated - an initial value.
Return type:: [float]

llk(y, mu)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

lp(y, mu)

Log-probability of observing every proportion in \(\mathbf{y}\) under their respective binomial with mean = \(\boldsymbol{\mu}\).

References:

Parameters:

y ([float]) – The vector containing each observed proportion.
mu ([float]) – The vector containing the predicted probability for the response distribution corresponding to each observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.Family(link: Link, twopar: bool, scale: float = None)

Bases: object

Base class to be implemented by Exp. family member.

Parameters:

link (Link) – The link function to be used by the model of the mean of this family.
twopar (bool) – Whether the family has two parameters (mean,scale) to be estimated (i.e., whether the likelihood is a function of two parameters), or only a single one (usually the mean).
scale (float or None, optional) – Known/fixed scale parameter for this family.

D(y, mu)

Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

V(mu, **kwargs)

The variance function (of the mean; see Wood, 2017, 3.1.2). Different exponential families allow for different relationships between the variance in our random response variable and the mean of it. For the normal model this is assumed to be constant.

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

dVy1(mu, **kwargs)

The first derivative of the variance function (of the mean; see Wood, 2017, 3.1.2) with respect ot the mean.

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

deviance(y, mu)

Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

init_mu(y)

Convenience function to compute an initial \(\boldsymbol{\mu}\) estimate passed to the GAMM/PIRLS estimation routine.

Parameters:: y ([float]) – The vector containing each observation.

llk(y, mu, **kwargs)

log-probability of \(\mathbf{y}\) under this family with mean = \(\boldsymbol{\mu}\). Essentially sum over all elements in the vector returned by the lp() method.

Families with more than one parameter that needs to be estimated in order to evaluate the model’s log-likelihood (i.e., two_par=True) must pass as key-word argument a scale parameter with a default value, e.g.,:

def llk(self, mu, scale=1):
   ...

You can check the implementation of the Gaussian Family for an example.

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

lp(y, mu, **kwargs)

Log-probability of observing every value in \(\mathbf{y}\) under this family with mean = \(\boldsymbol{\mu}\).

Families with more than one parameter that needs to be estimated in order to evaluate the model’s log-likelihood (i.e., two_par=True) must pass as key-word argument a scale parameter with a default value, e.g.,:

def lp(self, mu, scale=1):
   ...

You can check the implementation of the Gaussian Family for an example.

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.GAMLSSFamily(pars: int, links: [<class 'mssm.src.python.exp_fam.Link'>])

Bases: object

Base-class to be implemented by families of Generalized Additive Mixed Models of Location, Scale, and Shape (GAMMLSS; Rigby & Stasinopoulos, 2005).

Apart from the required methods, three mandatory attributes need to be defined by the __init__() constructor of implementations of this class. These are required to evaluate the first and second (pure & mixed) derivative of the log-likelihood with respect to any of the log-likelihood’s parameters (alternatively the linear predictors of the parameters - see the description of the d_eta instance variable.). See the variables below.

Optionally, a mean_init_fam attribute can be defined - specfiying a Family member that is fitted to the data to get an initial estimate of the mean parameter of the assumed distribution.

References:

Parameters:

pars (int) – Number of parameters of the distribution belonging to the random variables assumed to have generated the observations, e.g., 2 for the Normal: mean and standard deviation.
links ([Link]) – Link functions for each of the parameters of the distribution.

Variables:

d_eta (bool) – A boolean indicating whether partial derivatives of llk are provided with respect to the linear predictor instead of parameters (i.e., the mean), defaults to False (derivatives are provided with respect to parameters)
d1 ([Callable]) – A list holding n_par functions to evaluate the first partial derivatives of llk with respect to each parameter of the llk. Needs to be initialized when calling __init__().
d2 ([Callable]) – A list holding n_par functions to evaluate the second (pure) partial derivatives of llk with respect to each parameter of the llk. Needs to be initialized when calling __init__().
d2m ([Callable]) – A list holding n_par*(n_par-1)/2 functions to evaluate the second mixed partial derivatives of llk with respect to each parameter of the llk in order: d2m[0] = \(\partial l/\partial \mu_1 \partial \mu_2\), d2m[1] = \(\partial l/\partial \mu_1 \partial \mu_3\), …, d2m[n_par-1] = \(\partial l/\partial \mu_1 \partial \mu_{n_{par}}\), d2m[n_par] = \(\partial l/\partial \mu_2 \partial \mu_3\), d2m[n_par+1] = \(\partial l/\partial \mu_2 \partial \mu_4\), … . Needs to be initialized when calling __init__().

get_resid(y, *mus)

Get standardized residuals for a GAMMLSS model (Rigby & Stasinopoulos, 2005).

Any implementation of this function should return a vector that looks like what could be expected from taking len(y) independent draws from \(N(0,1)\).

References:

Parameters:

y ([float]) – The vector containing each observed value.
mus ([[float]]) – A list including self.n_par lists - one for each parameter of the distribution. Each of those lists contains the expected value for a particular parmeter for each of the N observations.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

init_coef(models)

(Optional) Function to initialize the coefficients of the model.

Can return None , in which case random initialization will be used.

Parameters:: models ([mssm.models.GAMM]) – A list of mssm.models.GAMM’s, - each based on one of the formulas provided to a model.
Returns:: A numpy.array of shape (-1,1), holding initial values for all model coefficients.
Return type:: numpy array

init_lambda(formulas)

(Optional) Function to initialize the smoothing parameters of the model.

Can return None , in which case random initialization will be used.

Parameters:: formulas ([mssm.src.python.formula.Formula]) – A list of mssm.src.python.formula.Formula’s provided to a model.
Returns:: A list, holding - for each \(\lambda\) parameter to be estimated - an initial value.
Return type:: [float]

llk(y, *mus)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

Parameters:

y ([float]) – The vector containing each observation.
mus ([[float]]) – A list including self.n_par lists - one for each parameter of the distribution. Each of those lists contains the expected value for a particular parmeter for each of the N observations.

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, *mus)

Log-probability of observing every element in \(\mathbf{y}\) under their respective distribution parameterized by mus.

References:

Parameters:

y ([float]) – The vector containing each observed value.
mus ([[float]]) – A list including self.n_par lists - one for each parameter of the distribution. Each of those lists contains the expected value for a particular parmeter for each of the N observations.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.GAMMALS(links: [<class 'mssm.src.python.exp_fam.Link'>])

Bases: GAMLSSFamily

Family for a GAMMA GAMMLSS model (Rigby & Stasinopoulos, 2005).

This Family follows the Gamma family, in that we assume: \(Y_i \sim \Gamma(\mu_i,\phi_i)\). The difference to the Gamma family is that we now also model \(\phi\) as an additive combination of smooth variables and other parametric terms. The Gamma distribution is usually not expressed in terms of the mean and scale (\(\phi\)) parameter but rather in terms of a shape and rate parameter - called \(\alpha\) and \(\beta\) respectively. Wood (2017) provides \(\alpha = 1/\phi\). With this we can obtain \(\beta = 1/\phi/\mu\) (see the source-code for lp() method of the Gamma family for details).

References:

Parameters:: links ([Link]) – Link functions for the mean and standard deviation. Standard would be links=[LOG(),LOG()].

get_resid(y, mu, scale)

Get standardized residuals for a Gamma GAMMLSS model (Rigby & Stasinopoulos, 2005).

Essentially, to get a standaridzed residual vector we first have to account for the mean-variance relationship of our RVs (which we also have to do for the Gamma family) - for this we can simply compute deviance residuals again (see Wood, 2017). These should be \(\sim N(0,\phi_i)\) (where \(\phi_i\) is the element in scale for a specific observation) - so if we divide each of those by the observation-specific scale we can expect the resulting standardized residuals to be :math:` sim N(0,1)` if the model is correct.

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale ([float]) – The vector containing the predicted scale parameter for the response distribution corresponding to each observation.

Returns:

A list of standardized residuals that should be ~ N(0,1) if the model is correct.

Return type:

[float]

init_coef(models)

Function to initialize the coefficients of the model.

Fits a GAMM for the mean and initializes all coef. for the scale parameter to 1.

Parameters:: models ([mssm.models.GAMM]) – A list of mssm.models.GAMM’s, - each based on one of the formulas provided to a model.
Returns:: A numpy.array of shape (-1,1), holding initial values for all model coefficients.
Return type:: numpy array

llk(y, mu, scale)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale ([float]) – The vector containing the predicted scale parameter for the response distribution corresponding to each observation.

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, mu, scale)

Log-probability of observing every proportion in \(\mathbf{y}\) under their respective Gamma with mean = \(\boldsymbol{\mu}\) and scale = \(\boldsymbol{\phi}\).

References:

Parameters:

y ([float]) – The vector containing each observed value.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale ([float]) – The vector containing the predicted scale parameter for the response distribution corresponding to each observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.GAUMLSS(links: [<class 'mssm.src.python.exp_fam.Link'>])

Bases: GAMLSSFamily

Family for a Normal GAMMLSS model (Rigby & Stasinopoulos, 2005).

This Family follows the Gaussian family, in that we assume: \(Y_i \sim N(\mu_i,\sigma_i)\). i.e., each of the \(N\) observations is still believed to have been generated from an independent normally distributed RV with observation-specific mean.

The important difference is that the scale parameter, \(\sigma\), is now also observation-specific and modeled as an additive combination of smooth functions and other parametric terms, just like the mean is in a Normal GAM. Note, that this explicitly models heteroscedasticity - the residuals are no longer assumed to be i.i.d samples from \(\sim N(0,\sigma)\), since \(\sigma\) can now differ between residual realizations.

References:

Parameters:: links ([Link]) – Link functions for the mean and standard deviation. Standard would be links=[Identity(),LOG()].

get_resid(y, mu, sigma)

Get standardized residuals for a Normal GAMMLSS model (Rigby & Stasinopoulos, 2005).

Essentially, each residual should reflect a realization of a normal with mean zero and observation-specific standard deviation. After scaling each residual by their observation-specific standard deviation we should end up with standardized residuals that can be expected to be i.i.d \(\sim N(0,1)\) - assuming that our model is correct.

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
sigma ([float]) – The vector containing the predicted stdandard deviation for the response distribution corresponding to each observation.

Returns:

A list of standardized residuals that should be ~ N(0,1) if the model is correct.

Return type:

[float]

init_coef(models)

Function to initialize the coefficients of the model.

Fits a GAMM for the mean and initializes all coef. for the standard deviation to 1.

Parameters:: models ([mssm.models.GAMM]) – A list of mssm.models.GAMM’s, - each based on one of the formulas provided to a model.
Returns:: A numpy.array of shape (-1,1), holding initial values for all model coefficients.
Return type:: numpy array

llk(y, mu, sigma)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
sigma ([float]) – The vector containing the predicted stdandard deviation for the response distribution corresponding to each observation.

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, mu, sigma)

Log-probability of observing every proportion in y under their respective Normal with observation-specific mean and standard deviation.

References:

Parameters:

y ([float]) – The vector containing each observed value.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
sigma ([float]) – The vector containing the predicted stdandard deviation for the response distribution corresponding to each observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.GENSMOOTHFamily(pars: int, links: [<class 'mssm.src.python.exp_fam.Link'>], *llkargs)

Bases: object

Base-class for General Smooth “families” as discussed by Wood, Pya, & Säfken (2016). For estimation of :class:mssm.models.GSMM models via BFGS it is sufficient to implement llk(). gradient() and hessian() can then simply return None. For exact estimation via Newton’s method, the latter two functions need to be implemented and have to return the gradient and hessian at the current coefficient estimate respectively.

Additional parameters needed for likelihood, gradient, or hessian evaluation can be passed along via the llkargs.

References:

Parameters:

pars (int) – Number of parameters of the likelihood.
links ([Link]) – List of Link functions for each parameter of the likelihood, e.g., links=[Identity(),LOG()].

gradient(coef, coef_split_idx, y, Xs)

Function to evaluate the gradient of the llk at current coefficient estimate coef.

Parameters:

coef ([float]) – The current coefficient estimate (as np.array of shape (-1,1) - so it must not be flattened!).
coef_split_idx ([int]) – A list used to split (via np.split()) the coef into the sub-sets associated with each paramter of the llk.
y ([float]) – The vector containing each observation.
Xs ([scp.sparse.csc_array]) – A list of sparse model matrices per likelihood parameter.

Returns:

The Gradient of the log-likelihood evaluated at coef as numpy.array) of shape (-1,1).

Return type:

[float]

hessian(coef, coef_split_idx, y, Xs)

Function to evaluate the hessian of the llk at current coefficient estimate coef.

Parameters:

coef ([float]) – The current coefficient estimate (as np.array of shape (-1,1) - so it must not be flattened!).
coef_split_idx ([int]) – A list used to split (via np.split()) the coef into the sub-sets associated with each paramter of the llk.
y ([float]) – The vector containing each observation.
Xs ([scp.sparse.csc_array]) – A list of sparse model matrices per likelihood parameter.

Returns:

The Hessian of the log-likelihood evaluated at coef.

Return type:

scipy.sparse.csc_array

init_coef(models)

(Optional) Function to initialize the coefficients of the model.

Can return None , in which case random initialization will be used.

Parameters:: models ([mssm.models.GAMM]) – A list of mssm.models.GAMM’s, - each based on one of the formulas provided to a model.
Returns:: A numpy.array of shape (-1,1), holding initial values for all model coefficients.
Return type:: numpy array

init_lambda(formulas)

(Optional) Function to initialize the smoothing parameters of the model.

Can return None , in which case random initialization will be used.

Parameters:: formulas ([mssm.src.python.formula.Formula]) – A list of mssm.src.python.formula.Formula’s provided to a model.
Returns:: A list, holding - for each \(\lambda\) parameter to be estimated - an initial value.
Return type:: [float]

llk(coef, coef_split_idx, y, Xs)

log-probability of data under given model.

References:

Parameters:

coef ([float]) – The current coefficient estimate (as np.array of shape (-1,1) - so it must not be flattened!).
coef_split_idx ([int]) – A list used to split (via np.split()) the coef into the sub-sets associated with each paramter of the llk.
y ([float]) – The vector containing each observation.
Xs ([scp.sparse.csc_array]) – A list of sparse model matrices per likelihood parameter.

Returns:

The log-likelihood evaluated at coef.

Return type:

float

class mssm.src.python.exp_fam.Gamma(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.LOG object>, scale: float = None)

Bases: Family

Gamma Family.

We assume: \(Y_i \sim \Gamma(\mu_i,\phi)\). The Gamma distribution is usually not expressed in terms of the mean and scale (\(\phi\)) parameter but rather in terms of a shape and rate parameter - called \(\alpha\) and \(\beta\) respectively. Wood (2017) provides \(\alpha = 1/\phi\). With this we can obtain \(\beta = 1/\phi/\mu\) (see the source-code for lp() method for details).

References:

Parameters:

link (Link) – The link function to be used by the model of the mean of this family. By default set to the log link.
scale (float or None, optional) – Known scale parameter for this family - by default set to None so that the scale parameter is estimated.

D(y, mu)

Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

A N-dimensional vector containing the contribution of each data-point to the overall model deviance.

Return type:

[float]

V(mu)

Variance function for the Gamma family.

The variance of random variable \(Y\) is proportional to it’s mean raised to the second power.

References:

Parameters:: mu ([float]) – a N-dimensional vector of the model prediction/the predicted mean
Returns:: mu raised to the power of 2
Return type:: [float]

dVy1(mu)

The first derivative of the variance function (of the mean; see Wood, 2017, 3.1.2) with respect ot the mean.

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

deviance(y, mu)

Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

The model deviance.

Return type:

float

llk(y, mu, scale=1)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale (float, optional) – The (estimated) scale parameter, defaults to 1

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, mu, scale=1)

Log-probability of observing every proportion in \(\mathbf{y}\) under their respective Gamma with mean = \(\boldsymbol{\mu}\).

References:

Parameters:

y ([float]) – The vector containing each observed value.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale (float, optional) – The (estimated) scale parameter, defaults to 1

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.Gaussian(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.Identity object>, scale: float = None)

Bases: Family

Normal/Gaussian Family.

We assume: \(Y_i \sim N(\mu_i,\sigma)\) - i.e., each of the \(N\) observations is generated from a normally distributed RV with observation-specific mean and shared scale parameter \(\sigma\). Equivalent to the assumption that the observed residual vector - the difference between the model prediction and the observed data - should look like what could be expected from drawing \(N\) independent samples from a Normal with mean zero and standard deviation equal to \(\sigma\).

References:

Parameters:

link (Link) – The link function to be used by the model of the mean of this family. By default set to the canonical identity link.
scale (float or None, optional) – Known scale parameter for this family - by default set to None so that the scale parameter is estimated.

D(y, mu)

Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

A N-dimensional vector containing the contribution of each data-point to the overall model deviance.

Return type:

[float]

V(mu)

Variance function for the Normal family.

Not really a function since the link between variance and mean of the RVs is assumed constant for this model.

References:

Parameters:: mu ([float]) – a N-dimensional vector of the model prediction/the predicted mean
Returns:: a N-dimensional vector of 1s
Return type:: [float]

dVy1(mu)

The first derivative of the variance function (of the mean; see Wood, 2017, 3.1.2) with respect ot the mean.

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

deviance(y, mu)

Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

The model deviance.

Return type:

float

llk(y, mu, sigma=1)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
sigma (float, optional) – The (estimated) sigma parameter, defaults to 1

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, mu, sigma=1)

Log-probability of observing every proportion in \(\mathbf{y}\) under their respective Normal with mean = \(\boldsymbol{\mu}\).

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
sigma (float, optional) – The (estimated) sigma parameter, defaults to 1

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.Identity

Bases: Link

Identity Link function. \(\boldsymbol{\mu}=\boldsymbol{\eta}\) and so this link is trivial.

dy1(mu)

First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for Fisher scoring/PIRLS (Wood, 2017).

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

dy2(mu)

Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

f(mu)

Canonical link for normal distribution with \(\boldsymbol{\eta} = \boldsymbol{\mu}\).

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

fi(eta)

For the identity link, \(\boldsymbol{\eta} = \boldsymbol{\mu}\), so the inverse is also just the identity. see Faraway (2016)

References:

Parameters:: eta ([float]) – The vector containing the model prediction corresponding to each observation.

class mssm.src.python.exp_fam.InvGauss(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.LOG object>, scale: float = None)

Bases: Family

Inverse Gaussian Family.

We assume: \(Y_i \sim IG(\mu_i,\phi)\). The Inverse Gaussian distribution is usually not expressed in terms of the mean and scale (\(\phi\)) parameter but rather in terms of a shape and scale parameter - called \(\nu\) and \(\lambda\) respectively (see the scipy implementation). We can simply set \(\nu=\mu\) (compare scipy density to the one in table 3.1 of Wood, 2017). Wood (2017) shows that \(\phi=1/\lambda\), so this provides \(\lambda=1/\phi\)

References:

Parameters:

link (Link) – The link function to be used by the model of the mean of this family. By default set to the log link.
scale (float or None, optional) – Known scale parameter for this family - by default set to None so that the scale parameter is estimated.

D(y, mu)

Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

A N-dimensional vector containing the contribution of each data-point to the overall model deviance.

Return type:

[float]

V(mu)

Variance function for the Inverse Gaussian family.

The variance of random variable \(Y\) is proportional to it’s mean raised to the third power.

References:

Parameters:: mu ([float]) – a N-dimensional vector of the model prediction/the predicted mean
Returns:: mu raised to the power of 3
Return type:: [float]

dVy1(mu)

The first derivative of the variance function (of the mean; see Wood, 2017, 3.1.2) with respect ot the mean.

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

deviance(y, mu)

Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

The model deviance.

Return type:

float

llk(y, mu, scale=1)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale (float, optional) – The (estimated) scale parameter, defaults to 1

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, mu, scale=1)

Log-probability of observing every value in \(\mathbf{y}\) under their respective inverse Gaussian with mean = \(\boldsymbol{\mu}\).

References:

Parameters:

y ([float]) – The vector containing each observed value.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale (float, optional) – The (estimated) scale parameter, defaults to 1

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.LOG

Bases: Link

Log Link function. \(log(\boldsymbol{\mu}) = \boldsymbol{\eta}\).

dy1(mu)

First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for Fisher scoring/PIRLS (Wood, 2017).

References:

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

dy2(mu)

Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

f(mu)

Non-canonical link for Gamma distribution with \(log(\boldsymbol{\mu}) = \boldsymbol{\eta}\).

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

fi(eta)

For the log link, \(\boldsymbol{\eta} = log(\boldsymbol{\mu})\), so \(exp(\boldsymbol{\eta})=\boldsymbol{\mu}\). see Faraway (2016)

References:

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:: eta ([float]) – The vector containing the model prediction corresponding to each observation.

class mssm.src.python.exp_fam.LOGb(b)

Bases: Link

Log + b Link function. \(log(\boldsymbol{\mu} + b) = \boldsymbol{\eta}\).

Parameters:: b (float) – The constant to add to \(\mu\) before taking the log.

dy1(mu)

First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for Fisher scoring/PIRLS (Wood, 2017).

References:

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

dy2(mu)

Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

f(mu)

\(log(\boldsymbol{\mu} + b) = \boldsymbol{\eta}\).

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

fi(eta)

For the logb link, \(\boldsymbol{\eta} = log(\boldsymbol{\mu} + b)\), so \(exp(\boldsymbol{\eta})-b =\boldsymbol{\mu}\)

References:

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:: eta ([float]) – The vector containing the model prediction corresponding to each observation.

class mssm.src.python.exp_fam.Link

Bases: object

Link function base class. To be implemented by any link functiion used for GAMMs and GAMMLSS models. Only links used by GAMLSS models require implementing the dy2 function. Note, that care must be taken that every method returns only valid values. Specifically, no returned element may be numpy.nan or numpy.inf.

dy1(mu)

First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\) Needed for Fisher scoring/PIRLS (Wood, 2017).

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

dy2(mu)

Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

f(mu)

Link function \(f()\) mapping mean \(\boldsymbol{\mu}\) of an exponential family to the model prediction \(\boldsymbol{\eta}\), so that \(f(\boldsymbol{\mu}) = \boldsymbol{\eta}\). see Wood (2017, 3.1.2) and Faraway (2016).

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

fi(eta)

Inverse of the link function mapping \(\boldsymbol{\eta} = f(\boldsymbol{\mu})\) to the mean \(fi(\boldsymbol{\eta}) = fi(f(\boldsymbol{\mu})) = \boldsymbol{\mu}\). see Faraway (2016) and the Link.f function.

References:

Parameters:: eta ([float]) – The vector containing the model prediction corresponding to each observation.

class mssm.src.python.exp_fam.Logit

Bases: Link

Logit Link function, which is canonical for the binomial model. \(\boldsymbol{\eta}\) = log-odds of success.

dy1(mu)

First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for Fisher scoring/PIRLS (Wood, 2017):

\[ \begin{align}\begin{aligned}f(\mu) = log(\mu / (1 - \mu))\\f(\mu) = log(\mu) - log(1 - \mu)\\\partial f(\mu)/ \partial \mu = 1/\mu - 1/(1 - \mu)\end{aligned}\end{align} \]

Faraway (2016) simplifies this to: \(\partial f(\mu)/ \partial \mu = 1 / (\mu - \mu^2) = 1/ ((1-\mu)\mu)\)

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

dy2(mu)

Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

f(mu)

Canonical link for binomial distribution with \(\boldsymbol{\mu}\) holding the probabilities of success, so that the model prediction \(\boldsymbol{\eta}\) is equal to the log-odds.

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

fi(eta)

For the logit link and the binomial model, \(\boldsymbol{\eta}\) = log-odds, so the inverse to go from \(\boldsymbol{\eta}\) to \(\boldsymbol{\mu}\) is \(\boldsymbol{\mu} = exp(\boldsymbol{\eta}) / (1 + exp(\boldsymbol{\eta}))\). see Faraway (2016)

References:

Parameters:: eta ([float]) – The vector containing the model prediction corresponding to each observation.

class mssm.src.python.exp_fam.MULNOMLSS(pars: int)

Bases: GAMLSSFamily

Family for a Multinomial GAMMLSS model (Rigby & Stasinopoulos, 2005).

This Family assumes that each observation \(y_i\) corresponds to one of \(K\) classes (labeled as 0, …, \(K\)) and reflects a realization of an independent RV \(Y_i\) with observation-specific probability mass function defined over the \(K\) classes. These \(K\) probabilities - that \(Y_i\) takes on class 1, …, \(K\) - are modeled as additive combinations of smooth functions of covariates and other parametric terms.

As an example, consider a visual search experiment where \(K-1\) distractors are presented on a computer screen together with a single target and subjects are instructed to find the target and fixate it. With a Multinomial model we can estimate how the probability of looking at each of the \(K\) stimuli on the screen changes (smoothly) over time and as a function of other predictor variables of interest (e.g., contrast of stimuli, dependening on whether parfticipants are instructed to be fast or accurate).

References:

Parameters:: pars (int) – K-1, i.e., 1- Number of classes or the number of linear predictors.

get_resid(y, *mus)

Get standardized residuals for a GAMMLSS model (Rigby & Stasinopoulos, 2005).

Any implementation of this function should return a vector that looks like what could be expected from taking len(y) independent draws from \(N(0,1)\).

References:

Parameters:

y ([float]) – The vector containing each observed value.
mus ([[float]]) – A list including self.n_par lists - one for each parameter of the distribution. Each of those lists contains the expected value for a particular parmeter for each of the N observations.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

llk(y, *mus)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

Parameters:

y ([float]) – The vector containing each observed class, every element must be larger than or equal to 0 and smaller than self.n_par + 1.
mus ([[float]]) – A list containing K-1 (self.n_par) lists, each containing the non-normalized probabilities of observing class k for every observation.

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, *mus)

Log-probability of observing class k under current model.

Our DV consists of K classes but we essentially enforce a sum-to zero constraint on the DV so that we end up modeling only K-1 (non-normalized) probabilities of observing class k (for all k except k==K) as an additive combination of smooth functions of our covariates and other parametric terms. The probability of observing class K as well as the normalized probabilities of observing each other class can readily be computed from these K-1 non-normalized probabilities. This is explained quite well on Wikipedia (see refs).

Specifically, the probability of the outcome being class k is simply:

\(p(Y_i == k) = \mu_k / (1 + \sum_j^{K-1} \mu_j)\) where \(\mu_k\) is the aforementioned non-normalized probability of observing class \(k\) - which is simply set to 1 for class \(K\) (this follows from the sum-to-zero constraint; see Wikipedia).

So, the log-prob of the outcome being class k is:

\(log(p(Y_i == k)) = log(\mu_k) - log(1 + \sum_j^{K-1} \mu_j)\)

References:

Parameters:

y ([float]) – The vector containing each observed class, every element must be larger than or equal to 0 and smaller than self.n_par + 1.
mus ([[float]]) – A list containing K-1 (self.n_par) lists, each containing the non-normalized probabilities of observing class k for every observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.Poisson(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.LOG object>)

Bases: Family

Poisson Family.

We assume: \(Y_i \sim P(\lambda)\). We can simply set \(\lambda=\mu\) (compare scipy density to the one in table 3.1 of Wood, 2017) and treat the scale parameter of a GAMM (\(\phi\)) as fixed/known at 1.

References:

Parameters:: link (Link) – The link function to be used by the model of the mean of this family. By default set to the log link.

D(y, mu)

Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

A N-dimensional vector containing the contribution of each data-point to the overall model deviance.

Return type:

[float]

V(mu)

Variance function for the Poisson family.

The variance of random variable \(Y\) is proportional to it’s mean.

References:

Parameters:: mu ([float]) – a N-dimensional vector of the model prediction/the predicted mean
Returns:: mu
Return type:: [float]

dVy1(mu)

The first derivative of the variance function (of the mean; see Wood, 2017, 3.1.2) with respect ot the mean.

References:

Parameters:: mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

deviance(y, mu)

Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

The model deviance.

Return type:

float

init_mu(y)

Function providing initial \(\boldsymbol{\mu}\) vector for Poisson GAMM.

We shrink extreme observed counts towards mean.

Parameters:: y ([float]) – The vector containing each observation.

llk(y, mu)

log-probability of data under given model. Essentially sum over all elements in the vector returned by the lp() method.

References:

Parameters:

y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale (float, optional) – The (estimated) scale parameter, defaults to 1

Returns:

The log-probability of observing all data under the current model.

Return type:

float

lp(y, mu)

Log-probability of observing every value in \(\mathbf{y}\) under their respective Poisson with mean = \(\boldsymbol{\mu}\).

References:

Parameters:

y ([float]) – The vector containing each observed value.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.

Returns:

a N-dimensional vector containing the log-probability of observing each data-point under the current model.

Return type:

[float]

class mssm.src.python.exp_fam.PropHaz(ut, r)

Bases: GENSMOOTHFamily

Family for proportional Hazard model - a type of General Smooth model as discussed by Wood, Pya, & Säfken (2016).

Based on Supplementary materials G in Wood, Pya, & Säfken (2016).

References:

Parameters:

ut (numpy.array) – Unique event time vector (each time represnted as int) as described by WPS (2016), holding unique event times in decreasing order.
r (numpy.array) – Index vector as described by WPS (2016), holding for each data-point (i.e., for each row in Xs[0]) the index to it’s corresponding event time in ut.

get_baseline_hazard(coef, delta, Xs)

Get the cumulative baseline hazard function + variance estimate as defined by Wood, Pya, & Säfken (2016).

The function is evaluated for all k unique event times that were available in the data.

References:

Parameters:

coef (numpy.array) – Coefficient vector as numpy.array of shape (-1,1).
Xs ([scipy.sparse.csc_array]) – The list model matrices (here holding a single model matrix) obtained from mssm.models.GAMMLSS.get_mmat().
delta (numpy.array) – Dependent variable passed to mssm.src.python.formula.Formula(), holds (for each row in Xs[0]) a value in {0,1}, indicating whether for that observation the event was observed or not.

Returns:

Two arrays, the first holds k baseline hazard function estimates, the latter holds k variance estimates for each of the survival function estimates.

Return type:

(numpy.array,numpy.array)

get_survival(coef, Xs, delta, t, x, V)

Compute survival function + variance at time-point t, given k optional covariate vector(s) x as defined by Wood, Pya, & Säfken (2016).

References:

Parameters:

coef (numpy.array) – Coefficient vector as numpy.array of shape (-1,1).
Xs ([scipy.sparse.csc_array]) – The list model matrices (here holding a single model matrix) obtained from mssm.models.GAMMLSS.get_mmat().
delta (numpy.array) – Dependent variable passed to mssm.src.python.formula.Formula(), holds (for each row in Xs[0]) a value in {0,1}, indicating whether for that observation the event was observed or not.
t (int) – Time-point at which to evaluate the survival function.
x (numpy.array) – Optional vector (or matrix) of covariate values. Needs to be of shape (k,len(coef)).
V (scipy.sparse.csc-array) – Estimated Co-variance matrix of posterior for coef

Returns:

Two arrays, the first holds k survival function estimates, the latter holds k variance estimates for each of the survival function estimates.

Return type:

(numpy.array,numpy.array)

gradient(coef, coef_split_idx, delta, Xs)

Gradient as defined by Wood, Pya, & Säfken (2016).

References:

delta (passed as dependent variable) holds values in {0,1}, indicating whether the event was observed or not.

hessian(coef, coef_split_idx, delta, Xs)

Hessian as defined by Wood, Pya, & Säfken (2016).

References:

delta (passed as dependent variable) holds values in {0,1}, indicating whether the event was observed or not.

init_coef(models)

Function to initialize the coefficients of the model.

Parameters:: models ([mssm.models.GAMM]) – A list of GAMMs, - each based on one of the formulas provided to a model.
Returns:: A numpy.array of shape (-1,1), holding initial values for all model coefficients.
Return type:: numpy array

llk(coef, coef_split_idx, delta, Xs)

Log-likelihood function as defined by Wood, Pya, & Säfken (2016).

References:

delta (passed as dependent variable) holds values in {0,1}, indicating whether the event was observed or not.

mssm.src.python.exp_fam.est_scale(res, rows_X, total_edf)

Scale estimate from Wood & Fasiolo (2017).

Refereces:

Wood, S. N., & Fasiolo, M. (2017). A generalized Fellner-Schall method for smoothing parameter optimization with application to Tweedie location, scale and shape models.

Parameters:

res ([float]) – The vector containing the difference between the model prediction and the (pseudo) data.
rows_X (int) – The number of observations collected.
total_edf (float) – The expected degrees of freedom for the model.

mssm.src.python.formula module

class mssm.src.python.formula.Formula(lhs: ~mssm.src.python.formula.lhs, terms: [<class 'mssm.src.python.terms.GammTerm'>], data: ~pandas.core.frame.DataFrame, series_id: str = None, codebook: dict = None, print_warn=True, keep_cov=False, file_paths=[], file_loading_nc=1, file_loading_kwargs: dict = {'header': 0, 'index_col': False})

Bases: object

The formula of a regression equation.

Parameters:

lhs – The lhs object defining the dependent variable.
terms ([GammTerm]) – A list of the terms which should be added to the model. See mssm.src.python.terms for info on which terms can be added.
data (pd.DataFrame or None) – A pandas dataframe (with header!) of the data which should be used to estimate the model. The variable specified for lhs as well as all variables included for a term in terms need to be present in the data, otherwise the call to Formula will throw an error.
series_id (str, optional) – A string identifying the individual experimental units. Usually a unique trial identifier. Only necessary if approximate derivative computations are to be utilized for random smooth terms.
codebook (dict or None) – Codebook - keys should correspond to factor variable names specified in terms. Values should again be a dict, with keys for each of K levels of the factor and value corresponding to an integer in {0,K}.
print_warn (bool,optional) – Whether warnings should be printed. Useful when fitting models from terminal. Defaults to True.
keep_cov (bool,optional) – Whether or not the internal encoding structure of all predictor variables should be created when forming \(\mathbf{X}^T\mathbf{X}\) iteratively instead of forming \(\mathbf{X}\) directly. Can speed up estimation but increases memory footprint. Defaults to True.
file_paths ([str],optional) – A list of paths to .csv files from which \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) should be created iteratively. Setting this to a non-empty list will prevent fitting X as a whole. data should then be set to None. Defaults to an empty list.
file_loading_nc (int,optional) – How many cores to use to a) accumulate \(\mathbf{X}\) in parallel (if data is not None and file_paths is an empty list) or b) to accumulate \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) (and \(\mathbf{\eta}\) during estimation) (if data is None and file_paths is a non-empty list). For case b, this should really be set to the maimum number of cores available. For a this only really speeds up accumulating \(\mathbf{X}\) if \(\mathbf{X}\) has many many columns and/or rows. Defaults to 1.
file_loading_kwargs (dict,optional) – Any key-word arguments to pass to pandas.read_csv when \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) should be created iteratively (if data is None and file_paths is a non-empty list). Defaults to {"header":0,"index_col":False}.

Variables:

coef_per_term ([int]) – A list containing the number of coefficients corresponding to each term included in terms. Initialized at construction.
coef_names ([str]) – A list containing a named identifier (e.g., “Intercept”) for each coefficient estimated by the model. Initialized at construction.
n_coef (int) – The number of coefficients estimated by the model in total. Initialized at construction.
unpenalized_coef (int) – The number of un-penalized coefficients estimated by the model. Initialized at construction.
y_flat ([float] or None) – An array, containing all values on the dependent variable (i.e., specified by lhs.variable) in order of the data-frame passed to data. This variable will be initialized at construction but only if file_paths=None, i.e., in case \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) are not created iteratively.
cov_flat ([float] or None) – An array, containing all (encoded, in case of categorical predictors) values on each predictor (each columns of cov_flat corresponds to a different predictor) variable included in any of the terms in order of the data-frame passed to data. This variable will be initialized at construction but only if file_paths=None, i.e., in case \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) are not created iteratively.
NOT_NA_flat ([bool] or None) – An array, containing an indication for each value on the dependent variable (i.e., specified by lhs.variable) whether the corresponding value is not a number (“NA”) or not. In order of the data-frame passed to data. This variable will be initialized at construction but only if file_paths=None, i.e., in case \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) are not created iteratively.

build_penalties(): Builds the penalties required by self.__terms. Called automatically whenever needed. Call manually only for testing.

encode_data(data, prediction=False)

Encodes data, which needs to be a pd.DataFrame and by default (if prediction==False) builds an index of which rows in data are NA in the column of the dependent variable described by self.lhs.

Parameters:

data (pd.DataFrame) – The data to encode.
prediction (bool, optional) – Whether or not a NA index and a column for the dependent variable should be generated.

Returns:

A tuple with 7 (optional) entries: the dependent variable described by self.__lhs, the encoded predictor variables as a (N,k) array (number of rows matches the number of rows of the first entry returned, the number of columns matches the number of k variables present in the formula), an indication for each row whether the dependent variable described by self.__lhs is NA, like the first entry but split into a list of lists by self.series_id, like the second entry but split into a list of lists by self.series_id, ike the third entry but split into a list of lists by self.series_id, start and end points for the splits used to split the previous three elements (identifying the start and end point of every level of self.series_id).

Return type:

(numpy.array or None, numpy.array or None, numpy.array or None, [numpy.array] or None, [numpy.array] or None, [numpy.array] or None, numpy.array or None)

get_coding_factors() → dict: Get a copy of the factor coding dictionary. Keys are factor variables in the data, values are dictionaries, where the keys correspond to the encoded levels (int) of the factor and the values to their levels (str).

get_data() → DataFrame: Get a copy of the data specified for this formula.

get_depvar() → list: Get a copy of the encoded dependent variable (defined via self.__lhs).

get_factor_codings() → dict: Get a copy of the factor coding dictionary. Keys are factor variables in the data, values are dictionaries, where the keys correspond to the levels (str) of the factor and the values to their encoded levels (int).

get_factor_levels() → dict: Get a copy of the factor levels dictionary. Keys are factor variables in the data, values are np.arrays holding the unique levels (as str) of the corresponding factor.

get_ir_smooth_term_idx() → list[int]: Get a copy of the list of indices that identify impulse response terms in self.__terms.

get_lhs() → lhs: Get a copy of the lhs specified for this formula.

get_linear_term_idx() → list[int]: Get a copy of the list of indices that identify linear terms in self.__terms.

get_n_coef() → int: Get the number of coefficients that are implied by the formula.

get_notNA() → list: Get a copy of the encoded ‘not a NA’ vector for the dependent variable (defined via self.__lhs).

get_penalties() → list: Get a copy of the penalties implied by the formula. Will be None if the penalties have not been initizlized yet.

get_random_term_idx() → list[int]: Get a copy of the list of indices that identify random terms in self.__terms.

get_smooth_term_idx() → list[int]: Get a copy of the list of indices that identify smooth terms in self.__terms.

get_subgroup_variables() → list: Returns a copy of sub-group variables for factor smooths.

get_term_names() → list: Returns a copy of the list with the names of the terms specified for this formula.

get_terms() → list[GammTerm]: Get a copy of the terms specified for this formula.

get_var_map() → dict: Get a copy of the var map dictionary. Keys are variables in the data, values their column index in the encoded predictor matrix returned by self.encode_data.

get_var_maxs() → dict: Get a copy of the var maxs dictionary. Keys are variables in the data, values are either the maximum value the variable takes on in self.__data for continuous variables or None for categorical variables.

get_var_mins() → dict: Get a copy of the var mins dictionary. Keys are variables in the data, values are either the minimum value the variable takes on in self.__data for continuous variables or None for categorical variables.

get_var_mins_maxs() -> (<class 'dict'>, <class 'dict'>): Get a tuple containing copies of both the mins and maxs directory. See self.get_var_mins and self.get_var_maxs.

get_var_types() → dict: Get a copy of the var types dictionary. Keys are variables in the data, values are either VarType.NUMERIC for continuous variables or VarType.FACTOR for categorical variables.

has_intercept() → bool: Does this formula include an intercept or not.

has_ir_terms() → bool: Does this formula include impulse response terms or not.

class mssm.src.python.formula.VarType(*values)

Bases: Enum

FACTOR = 2

NUMERIC = 1

class mssm.src.python.formula.lhs(variable: str, f: Callable = None)

Bases: object

The Left-hand side of a regression equation.

Parameters:

Parameters:

variable (str) – The dependent variable. Can point to continuous and categorical variables.
f (Callable, optional) – A function that will be applied to the variable before fitting. For example: np.log(). By default no function is applied to the variable.

mssm.src.python.formula.reparam(X, S, cov, option=1, n_bins=30, QR=False, identity=False, scale=False)

Options 1 - 3 are natural reparameterization discussed in Wood (2017; 5.4.2) with different strategies for the QR computation of \(\mathbf{X}\). Option 4 helps with stabilizing the REML computation and is from Appendix B of Wood (2011) and section 6.2.7 in Wood (2017):

For Options 1-3:

For Option 4:

References:

Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.
Wood, S. N., Scheipl, F., & Faraway, J. J. (2013). Straightforward intermediate rank tensor product smoothing in mixed models.
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
mgcv source code (accessed 2024). smooth.R file, nat.param function.

mssm.src.python.gamm_solvers module

class mssm.src.python.gamm_solvers.Fit_info(lambda_updates: int = 0, iter: int = 0, code: int = 1, eps: float = None, K2: float = None, dropped: [<class 'int'>] = None)

Bases: object

Holds information related to convergence (speed) for GAMMs, GAMMLSS, and GSMMs.

Variables:

lambda_updates (int) – The total number of lambda updates computed during estimation. Initialized with 0.
iter (int) – The number of outer iterations (a single outer iteration can involve multiple lambda updates) completed during estimation. Initialized with 0.
code (int) – Convergence status. Anything above 0 indicates that the model did not converge and estimates should be considered carefully. Initialized with 1.
eps (float) – The fraction added to the last estimate of the negative Hessian of the penalized likelihood during GAMMLSS or GSMM estimation. If this is not 0 - the model should not be considered as converged, irrespective of what code indicates. This most likely implies that the model is not identifiable. Initialized with None and ignored for GAMM estimation.
K2 (float) – An estimate for the condition number of matrix A, where A.T@A=H and H is the final estimate of the negative Hessian of the penalized likelihood. Only available if check_cond>0 when model.fit() is called for any model (i.e., GAMM, GAMMLSS, GSMM). Initialized with None.
dropped ([int]) – The final set of coefficients dropped during GAMMLSS/GSMM estimation when using method in ["QR/Chol","LU/Chol","Direct/Chol"] or None in which case no coefficients were dropped. Initialized with 0.

K2: float = None

code: int = 1

dropped: [<class 'int'>] = None

eps: float = None

iter: int = 0

lambda_updates: int = 0

mssm.src.python.penalties module

class mssm.src.python.penalties.LambdaTerm(S_J: ~scipy.sparse._csc.csc_array = None, S_J_emb: ~scipy.sparse._csc.csc_array = None, D_J_emb: ~scipy.sparse._csc.csc_array = None, rep_sj: int = 1, lam: float = 1.1, start_index: int = None, frozen: bool = False, type: ~mssm.src.python.penalties.PenType = None, rank: int = None, term: int = None, clust_series: [<class 'int'>] = None, clust_weights: [[<class 'float'>]] = None, dist_param: int = None, rp_idx: int = None, S_J_lam: ~scipy.sparse._csc.csc_array = None)

Bases: object

\(\lambda\) storage term.

Variables:

S_J (scipy.sparse.csc_array) – The penalty matrix associated with this lambda term. Note, in case multiple penalty matrices share the same lambda value, the rep_sj argument determines how many diagonal blocks we need to fill with this penalty matrix to get S_J_emb. Initialized with None.
S_J_emb (scipy.sparse.csc_array) – A zero-embedded version of the penalty matrix associated with this lambda term. Note, this matrix contains rep_sj diagonal sub-blocks each filled with S_J. Initialized with None.
D_J_emb (scipy.sparse.csc_array) – Root of S_J_emb, so that D_J_emb@D_J_emb.T=S_J_emb. Initialized with None.
rep_sj (int) – How many sequential sub-blocks of S_J_emb need to be filled with S_J. Useful if all levels of a categorical variable for which a separate smooth is to be estimated are assumed to share the same lambda value. Initialized with 1.
lam (float) – The current estimate for \(\lambda\). Initialized with 1.1.
start_index (int) – The first row and column in the overall penalty matrix taken up by S_J. Initialized with None.
type (PenType) – The type of this penalty term. Initialized with None.
rank (int) – The rank of S_J. Initialized with None.
term (int) – The index of the term in a mssm.src.python.formula.Formula with which this penalty is associated. Initialized with None.

D_J_emb: csc_array = None

S_J: csc_array = None

S_J_emb: csc_array = None

S_J_lam: csc_array = None

clust_series: [<class 'int'>] = None

clust_weights: [[<class 'float'>]] = None

dist_param: int = None

frozen: bool = False

lam: float = 1.1

rank: int = None

rep_sj: int = 1

rp_idx: int = None

start_index: int = None

term: int = None

type: PenType = None

mssm.src.python.terms module

class mssm.src.python.terms.GammTerm(variables: list[str], type: TermType, is_penalized: bool, penalty: list[PenType], pen_kwargs: list[dict])

Bases: object

Base-class implemented by the terms passed to mssm.src.python.formula.Formula.

Parameters:

variables ([str]) – List of variables as strings.
type (TermType) – Type of term as enum
is_penalized (bool) – Whether the term is penalized/can be penalized or not
penalty ([penalties.PenType]) – The default penalties associated with a term.
pen_kwargs ([dict]) – A list of dictionaries, each with key-word arguments passed to the construction of the corresponding penalties.PenType in penalty.

class mssm.src.python.terms.TermType(*values)

Bases: Enum

LINEAR = 3

LSMOOTH = 1

RANDINT = 4

RANDSLOPE = 5

SMOOTH = 2

class mssm.src.python.terms.f(variables: list, by: str = None, by_cont: str = None, binary: list[str, str] = None, id: int = None, nk: int = 9, te: bool = False, rp: int = 0, constraint: ~mssm.src.python.constraints.ConstType = ConstType.QR, identifiable: bool = True, basis: ~collections.abc.Callable = <function B_spline_basis>, basis_kwargs: dict = {}, is_penalized: bool = True, penalize_null: bool = False, penalty: list[~mssm.src.python.penalties.PenType] = None, pen_kwargs: list[dict] = None)

Bases: GammTerm

A univariate or tensor interaction smooth term. If variables only contains a single variable \(x\), this term will represent a univariate \(f(x)\) in a model:

\[\mu_i = a + f(x_i)\]

For example, the model below in mgcv:

bam(y ~ s(x,k=10) + s(z,k=20))

would be expressed as follows in mssm:

GAMM(Formula(lhs("y"),[i(),f(["x"],nk=9),f(["z"],nk=19)]),Gaussian())

If variables contains two variables \(x\) and \(z\), then this term will either represent the tensor interaction \(f(x,z)\) in model:

\[\mu_i = a + f(x_i) + f(z_i) + f(x_i,z_i)\]

or in model:

\[\mu_i = a + f(x_i,z_i)\]

The first behavior is achieved by setting te=False. In that case it is necessary to add ‘main effect’ f terms for \(x\) and \(y\). In other words, the behavior then mimicks the ti() term available in mgcv (Wood, 2017). If te=True, the term instead behaves like a te() term in mgcv, so no separate smooth effects for the main effects need to be included.

For example, the model below in mgcv:

bam(y ~ te(x,z,k=10))

would be expressed as follows in mssm:

GAMM(Formula(lhs("y"),[i(),f(["x","z"],nk=9,te=True)]),Gaussian())

In addition, the model below in mgcv:

bam(y ~ s(x,k=10) + s(z,k=20) + ti(x,z,k=10))

would be expressed as follows in mssm:

GAMM(Formula(lhs("y"),[i(),f(["x"],nk=9),f(["z"],nk=19),f(["x","z"],nk=9,te=False)]),Gaussian())

By default a B-spline basis is used with nk=9 basis functions (after removing identifiability constrains). This is equivalent to mgcv’s default behavior of using 10 basis functions (before removing identifiability constrains). In case variables contains more then one variable nk can either bet set to a single value or to a list containing the number of basis functions that should be used to setup the spline matrix for every variable. The former implies that the same number of coefficients should be used for all variables. Keyword arguments that change the computation of the spline basis can be passed along via a dictionary to the basis_kwargs argument. Importantly, if multiple variables are present and a list is passed to nk, a list of dictionaries with keyword arguments of the same length needs to be passed to basis_kwargs as well.

Multiple penalties can be placed on every term by adding penalties.PenType to the penalties argument. In case variables contains multiple variables a separate tensor penalty (see Wood, 2017) will be created for every penalty included in penalties. Again, key-word arguments that alter the behavior of the penalty creation need to be passed as dictionaries to pen_kwargs for every penalty included in penalties. By default, a univariate term is penalized with a difference penalty of order 2 (Eilers & Marx, 2010).

References:

Parameters:

variables (list[str]) – A list of the variables (strings) of which the term is a function. Need to exist in data passed to Formula. Need to be continuous.
by (str, optional) – A string corresponding to a factor in data passed to Formula. Separate f(variables) (and smoothness penalties) will be estimated per level of by.
by_cont (str, optional) – A string corresponding to a numerical variable in data passed to Formula. The model matrix for the estimated smooth term f(variables) will be multiplied by the column of this variable. Can be used to estimate ‘varying coefficient’ models but also to set up binary smooths or to only estimate a smooth term for specific levels of a factor (i.e., what is possible for ordered factors in R & mgcv).
binary ([str,str], optional) – A list containing two strings. The first string corresponds to a factor in data passed to Formula. A separate f(variables) will be estimated for the level of this factor corresponding to the second string.
id (int, optional) – Only useful in combination with specifying a by variable. If id is set to any integer the penalties placed on the separate f(variables) will share a single smoothness penalty.
nk (int or list[int], optional) – Number of basis functions to use. Even if identifiable is true, this number will reflect the final number of basis functions for this term (i.e., mssm acts like you would have asked for 10 basis functions if nk=9 and identifiable=True; the default).
te (bool, optional) – For tensor interaction terms only. If set to false, the term mimics the behavior of ti() in mgcv (Wood, 2017). Otherwise, the term behaves like a te() term in mgcv - i.e., the marginal basis functions are not removed from the interaction.
rp (int, optional) – Experimental - will currently break for tensor smooths or in case by is provided. Whether or not to re-parameterize the term - see mssm.src.python.formula.reparam() for details. Defaults to no re-parameterization.
constraint (mssm.src.constraints.ConstType, optional) – What kind of identifiability constraints should be absorbed by the terms (if they are to be identifiable). Either QR-based constraints (default, well-behaved), by means of column-dropping (no infill, not so well-behaved), or by means of difference re-coding (little infill, not so well behaved either).
identifiable (bool, optional) – Whether or not the constant should be removed from the space of functions this term can fit. Achieved by enforcing that \(\mathbf{1}^T \mathbf{X} = 0\) (\(\mathbf{X}\) here is the spline matrix computed for the observed data; see Wood, 2017 for details). Necessary in most cases to keep the model identifiable.
basis (Callable, optional) – The basis functions to use to construct the spline matrix. By default a B-spline basis (Eilers & Marx, 2010) implemented in mssm.src.smooths.B_spline_basis().
basis_kwargs (dict, optional) – A list containing one or multiple dictionaries specifying how the basis should be computed. For the B-spline basis the following arguments (with default values) are available: convolve``=``False, min_c``=``None, max_c``=``None, deg``=``3. See mss.src.smooths.B_spline_basis() for details, but the default should work for most cases.
is_penalized (bool, optional) – Should the term be left unpenalized or not. There are rarely good reasons to set this to False.
penalize_null (bool, optional) – Should a separate Null-space penalty (Marra & Wood, 2011) be placed on the term. By default, the term here will leave a linear f(variables) un-penalized! Thus, there is no option for the penalty to achieve f(variables) = 0 even if that would be supported by the data. Adding a Null-space penalty provides the penalty with that power. This can be used for model selection instead of Hypothesis testing and is the preferred way in mssm (see Marra & Wood, 2011 for details).
penalty (list[penalties.PenType], optional) – A list of penalty types to be placed on the term.
pen_kwargs (list[dict], optional) – A list containing one or multiple dictionaries specifying how the penalty should be created. For the default difference penalty (Eilers & Marx, 2010) the only keyword argument (with default value) available is: m=2. This reflects the order of the difference penalty. Note, that while a higher m permits penalizing towards smoother functions it also leads to an increased dimensionality of the penalty Kernel (the set of bases functions which will not be penalized). In other words, increasingly more complex functions will be left un-penalized for higher m (except if penalize_null is set to True). m=2 is usually a good choice and thus the default but see Eilers & Marx (2010) for details.

class mssm.src.python.terms.fs(variables: list, rf: str = None, nk: int = 9, m: int = 1, rp: int = 1, by_subgroup: [<class 'str'>, <class 'str'>] = None, approx_deriv: dict = None, basis: ~collections.abc.Callable = <function B_spline_basis>, basis_kwargs: dict = {})

Bases: f

Essentially a f term with by=rf, id != None, penalize_null= True, pen_kwargs = [{"m":1}], and rp=1.

This term approximates the “factor-smooth interaction” basis “fs” with m= 1 available in mgcv (Wood, 2017). For example, the term below from mgcv:

s(x,sub,bs="fs"))

would approximately correspond to the following term in mssm:

fs(["x"],rf="sub")

They are however not equivalent (mgcv by default uses a different basis for which the m key-word has a different functionality).

Specifically, here m= 1 implies that the only function left unpenalized by the default (difference) penalty is the constant (Eilers & Marx, 2010). Thus, a linear basis is penalized by the same default penalty that also penalizes smoothness (and not by a separate penalty as is the case in mgcv when m=1 for the default basis)! Any constant basis is penalized by the null-space penalty (in both mgcv and mssm; see Marra & Wood, 2011) - the term thus shrinks towards zero (Wood, 2017).

The factor smooth basis in mgcv allows to let the penalty be different for different levels of an additional factor (by additionally specifying the by argument for a smooth with basis “fs”). I.e.,

s(Time,Subject,by='condition',bs='fs')

in mgcv would estimate a non-linear random smooth of “time” per level of the “subject” & “condition” interaction - with the same penalty being placed on all random smooth terms within the same “condition” level.

This can be achieved in mssm by adding multiple fs terms to the Formula and utilising the by_subgroup argument. This needs to be set to a list where the first element identifies the additional factor variable (e.g., “condition”) and the second element corresponds to a level of said factor variable. E.g., to approximate the aforementioned mgcv term we have to add:

*[fs(["Time"],rf="subject_cond",by_subgroup=["cond",cl]) for cl in np.unique(dat["cond"])]

to the Formula terms list. Importantly, “subject_cond” is the interaction of “subject” and “condition” - not just the “subject variable in the data.

Model estimation can become quite expensive for fs terms, when the factor variable for rf has many levels. (> 10000) In that case, approximate derivative evaluation can speed things up considerably. To enforce this, the approx_deriv argument needs to be specified with a dict, having the following structure: {"no_disc":[str],"excl":[str],"split_by":[str],"restarts":int,"seed":None or int}. “no_disc” should usually be set to an empty list, and should in general only contain names of continuous variables included in the formula. Any variable mentioned here will not be discretized before clustering - this will make the approximation a bit more accurate but also require more time. Similarly, “excl” specifies any continuous variables that should be excluded for clustering. “split_by” should generally be set to a list containing all categorical variables present in the formula. “restarts” indicates the number of times to re-produce the clustering (40 seems to be a good number). “seed” can either be set to None or to an integer - in the latter case, the random cluster initialization will use that seed, ensuring that the clustering outcome (and hence model fit) is replicable.

References:

Parameters:

variables (list[str]) – A list of the variables (strings) of which the term is a function. Need to exist in data passed to Formula. Need to be continuous.
rf (str, optional) – A string corresponding to a (random) factor in data passed to Formula. Separate f(variables) (but a shared smoothness penalty!) will be estimated per level of rf.
nk (int or list[int], optional) – Number of basis functions -1 to use. I.e., if nk=9 (the default), the term will use 10 basis functions. By default f() has identifiability constraints applied and we act as if nk``+ 1 coefficients were requested. The ``fs() term needs no identifiability constrains so if the same number of coefficients used for a f() term is requested (the desired approach), one coefficient is added to compensate for the lack of identifiability constraints. This is the opposite to how this is handled in mgcv: specifying nk=10 for “fixed” univariate smooths results in 9 basis functions being available. However, for a smooth in mgcv with basis=’fs’, 10 basis functions will remain available.
basis (Callable, optional) – The basis functions to use to construct the spline matrix. By default a B-spline basis (Eilers & Marx, 2010) implemented in mssm.src.smooths.B_spline_basis().
basis_kwargs (dict, optional) – A list containing one or multiple dictionaries specifying how the basis should be computed. For the B-spline basis the following arguments (with default values) are available: convolve``=``False, min_c``=``None, max_c``=``None, deg``=``3. See mssm.src.smooths.B_spline_basis() for details.
by_subgroup ([str,str], optional) – List including a factor variable and specific level of said variable. Allows for separate penalties as described above.
approx_deriv (dict, optional) – Dict holding important info for the clustering algorithm. Structure: {"no_disc":[str],"excl":[str],"split_by":[str],"restarts":int}

class mssm.src.python.terms.i

Bases: GammTerm

An intercept/offset term. In a model

\[\mu_i = a + f(x_i)\]

it reflects \(a\).

class mssm.src.python.terms.irf(variables: [<class 'str'>], event_onset: [<class 'int'>], basis_kwargs: list[dict], by: str = None, id: int = None, nk: int = 10, basis: ~collections.abc.Callable = <function B_spline_basis>, is_penalized: bool = True, penalty: list[~mssm.src.python.penalties.PenType] = None, pen_kwargs: list[dict] = None)

Bases: GammTerm

A simple impulse response term, designed to correct for events with overlapping responses in multi-level time-series modeling.

The idea (see Ehinger & Dimigen; 2019 for a detailed introduction to this kind of deconvolution analysis) is that some kind of event happens during each recorded time-series (e.g., stimulus onset, distractor display, mask onset, etc.) which is assumed to affect the recorded signal in the next X ms in some way. The moment of event onset can differ between recorded time-series. In other words, the event is believed to act like an impulse which triggers a delayed response on the signal. This term class can be used to estimate the shape of this impulse response. Multiple irf terms can be included in a Formula if multiple events happen, potentially with overlapping responses.

Example:

# Simulate time-series based on two events that elicit responses which vary in their overlap.
# The summed responses + a random intercept + noise is then the signal.
overlap_dat,onsets1,onsets2 = sim7(100,1,2,seed=20)

# Model below tries to recover the shape of the two responses in the 200 ms after event onset (max_c=200) + the random intercepts:
overlap_formula = Formula(lhs("y"),[irf(["time"],onsets1,nk=15,basis_kwargs=[{"max_c":200,"min_c":0,"convolve":True}]),
                                    irf(["time"],onsets2,nk=15,basis_kwargs=[{"max_c":200,"min_c":0,"convolve":True}]),
                                    ri("factor")],
                                    data=overlap_dat,
                                    series_id="series") # For models with irf terms, the column in the data identifying unique series need to be specified.

model = GAMM(overlap_formula,Gaussian())
model.fit()

Note, that care needs to be taken when predicting for models including irf terms, because the onset of events can differ between time-series. Hence, model predictions + standard errors should first be obtained for the entire data-set used also to train the model and then extract series-specific predictions from the model-matrix as follows:

# Get model matrix for entire data-set but only based on the estimated shape for first irf term:
_,pred_mat,ci_b = model.predict([0],overlap_dat,ci=True)

# Now extract the prediction + approximate ci boundaries for a single series:
s = 8
s_pred = pred_mat[overlap_dat["series"] == s,:]@model.coef
s_ci = ci_b[overlap_dat["series"] == s]

# Now the estimated response following the onset of the first event can be visualized + an approximate CI:
from matplotlib import pyplot as plt
plt.plot(overlap_dat["time"][overlap_dat["series"] == s],s_pred,color='blue')
plt.plot(overlap_dat["time"][overlap_dat["series"] == s],s_pred+s_ci,color='blue',linestyle='dashed')
plt.plot(overlap_dat["time"][overlap_dat["series"] == s],s_pred-s_ci,color='blue',linestyle='dashed')

References:

Ehinger, B. V., & Dimingen, O. (2019). Unfold: an integrated toolbox for overlap correction, non-linear modeling, and regression-based EEG analysis. https://doi.org/10.7717/peerj.7838

Parameters:

variables (list[str]) – A list of the variables (strings) of which the term is a function. Need to exist in data passed to Formula. Need to be continuous.
event_onset ([int]) – A np.array containing, for each individual time-series, the index corresponding to the sample/time-point at which the event eliciting the response to be estimate by this term happened.
basis_kwargs (dict) – A list containing one or multiple dictionaries specifying how the basis should be computed. For irf terms, the convolve argument has to be set to True! Also, min_c and max_c must be specified. min_c corresponds to the assumed min. delay of the response after event onset and can usually be set to 0. max_c corresponds to the assumed max. delay of the response (in ms) after which the response is believed to have returned to a zero base-line.
by (str, optional) – A string corresponding to a factor in data passed to Formula. Separate irf(variables) (and smoothness penalties) will be estimated per level of by.
id (int, optional) – Only useful in combination with specifying a by variable. If id is set to any integer the penalties placed on the separate irff(variables) will share a single smoothness penalty.
nk (int, optional) – Number of basis functions to use. I.e., if nk=10 (the default), the term will use 10 basis functions (Note that these terms are not made identifiable by absorbing any kind of constraint).
basis (Callable, optional) – The basis functions to use to construct the spline matrix. By default a B-spline basis (Eilers & Marx, 2010) implemented in src.smooths.B_spline_basis.
is_penalized (bool, optional) – Should the term be left unpenalized or not. There are rarely good reasons to set this to False.
penalty (list[penalties.PenType], optional) – A list of penalty types to be placed on the term.
pen_kwargs (list[dict], optional) – A list containing one or multiple dictionaries specifying how the penalty should be created. For the default difference penalty (Eilers & Marx, 2010) the only keyword argument (with default value) available is: m=2. This reflects the order of the difference penalty. Note, that while a higher m permits penalizing towards smoother functions it also leads to an increased dimensionality of the penalty Kernel (the set of f[variables] which will not be penalized). In other words, increasingly more complex functions will be left un-penalized for higher m (except if penalize_null is set to True). m=2 is usually a good choice and thus the default but see Eilers & Marx (2010) for details.

class mssm.src.python.terms.l(variables: list)

Bases: GammTerm

Adds a parametric (linear) term to the model formula. The model \(\mu_i = a + b*x_i\) can for example be achieved by adding [i(), l(['x'])] to the term argument of a Formula. The coefficient \(b\) estimated for the term will then correspond to the slope of \(x\). This class can also be used to add predictors for categorical variables. If the formula includes an intercept, binary coding will be utilized to add reference-level adjustment coefficients for the remaining k-1 levels of any additional factor variable.

If more than one variable is included in variables the model will only add the the len(variables)-interaction to the model! Lower order interactions and main effects will not be included by default (see li() function instead, which automatically includes all lower-order interactions and main effects).

Example: The interaction effect of factor variable “cond”, with two levels “1” and “2”, and acontinuous variable “x” on the dependent variable “y” are of interest. To estimate such a model, the following formula can be used:

formula = Formula(lhs("y"),terms=[i(),l(["cond"]),l(["x"]),l(["cond","x"])])

This formula will estimate the following model:

\[\mu_i = a + b_1*c_i + b_2*x_i + b_3*c_i*x_i\]

Here, \(c\) is a binary predictor variable created so that it is 1 if “cond”=2 else 0 and \(b_3\) is the coefficient that is added because l(["cond","x"]) is included in the terms (i.e., the interaction effect).

To get a model with only main effects for “cond” and “x”, the following formula could be used:

formula = Formula(lhs("y"),terms=[i(),l(["cond"]),l(["x"])])

This formula will estimate:

\[\mu_i = a + b_1*c_i + b_2*x_i\]

Parameters:: variables ([str]) – A list of the variables (strings) for which linear predictors should be included

mssm.src.python.terms.li(variables: list[str])

Behaves like the l class but automatically includes all lower-order interactions and main effects.

Example: The interaction effect of factor variable “cond”, with two levels “1” and “2”, and acontinuous variable “x” on the dependent variable “y” are of interest. To estimate such a model, the following formula can be used:

formula = Formula(lhs("y"),terms=[i(),*li(["cond","x"])])

Note, the use of the * operator to unpack the individual terms returned from li!

This formula will still (see l) estimate the following model:

\[\mu = a + b_1*c_i + b_2*x_i + b_3*c_i*x_i\]

with: \(c\) corresponding to a binary predictor variable created so that it is 1 if “cond”=2 else 0.

To get a model with only main effects for “cond” and “x” li cannot be used and l needs to be used instead:

formula = Formula(lhs("y"),terms=[i(),l(["cond"]),l(["x"])])

This formula will estimate:

\[\mu_i = a + b_1*c_i + b_2*x_i\]

Parameters:: variables (list[str]) – A list of the variables (strings) for which linear predictors should be included

class mssm.src.python.terms.ri(variable: str)

Bases: GammTerm

Adds a random intercept for the factor variable to the model. The random intercepts \(b_i\) are assumed to be i.i.d \(b_i \sim N(0,\sigma_b)\) i.e., normally distributed around zero - the simplest random effect supported by mssm.

Thus, this term achieves exactly what is achieved in mgcv by adding the term:

s(variable,bs="re")

The variable needs to identify a factor-variable in the data (i.e., the .dtype of the variable has to be equal to ‘O’). If you want to add more complex random effects to the model (e.g., random slopes for continuous variable “x” per level of factor variable) use the rs class.

Parameters:: variable (str) – The name (string) of a factor variable. For every level of this factor a random intercept will be estimated. The random intercepts are assumed to follow a normal distribution centered around zero.

class mssm.src.python.terms.rs(variables: list[str], rf: str)

Bases: GammTerm

Adds random slopes for the effects of variables for each level of the random factor rf. The type of random slope created depends on the content of variables.

If len(variables)==1, and the string in variables identifies a categorical variable in the data, then a random offset adjustment (for every level of the categorical variable, so without binary coding!) will be estimated for every level of the random factor rf.

Example: The factor variable “cond”, with two levels “1” and “2” is assumed to have a general effect on the DV “y”. However, data was collected from multiple subjects (random factor rf = “subject”) and it is reasonable to assume that the effect of “cond” is slightly different for every subject (it is also assumed that all subjects took part in both conditions identified by “cond”). A model that accounts for this is estimated via:

formula = Formula(lhs("y"),terms=[i(),l(["cond"]),rs(["cond"],rf="subject")])

This formula will estimate the following model:

\[\mu = a + b_1*c_i + a_{j(i),cc(i)}\]

Here, \(c\) is again a binary predictor variable created so that it is 1 if “cond”=2 for observation i else 0, \(cc(i)\) indexes the level of “cond” at observation \(i\), \(j(i)\) indexes the level of “subject” at observation \(i\), and \(a_{j,cc(i)}\) identifies the random offset estimated for subject \(j\) at the level of “cond” indicated by \(cc(i)\). The \(a_{j,cc(i)}\) are assumed to be i.i.d \(\sim N(0,\sigma_a)\). Note that the fixed effect sturcture uses binary coding but the random effect structure does not!

Hence, rs(["cond"],rf="subject") in mssm corresponds to adding the term below to a mgcv model:

s(cond,subject,bs="re")

If all the strings in variables identify continuous variables in the data, then a random slope for the len(variables)-way interaction (will simplify to a slope for a single continuous variable if len(variables) == 1) will be estimated for every level of the random factor rf.

Example: The continuous variable “x” is assumed to have a general effect on the DV “y”. However, data was collected from multiple subjects (random factor rf =”subject”) and it is reasonable to assume that the effect of “x” is slightly different for every subject. A model that accounts for this is estimated via:

formula = Formula(lhs("y"),terms=[i(),l(["x"]),rs(["x"],rf="subject")])

This formula will estimate the following model:

\[\mu = a + b*x_i + b_{j(i)} * x_i\]

Where, \(j(i)\) again indexes the level of “subject” at observation \(i\), \(b_j(i)\) identifies the random slope (the subject-specific slope adjustment for \(b\)) for variable “x” estimated for subject \(j\) and the \(b_{j(i)}\) are again assumed to be i.i.d from a single \(\sim N(0,\sigma_b)\)

Note, lower-order interaction slopes (as well as main effects) are not pulled in by default! Consider the following formula:

formula = Formula(lhs("y"),terms=[i(),*li(["x","z"]),rs(["x","z"],rf="subject")])

with another continuous variable “z”. This corresponds to the model:

\[\mu = a + b_1*x_i + b_2*z_i + b_3*x_i*z_i + b_{j(i)}*x_i*z_i\]

With \(j(i)\) again indexing the level of “subject” at observation i, \(b_{j(i)}\) identifying the random slope (the subject-specific slope adjustment for \(b_3\)) for the interaction of variables \(x\) and \(z\) estimated for subject \(j\). The \(b_{j(i)}\) are again assumed to be i.i.d from a single \(\sim N(0,\sigma_b)\).

To add random slopes for the main effects of either \(x\) or \(z\) as well as an additional random intercept, additional rs and a ri terms would have to be added to the formula:

formula = Formula(lhs("y"),terms=[i(),*li(["x","z"]),
                                 ri("subject"),
                                 rs(["x"],rf="subject"),
                                 rs(["z"],rf="subject"),
                                 rs(["x","z"],rf="subject")])

If len(variables) > 1 and at least one string in variables identifies a categorical variable in the data then random slopes for the len(variables)-way interaction will be estimated for every level of the random factor rf. Separate distribution parameters (the \(\sigma\) of the Normal) will be estimated for every level of the resulting interaction.

Example: The continuous variable “x” and the factor variable “cond”, with two levels “1” and “2” are assumed to have a general interaction effect on the DV “y”. However, data was collected from multiple subjects (random factor rf =”subject”) and it is reasonable to assume that their interaction effect is slightly different for every subject. A model that accounts for this is estimated via:

formula = Formula(lhs("y"),terms=[i(),*li(["x","cond"]),rs(["x","cond"],rf="subject")])

This formula will estimate the following model:

\[\mu = a + b_1*c_i + b_2*x_i + b_3*x_i*c_i + b_{j(i),cc(i)}*x_i\]

With, \(c\) corresponding to a binary predictor variable created so that it is 1 if “cond”=2 for observation \(i\) else 0, \(cc(i)\) corresponds to the level of “cond” at observation \(i\), \(j(i)\) corresponds to the level of “subject” at observation \(i\), and \(b_{j(i),cc(i)}\) identifies the random slope for variable \(x\) at “cond” = \(cc(i)\) estimated for subject \(j\). That is: the \(b_{j,cc(i)}\) where \(cc(i)=1\) are assumed to be i.i.d realizations from normal distribution \(N(0,\sigma_{b_1})\) and the \(b_{j,cc(i)}\) where \(cc(i)=2\) are assumed to be i.i.d realizations from a separate normal distribution \(N(0,\sigma_{b_2})\).

Hence, adding rs(["x","cond"],rf="subject") to a mssm model, is equivalent to adding the term below to a mgcv model:

s(x,subject,by=cond,bs="re")

Correlations between random effects cannot be taken into account by means of parameters (this is possible for example in lme4).

Parameters:

variables ([str]) – A list of variables. Can point to continuous and categorical variables.
rf (str) – A factor variable. Identifies the random factor in the data.

mssm.src.python.utils module

class mssm.src.python.utils.DummyRhoPrior(a=np.float64(-16.11809565095832), b=np.float64(16.11809565095832))

Bases: RhoPrior

Simple uniform prior for rho - the log-smoothing penalty parameters

logpdf(rho)

Compute log density for log smoothing penalty parameters included in rho under this prior.

Parameters:: rho (np.ndarray) – Numpy array of shape (nR,nrho) containing nR proposed candidate vectors for the nrho log-smoothing parameters.

mssm.src.python.utils.REML(llk, nH, coef, scale, penalties)

Based on Wood (2011). Exact REML for Gaussian GAM, Laplace approximate (Wood, 2016) for everything else. Evaluated after applying stabilizing reparameterization discussed by Wood (2011).

References:

Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.
Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models

Parameters:

llk (float) – log-likelihood of model
nH (scipy.sparse.csc_array) – negative hessian of log-likelihood of model
coef (numpy.array) – Estimated vector of coefficients of shape (-1,1)
scale (float) – (Estimated) scale parameter - can be set to 1 for GAMLSS or GSMMs.
penalties ([LambdaTerm]) – List of penalties that were part of the model.

Returns:

(Approximate) REML score

Return type:

float

class mssm.src.python.utils.RhoPrior(*args, **kwargs)

Bases: object

Base class to demonstrate the functionlaity that any prior passed to the correct_VB function has to implement.

logpdf(rho: ndarray)

Compute log density for log smoothing penalty parameters included in rho under this prior.

Parameters:: rho (np.ndarray) – Numpy array of shape (nR,nrho) containing nR proposed candidate vectors for the nrho log-smoothing parameters.

mssm.src.python.utils.adjust_CI(model, n_ps, b, predi_mat, use_terms, alpha, seed)

Internal function to adjust point-wise CI to behave like whole-function interval (based on Wood, 2017; section 6.10.2 and Simpson, 2016):

model.coef +- b gives point-wise interval, and for the interval to cover the whole-function, 1-alpha % of posterior samples should be expected to fall completely within these boundaries.

From section 6.10 in Wood (2017) we have that \(\boldsymbol{\beta} | \mathbf{y}, \boldsymbol{\lambda} \sim N(\hat{\boldsymbol{\beta}},\mathbf{V})\). \(\mathbf{V}\) is the covariance matrix of this conditional posterior, and can be obtained by evaluating model.lvi.T @ model.lvi * model.scale (model.scale should be set to 1 for msssm.models.GAMMLSS and msssm.models.GSMM).

The implication of this result is that we can also expect the deviations \(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}\) to follow \(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}} | \mathbf{y}, \boldsymbol{\lambda} \sim N(0,\mathbf{V})\). In line with the whole-function interval definition above, 1-alpha % of predi_mat@[*coef - coef] (where [*coef - coef] representes the deviations \(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}\)) should fall within [b,-b]. Wood (2017) suggests to find a so that [a*b,a*-b] achieves this.

To do this, we find a for every predi_mat@[*coef - coef] and then select the final one so that 1-alpha % of samples had an equal or lower one. The consequence: 1-alpha % of samples drawn should fall completely within the modified boundaries.

References:

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
Simpson, G. (2016). Simultaneous intervals for smooths revisited.

Parameters:

model (GAMM or GAMLSS or GSMM) – GAMM,GAMLSS, or GSMM model (which has been fitted) for which to estimate \(\mathbf{V}\)
n_ps (int) – Number of samples to obtain from posterior.
b ([float]) – Ci boundary of point-wise CI.
predi_mat (scipy.sparse.csc_array) – Model matrix for a particular smooth term or additive combination of parameters evaluated usually at a representative sample of predictor variables.
use_terms (list[int] or None) – The indices corresponding to the terms that should be used to obtain the prediction or None in which case all terms will be used.
alpha (float) – The alpha level to use for the whole-function interval adjustment calculation as outlined above.
seed (int or None) – Can be used to provide a seed for the posterior sampling.

mssm.src.python.utils.central_hessian(coef, llkfun, *llkgargs, **llkkwargs)

Generic function to approximate the hessian of llkfun at coef.

llkfun is called as llkfun(coef,*llkargs) - so all additional arguments have to be passed via the latter.

Uses central finite differences with fixed value for epsilon, based on eq. 8 and following paragraphs in Ridout (2009) (also used similarly in numdifftools).

References:

1. Wood (2015). Core Statistics
M.S. Ridout (2009). Statistical Applications of the Complex-step Method of Numerical Differentiation
1. Brodtkorb (2014). numdifftools. see https://numdifftools.readthedocs.io/en/latest/reference/generated/numdifftools.core.Hessian.html#equation-9

Parameters:

coef (numpy.array) – Current estimate of coefficients of which llkfun is some function.
llkfun (Callable) – log-likelihood function to optimize.

Returns:

An approximation of the Hessian as a numpy array.

Return type:

numpy.array

mssm.src.python.utils.compute_REML_candidate_GSMM(family, y, Xs, penalties, coef, n_coef, coef_split_idx, method='Chol', conv_tol=1e-07, n_c=10, bfgs_options={}, origNH=None)

Allows to evaluate REML criterion (e.g., Wood, 2011; Wood, 2016) efficiently for a set of lambda values for a GSMM or GAMMLSS.

Internal function used for computing the correction applied to the edf for the GLRT - based on Wood (2017) and Wood et al., (2016).

See REML() function for more details.

mssm.src.python.utils.compute_Vb_corr_WPS(Vbr, Vpr, Vr, H, S_emb, penalties, coef, scale=1)

Computes both correction terms for Vb or \(\mathbf{V}_{\boldsymbol{\beta}}\), which is the co-variance matrix for the conditional posterior of \(\boldsymbol{\beta}\) so that \(\boldsymbol{\beta} | y, \boldsymbol{\lambda} \sim N(\hat{\boldsymbol{\beta}},\mathbf{V}_{\boldsymbol{\beta}})\), described by Wood, Pya, & Säfken (2016).

References:

Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.
Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

Vbr (scipy.sparse.csc_array) – Transpose of root for the estimate for the (unscaled) covariance matrix of \(\boldsymbol{\beta} | y, \boldsymbol{\lambda}\) - the coefficients estimated by the model.
Vpr (numpy.array) – A (regularized) estimate of the covariance matrix of \(\boldsymbol{\rho}\) - the log smoothing penalties.
Vr (numpy.array) – Transpose of root of un-regularized covariance matrix of \(\boldsymbol{\rho}\) - the log smoothing penalties.
H (scipy.sparse.csc_array) – The Hessian of the log-likelihood
S_emb (scipy.sparse.csc_array) – The weighted penalty matrix.
penalties ([LambdaTerm]) – A list holding the Lambdaterms estimated for the model.
coef (numpy.array) – An array holding the estimated regression coefficients. Has to be of shape (-1,1)
scale (float) – Any scale parameter estimated as part of the model. Can be omitted for more generic models beyond GAMMs. Defaults to 1.

Raises:

ArithmeticError – Will throw an error when the negative Hessian of the penalized likelihood is ill-scaled so that a Cholesky decomposition fails.

Returns:

A tuple containing: Vc and Vcc. Vbr.T@Vbr*scale + Vc + Vcc is then approximately the correction devised by WPS (2016).

Return type:

(numpy.array, numpy.array)

mssm.src.python.utils.compute_Vp_WPS(Vbr, H, S_emb, penalties, coef, scale=1)

Computes the inverse of what is approximately the negative Hessian of the Laplace approximate REML criterion with respect to the log smoothing penalties.

The derivatives computed are only exact for Gaussian additive models and canonical generalized additive models. For all other models they are in-exact in that they assume that the hessian of the log-likelihood does not depend on \(\lambda\) (or \(log(\lambda)\)), so they are essentially the PQL derivatives of Wood et al. (2017). The inverse computed here acts as an approximation to the covariance matrix of the log smoothing parameters.

References:

Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.
Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
Wood, S. N., Li, Z., Shaddick, G., & Augustin, N. H. (2017). Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data.

Parameters:

Vbr (scipy.sparse.csc_array) – Transpose of root for the estimate for the (unscaled) covariance matrix of \(\boldsymbol{\beta} | y, \boldsymbol{\lambda}\) - the coefficients estimated by the model.
H (scipy.sparse.csc_array) – The Hessian of the log-likelihood
S_emb (scipy.sparse.csc_array) – The weighted penalty matrix.
penalties ([LambdaTerm]) – A list holding the Lambdaterms estimated for the model.
coef (numpy.array) – An array holding the estimated regression coefficients. Has to be of shape (-1,1)
scale (float) – Any scale parameter estimated as part of the model. Can be omitted for more generic models beyond GAMMs. Defaults to 1.

mssm.src.python.utils.compute_reml_candidate_GAMM(family, y, X, penalties, n_c=10, offset=0, init_eta=None, method='Chol', compute_inv=False, origNH=None)

Allows to evaluate REML criterion (e.g., Wood, 2011; Wood, 2016) efficiently for a set of lambda values.

Internal function used for computing the correction applied to the edf for the GLRT - based on Wood (2017) and Wood et al., (2016).

See REML() function for more details.

mssm.src.python.utils.correct_VB(model, nR=250, grid_type='JJJ1', a=1e-07, b=10000000.0, df=40, n_c=10, form_t=True, form_t1=False, verbose=False, drop_NA=True, method='Chol', only_expected_edf=False, Vp_fidiff=False, use_importance_weights=True, prior=None, recompute_H=False, seed=None, **bfgs_options)

Estimate \(\tilde{\mathbf{V}}\), the covariance matrix of the marginal posterior \(\boldsymbol{\beta} | y\) to account for smoothness uncertainty.

Wood et al. (2016) and Wood (2017) show that when basing conditional versions of model selection criteria or hypothesis tests on \(\mathbf{V}\), which is the co-variance matrix for the normal approximation to the conditional posterior of \(\boldsymbol{\beta}\) so that \(\boldsymbol{\beta} | y, \boldsymbol{\lambda} \sim N(\hat{\boldsymbol{\beta}},\mathbf{V})\), the tests are severely biased. To correct for this they show that uncertainty in \(\boldsymbol{\lambda}\) needs to be accounted for. Hence they suggest to base these tests on \(\tilde{\mathbf{V}}\), the covariance matrix of the normal approximation to the marginal posterior \(\boldsymbol{\beta} | y\). They show how to obtain an estimate of \(\tilde{\mathbf{V}}\), but this requires \(\mathbf{V}^{\boldsymbol{\rho}}\) - an estimate of the covariance matrix of the normal approximation to the posterior of \(\boldsymbol{\rho}=log(\boldsymbol{\lambda})\). Computing \(\mathbf{V}^{\boldsymbol{\rho}}\) requires derivatives that are not available when using the efs update.

This function implements multiple strategies to approximately correct for smoothing parameter uncertainty, based on the proposals by Wood et al. (2016) and Greven & Scheipl (2017). The most straightforward strategy (grid_type = 'JJJ1') is to obtain a PQL or finite difference approximation for \(\mathbf{V}^{\boldsymbol{\rho}}\) and to then compute approximately the Wood et al. (2016) correction assuming that higher-order derivatives of the llk are zero (this will be exact for Gaussian additive or canonical Generalized models). This is too costly for large sparse multi-level models and not exact for more generic models. The MC based alternative available via grid_type = 'JJJ2' addresses the first problem (Important, set: use_importance_weights=False and only_expected_edf=True.). The second MC based alternative available via grid_type = 'JJJ3' is most appropriate for more generic models (The prior argument can be used to specify any prior to be placed on \(\boldsymbol{\rho}\) also you will need to set: use_importance_weights=True and only_expected_edf=False). Both strategies use a PQL or finite difference approximation to \(\mathbf{V}^{\boldsymbol{\rho}}\) to obtain nR samples from the (normal approximation) to the posterior of \(\boldsymbol{\rho}\). From these samples mssm then estimates \(\tilde{\mathbf{V}}\) as described in more detail by Krause et al. (in preparation).

Example:

# Simulate some data for a Gaussian model
sim_fit_dat = sim3(n=500,scale=2,c=1,family=Gaussian(),seed=21)

# Now fit nested models
sim_fit_formula = Formula(lhs("y"),
                            [i(),f(["x0"],nk=20,rp=0),f(["x1"],nk=20,rp=0),f(["x2"],nk=20,rp=0),f(["x3"],nk=20,rp=0)],
                            data=sim_fit_dat,
                            print_warn=False)

model = GAMM(sim_fit_formula,Gaussian())
model.fit(exclude_lambda=False,progress_bar=False,max_outer=100)


# Compute correction from Wood et al. (2016) - will be approximate for more generic models
# V will be approximate covariance matrix of marginal posterior of coefficients
# LV is Cholesky of the former
# Vp is approximate covariance matrix of log regularization parameters
# Vpr is regularized version of the former
# edf is vector of estimated degrees of freedom (uncertainty corrected) per coefficient
# total_edf is sum of former (but subjected to upper bounds so this might not be exactly the same as sum(edf))
# ed2 is optionally smoothness bias corrected version of edf
# total_edf2 is optionally bias corrected version of total_edf (again subjected to upper bounds)
# expected_edf is None here but for MC strategies (i.e., ``grid!=1``) will be an estimate of total_edf (**without being subjected to upper bounds**) that does not require forming V (only computed when ``only_expected_edf=True``). 
# mean_coef is None here but for MC strategies will be an estimate of the mean of the marginal posterior of coefficients, only computed when setting ``recompute_H=True``
V,LV,Vp,Vpr,edf,total_edf,edf2,total_edf2,expected_edf,mean_coef = correct_VB(model,grid_type="JJJ1",verbose=True,seed=20)


# Compute MC estimate for generic model and given prior
prior = DummyRhoPrior(b=np.log(1e12)) # Set up uniform prior
V_MC,LV_MC,Vp_MC,Vpr_MC,edf_MC,total_edf_MC,edf2_MC,total_edf2_MC,expected_edf_MC,mean_coef_MC = correct_VB(model2,grid_type="JJJ3",verbose=True,seed=20,df=10,prior=prior,recompute_H=True)

References:

Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.
Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models
Greven, S., & Scheipl, F. (2016). Comment on: Smoothing Parameter and Model Selection for General Smooth Models
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

model (GAMM or GAMMLSS or GSMM) – GAMM, GAMMLSS, or GSMM model (which has been fitted) for which to estimate \(\mathbf{V}\)
nR (int, optional) – In case grid!="JJJ1", nR samples/reml scores are generated/computed to numerically evaluate the expectations necessary for the uncertainty correction, defaults to 250
grid_type (str, optional) – How to compute the smoothness uncertainty correction - see above for details, defaults to ‘JJJ1’
a (float, optional) – Any of the \(\lambda\) estimates obtained from model (used to define the mean for the posterior of \(\boldsymbol{\rho}|y \sim N(log(\hat{\boldsymbol{\rho}}),\mathbf{V}^{\boldsymbol{\rho}})\) used to sample nR candidates) which are smaller than this are set to this value as well, defaults to 1e-7 the minimum possible estimate
b (float, optional) – Any of the \(\lambda\) estimates obtained from model (used to define the mean for the posterior of \(\boldsymbol{\rho}|y \sim N(log(\hat{\boldsymbol{\rho}}),\mathbf{V}^{\boldsymbol{\rho}})\) used to sample nR candidates) which are larger than this are set to this value as well, defaults to 1e7 the maximum possible estimate
df (int, optional) – Degrees of freedom used for the multivariate t distribution used to sample the next set of candidates. Setting this to np.inf means a multivariate normal is used for sampling, defaults to 40
n_c (int, optional) – Number of cores to use during parallel parts of the correction, defaults to 10
form_t (bool, optional) – Whether or not the smoothness uncertainty corrected edf should be computed, defaults to True
form_t1 (bool, optional) – Whether or not the smoothness uncertainty + smoothness bias corrected edf should be computed, defaults to False
verbose (bool, optional) – Whether to print progress information or not, defaults to False
drop_NA (bool,optional) – Whether to drop rows in the model matrices corresponding to NAs in the dependent variable vector. Defaults to True.
method (str,optional) – Which method to use to solve for the coefficients (and smoothing parameters). The default (“Chol”) relies on Cholesky decomposition. This is extremely efficient but in principle less stable, numerically speaking. For a maximum of numerical stability set this to “QR/Chol”. In that case a QR decomposition is used - which is first pivoted to maximize sparsity in the resulting decomposition but also pivots for stability in order to get an estimate of rank defficiency. A Cholesky is than used using the combined pivoting strategy obtained from the QR. This takes substantially longer. If this is set to 'qEFS', then the coefficients are estimated via quasi netwon and the smoothing penalties are estimated from the quasi newton approximation to the hessian. This only requieres first derviative information. Defaults to “Chol”.
only_expected_edf (bool,optional) – Whether to compute edf. by explicitly forming covariance matrix (only_expected_edf=False) or not. The latter is much more efficient for sparse models at the cost of access to the covariance matrix and the ability to compute an upper bound on the smoothness uncertainty corrected edf. Only makes sense when grid_type!='JJJ1'. Defaults to False
Vp_fidiff (bool,optional) – Whether to rely on a finite difference approximation to compute \(\mathbf{V}^{\boldsymbol{\rho}}\) or on a PQL approximation. The latter is exact for Gaussian and canonical GAMs and far cheaper if many penalties are to be estimated. Defaults to False (PQL approximation)
use_importance_weights (bool,optional) – Whether to rely importance weights to compute the numerical integration when grid_type != 'JJJ1' or on the log-densities of \(\mathbf{V}^{\boldsymbol{\rho}}\) - the latter assumes that the unconditional posterior is normal. Defaults to True (Importance weights are used)
prior (any, optional) – An (optional) instance of an arbitrary class that has a .logpdf() method to compute the prior log density of a sampled candidate. If this is set to None, the prior is assumed to coincide with the proposal distribution, simplifying the importance weight computation. Ignored when use_importance_weights=False. Defaults to None
recompute_H (bool, optional) – Whether or not to re-compute the Hessian of the log-likelihood at an estimate of the mean of the Bayesian posterior \(\boldsymbol{\beta}|y\) before computing the (uncertainty/bias corrected) edf. Defaults to False
seed (int,optional) – Seed to use for random parts of the correction. Defaults to None
bfgs_options (key=value,optional) – Any additional keyword arguments that should be passed on to the call of scipy.optimize.minimize(). If none are provided, the gtol argument will be initialized to 1e-3. Note also, that in any case the maxiter argument is automatically set to 100. Defaults to None.

Returns:

A tuple containing: V - an estimate of the unconditional covariance matrix, LV - the Cholesky of the former, Vp - an estimate of the covariance matrix for \(\boldsymbol{\rho}\), Vpr - a regularized version of the former, edf - smoothness uncertainty corrected coefficient-wise edf, total_edf - smoothness uncertainty corrected total (i.e., model) edf, edf2 - smoothness uncertainty + smoothness bias corrected coefficient-wise edf, total_edf2 - smoothness uncertainty + smoothness bias corrected total (i.e., model) edf, expected_edf - an optional estimate of total_edf that does not require forming V, mean_coef - an optional estimate of the mean of the posterior of the coefficients

Return type:

(scipy.sparse.csc_array,scipy.sparse.csc_array,numpyp.array,numpy.array,numpy.array,float,numpy.array,float,float,numpy.array)

mssm.src.python.utils.estimateVp(model, nR=250, grid_type='JJJ1', a=1e-07, b=10000000.0, df=40, n_c=10, drop_NA=True, method='Chol', Vp_fidiff=False, use_importance_weights=True, prior=None, seed=None, **bfgs_options)

Estimate covariance matrix \(\mathbf{V}^{\boldsymbol{\rho}}\) of posterior for \(\mathbf{p} = log(\boldsymbol{\lambda})\).

Either \(\mathbf{V}^{\boldsymbol{\rho}}\) is based on finite difference approximation or on a PQL approximation (see grid_type parameter), or it is estimated via numerical integration similar to what is done in the correct_VB() function (this is done when grid_type=='JJJ2'; see the aforementioned function for details).

Example:

# Simulate some data for a Gaussian model
sim_fit_dat = sim3(n=500,scale=2,c=1,family=Gaussian(),seed=21)

# Now fit nested models
sim_fit_formula = Formula(lhs("y"),
                            [i(),f(["x0"],nk=20,rp=0),f(["x1"],nk=20,rp=0),f(["x2"],nk=20,rp=0),f(["x3"],nk=20,rp=0)],
                            data=sim_fit_dat,
                            print_warn=False)

model = GAMM(sim_fit_formula,Gaussian())
model.fit(exclude_lambda=False,progress_bar=False,max_outer=100)

# Compute correction from Wood et al. (2016) - will be approximate for more generic models
# Vp is approximate covariance matrix of log regularization parameters
# Vpr is regularized version of the former
# Ri is a root of covariance matrix of log regularization parameters
# Rir is a root of regularized version of covariance matrix of log regularization parameters
# ep will be an estimate of the mean of the marginal posterior of log regularization parameters (for ``grid_type="JJJ1"`` this will simply be the log of the estimated regularization parameters)
Vp, Vpr, Ri, Rir, ep = estimateVp(model,grid_type="JJJ1",verbose=True,seed=20)


# Compute MC estimate for generic model and given prior
prior = DummyRhoPrior(b=np.log(1e12)) # Set up uniform prior
Vp_MC, Vpr_MC, Ri_MC, Rir_MC, ep_MC = estimateVp(model,strategy="JJJ2",verbose=True,seed=20,use_importance_weights=True,prior=prior)

References:

https://en.wikipedia.org/wiki/Estimation_of_covariance_matrices
Greven, S., & Scheipl, F. (2016). Comment on: Smoothing Parameter and Model Selection for General Smooth Models

Parameters:

model (GAMM or GAMMLSS or GSMM) – GAMM, GAMMLSS, or GSMM model (which has been fitted) for which to estimate \(\mathbf{V}\)
nR (int, optional) – In case grid!="JJJ1", nR samples/reml scores are generated/computed to numerically evaluate the expectations necessary for the uncertainty correction, defaults to 250
grid_type (str, optional) – How to compute the smoothness uncertainty correction. Setting grid_type="JJJ1" means a PQL or finite difference approximation is obtained. Setting grid_type="JJJ2" means numerical integration is performed - see correct_VB() for details , defaults to ‘JJJ1’
a (float, optional) – Any of the \(\lambda\) estimates obtained from model (used to define the mean for the posterior of \(\boldsymbol{\rho}|y \sim N(log(\hat{\boldsymbol{\rho}}),\mathbf{V}^{\boldsymbol{\rho}})\) used to sample nR candidates) which are smaller than this are set to this value as well, defaults to 1e-7 the minimum possible estimate
b (float, optional) – Any of the \(\lambda\) estimates obtained from model (used to define the mean for the posterior of \(\boldsymbol{\rho}|y \sim N(log(\hat{\boldsymbol{\rho}}),\mathbf{V}^{\boldsymbol{\rho}})\) used to sample nR candidates) which are larger than this are set to this value as well, defaults to 1e7 the maximum possible estimate
df (int, optional) – Degrees of freedom used for the multivariate t distribution used to sample the next set of candidates. Setting this to np.inf means a multivariate normal is used for sampling, defaults to 40
n_c (int, optional) – Number of cores to use during parallel parts of the correction, defaults to 10
drop_NA (bool,optional) – Whether to drop rows in the model matrices corresponding to NAs in the dependent variable vector. Defaults to True.
method (str,optional) – Which method to use to solve for the coefficients (and smoothing parameters). The default (“Chol”) relies on Cholesky decomposition. This is extremely efficient but in principle less stable, numerically speaking. For a maximum of numerical stability set this to “QR/Chol”. In that case a QR decomposition is used - which is first pivoted to maximize sparsity in the resulting decomposition but also pivots for stability in order to get an estimate of rank defficiency. A Cholesky is than used using the combined pivoting strategy obtained from the QR. This takes substantially longer. If this is set to 'qEFS', then the coefficients are estimated via quasi netwon and the smoothing penalties are estimated from the quasi newton approximation to the hessian. This only requieres first derviative information. Defaults to “Chol”.
Vp_fidiff (bool,optional) – Whether to rely on a finite difference approximation to compute \(\mathbf{V}^{\boldsymbol{\rho}}\) or on a PQL approximation. The latter is exact for Gaussian and canonical GAMs and far cheaper if many penalties are to be estimated. Defaults to False (PQL approximation)
use_importance_weights (bool,optional) – Whether to rely importance weights to compute the numerical integration when grid_type != 'JJJ1' or on the log-densities of \(\mathbf{V}^{\boldsymbol{\rho}}\) - the latter assumes that the unconditional posterior is normal. Defaults to True (Importance weights are used)
prior (any, optional) – An (optional) instance of an arbitrary class that has a .logpdf() method to compute the prior log density of a sampled candidate. If this is set to None, the prior is assumed to coincide with the proposal distribution, simplifying the importance weight computation. Ignored when use_importance_weights=False. Defaults to None
recompute_H (bool, optional) – Whether or not to re-compute the Hessian of the log-likelihood at an estimate of the mean of the Bayesian posterior \(\boldsymbol{\beta}|y\) before computing the (uncertainty/bias corrected) edf. Defaults to False
seed (int,optional) – Seed to use for random parts of the correction. Defaults to None
bfgs_options (key=value,optional) – Any additional keyword arguments that should be passed on to the call of scipy.optimize.minimize. If none are provided, the gtol argument will be initialized to 1e-3. Note also, that in any case the maxiter argument is automatically set to 100. Defaults to None.

Returns:

A tuple with 4 elements: an estimate of the covariance matrix of the posterior for \(\mathbf{p} = log(\boldsymbol{\lambda})\), a regularized version of the former, a root of the covariance matrix, a root of the regularized covariance matrix, and an estimate of the mean of the posterior

Return type:

(numpy.array, numpy.array, numpy.array, numpy.array)

mssm.src.python.utils.forward_hessian(coef, llkfun, *llkgargs, **llkkwargs)

Generic function to approximate the hessian of llkfun at coef.

llkfun is called as llkfun(coef,*llkargs) - so all additional arguments have to be passed via the latter.

Uses finite differences with fixed value for epsilon, based on chapter 5.5.2 in Wood (2015).

References:

1. Wood (2015). Core Statistics

Parameters:

coef (numpy.array) – Current estimate of coefficients of which llkfun is some function.
llkfun (Callable) – log-likelihood function to optimize.

Returns:

An approximation of the Hessian as a numpy array.

Return type:

numpy.array

mssm.src.python.utils.sample_MVN(n, mu, scale, P, L, LI=None, use=None, seed=None)

Draw n samples from multivariate normal with mean \(\boldsymbol{\mu}\) (mu) and covariance matrix \(\boldsymbol{\Sigma}\).

\(\boldsymbol{\Sigma}\) does not need to be provided. Rather the function expects either L (\(\mathbf{L}\) in what follows) or LI (\(\mathbf{L}^{-1}\) in what follows) and scale (\(\phi\) in what follows). These relate to \(\boldsymbol{\Sigma}\) so that \(\boldsymbol{\Sigma}/\phi = \mathbf{L}^{-T}\mathbf{L}^{-1}\) or \(\mathbf{L}\mathbf{L}^T = [\boldsymbol{\Sigma}/\phi]^{-1}\) so that \(\mathbf{L}*(1/\phi)^{0.5}\) is the Cholesky of the precision matrix of \(\boldsymbol{\Sigma}\).

Notably, for models available in mssm L (and LI) have usually be computed for a permuted matrix, e.g., \(\mathbf{P}[\mathbf{X}^T\mathbf{X} + \mathbf{S}_{\lambda}]\mathbf{P}^T\) (see Wood & Fasiolo, 2017). Hence for sampling we often need to correct for permutation matrix \(\mathbf{P}\) (P). if LI is provided, then P can be omitted and is assumed to have been used to un-pivot LI already.

Used for example sample the uncorrected posterior \(\boldsymbol{\beta} | \mathbf{y}, \boldsymbol{\lambda} \sim N(\boldsymbol{\mu} = \hat{\boldsymbol{\beta}},[\mathbf{X}^T\mathbf{X} + \mathbf{S}_{\lambda}]^{-1}\phi)\) for a GAMM (see Wood, 2017). Based on section 7.4 in Gentle (2009), assuming \(\boldsymbol{\Sigma}\) is \(p*p\) and covariance matrix of uncorrected posterior, samples \(\boldsymbol{\beta}\) are then obtained by computing:

\[\boldsymbol{\beta} = \hat{\boldsymbol{\beta}} + [\mathbf{P}^T \mathbf{L}^{-T}*\phi^{0.5}]\mathbf{z}\ \text{where}\ z_i \sim N(0,1)\ \forall i = 1,...,p\]

Alternatively, relying on the fact of equivalence that:

\[[\mathbf{L}^T*(1/\phi)^{0.5}]\mathbf{P}[\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}] = \mathbf{z}\]

we can first solve for \(\mathbf{y}\) in:

\[[\mathbf{L}^T*(1/\phi)^{0.5}] \mathbf{y} = \mathbf{z}\]

followed by computing:

\[ \begin{align}\begin{aligned}\mathbf{y} = \mathbf{P}[\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}]\\\boldsymbol{\beta} = \hat{\boldsymbol{\beta}} + \mathbf{P}^T\mathbf{y}\end{aligned}\end{align} \]

The latter avoids forming \(\mathbf{L}^{-1}\) (which unlike \(\mathbf{L}\) might not benefit from the sparsity preserving permutation \(\mathbf{P}\)). If LI is None, L will thus be used for sampling as outlined in these alternative steps.

Often we care only about a handfull of elements in mu (e.g., the first ones corresponding to “fixed effects’” in a GAMM). In that case we can generate samles only for this sub-set of interest by only using a sub-block of rows of \(\mathbf{L}\) or \(\mathbf{L}^{-1}\) (all columns remain). Argument use can be a np.array containg the indices of elements in mu that should be sampled. Because this only works efficiently when LI is available an error is raised when not use is None and LI is None.

If mu is set to any integer (i.e., not a Numpy array/list) it is automatically treated as 0. For mssm.models.GAMMLSS or mssm.models.GSMM models, scale can be set to 1.

References:

Parameters:

n (int) – Number of samples to generate
mu (np.array) – mean of normal distribution as described above
scale (float) – scaling parameter of covariance matrix as described above
P (scp.sparse.csc_array) – Permutation matrix, optional.
L (scp.sparse.csc_array) – Cholesky of precision of scaled covariance matrix as described above.
LI (scp.sparse.csc_array, optional) – Inverse of cholesky factor of precision of scaled covariance matrix as described above.
use ([int], optional) – Indices of parameters in mu for which to generate samples, defaults to None in which case all parameters will be sampled
seed (int, optional) – Seed to use for random sample generation, defaults to None

Raises:

ValueError – In case neither LI nor L are provided.
ValueError – In case L is provided but P is not.
ValueError – In case use is provided but LI is not.

Returns:

Samples from multi-variate normal distribution. In case use is not provided, the returned array will be of shape (p,n) where p==LI.shape[1]. Otherwise, the returned array will be of shape (len(use),n).

Return type:

np.array

mssm.src.python.utils.updateVp(ep, ws, rGrid)

Update covariance matrix of posterior for \(\mathbf{p} = log(\boldsymbol{\lambda})\). REML scores are used to approximate expectation, similar to what was suggested by Greven & Scheipl (2016).

References:

https://en.wikipedia.org/wiki/Estimation_of_covariance_matrices
Greven, S., & Scheipl, F. (2016). Comment on: Smoothing Parameter and Model Selection for General Smooth Models

Parameters:

ep ([float]) – Model estimate log(lambda), i.e., the expectation over rGrid
ws ([float]) – weight associated with each log(lambda) value used for numerical integration
rGrid ([float]) – A 2d array, holding all lambda samples considered so far. Each row is one sample

Returns:

An estimate of the covariance matrix of log(lambda) - 2d array of shape len(mp)*len(mp).

Return type:

[float]