mssm.src.python package
Submodules
mssm.src.python.compare module
- mssm.src.python.compare.compare_CDL(model1: GAMM, model2: GAMM, correct_V: bool = True, correct_t1: bool = False, perform_GLRT: bool = False, nR=250, n_c=10, alpha=0.05, grid='JJJ2', a=1e-07, b=10000000.0, df=40, verbose=False, drop_NA=True, method='Chol', seed=None, use_upper=True, Vp_fidiff=True, use_reml_weights=True, **bfgs_options)
Computes the AIC difference and (optionally) performs an approximate GLRT on twice the difference in unpenalized likelihood between models
model1
andmodel2
(see Wood et al., 2016).For the GLRT to be appropriate
model1
should be set to the model containing more effects andmodel2
should be a nested, simpler, variant ofmodel1
. For the degrees of freedom for the test, the expected degrees of freedom (EDF) of each model are used (i.e., this is the conditional test discussed in Wood (2017: 6.12.4)). The difference between the models in EDF serves as DoF for computing the Chi-Square statistic. In addition,correct_t1
should be set to True, when computing the GLRT.To get the AIC for each model, 2*edf is added to twice the negative (conditional) likelihood (see Wood et al., 2016).
By default (
correct_V=True
),mssm
will attempt to correct the edf for uncertainty in the estimated \(\lambda\) parameters. Which correction is computed depends on the choice for thegrid
argument. Approximately the analytic solution for the correction proposed by Wood, Pya, & Säfken (2016) is computed whengrid='JJJ1'
- which is exact for strictly Gaussian and canonical Generalized additive models. This is not efficient for sparse models and will thus become too expensive for large sparse multi-level models. When settinggrid='JJJ2'
(default), the analytic solution proposed by Wood, Pya, & Säfken (2016) is used as a basis for the correction and subsequently refined via numeric integration. This requires a costly sampling step (see Greven & Scheipl, 2016 and thecorrect_VB
function in the utils module) which will take quite some time for reasonably large models with more than 3-4 smoothing parameters, but remains efficient for sparse multi-level models (assuminguse_upper
is set to True andcorrect_t1
is set to False).In case either form of correction is too expensive, it might be better to rely on hypothesis tests for individual smooths, confidence intervals, and penalty-based selection approaches instead (see Marra & Wood, 2011 for details on the latter).
In case
correct_t1=True
the EDF will be set to the (smoothness uncertainty corrected in casecorrect_V=True
) smoothness bias corrected exprected degrees of freedom (t1 in section 6.1.2 of Wood, 2017), for the GLRT (based on recomendation given in section 6.12.4 in Wood, 2017). The AIC (Wood, 2017) of both models will still be based on the regular (smoothness uncertainty corrected) edf.The computation here is different to the one performed by the
compareML
function in the R-packageitsadug
- which rather performs a version of the marginal GLRT (also discussed in Wood, 2017: 6.12.4) - and more similar to theanova.gam
implementation provided bymgcv
(ifgrid='JJJ1'). The returned p-value is approximate - very **very** much so if ``correct_V=False
. Also, the GLRT should **no**t be used to compare models differing in their random effect structures - the AIC is more appropriate for this (see Wood, 2017: 6.12.4).- References:
Marra, G., & Wood, S. N. (2011) Practical variable selection for generalized additive models.
Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models
Greven, S., & Scheipl, F. (2016). Comment on: Smoothing Parameter and Model Selection for General Smooth Models
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
compareML
function fromitsadug
R-package: https://rdrr.io/cran/itsadug/man/compareML.htmlanova.gam
function frommgcv
, see: https://www.rdocumentation.org/packages/mgcv/versions/1.9-1/topics/anova.gam
- Parameters:
model1 (GAMM or GAMMLSS or GSMM) – GAMM, GAMMLSS, or GSMM 1.
model2 (GAMM or GAMMLSS or GSMM) – GAMM, GAMMLSS, or GSMM 2.
correct_V (bool, optional) – Whether or not to correct for smoothness uncertainty. Defaults to True
correct_t1 (bool, optional) – Whether or not to also correct the smoothness bias corrected edf for smoothness uncertainty. Defaults to True.
perform_GLRT (bool, optional) – Whether to perform both a GLRT and to compute the AIC or to only compute the AIC. Defaults to True.
nR (int, optional) – In case
grid!="JJJ1"
,nR
samples/reml scores are generated/computed to numerically evaluate the expectations necessary for the uncertainty correction, defaults to 250n_c (int, optional) – Number of cores to use to compute the smoothness uncertaincy correction, defaults to 10
alpha (float, optional) – alpha level of the GLRT. Defaults to 0.05
grid (str, optional) – How to compute the smoothness uncertainty correction, defaults to ‘JJJ2’
a (float, optional) – Minimum \(\lambda\) value that is included when correcting for uncertainty and
grid!='JJJ1'
. In addition, any of the \(\lambda\) estimates obtained frommodel
(used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\)) which are smaller than this are set to this value as well, defaults to 1e-7 the minimum possible estimateb (float, optional) – Maximum \(\lambda\) value that is included when correcting for uncertainty and
grid!='JJJ1'
. In addition, any of the \(\lambda\) estimates obtained frommodel
(used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\)) which are larger than this are set to this value as well, defaults to 1e7 the maximum possible estimatedf (int, optional) – Degrees of freedom used for the multivariate t distribution used to sample the next set of candidates. Setting this to
np.inf
means a multivariate normal is used for sampling, defaults to 40verbose (bool, optional) – Whether to print progress information or not, defaults to False
drop_NA (bool,optional) – Whether to drop rows in the model matrices corresponding to NAs in the dependent variable vector. Defaults to True.
method (str,optional) – Which method to use to solve for the coefficients (and smoothing parameters). The default (“Chol”) relies on Cholesky decomposition. This is extremely efficient but in principle less stable, numerically speaking. For a maximum of numerical stability set this to “QR/Chol”. In that case a QR decomposition is used - which is first pivoted to maximize sparsity in the resulting decomposition but also pivots for stability in order to get an estimate of rank defficiency. A Cholesky is than used using the combined pivoting strategy obtained from the QR. This takes substantially longer. If this is set to
'qEFS'
, then the coefficients are estimated via quasi netwon and the smoothing penalties are estimated from the quasi newton approximation to the hessian. This only requieres first derviative information. Defaults to “Chol”.seed (int,optional) – Seed to use for random parts of the correction. Defaults to None
use_upper (bool,optional) – Whether to compute edf. from trace of covariance matrix (
use_upper=False
) or based on numeric integration weights. The latter is much more efficient for sparse models. Only makes sense whengrid='JJJ2'
. Defaults to TrueVp_fidiff (bool,optional) – Whether to rely on a finite difference approximation to compute \(\mathbf{V}_{\boldsymbol{p}}\) for grid
JJJ1
andJJJ2
or on a PQL approximation. The latter is exact for Gaussian and canonical GAMs and far cheaper if many penalties are to be estimated. Defaults to True (Finite difference approximation)use_reml_weights (bool,optional) – Whether to rely on REMl scores to compute the numerical integration when
grid_type != 'JJJ1'
or on the log-densities of \(\mathbf{V}_{\boldsymbol{p}}\) - the latter assumes that the unconditional posterior is normal. Defaults to True (REML scores are used)bfgs_options (key=value,optional) – Any additional keyword arguments that should be passed on to the call of
scipy.optimize.minimize()
. If none are provided, thegtol
argument will be initialized to 1e-3. Note also, that in any case themaxiter
argument is automatically set to 100. Defaults to None.
- Raises:
ValueError – If both models are from different families.
ValueError – If
perform_GLRT=True
andmodel1
has fewer coef thanmodel2
- i.e.,model1
has to be the notationally more complex one.
- Returns:
A dictionary with outcomes of all tests. Key
H1
will be a bool indicating whether Null hypothesis was rejected or not,p
will be the p-value,chi^2
will be the test statistic used,Res. DOF
will be the degrees of freedom used by the test,aic1
andaic2
will be the aic scores for both models.- Return type:
dict
mssm.src.python.constraints module
- class mssm.src.python.constraints.Constraint(Z: array = None, type: ConstType = None)
Bases:
object
Constraint storage.
Z
, either holds the Qr-based correction matrix that needs to be multiplied with \(\mathbf{X}\), \(\mathbf{S}\), and \(\mathbf{D}\) (where \(\mathbf{D}\mathbf{D}^T = \mathbf{S}\)) to make terms subject to the conventional sum-to-zero constraints applied also in mgcv (Wood, 2017), the column/row that should be dropped from those - then \(\mathbf{X}\) can also no longer take on a constant, orNone
indicating that the model should be “difference re-coded” to enable sparse sum-to-zero constraints. The latter two are available in mgcv’ssmoothCon
function by setting thesparse.cons
argument to 1 or 2 respectively.The QR-based approach is described in detail by Wood (2017) and is similar to just mean centering every basis function involved in the smooth and then dropping one column from the corresponding centered model matrix. The column-dropping approach is self-explanatory. The difference re-coding deserves some explanation. Consider the model:
\[y = f(x) = b_1*f_1(x) + b_2*f_2(x) + b_3*f_3(x) + b_4*f_4(x) + b_5*f_5(x)\]In
mssm
\(f(x)\) will be parameterized via five B-spline basis functions \(f_i(x)\), each weighted by a coefficient \(b_i\). Each of the \(f_i(x)\) has support only over a small range of \(x\) - resulting in a sparse model matrix \(\mathbf{X}\). This desirable property is lost when enforcing the conventional sum-to-zero constraints (either QR-based or by subtracting the mean of each basis over \(x\) from the corresponding \(f_i(x)\)). To maintain sparsity, one can enforce a coefficients-sum-to-zero constraint (Wood, 2020) by re-coding instead:Set \(b_1 = -b_2\)
Then, re-write:
\[ \begin{align}\begin{aligned}y = -b_2 * f_1(x) + b_2 * f_2(x) + b_3 * f_3(x) - b_3 * f_2(x) + b_4 * f_4(x) - b_4 * f_3(x) + b_5 * f_5(x) - b_5 * f_4(x)\\y = -b_2 * f_1(x) + (b_2 - b_3) * f_2(x) + (b_3 - b_4) * f_3(x) + (b_4 - b_5) * f_4(x) + b_5 * f_5(x)\\y = b_2 * (f_2(x) - f_1(x)) + b_3 * (f_3(x) - f_2(x)) + b_4 * (f_4(x) - f_3(x)) + b_5 * (f_5(x) - f_4(x))\end{aligned}\end{align} \]Line 3 shows how the constraint can be absorbed for model fitting by first computing new bases \((f_i(x) - f_{i-1}(x))\) and then estimating the coefficients based on those (this is implemented in mgcv’s
smoothCon
function when settingsparse.cons=2
). Note that the constrained model, just like when using the QR-based or dropping approach, requires dropping one of the \(k\) original coefficients for estimation since only \(k-1\) coefficients are allowed to vary. The same sum-to-zero constraint can be achieved by fixing one of the central bases in the original model to it’s neighbor (e.g., setting \(b_2 = -b_3\)) or by setting \(b_1= -b_2 - b_3 - b_4 - b_5\).mssm
fixes one of the central bases to it’s neighbor.With a B-splines basis, it would be necessary that \(b_1 = b_2 = b_3 = b_4 = b_5\) for the model to fit a constant \(f(x)\) over all values of \(x\) (Eilers & Marx, 1996). In the constrained model this is no longer possible due to \(b_1 = -b_2\), effectively removing the constant from the space of functions that \(f(x)\) can approximate. Instead, in the constrained model if \(b_2 = b_3 = b_4 = b_5\), \(f(x)\) summed over all \(x\) will be equal to zero!
Importantly, the “new” bases obtained by computing \((f_i(x) - f_{i-1}(x))\) are all still relatively sparse. This is because neighboring \(f_i(x)\) overlap in their support and because the support of each individual \(f_i(x)\) is narrow, i.e., \(f_i(x)\) is sparse itself.
However, this is not a true centering constraint: \(f(x)\) will not necessarily be orthogonal to the intercept, i.e., \(\mathbf{1}^T \mathbf{f(x)}\) will not necessarily be 0. Hence, confidence intervals will usually be wider when using ConstType.DIFF (also when using ConstType.DROP, for the same reason) instead of
ConstType.QR
(see Wood; 2017,2020)!A final note regards the use of tensor smooths when
te==False
. Since the value of any constant estimated for a smooth depends on the type of constraint used, the mmarginal functions estimated for the “main effects” (\(f(x)\), \(f(z)\)) and “interaction effect” (\(f(x,z)\)) in a model: \(y = a + f(x) + f(z) + f(x,z)\) will differ depending on the type of constraint used. The “Anova-like” decomposition described in detail in Wood (2017) is achievable only when usingConstType.QR
.- References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
Wood, S. N. (2020). Inference and computation with generalized additive models and their extensions. TEST, 29(2), 307–339. https://doi.org/10.1007/s11749-020-00711-5
Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89–121. https://doi.org/10.1214/ss/1038425655
- Z: array = None
mssm.src.python.exp_fam module
- class mssm.src.python.exp_fam.Binomial(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.Logit object>, n: int = 1)
Bases:
Family
Binomial family. For this implementation we assume that we have collected proportions of success, i.e., the dependent variables specified in the model Formula needs to hold observed proportions and not counts! If we assume that each observation \(y_i\) reflects a single independent draw from a binomial, (with \(n=1\), and \(p_i\) being the probability that the result is 1) then the dependent variable should either hold 1 or 0. If we have multiple independent draws from the binomial per observation (i.e., row in our data-frame), then \(n\) will usually differ between observations/rows in our data-frame (i.e., we observe \(k_i\) counts of success out of \(n_i\) draws - so that \(y_i=k_i/n_i\)). In that case, the Binomial() family accepts a vector for argument \(\mathbf{n}\) (which is simply set to 1 by default, assuming binary data), containing \(n_i\) for every observation \(y_i\).
In this implementation, the scale parameter is kept fixed/known at 1.
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
link (Link) – The link function to be used by the model of the mean of this family. By default set to the canonical logit link.
n (int or [int], optional) – Number of independent draws from a Binomial per observation/row of data-frame. For binary data this can simply be set to 1, which is the default.
- D(y, mu)
Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- V(mu)
The variance function (of the mean; see Wood, 2017, 3.1.2) for the Binomial model. Variance is minimal for \(\mu=1\) and \(\mu=0\), maximal for \(\mu=0.5\).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted probability for the response distribution corresponding to each observation.
- deviance(y, mu)
Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- init_mu(y)
Function providing initial \(\boldsymbol{\mu}\) vector for GAMM.
Estimation assumes proportions as dep. variable. According to: https://stackoverflow.com/questions/60526586/ the glm() function in R always initializes \(\mu\) = 0.75 for observed proportions (i.e., elements in \(\mathbf{y}\)) of 1 and \(\mu\) = 0.25 for proportions of zero. This can be achieved by adding 0.5 to the observed proportion of success (and adding one observation).
- Parameters:
y ([float]) – The vector containing each observation.
- llk(y, mu)
log-probability of data under given model. Essentially sum over all elements in the vector returned by the
lp()
method.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- lp(y, mu)
Log-probability of observing every proportion in \(\mathbf{y}\) under their respective binomial with mean = \(\boldsymbol{\mu}\).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observed proportion.
mu ([float]) – The vector containing the predicted probability for the response distribution corresponding to each observation.
- Returns:
a N-dimensional vector containing the log-probability of observing each data-point under the current model.
- Return type:
[float]
- class mssm.src.python.exp_fam.Binomial2(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.Logit object>, n: int = 1)
Bases:
GAMLSSFamily
Another implementation of the Binomial family. That allows estimating binomial models via
GAMMLSS
models (And thus full-newton, no PQL!).For this implementation we again assume that we have collected proportions of success, i.e., the dependent variables specified in the model Formula needs to hold observed proportions and not counts! If we assume that each observation \(y_i\) reflects a single independent draw from a binomial, (with \(n=1\), and \(p_i\) being the probability that the result is 1) then the dependent variable should either hold 1 or 0. If we have multiple independent draws from the binomial per observation (i.e., row in our data-frame), then \(n\) will usually differ between observations/rows in our data-frame (i.e., we observe \(k_i\) counts of success out of \(n_i\) draws - so that \(y_i=k_i/n_i\)). In that case, the Binomial() family accepts a vector for argument \(\mathbf{n}\) (which is simply set to 1 by default, assuming binary data), containing \(n_i\) for every observation \(y_i\).
In this implementation, the scale parameter is kept fixed/known at 1.
References:
Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
link (Link) – The link function to be used by the model of the mean of this family. By default set to the canonical logit link.
n (int or [int], optional) – Number of independent draws from a Binomial per observation/row of data-frame. For binary data this can simply be set to 1, which is the default.
- get_resid(y, mu)
Get standardized residuals for a Binomial model
Essentially, the deviance residuals are returned, which are equivalent to \(sign(y_i - \mu_i)*D_i^{0.5}\), where \(\sum_{i=1,...N} D_i\) equals the model deviance (see Wood 2017, section 3.1.7).
- Parameters:
y ([float]) – The vector containing each observed proportion.
mu ([float]) – The vector containing the predicted probability for the response distribution corresponding to each observation.
- Returns:
A list of deviance residuals that should be ~ N(0,1) if the model is correct.
- Return type:
[float]
- init_coef(models)
Function to initialize the coefficients of the model.
Fits a GAMM (via PQL) for the mean.
- Parameters:
models ([mssm.models.GAMM]) – A list of
mssm.models.GAMM
’s, - each based on one of the formulas provided to a model.- Returns:
A
numpy.array
of shape (-1,1), holding initial values for all model coefficients.- Return type:
numpy array
- init_lambda(formulas)
Function to initialize the smoothing parameters of the model.
Returns values in line with what is used to initialize GAMM
- Parameters:
formulas ([mssm.src.python.formula.Formula]) – A list of
mssm.src.python.formula.Formula
’s provided to a model.- Returns:
A list, holding - for each \(\lambda\) parameter to be estimated - an initial value.
- Return type:
[float]
- llk(y, mu)
log-probability of data under given model. Essentially sum over all elements in the vector returned by the
lp()
method.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- lp(y, mu)
Log-probability of observing every proportion in \(\mathbf{y}\) under their respective binomial with mean = \(\boldsymbol{\mu}\).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observed proportion.
mu ([float]) – The vector containing the predicted probability for the response distribution corresponding to each observation.
- Returns:
a N-dimensional vector containing the log-probability of observing each data-point under the current model.
- Return type:
[float]
- class mssm.src.python.exp_fam.Family(link: Link, twopar: bool, scale: float = None)
Bases:
object
Base class to be implemented by Exp. family member.
- Parameters:
link (Link) – The link function to be used by the model of the mean of this family.
twopar (bool) – Whether the family has two parameters (mean,scale) to be estimated (i.e., whether the likelihood is a function of two parameters), or only a single one (usually the mean).
scale (float or None, optional) – Known/fixed scale parameter for this family.
- D(y, mu)
Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- V(mu, **kwargs)
The variance function (of the mean; see Wood, 2017, 3.1.2). Different exponential families allow for different relationships between the variance in our random response variable and the mean of it. For the normal model this is assumed to be constant.
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- deviance(y, mu)
Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- init_mu(y)
Convenience function to compute an initial \(\boldsymbol{\mu}\) estimate passed to the GAMM/PIRLS estimation routine.
- Parameters:
y ([float]) – The vector containing each observation.
- llk(y, mu, **kwargs)
log-probability of \(\mathbf{y}\) under this family with mean = \(\boldsymbol{\mu}\). Essentially sum over all elements in the vector returned by the
lp()
method.Families with more than one parameter that needs to be estimated in order to evaluate the model’s log-likelihood (i.e.,
two_par=True
) must pass as key-word argument ascale
parameter with a default value, e.g.,:def llk(self, mu, scale=1): ...
You can check the implementation of the
Gaussian
Family for an example.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- lp(y, mu, **kwargs)
Log-probability of observing every value in \(\mathbf{y}\) under this family with mean = \(\boldsymbol{\mu}\).
Families with more than one parameter that needs to be estimated in order to evaluate the model’s log-likelihood (i.e.,
two_par=True
) must pass as key-word argument ascale
parameter with a default value, e.g.,:def lp(self, mu, scale=1): ...
You can check the implementation of the
Gaussian
Family for an example.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- Returns:
a N-dimensional vector containing the log-probability of observing each data-point under the current model.
- Return type:
[float]
- class mssm.src.python.exp_fam.GAMLSSFamily(pars: int, links: [<class 'mssm.src.python.exp_fam.Link'>])
Bases:
object
Base-class to be implemented by families of Generalized Additive Mixed Models of Location, Scale, and Shape (GAMMLSS; Rigby & Stasinopoulos, 2005).
Apart from the required methods, three mandatory attributes need to be defined by the
__init__()
constructor of implementations of this class. These are required to evaluate the first and second (pure & mixed) derivative of the log-likelihood with respect to any of the log-likelihood’s parameters. See the variables below.Optionally, a
mean_init_fam
attribute can be defined - specfiying aFamily
member that is fitted to the data to get an initial estimate of the mean parameter of the assumed distribution.References:
Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
- Parameters:
pars (int) – Number of parameters of the distribution belonging to the random variables assumed to have generated the observations, e.g., 2 for the Normal: mean and standard deviation.
links ([Link]) – Link functions for each of the parameters of the distribution.
- Variables:
d1 ([Callable]) – A list holding
n_par
functions to evaluate the first partial derivatives of llk with respect to each parameter of the llk. Needs to be initialized when calling__init__()
.d2 ([Callable]) – A list holding
n_par
functions to evaluate the second (pure) partial derivatives of llk with respect to each parameter of the llk. Needs to be initialized when calling__init__()
.d2m ([Callable]) – A list holding
n_par*(n_par-1)/2
functions to evaluate the second mixed partial derivatives of llk with respect to each parameter of the llk in order:d2m[0]
= \(\partial l/\partial \mu_1 \partial \mu_2\),d2m[1]
= \(\partial l/\partial \mu_1 \partial \mu_3\), …,d2m[n_par-1]
= \(\partial l/\partial \mu_1 \partial \mu_{n_{par}}\),d2m[n_par]
= \(\partial l/\partial \mu_2 \partial \mu_3\),d2m[n_par+1]
= \(\partial l/\partial \mu_2 \partial \mu_4\), … . Needs to be initialized when calling__init__()
.
- get_resid(y, *mus)
Get standardized residuals for a GAMMLSS model (Rigby & Stasinopoulos, 2005).
Any implementation of this function should return a vector that looks like what could be expected from taking
len(y)
independent draws from \(N(0,1)\).References:
Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observed value.
mus ([[float]]) – A list including self.n_par lists - one for each parameter of the distribution. Each of those lists contains the expected value for a particular parmeter for each of the N observations.
- Returns:
a N-dimensional vector containing the log-probability of observing each data-point under the current model.
- Return type:
[float]
- init_coef(models)
(Optional) Function to initialize the coefficients of the model.
Can return
None
, in which case random initialization will be used.- Parameters:
models ([mssm.models.GAMM]) – A list of
mssm.models.GAMM
’s, - each based on one of the formulas provided to a model.- Returns:
A
numpy.array
of shape (-1,1), holding initial values for all model coefficients.- Return type:
numpy array
- init_lambda(formulas)
(Optional) Function to initialize the smoothing parameters of the model.
Can return
None
, in which case random initialization will be used.- Parameters:
formulas ([mssm.src.python.formula.Formula]) – A list of
mssm.src.python.formula.Formula
’s provided to a model.- Returns:
A list, holding - for each \(\lambda\) parameter to be estimated - an initial value.
- Return type:
[float]
- llk(y, *mus)
log-probability of data under given model. Essentially sum over all elements in the vector returned by the
lp()
method.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mus ([[float]]) – A list including self.n_par lists - one for each parameter of the distribution. Each of those lists contains the expected value for a particular parmeter for each of the N observations.
- Returns:
The log-probability of observing all data under the current model.
- Return type:
float
- lp(y, *mus)
Log-probability of observing every element in \(\mathbf{y}\) under their respective distribution parameterized by
mus
.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observed value.
mus ([[float]]) – A list including self.n_par lists - one for each parameter of the distribution. Each of those lists contains the expected value for a particular parmeter for each of the N observations.
- Returns:
a N-dimensional vector containing the log-probability of observing each data-point under the current model.
- Return type:
[float]
- class mssm.src.python.exp_fam.GAMMALS(links: [<class 'mssm.src.python.exp_fam.Link'>])
Bases:
GAMLSSFamily
Family for a GAMMA GAMMLSS model (Rigby & Stasinopoulos, 2005).
This Family follows the
Gamma
family, in that we assume: \(Y_i \sim \Gamma(\mu_i,\phi_i)\). The difference to theGamma
family is that we now also model \(\phi\) as an additive combination of smooth variables and other parametric terms. The Gamma distribution is usually not expressed in terms of the mean and scale (\(\phi\)) parameter but rather in terms of a shape and rate parameter - called \(\alpha\) and \(\beta\) respectively. Wood (2017) provides \(\alpha = 1/\phi\). With this we can obtain \(\beta = 1/\phi/\mu\) (see the source-code forlp()
method of theGamma
family for details).References:
Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
- Parameters:
links ([Link]) – Link functions for the mean and standard deviation. Standard would be
links=[LOG(),LOG()]
.
- get_resid(y, mu, scale)
Get standardized residuals for a Gamma GAMMLSS model (Rigby & Stasinopoulos, 2005).
Essentially, to get a standaridzed residual vector we first have to account for the mean-variance relationship of our RVs (which we also have to do for the
Gamma
family) - for this we can simply compute deviance residuals again (see Wood, 2017). These should be \(\sim N(0,\phi_i)\) (where \(\phi_i\) is the element inscale
for a specific observation) - so if we divide each of those by the observation-specific scale we can expect the resulting standardized residuals to be :math:` sim N(0,1)` if the model is correct.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale ([float]) – The vector containing the predicted scale parameter for the response distribution corresponding to each observation.
- Returns:
A list of standardized residuals that should be ~ N(0,1) if the model is correct.
- Return type:
[float]
- init_coef(models)
Function to initialize the coefficients of the model.
Fits a GAMM for the mean and initializes all coef. for the scale parameter to 1.
- Parameters:
models ([mssm.models.GAMM]) – A list of
mssm.models.GAMM
’s, - each based on one of the formulas provided to a model.- Returns:
A
numpy.array
of shape (-1,1), holding initial values for all model coefficients.- Return type:
numpy array
- llk(y, mu, scale)
log-probability of data under given model. Essentially sum over all elements in the vector returned by the
lp()
method.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale ([float]) – The vector containing the predicted scale parameter for the response distribution corresponding to each observation.
- Returns:
The log-probability of observing all data under the current model.
- Return type:
float
- lp(y, mu, scale)
Log-probability of observing every proportion in \(\mathbf{y}\) under their respective Gamma with mean = \(\boldsymbol{\mu}\) and scale = \(\boldsymbol{\phi}\).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observed value.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale ([float]) – The vector containing the predicted scale parameter for the response distribution corresponding to each observation.
- Returns:
a N-dimensional vector containing the log-probability of observing each data-point under the current model.
- Return type:
[float]
- class mssm.src.python.exp_fam.GAUMLSS(links: [<class 'mssm.src.python.exp_fam.Link'>])
Bases:
GAMLSSFamily
Family for a Normal GAMMLSS model (Rigby & Stasinopoulos, 2005).
This Family follows the
Gaussian
family, in that we assume: \(Y_i \sim N(\mu_i,\sigma_i)\). i.e., each of the \(N\) observations is still believed to have been generated from an independent normally distributed RV with observation-specific mean.The important difference is that the scale parameter, \(\sigma\), is now also observation-specific and modeled as an additive combination of smooth functions and other parametric terms, just like the mean is in a Normal GAM. Note, that this explicitly models heteroscedasticity - the residuals are no longer assumed to be i.i.d samples from \(\sim N(0,\sigma)\), since \(\sigma\) can now differ between residual realizations.
References:
Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
- Parameters:
links ([Link]) – Link functions for the mean and standard deviation. Standard would be
links=[Identity(),LOG()]
.
- get_resid(y, mu, sigma)
Get standardized residuals for a Normal GAMMLSS model (Rigby & Stasinopoulos, 2005).
Essentially, each residual should reflect a realization of a normal with mean zero and observation-specific standard deviation. After scaling each residual by their observation-specific standard deviation we should end up with standardized residuals that can be expected to be i.i.d \(\sim N(0,1)\) - assuming that our model is correct.
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
sigma ([float]) – The vector containing the predicted stdandard deviation for the response distribution corresponding to each observation.
- Returns:
A list of standardized residuals that should be ~ N(0,1) if the model is correct.
- Return type:
[float]
- init_coef(models)
Function to initialize the coefficients of the model.
Fits a GAMM for the mean and initializes all coef. for the standard deviation to 1.
- Parameters:
models ([mssm.models.GAMM]) – A list of
mssm.models.GAMM
’s, - each based on one of the formulas provided to a model.- Returns:
A
numpy.array
of shape (-1,1), holding initial values for all model coefficients.- Return type:
numpy array
- llk(y, mu, sigma)
log-probability of data under given model. Essentially sum over all elements in the vector returned by the
lp()
method.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
sigma ([float]) – The vector containing the predicted stdandard deviation for the response distribution corresponding to each observation.
- Returns:
The log-probability of observing all data under the current model.
- Return type:
float
- lp(y, mu, sigma)
Log-probability of observing every proportion in y under their respective Normal with observation-specific mean and standard deviation.
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observed value.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
sigma ([float]) – The vector containing the predicted stdandard deviation for the response distribution corresponding to each observation.
- Returns:
a N-dimensional vector containing the log-probability of observing each data-point under the current model.
- Return type:
[float]
- class mssm.src.python.exp_fam.GENSMOOTHFamily(pars: int, links: [<class 'mssm.src.python.exp_fam.Link'>], *llkargs)
Bases:
object
Base-class for General Smooth “families” as discussed by Wood, Pya, & Säfken (2016). For estimation of :class:
mssm.models.GSMM
models viaBFGS
it is sufficient to implementllk()
.gradient()
andhessian()
can then simply returnNone
. For exact estimation via Newton’s method, the latter two functions need to be implemented and have to return the gradient and hessian at the current coefficient estimate respectively.Additional parameters needed for likelihood, gradient, or hessian evaluation can be passed along via the
llkargs
.References:
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
Nocedal & Wright (2006). Numerical Optimization. Springer New York.
- Parameters:
pars (int) – Number of parameters of the likelihood.
links ([Link]) – List of Link functions for each parameter of the likelihood, e.g., links=[Identity(),LOG()].
- gradient(coef, coef_split_idx, y, Xs)
Function to evaluate the gradient of the llk at current coefficient estimate
coef
.- Parameters:
coef ([float]) – The current coefficient estimate (as np.array of shape (-1,1) - so it must not be flattened!).
coef_split_idx ([int]) – A list used to split (via
np.split()
) thecoef
into the sub-sets associated with each paramter of the llk.y ([float]) – The vector containing each observation.
Xs ([scp.sparse.csc_array]) – A list of sparse model matrices per likelihood parameter.
- Returns:
The Gradient of the log-likelihood evaluated at
coef
asnumpy.array
) of shape (-1,1).- Return type:
[float]
- hessian(coef, coef_split_idx, y, Xs)
Function to evaluate the hessian of the llk at current coefficient estimate
coef
.- Parameters:
coef ([float]) – The current coefficient estimate (as np.array of shape (-1,1) - so it must not be flattened!).
coef_split_idx ([int]) – A list used to split (via
np.split()
) thecoef
into the sub-sets associated with each paramter of the llk.y ([float]) – The vector containing each observation.
Xs ([scp.sparse.csc_array]) – A list of sparse model matrices per likelihood parameter.
- Returns:
The Hessian of the log-likelihood evaluated at
coef
.- Return type:
scipy.sparse.csc_array
- init_coef(models)
(Optional) Function to initialize the coefficients of the model.
Can return
None
, in which case random initialization will be used.- Parameters:
models ([mssm.models.GAMM]) – A list of
mssm.models.GAMM
’s, - each based on one of the formulas provided to a model.- Returns:
A
numpy.array
of shape (-1,1), holding initial values for all model coefficients.- Return type:
numpy array
- init_lambda(formulas)
(Optional) Function to initialize the smoothing parameters of the model.
Can return
None
, in which case random initialization will be used.- Parameters:
formulas ([mssm.src.python.formula.Formula]) – A list of
mssm.src.python.formula.Formula
’s provided to a model.- Returns:
A list, holding - for each \(\lambda\) parameter to be estimated - an initial value.
- Return type:
[float]
- llk(coef, coef_split_idx, y, Xs)
log-probability of data under given model.
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
coef ([float]) – The current coefficient estimate (as np.array of shape (-1,1) - so it must not be flattened!).
coef_split_idx ([int]) – A list used to split (via
np.split()
) thecoef
into the sub-sets associated with each paramter of the llk.y ([float]) – The vector containing each observation.
Xs ([scp.sparse.csc_array]) – A list of sparse model matrices per likelihood parameter.
- Returns:
The log-likelihood evaluated at
coef
.- Return type:
float
- class mssm.src.python.exp_fam.Gamma(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.LOG object>, scale: float = None)
Bases:
Family
Gamma Family.
We assume: \(Y_i \sim \Gamma(\mu_i,\phi)\). The Gamma distribution is usually not expressed in terms of the mean and scale (\(\phi\)) parameter but rather in terms of a shape and rate parameter - called \(\alpha\) and \(\beta\) respectively. Wood (2017) provides \(\alpha = 1/\phi\). With this we can obtain \(\beta = 1/\phi/\mu\) (see the source-code for
lp()
method for details).References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
link (Link) – The link function to be used by the model of the mean of this family. By default set to the log link.
scale (float or None, optional) – Known scale parameter for this family - by default set to None so that the scale parameter is estimated.
- D(y, mu)
Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- Returns:
A N-dimensional vector containing the contribution of each data-point to the overall model deviance.
- Return type:
[float]
- V(mu)
Variance function for the Gamma family.
The variance of random variable \(Y\) is proportional to it’s mean raised to the second power.
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – a N-dimensional vector of the model prediction/the predicted mean
- Returns:
mu raised to the power of 2
- Return type:
[float]
- deviance(y, mu)
Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- Returns:
The model deviance.
- Return type:
float
- llk(y, mu, scale=1)
log-probability of data under given model. Essentially sum over all elements in the vector returned by the
lp()
method.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale (float, optional) – The (estimated) scale parameter, defaults to 1
- Returns:
The log-probability of observing all data under the current model.
- Return type:
float
- lp(y, mu, scale=1)
Log-probability of observing every proportion in \(\mathbf{y}\) under their respective Gamma with mean = \(\boldsymbol{\mu}\).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observed value.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale (float, optional) – The (estimated) scale parameter, defaults to 1
- Returns:
a N-dimensional vector containing the log-probability of observing each data-point under the current model.
- Return type:
[float]
- class mssm.src.python.exp_fam.Gaussian(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.Identity object>, scale: float = None)
Bases:
Family
Normal/Gaussian Family.
We assume: \(Y_i \sim N(\mu_i,\sigma)\) - i.e., each of the \(N\) observations is generated from a normally distributed RV with observation-specific mean and shared scale parameter \(\sigma\). Equivalent to the assumption that the observed residual vector - the difference between the model prediction and the observed data - should look like what could be expected from drawing \(N\) independent samples from a Normal with mean zero and standard deviation equal to \(\sigma\).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
link (Link) – The link function to be used by the model of the mean of this family. By default set to the canonical identity link.
scale (float or None, optional) – Known scale parameter for this family - by default set to None so that the scale parameter is estimated.
- D(y, mu)
Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- Returns:
A N-dimensional vector containing the contribution of each data-point to the overall model deviance.
- Return type:
[float]
- V(mu)
Variance function for the Normal family.
Not really a function since the link between variance and mean of the RVs is assumed constant for this model.
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – a N-dimensional vector of the model prediction/the predicted mean
- Returns:
a N-dimensional vector of 1s
- Return type:
[float]
- deviance(y, mu)
Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- Returns:
The model deviance.
- Return type:
float
- llk(y, mu, sigma=1)
log-probability of data under given model. Essentially sum over all elements in the vector returned by the
lp()
method.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
sigma (float, optional) – The (estimated) sigma parameter, defaults to 1
- Returns:
The log-probability of observing all data under the current model.
- Return type:
float
- lp(y, mu, sigma=1)
Log-probability of observing every proportion in \(\mathbf{y}\) under their respective Normal with mean = \(\boldsymbol{\mu}\).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
sigma (float, optional) – The (estimated) sigma parameter, defaults to 1
- Returns:
a N-dimensional vector containing the log-probability of observing each data-point under the current model.
- Return type:
[float]
- class mssm.src.python.exp_fam.Identity
Bases:
Link
Identity Link function. \(\boldsymbol{\mu}=\boldsymbol{\eta}\) and so this link is trivial.
- dy1(mu)
First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for Fisher scoring/PIRLS (Wood, 2017).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- dy2(mu)
Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).
References:
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- f(mu)
Canonical link for normal distribution with \(\boldsymbol{\eta} = \boldsymbol{\mu}\).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- fi(eta)
For the identity link, \(\boldsymbol{\eta} = \boldsymbol{\mu}\), so the inverse is also just the identity. see Faraway (2016)
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
eta ([float]) – The vector containing the model prediction corresponding to each observation.
- class mssm.src.python.exp_fam.InvGauss(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.LOG object>, scale: float = None)
Bases:
Family
Inverse Gaussian Family.
We assume: \(Y_i \sim IG(\mu_i,\phi)\). The Inverse Gaussian distribution is usually not expressed in terms of the mean and scale (\(\phi\)) parameter but rather in terms of a shape and scale parameter - called \(\nu\) and \(\lambda\) respectively (see the scipy implementation). We can simply set \(\nu=\mu\) (compare scipy density to the one in table 3.1 of Wood, 2017). Wood (2017) shows that \(\phi=1/\lambda\), so this provides \(\lambda=1/\phi\)
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.invgauss.html
- Parameters:
link (Link) – The link function to be used by the model of the mean of this family. By default set to the log link.
scale (float or None, optional) – Known scale parameter for this family - by default set to None so that the scale parameter is estimated.
- D(y, mu)
Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- Returns:
A N-dimensional vector containing the contribution of each data-point to the overall model deviance.
- Return type:
[float]
- V(mu)
Variance function for the Inverse Gaussian family.
The variance of random variable \(Y\) is proportional to it’s mean raised to the third power.
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – a N-dimensional vector of the model prediction/the predicted mean
- Returns:
mu raised to the power of 3
- Return type:
[float]
- deviance(y, mu)
Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- Returns:
The model deviance.
- Return type:
float
- llk(y, mu, scale=1)
log-probability of data under given model. Essentially sum over all elements in the vector returned by the
lp()
method.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale (float, optional) – The (estimated) scale parameter, defaults to 1
- Returns:
The log-probability of observing all data under the current model.
- Return type:
float
- lp(y, mu, scale=1)
Log-probability of observing every value in \(\mathbf{y}\) under their respective inverse Gaussian with mean = \(\boldsymbol{\mu}\).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observed value.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale (float, optional) – The (estimated) scale parameter, defaults to 1
- Returns:
a N-dimensional vector containing the log-probability of observing each data-point under the current model.
- Return type:
[float]
- class mssm.src.python.exp_fam.LOG
Bases:
Link
Log Link function. \(log(\boldsymbol{\mu}) = \boldsymbol{\eta}\).
- dy1(mu)
First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for Fisher scoring/PIRLS (Wood, 2017).
- References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- dy2(mu)
Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).
References:
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- f(mu)
Non-canonical link for Gamma distribution with \(log(\boldsymbol{\mu}) = \boldsymbol{\eta}\).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- fi(eta)
For the log link, \(\boldsymbol{\eta} = log(\boldsymbol{\mu})\), so \(exp(\boldsymbol{\eta})=\boldsymbol{\mu}\). see Faraway (2016)
- References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
eta ([float]) – The vector containing the model prediction corresponding to each observation.
- class mssm.src.python.exp_fam.LOGb(b)
Bases:
Link
Log + b Link function. \(log(\boldsymbol{\mu} + b) = \boldsymbol{\eta}\).
- Parameters:
b (float) – The constant to add to \(\mu\) before taking the log.
- dy1(mu)
First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for Fisher scoring/PIRLS (Wood, 2017).
- References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- dy2(mu)
Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).
References:
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- f(mu)
\(log(\boldsymbol{\mu} + b) = \boldsymbol{\eta}\).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- fi(eta)
For the logb link, \(\boldsymbol{\eta} = log(\boldsymbol{\mu} + b)\), so \(exp(\boldsymbol{\eta})-b =\boldsymbol{\mu}\)
- References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
eta ([float]) – The vector containing the model prediction corresponding to each observation.
- class mssm.src.python.exp_fam.Link
Bases:
object
Link function base class. To be implemented by any link functiion used for GAMMs and GAMMLSS models. Only links used by
GAMLSS
models require implementing the dy2 function. Note, that care must be taken that every method returns only valid values. Specifically, no returned element may benumpy.nan
ornumpy.inf
.- dy1(mu)
First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\) Needed for Fisher scoring/PIRLS (Wood, 2017).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- dy2(mu)
Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).
References:
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- f(mu)
Link function \(f()\) mapping mean \(\boldsymbol{\mu}\) of an exponential family to the model prediction \(\boldsymbol{\eta}\), so that \(f(\boldsymbol{\mu}) = \boldsymbol{\eta}\). see Wood (2017, 3.1.2) and Faraway (2016).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- fi(eta)
Inverse of the link function mapping \(\boldsymbol{\eta} = f(\boldsymbol{\mu})\) to the mean \(fi(\boldsymbol{\eta}) = fi(f(\boldsymbol{\mu})) = \boldsymbol{\mu}\). see Faraway (2016) and the
Link.f
function.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
eta ([float]) – The vector containing the model prediction corresponding to each observation.
- class mssm.src.python.exp_fam.Logit
Bases:
Link
Logit Link function, which is canonical for the binomial model. \(\boldsymbol{\eta}\) = log-odds of success.
- dy1(mu)
First derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for Fisher scoring/PIRLS (Wood, 2017):
\[ \begin{align}\begin{aligned}f(\mu) = log(\mu / (1 - \mu))\\f(\mu) = log(\mu) - log(1 - \mu)\\\partial f(\mu)/ \partial \mu = 1/\mu - 1/(1 - \mu)\end{aligned}\end{align} \]Faraway (2016) simplifies this to: \(\partial f(\mu)/ \partial \mu = 1 / (\mu - \mu^2) = 1/ ((1-\mu)\mu)\)
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- dy2(mu)
Second derivative of \(f(\boldsymbol{\mu})\) with respect to \(\boldsymbol{\mu}\). Needed for GAMMLSS models (Wood, 2017).
References:
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- f(mu)
Canonical link for binomial distribution with \(\boldsymbol{\mu}\) holding the probabilities of success, so that the model prediction \(\boldsymbol{\eta}\) is equal to the log-odds.
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- fi(eta)
For the logit link and the binomial model, \(\boldsymbol{\eta}\) = log-odds, so the inverse to go from \(\boldsymbol{\eta}\) to \(\boldsymbol{\mu}\) is \(\boldsymbol{\mu} = exp(\boldsymbol{\eta}) / (1 + exp(\boldsymbol{\eta}))\). see Faraway (2016)
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
eta ([float]) – The vector containing the model prediction corresponding to each observation.
- class mssm.src.python.exp_fam.MULNOMLSS(pars: int)
Bases:
GAMLSSFamily
Family for a Multinomial GAMMLSS model (Rigby & Stasinopoulos, 2005).
This Family assumes that each observation \(y_i\) corresponds to one of \(K\) classes (labeled as 0, …, \(K\)) and reflects a realization of an independent RV \(Y_i\) with observation-specific probability mass function defined over the \(K\) classes. These \(K\) probabilities - that \(Y_i\) takes on class 1, …, \(K\) - are modeled as additive combinations of smooth functions of covariates and other parametric terms.
As an example, consider a visual search experiment where \(K-1\) distractors are presented on a computer screen together with a single target and subjects are instructed to find the target and fixate it. With a Multinomial model we can estimate how the probability of looking at each of the \(K\) stimuli on the screen changes (smoothly) over time and as a function of other predictor variables of interest (e.g., contrast of stimuli, dependening on whether parfticipants are instructed to be fast or accurate).
References:
Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
- Parameters:
pars (int) – K-1, i.e., 1- Number of classes or the number of linear predictors.
- get_resid(y, *mus)
Get standardized residuals for a GAMMLSS model (Rigby & Stasinopoulos, 2005).
Any implementation of this function should return a vector that looks like what could be expected from taking
len(y)
independent draws from \(N(0,1)\).References:
Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observed value.
mus ([[float]]) – A list including self.n_par lists - one for each parameter of the distribution. Each of those lists contains the expected value for a particular parmeter for each of the N observations.
- Returns:
a N-dimensional vector containing the log-probability of observing each data-point under the current model.
- Return type:
[float]
- llk(y, *mus)
log-probability of data under given model. Essentially sum over all elements in the vector returned by the
lp()
method.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observed class, every element must be larger than or equal to 0 and smaller than self.n_par + 1.
mus ([[float]]) – A list containing K-1 (self.n_par) lists, each containing the non-normalized probabilities of observing class k for every observation.
- Returns:
The log-probability of observing all data under the current model.
- Return type:
float
- lp(y, *mus)
Log-probability of observing class k under current model.
Our DV consists of K classes but we essentially enforce a sum-to zero constraint on the DV so that we end up modeling only K-1 (non-normalized) probabilities of observing class k (for all k except k==K) as an additive combination of smooth functions of our covariates and other parametric terms. The probability of observing class K as well as the normalized probabilities of observing each other class can readily be computed from these K-1 non-normalized probabilities. This is explained quite well on Wikipedia (see refs).
Specifically, the probability of the outcome being class k is simply:
\(p(Y_i == k) = \mu_k / (1 + \sum_j^{K-1} \mu_j)\) where \(\mu_k\) is the aforementioned non-normalized probability of observing class \(k\) - which is simply set to 1 for class \(K\) (this follows from the sum-to-zero constraint; see Wikipedia).
So, the log-prob of the outcome being class k is:
\(log(p(Y_i == k)) = log(\mu_k) - log(1 + \sum_j^{K-1} \mu_j)\)
References:
Wikipedia. https://en.wikipedia.org/wiki/Multinomial_logistic_regression
gamlss.dist on Github (see Rigby & Stasinopoulos, 2005). https://github.com/gamlss-dev/gamlss.dist/blob/main/R/MN4.R
Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.
- Parameters:
y ([float]) – The vector containing each observed class, every element must be larger than or equal to 0 and smaller than self.n_par + 1.
mus ([[float]]) – A list containing K-1 (self.n_par) lists, each containing the non-normalized probabilities of observing class k for every observation.
- Returns:
a N-dimensional vector containing the log-probability of observing each data-point under the current model.
- Return type:
[float]
- class mssm.src.python.exp_fam.Poisson(link: ~mssm.src.python.exp_fam.Link = <mssm.src.python.exp_fam.LOG object>)
Bases:
Family
Poisson Family.
We assume: \(Y_i \sim P(\lambda)\). We can simply set \(\lambda=\mu\) (compare scipy density to the one in table 3.1 of Wood, 2017) and treat the scale parameter of a GAMM (\(\phi\)) as fixed/known at 1.
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html
- Parameters:
link (Link) – The link function to be used by the model of the mean of this family. By default set to the log link.
- D(y, mu)
Contribution of each observation to model Deviance (Wood, 2017; Faraway, 2016)
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- Returns:
A N-dimensional vector containing the contribution of each data-point to the overall model deviance.
- Return type:
[float]
- V(mu)
Variance function for the Poisson family.
The variance of random variable \(Y\) is proportional to it’s mean.
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
mu ([float]) – a N-dimensional vector of the model prediction/the predicted mean
- Returns:
mu
- Return type:
[float]
- deviance(y, mu)
Deviance of the model under this family: 2 * (llk_max - llk_c) * scale (Wood, 2017; Faraway, 2016).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- Returns:
The model deviance.
- Return type:
float
- init_mu(y)
Function providing initial \(\boldsymbol{\mu}\) vector for Poisson GAMM.
We shrink extreme observed counts towards mean.
- Parameters:
y ([float]) – The vector containing each observation.
- llk(y, mu)
log-probability of data under given model. Essentially sum over all elements in the vector returned by the
lp()
method.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observation.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
scale (float, optional) – The (estimated) scale parameter, defaults to 1
- Returns:
The log-probability of observing all data under the current model.
- Return type:
float
- lp(y, mu)
Log-probability of observing every value in \(\mathbf{y}\) under their respective Poisson with mean = \(\boldsymbol{\mu}\).
References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
y ([float]) – The vector containing each observed value.
mu ([float]) – The vector containing the predicted mean for the response distribution corresponding to each observation.
- Returns:
a N-dimensional vector containing the log-probability of observing each data-point under the current model.
- Return type:
[float]
- class mssm.src.python.exp_fam.PropHaz(ut, r)
Bases:
GENSMOOTHFamily
Family for proportional Hazard model - a type of General Smooth model as discussed by Wood, Pya, & Säfken (2016).
Based on Supplementary materials G in Wood, Pya, & Säfken (2016).
References:
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
Nocedal & Wright (2006). Numerical Optimization. Springer New York.
- Parameters:
ut (numpy.array) – Unique event time vector (each time represnted as
int
) as described by WPS (2016), holding unique event times in decreasing order.r (numpy.array) – Index vector as described by WPS (2016), holding for each data-point (i.e., for each row in
Xs[0
]) the index to it’s corresponding event time inut
.
- get_baseline_hazard(coef, delta, Xs)
Get the cumulative baseline hazard function + variance estimate as defined by Wood, Pya, & Säfken (2016).
The function is evaluated for all
k
unique event times that were available in the data.References:
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
- Parameters:
coef (numpy.array) – Coefficient vector as
numpy.array
of shape (-1,1).
:param Xs The list model matrices (here holding a single model matrix) obtained from
mssm.models.GSMM.get_mmat()`()
. :type Xs: [scipy.sparse.csc_array] :param delta: Dependent variable passed tomssm.src.python.formula.Formula()
, holds (for each row inXs[0
]) a value in{0,1}
, indicating whether for that observation the event was observed or not. :type delta: numpy.array :return: Two arrays, the first holdsk
baseline hazard function estimates, the latter holdsk
variance estimates for each of the survival function estimates. :rtype: (numpy.array,numpy.array)
- get_survival(coef, Xs, delta, t, x, V)
Compute survival function + variance at time-point
t
, givenk
optional covariate vector(s) x as defined by Wood, Pya, & Säfken (2016).References:
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
- Parameters:
coef (numpy.array) – Coefficient vector as
numpy.array
of shape (-1,1).
:param Xs The list model matrices (here holding a single model matrix) obtained from
mssm.models.GSMM.get_mmat()`()
. :type Xs: [scipy.sparse.csc_array] :param delta: Dependent variable passed tomssm.src.python.formula.Formula()
, holds (for each row inXs[0
]) a value in{0,1}
, indicating whether for that observation the event was observed or not. :type delta: numpy.array :param t: Time-point at which to evaluate the survival function. :type t: int :param x: Optional vector (or matrix) of covariate values. Needs to be of shape(k,len(coef))
. :type x: numpy.array :param V: Estimated Co-variance matrix of posterior forcoef
:type V: scipy.sparse.csc-array :return: Two arrays, the first holdsk
survival function estimates, the latter holdsk
variance estimates for each of the survival function estimates. :rtype: (numpy.array,numpy.array)
- gradient(coef, coef_split_idx, delta, Xs)
Gradient as defined by Wood, Pya, & Säfken (2016).
References:
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
delta
(passed as dependent variable) holds values in{0,1}
, indicating whether the event was observed or not.
- hessian(coef, coef_split_idx, delta, Xs)
Hessian as defined by Wood, Pya, & Säfken (2016).
References:
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
delta
(passed as dependent variable) holds values in{0,1}
, indicating whether the event was observed or not.
- init_coef(models)
Function to initialize the coefficients of the model.
- Parameters:
models ([mssm.models.GAMM]) – A list of
mssm.models.GAMM
’s, - each based on one of the formulas provided to a model.- Returns:
A
numpy.array
of shape (-1,1), holding initial values for all model coefficients.- Return type:
numpy array
- llk(coef, coef_split_idx, delta, Xs)
Log-likelihood function as defined by Wood, Pya, & Säfken (2016).
References:
Wood, Pya, & Säfken (2016). Smoothing Parameter and Model Selection for General Smooth Models.
delta
(passed as dependent variable) holds values in{0,1}
, indicating whether the event was observed or not.
- mssm.src.python.exp_fam.est_scale(res, rows_X, total_edf)
Scale estimate from Wood & Fasiolo (2017).
- Refereces:
Wood, S. N., & Fasiolo, M. (2017). A generalized Fellner-Schall method for smoothing parameter optimization with application to Tweedie location, scale and shape models.
- Parameters:
res ([float]) – The vector containing the difference between the model prediction and the (pseudo) data.
rows_X (int) – The number of observations collected.
total_edf (float) – The expected degrees of freedom for the model.
mssm.src.python.formula module
- class mssm.src.python.formula.Formula(lhs: ~mssm.src.python.formula.lhs, terms: [<class 'mssm.src.python.terms.GammTerm'>], data: ~pandas.core.frame.DataFrame, series_id: str = None, codebook: dict = None, print_warn=True, keep_cov=False, file_paths=[], file_loading_nc=1, file_loading_kwargs: dict = {'header': 0, 'index_col': False})
Bases:
object
The formula of a regression equation.
- Parameters:
lhs – The lhs object defining the dependent variable.
terms ([GammTerm]) – A list of the terms which should be added to the model. See
mssm.src.python.terms
for info on which terms can be added.data (pd.DataFrame or None) – A pandas dataframe (with header!) of the data which should be used to estimate the model. The variable specified for
lhs
as well as all variables included for aterm
interms
need to be present in the data, otherwise the call to Formula will throw an error.series_id (str, optional) – A string identifying the individual experimental units. Usually a unique trial identifier. Only necessary if approximate derivative computations are to be utilized for random smooth terms.
codebook (dict or None) – Codebook - keys should correspond to factor variable names specified in terms. Values should again be a
dict
, with keys for each of K levels of the factor and value corresponding to an integer in {0,K}.print_warn (bool,optional) – Whether warnings should be printed. Useful when fitting models from terminal. Defaults to True.
keep_cov (bool,optional) – Whether or not the internal encoding structure of all predictor variables should be created when forming \(\mathbf{X}^T\mathbf{X}\) iteratively instead of forming \(\mathbf{X}\) directly. Can speed up estimation but increases memory footprint. Defaults to True.
file_paths ([str],optional) – A list of paths to .csv files from which \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) should be created iteratively. Setting this to a non-empty list will prevent fitting X as a whole.
data
should then be set toNone
. Defaults to an empty list.file_loading_nc (int,optional) – How many cores to use to a) accumulate \(\mathbf{X}\) in parallel (if
data
is notNone
andfile_paths
is an empty list) or b) to accumulate \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) (and \(\mathbf{\eta}\) during estimation) (ifdata
isNone
andfile_paths
is a non-empty list). For case b, this should really be set to the maimum number of cores available. For a this only really speeds up accumulating \(\mathbf{X}\) if \(\mathbf{X}\) has many many columns and/or rows. Defaults to 1.file_loading_kwargs (dict,optional) – Any key-word arguments to pass to pandas.read_csv when \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) should be created iteratively (if
data
isNone
andfile_paths
is a non-empty list). Defaults to{"header":0,"index_col":False}
.
- Variables:
coef_per_term ([int]) – A list containing the number of coefficients corresponding to each term included in
terms
. Initialized at construction.coef_names ([str]) – A list containing a named identifier (e.g., “Intercept”) for each coefficient estimated by the model. Initialized at construction.
n_coef (int) – The number of coefficients estimated by the model in total. Initialized at construction.
unpenalized_coef (int) – The number of un-penalized coefficients estimated by the model. Initialized at construction.
y_flat ([float] or None) – An array, containing all values on the dependent variable (i.e., specified by
lhs.variable
) in order of the data-frame passed todata
. This variable will be initialized at construction but only iffile_paths=None
, i.e., in case \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) are not created iteratively.cov_flat ([float] or None) – An array, containing all (encoded, in case of categorical predictors) values on each predictor (each columns of
cov_flat
corresponds to a different predictor) variable included in any of theterms
in order of the data-frame passed todata
. This variable will be initialized at construction but only iffile_paths=None
, i.e., in case \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) are not created iteratively.NOT_NA_flat ([bool] or None) – An array, containing an indication for each value on the dependent variable (i.e., specified by
lhs.variable
) whether the corresponding value is not a number (“NA”) or not. In order of the data-frame passed todata
. This variable will be initialized at construction but only iffile_paths=None
, i.e., in case \(\mathbf{X}^T\mathbf{X}\) and \(\mathbf{X}^T\mathbf{y}\) are not created iteratively.
- build_penalties()
Builds the penalties required by
self.__terms
. Called automatically whenever needed. Call manually only for testing.
- encode_data(data, prediction=False)
Encodes
data
, which needs to be apd.DataFrame
and by default (ifprediction==False
) builds an index of which rows indata
are NA in the column of the dependent variable described byself.lhs
.- Parameters:
data (pd.DataFrame) – The data to encode.
prediction (bool, optional) – Whether or not a NA index and a column for the dependent variable should be generated.
- Returns:
A tuple with 7 (optional) entries: the dependent variable described by
self.__lhs
, the encoded predictor variables as a (N,k) array (number of rows matches the number of rows of the first entry returned, the number of columns matches the number of k variables present in the formula), an indication for each row whether the dependent variable described byself.__lhs
is NA, like the first entry but split into a list of lists byself.series_id
, like the second entry but split into a list of lists byself.series_id
, ike the third entry but split into a list of lists byself.series_id
, start and end points for the splits used to split the previous three elements (identifying the start and end point of every level ofself.series_id
).- Return type:
(numpy.array or None, numpy.array or None, numpy.array or None, [numpy.array] or None, [numpy.array] or None, [numpy.array] or None, numpy.array or None)
- get_coding_factors() dict
Get a copy of the factor coding dictionary. Keys are factor variables in the data, values are dictionaries, where the keys correspond to the encoded levels (int) of the factor and the values to their levels (str).
- get_data() DataFrame
Get a copy of the
data
specified for this formula.
- get_depvar() list
Get a copy of the encoded dependent variable (defined via
self.__lhs
).
- get_factor_codings() dict
Get a copy of the factor coding dictionary. Keys are factor variables in the data, values are dictionaries, where the keys correspond to the levels (str) of the factor and the values to their encoded levels (int).
- get_factor_levels() dict
Get a copy of the factor levels dictionary. Keys are factor variables in the data, values are np.arrays holding the unique levels (as str) of the corresponding factor.
- get_ir_smooth_term_idx() list[int]
Get a copy of the list of indices that identify impulse response terms in
self.__terms
.
- get_linear_term_idx() list[int]
Get a copy of the list of indices that identify linear terms in
self.__terms
.
- get_n_coef() int
Get the number of coefficients that are implied by the formula.
- get_notNA() list
Get a copy of the encoded ‘not a NA’ vector for the dependent variable (defined via
self.__lhs
).
- get_penalties() list
Get a copy of the penalties implied by the formula. Will be None if the penalties have not been initizlized yet.
- get_random_term_idx() list[int]
Get a copy of the list of indices that identify random terms in
self.__terms
.
- get_smooth_term_idx() list[int]
Get a copy of the list of indices that identify smooth terms in
self.__terms
.
- get_subgroup_variables() list
Returns a copy of sub-group variables for factor smooths.
- get_term_names() list
Returns a copy of the list with the names of the terms specified for this formula.
- get_var_map() dict
Get a copy of the var map dictionary. Keys are variables in the data, values their column index in the encoded predictor matrix returned by
self.encode_data
.
- get_var_maxs() dict
Get a copy of the var maxs dictionary. Keys are variables in the data, values are either the maximum value the variable takes on in
self.__data
for continuous variables orNone
for categorical variables.
- get_var_mins() dict
Get a copy of the var mins dictionary. Keys are variables in the data, values are either the minimum value the variable takes on in
self.__data
for continuous variables orNone
for categorical variables.
- get_var_mins_maxs() -> (<class 'dict'>, <class 'dict'>)
Get a tuple containing copies of both the mins and maxs directory. See
self.get_var_mins
andself.get_var_maxs
.
- get_var_types() dict
Get a copy of the var types dictionary. Keys are variables in the data, values are either
VarType.NUMERIC
for continuous variables orVarType.FACTOR
for categorical variables.
- has_intercept() bool
Does this formula include an intercept or not.
- has_ir_terms() bool
Does this formula include impulse response terms or not.
- class mssm.src.python.formula.lhs(variable: str, f: Callable = None)
Bases:
object
The Left-hand side of a regression equation.
Parameters:
- Parameters:
variable (str) – The dependent variable. Can point to continuous and categorical variables.
f (Callable, optional) – A function that will be applied to the
variable
before fitting. For example: np.log(). By default no function is applied to thevariable
.
- mssm.src.python.formula.reparam(X, S, cov, option=1, n_bins=30, QR=False, identity=False, scale=False)
Options 1 - 3 are natural reparameterization discussed in Wood (2017; 5.4.2) with different strategies for the QR computation of \(\mathbf{X}\). Option 4 helps with stabilizing the REML computation and is from Wood (2011) and section 6.2.7 in Wood (2017):
Form complete matrix \(\mathbf{X}\) based on entire covariate.
Form matrix \(\mathbf{X}\) only based on unique covariate values.
Form matrix \(\mathbf{X}\) on a sample of values making up covariate. Covariate is split up into
n_bins
equally wide bins. The number of covariate values per bin is then calculated. Subsequently, the ratio relative to minimum bin size is computed and each ratio is rounded to the nearest integer. Thenratio
samples are obtained from each bin. That way, imbalance in the covariate is approximately preserved when forming the QR.Transform term-specific \(\mathbf{S}_{\boldsymbol{\lambda}}\) based on section Wood (2011) and section 6.2.7 in Wood (2017) so that they are full-rank and their log-determinant can be computed safely. In that case, only
S
needs to be provided and has to be a list holding the penalties to be transformed
For Options 1-3:
If
QR==True
then \(\mathbf{X}\) is decomposed into \(\mathbf{Q}\mathbf{R}\) directly via QR decomposition. Alternatively, we first form \(\mathbf{X}^T\mathbf{X}\) and then compute the cholesky \(\mathbf{L}\) of this product - note that \(\mathbf{L}^T = \mathbf{R}\). Overall the latter strategy is much faster (in particular ifoption==1
), but the increased loss of precision in \(\mathbf{L}^T = \mathbf{R}\) might not be ok for some.After transformation S only contains elements on it’s diagonal and \(\mathbf{X}\) the transformed functions. As discussed in Wood (2017), the transformed functions are decreasingly flexible - so the elements on \(\mathbf{S}\) diagonal become smaller and eventually zero, for elements that are in the kernel of the original \(\mathbf{S}\) (un-penalized == not flexible).
For a similar transformation (based solely on \(\mathbf{S}\)), Wood et al. (2013) show how to further reduce the diagonally transformed \(\mathbf{S}\) to an even simpler identity penalty. As discussed also in Wood (2017) the same behavior of decreasing flexibility if all entries on the diagonal of \(\mathbf{S}\) are 1 can only be maintained if the transformed functions are multiplied by a weight related to their wiggliness. Specifically, more flexible functions need to become smaller in amplitude - so that for the same level of penalization they are removed earlier than less flexible ones. To achieve this Wood further post-multiply the transformed matrix \(\mathbf{X}'\) with a matrix that contains on it’s diagonal the reciprocal of the square root of the transformed penalty matrix (and 1s in the last cells corresponding to the kernel). This is done here if
identity=True
.In
mgcv
the transformed model matrix and penalty can optionally be scaled by the root mean square value of the transformed model matrix (see the nat.param function in mgcv). This is done here ifscale=True
.For Option 4:
Option 4 enforces re-parameterization of term-specific \(\mathbf{S}_{\boldsymbol{\lambda}}\) based on section Wood (2011) and section 6.2.7 in Wood (2017). In
mssm
multiple penalties can be placed on individual terms (i.e., tensor terms, random smooths, Kernel penalty) but it is not always the case that the term-specific \(\mathbf{S}_{\boldsymbol{\lambda}}\) - i.e., the sum over all those individual penalties multiplied with their \(\lambda\) parameters, is of full rank. If we need to form the inverse of the term-specific \(\mathbf{S}_{\boldsymbol{\lambda}}\) this is problematic. It is also problematic, as discussed by Wood (2011), if the different \(\lambda\) are all of different magnitude in which case forming the term-specific \(log(|\mathbf{S}_{\boldsymbol{\lambda}}|+)\) becomes numerically difficult.The re-parameterization implemented by option 4, based on Appendix B in Wood (2011), solves these issues. After this re-parameterization a term-specific \(\mathbf{S}_{\boldsymbol{\lambda}}\) has been formed that is full rank. And \(log(|\mathbf{S}_{\boldsymbol{\lambda}}|)\) - no longer just a generalized determinant - can be computed without running into numerical problems.
The strategy by Wood (2011) is more general and could be applied to form an overall - not just term-specific - \(\mathbf{S}_{\boldsymbol{\lambda}}\) with these properties. However, in
mssm
penalties currently cannot overlap, so this is not necessary at the moment.- References:
Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.
Wood, S. N., Scheipl, F., & Faraway, J. J. (2013). Straightforward intermediate rank tensor product smoothing in mixed models.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
mgcv source code (accessed 2024). smooth.R file, nat.param function.
mssm.src.python.gamm_solvers module
- class mssm.src.python.gamm_solvers.Fit_info(lambda_updates: int = 0, iter: int = 0, code: int = 1, eps: float = None, K2: float = None, dropped: [<class 'int'>] = None)
Bases:
object
Holds information related to convergence (speed) for GAMMs, GAMMLSS, and GSMMs.
- Variables:
lambda_updates (int) – The total number of lambda updates computed during estimation. Initialized with 0.
iter (int) – The number of outer iterations (a single outer iteration can involve multiple lambda updates) completed during estimation. Initialized with 0.
code (int) – Convergence status. Anything above 0 indicates that the model did not converge and estimates should be considered carefully. Initialized with 1.
eps (float) – The fraction added to the last estimate of the negative Hessian of the penalized likelihood during GAMMLSS or GSMM estimation. If this is not 0 - the model should not be considered as converged, irrespective of what
code
indicates. This most likely implies that the model is not identifiable. Initialized withNone
and ignored for GAMM estimation.K2 (float) – An estimate for the condition number of matrix
A
, whereA.T@A=H
andH
is the final estimate of the negative Hessian of the penalized likelihood. Only available ifcheck_cond>0
whenmodel.fit()
is called for any model (i.e., GAMM, GAMMLSS, GSMM). Initialized withNone
.dropped ([int]) – The final set of coefficients dropped during GAMMLSS/GSMM estimation when using
method="QR/Chol"
orNone
in which case no coefficients were dropped. Initialized with 0.
- K2: float = None
- code: int = 1
- dropped: [<class 'int'>] = None
- eps: float = None
- iter: int = 0
- lambda_updates: int = 0
mssm.src.python.penalties module
- class mssm.src.python.penalties.LambdaTerm(S_J: ~scipy.sparse._csc.csc_array = None, S_J_emb: ~scipy.sparse._csc.csc_array = None, D_J_emb: ~scipy.sparse._csc.csc_array = None, rep_sj: int = 1, lam: float = 1.1, start_index: int = None, frozen: bool = False, type: ~mssm.src.python.penalties.PenType = None, rank: int = None, term: int = None, clust_series: [<class 'int'>] = None, clust_weights: [[<class 'float'>]] = None)
Bases:
object
\(\lambda\) storage term.
- Variables:
S_J (scipy.sparse.csc_array) – The penalty matrix associated with this lambda term. Note, in case multiple penalty matrices share the same lambda value, the
rep_sj
argument determines how many diagonal blocks we need to fill with this penalty matrix to getS_J_emb
. Initialized withNone
.S_J_emb (scipy.sparse.csc_array) – A zero-embedded version of the penalty matrix associated with this lambda term. Note, this matrix contains
rep_sj
diagonal sub-blocks each filled withS_J
. Initialized withNone
.D_J_emb (scipy.sparse.csc_array) – Root of
S_J_emb
, so thatD_J_emb@D_J_emb.T=S_J_emb
. Initialized withNone
.rep_sj (int) – How many sequential sub-blocks of
S_J_emb
need to be filled withS_J
. Useful if all levels of a categorical variable for which a separate smooth is to be estimated are assumed to share the same lambda value. Initialized with 1.lam (float) – The current estimate for \(\lambda\). Initialized with 1.1.
start_index (int) – The first row and column in the overall penalty matrix taken up by
S_J
. Initialized withNone
.type (PenType) – The type of this penalty term. Initialized with
None
.rank (int) – The rank of
S_J
. Initialized withNone
.term (int) – The index of the term in a
mssm.src.python.formula.Formula
with which this penalty is associated. Initialized withNone
.
- D_J_emb: csc_array = None
- S_J: csc_array = None
- S_J_emb: csc_array = None
- clust_series: [<class 'int'>] = None
- clust_weights: [[<class 'float'>]] = None
- frozen: bool = False
- lam: float = 1.1
- rank: int = None
- rep_sj: int = 1
- start_index: int = None
- term: int = None
- type: PenType = None
mssm.src.python.terms module
- class mssm.src.python.terms.GammTerm(variables: list[str], type: TermType, is_penalized: bool, penalty: list[PenType], pen_kwargs: list[dict])
Bases:
object
Base-class implemented by the terms passed to
mssm.src.python.formula.Formula
.- Parameters:
variables ([str]) – List of variables as strings.
type (TermType) – Type of term as enum
is_penalized (bool) – Whether the term is penalized/can be penalized or not
penalty ([penalties.PenType]) – The default penalties associated with a term.
pen_kwargs ([dict]) – A list of dictionaries, each with key-word arguments passed to the construction of the corresponding
penalties.PenType
inpenalty
.
- class mssm.src.python.terms.TermType(*values)
Bases:
Enum
- LINEAR = 3
- LSMOOTH = 1
- RANDINT = 4
- RANDSLOPE = 5
- SMOOTH = 2
- class mssm.src.python.terms.f(variables: list, by: str = None, by_cont: str = None, binary: list[str, str] = None, id: int = None, nk: int = 9, te: bool = False, rp: int = 0, constraint: ~mssm.src.python.constraints.ConstType = ConstType.QR, identifiable: bool = True, basis: ~collections.abc.Callable = <function B_spline_basis>, basis_kwargs: dict = {}, is_penalized: bool = True, penalize_null: bool = False, penalty: list[~mssm.src.python.penalties.PenType] = None, pen_kwargs: list[dict] = None)
Bases:
GammTerm
A univariate or tensor interaction smooth term. If
variables
only contains a single variable \(x\), this term will represent a univariate \(f(x)\) in a model:\[\mu_i = a + f(x_i)\]For example, the model below in
mgcv
:bam(y ~ s(x,k=10) + s(z,k=20))
would be expressed as follows in
mssm
:GAMM(Formula(lhs("y"),[i(),f(["x"],nk=9),f(["z"],nk=19)]),Gaussian())
If
variables
contains two variables \(x\) and \(z\), then this term will either represent the tensor interaction \(f(x,z)\) in model:\[\mu_i = a + f(x_i) + f(z_i) + f(x_i,z_i)\]or in model:
\[\mu_i = a + f(x_i,z_i)\]The first behavior is achieved by setting
te=False
. In that case it is necessary to add ‘main effect’f
terms for \(x\) and \(y\). In other words, the behavior then mimicks theti()
term available inmgcv
(Wood, 2017). Ifte=True
, the term instead behaves like ate()
term inmgcv
, so no separate smooth effects for the main effects need to be included.For example, the model below in
mgcv
:bam(y ~ te(x,z,k=10))
would be expressed as follows in
mssm
:GAMM(Formula(lhs("y"),[i(),f(["x","z"],nk=9,te=True)]),Gaussian())
In addition, the model below in
mgcv
:bam(y ~ s(x,k=10) + s(z,k=20) + ti(x,z,k=10))
would be expressed as follows in
mssm
:GAMM(Formula(lhs("y"),[i(),f(["x"],nk=9),f(["z"],nk=19),f(["x","z"],nk=9,te=False)]),Gaussian())
By default a B-spline basis is used with
nk=9
basis functions (after removing identifiability constrains). This is equivalent tomgcv
’s default behavior of using 10 basis functions (before removing identifiability constrains). In casevariables
contains more then one variablenk
can either bet set to a single value or to a list containing the number of basis functions that should be used to setup the spline matrix for every variable. The former implies that the same number of coefficients should be used for all variables. Keyword arguments that change the computation of the spline basis can be passed along via a dictionary to thebasis_kwargs
argument. Importantly, if multiple variables are present and a list is passed tonk
, a list of dictionaries with keyword arguments of the same length needs to be passed tobasis_kwargs
as well.Multiple penalties can be placed on every term by adding
penalties.PenType
to thepenalties
argument. In casevariables
contains multiple variables a separate tensor penalty (see Wood, 2017) will be created for every penalty included inpenalties
. Again, key-word arguments that alter the behavior of the penalty creation need to be passed as dictionaries topen_kwargs
for every penalty included inpenalties
. By default, a univariate term is penalized with a difference penalty of order 2 (Eilers & Marx, 2010).References:
Eilers, P., & Marx, B. (2010). Splines, knots, and penalties. https://doi.org/10.1002/WICS.125
Marra, G., & Wood, S. N. (2011). Practical variable selection for generalized additive models. Computational Statistics & Data Analysis, 55(7), 2372–2387. https://doi.org/10.1016/j.csda.2011.02.004
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
variables (list[str]) – A list of the variables (strings) of which the term is a function. Need to exist in
data
passed toFormula
. Need to be continuous.by (str, optional) – A string corresponding to a factor in
data
passed toFormula
. Separate f(variables
) (and smoothness penalties) will be estimated per level ofby
.by_cont (str, optional) – A string corresponding to a numerical variable in
data
passed toFormula
. The model matrix for the estimated smooth term f(variables
) will be multiplied by the column of this variable. Can be used to estimate ‘varying coefficient’ models but also to set up binary smooths or to only estimate a smooth term for specific levels of a factor (i.e., what is possible for ordered factors in R & mgcv).binary ([str,str], optional) – A list containing two strings. The first string corresponds to a factor in
data
passed toFormula
. A separate f(variables
) will be estimated for the level of this factor corresponding to the second string.id (int, optional) – Only useful in combination with specifying a
by
variable. Ifid
is set to any integer the penalties placed on the separate f(variables
) will share a single smoothness penalty.nk (int or list[int], optional) – Number of basis functions to use. Even if
identifiable
is true, this number will reflect the final number of basis functions for this term (i.e., mssm acts like you would have asked for 10 basis functions ifnk=9
and identifiable=True; the default).te (bool, optional) – For tensor interaction terms only. If set to false, the term mimics the behavior of
ti()
in mgcv (Wood, 2017). Otherwise, the term behaves like ate()
term in mgcv - i.e., the marginal basis functions are not removed from the interaction.rp (int, optional) – Experimental - will currently break for tensor smooths or in case
by
is provided. Whether or not to re-parameterize the term - seemssm.src.python.formula.reparam()
for details. Defaults to no re-parameterization.constraint (mssm.src.constraints.ConstType, optional) – What kind of identifiability constraints should be absorbed by the terms (if they are to be identifiable). Either QR-based constraints (default, well-behaved), by means of column-dropping (no infill, not so well-behaved), or by means of difference re-coding (little infill, not so well behaved either).
identifiable (bool, optional) – Whether or not the constant should be removed from the space of functions this term can fit. Achieved by enforcing that \(\mathbf{1}^T \mathbf{X} = 0\) (\(\mathbf{X}\) here is the spline matrix computed for the observed data; see Wood, 2017 for details). Necessary in most cases to keep the model identifiable.
basis (Callable, optional) – The basis functions to use to construct the spline matrix. By default a B-spline basis (Eilers & Marx, 2010) implemented in
mssm.src.smooths.B_spline_basis()
.basis_kwargs (dict, optional) – A list containing one or multiple dictionaries specifying how the basis should be computed. For the B-spline basis the following arguments (with default values) are available:
convolve``=``False
,min_c``=``None
,max_c``=``None
,deg``=``3
. Seemss.src.smooths.B_spline_basis()
for details, but the default should work for most cases.is_penalized (bool, optional) – Should the term be left unpenalized or not. There are rarely good reasons to set this to False.
penalize_null (bool, optional) – Should a separate Null-space penalty (Marra & Wood, 2011) be placed on the term. By default, the term here will leave a linear f(variables) un-penalized! Thus, there is no option for the penalty to achieve f(variables) = 0 even if that would be supported by the data. Adding a Null-space penalty provides the penalty with that power. This can be used for model selection instead of Hypothesis testing and is the preferred way in
mssm
(see Marra & Wood, 2011 for details).penalty (list[penalties.PenType], optional) – A list of penalty types to be placed on the term.
pen_kwargs (list[dict], optional) – A list containing one or multiple dictionaries specifying how the penalty should be created. For the default difference penalty (Eilers & Marx, 2010) the only keyword argument (with default value) available is:
m=2
. This reflects the order of the difference penalty. Note, that while a higherm
permits penalizing towards smoother functions it also leads to an increased dimensionality of the penalty Kernel (the set of bases functions which will not be penalized). In other words, increasingly more complex functions will be left un-penalized for higherm
(except ifpenalize_null
is set to True).m=2
is usually a good choice and thus the default but see Eilers & Marx (2010) for details.
- class mssm.src.python.terms.fs(variables: list, rf: str = None, nk: int = 9, m: int = 1, rp: int = 1, by_subgroup: [<class 'str'>, <class 'str'>] = None, approx_deriv: dict = None, basis: ~collections.abc.Callable = <function B_spline_basis>, basis_kwargs: dict = {})
Bases:
f
Essentially a
f
term withby=rf
,id != None
,penalize_null= True
,pen_kwargs = [{"m":1}]
, andrp=1
.This term approximates the “factor-smooth interaction” basis “fs” with
m= 1
available inmgcv
(Wood, 2017). For example, the term below frommgcv
:s(x,sub,bs="fs"))
would approximately correspond to the following term in
mssm
:fs(["x"],rf="sub")
They are however not equivalent (mgcv by default uses a different basis for which the
m
key-word has a different functionality).Specifically, here
m= 1
implies that the only function left unpenalized by the default (difference) penalty is the constant (Eilers & Marx, 2010). Thus, a linear basis is penalized by the same default penalty that also penalizes smoothness (and not by a separate penalty as is the case inmgcv
whenm=1
for the default basis)! Any constant basis is penalized by the null-space penalty (in bothmgcv
andmssm
; see Marra & Wood, 2011) - the term thus shrinks towards zero (Wood, 2017).The factor smooth basis in mgcv allows to let the penalty be different for different levels of an additional factor (by additionally specifying the
by
argument for a smooth with basis “fs”). I.e.,s(Time,Subject,by='condition',bs='fs')
in
mgcv
would estimate a non-linear random smooth of “time” per level of the “subject” & “condition” interaction - with the same penalty being placed on all random smooth terms within the same “condition” level.This can be achieved in
mssm
by adding multiplefs
terms to theFormula
and utilising theby_subgroup
argument. This needs to be set to a list where the first element identifies the additional factor variable (e.g., “condition”) and the second element corresponds to a level of said factor variable. E.g., to approximate the aforementionedmgcv
term we have to add:*[fs(["Time"],rf="subject_cond",by_subgroup=["cond",cl]) for cl in np.unique(dat["cond"])]
to the
Formula
terms
list. Importantly, “subject_cond” is the interaction of “subject” and “condition” - not just the “subject variable in the data.Model estimation can become quite expensive for
fs
terms, when the factor variable forrf
has many levels. (> 10000) In that case, approximate derivative evaluation can speed things up considerably. To enforce this, theapprox_deriv
argument needs to be specified with a dict, having the following structure:{"no_disc":[str],"excl":[str],"split_by":[str],"restarts":int,"seed":None or int}
. “no_disc” should usually be set to an empty list, and should in general only contain names of continuous variables included in the formula. Any variable mentioned here will not be discretized before clustering - this will make the approximation a bit more accurate but also require more time. Similarly, “excl” specifies any continuous variables that should be excluded for clustering. “split_by” should generally be set to a list containing all categorical variables present in the formula. “restarts” indicates the number of times to re-produce the clustering (40 seems to be a good number). “seed” can either be set to None or to an integer - in the latter case, the random cluster initialization will use that seed, ensuring that the clustering outcome (and hence model fit) is replicable.References:
Eilers, P., & Marx, B. (2010). Splines, knots, and penalties. https://doi.org/10.1002/WICS.125
Marra, G., & Wood, S. N. (2011). Practical variable selection for generalized additive models.Computational Statistics & Data Analysis, 55(7), 2372–2387. https://doi.org/10.1016/j.csda.2011.02.004
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.). Chapman and Hall/CRC.
- Parameters:
variables (list[str]) – A list of the variables (strings) of which the term is a function. Need to exist in
data
passed toFormula
. Need to be continuous.rf (str, optional) – A string corresponding to a (random) factor in
data
passed toFormula
. Separate f(variables
) (but a shared smoothness penalty!) will be estimated per level ofrf
.nk (int or list[int], optional) – Number of basis functions -1 to use. I.e., if
nk=9
(the default), the term will use 10 basis functions. By defaultf()
has identifiability constraints applied and we act as ifnk``+ 1 coefficients were requested. The ``fs()
term needs no identifiability constrains so if the same number of coefficients used for af()
term is requested (the desired approach), one coefficient is added to compensate for the lack of identifiability constraints. This is the opposite to how this is handled in mgcv: specifyingnk=10
for “fixed” univariate smooths results in 9 basis functions being available. However, for a smooth in mgcv with basis=’fs’, 10 basis functions will remain available.basis (Callable, optional) – The basis functions to use to construct the spline matrix. By default a B-spline basis (Eilers & Marx, 2010) implemented in
mssm.src.smooths.B_spline_basis()
.basis_kwargs (dict, optional) – A list containing one or multiple dictionaries specifying how the basis should be computed. For the B-spline basis the following arguments (with default values) are available:
convolve``=``False
,min_c``=``None
,max_c``=``None
,deg``=``3
. Seemssm.src.smooths.B_spline_basis()
for details.by_subgroup ([str,str], optional) – List including a factor variable and specific level of said variable. Allows for separate penalties as described above.
approx_deriv (dict, optional) – Dict holding important info for the clustering algorithm. Structure:
{"no_disc":[str],"excl":[str],"split_by":[str],"restarts":int}
- class mssm.src.python.terms.i
Bases:
GammTerm
An intercept/offset term. In a model
\[\mu_i = a + f(x_i)\]it reflects \(a\).
- class mssm.src.python.terms.irf(variables: [<class 'str'>], event_onset: [<class 'int'>], basis_kwargs: list[dict], by: str = None, id: int = None, nk: int = 10, basis: ~collections.abc.Callable = <function B_spline_basis>, is_penalized: bool = True, penalty: list[~mssm.src.python.penalties.PenType] = None, pen_kwargs: list[dict] = None)
Bases:
GammTerm
A simple impulse response term, designed to correct for events with overlapping responses in multi-level time-series modeling.
The idea (see Ehinger & Dimigen; 2019 for a detailed introduction to this kind of deconvolution analysis) is that some kind of event happens during each recorded time-series (e.g., stimulus onset, distractor display, mask onset, etc.) which is assumed to affect the recorded signal in the next X ms in some way. The moment of event onset can differ between recorded time-series. In other words, the event is believed to act like an impulse which triggers a delayed response on the signal. This term class can be used to estimate the shape of this impulse response. Multiple
irf
terms can be included in aFormula
if multiple events happen, potentially with overlapping responses.Example:
# Simulate time-series based on two events that elicit responses which vary in their overlap. # The summed responses + a random intercept + noise is then the signal. overlap_dat,onsets1,onsets2 = sim7(100,1,2,seed=20) # Model below tries to recover the shape of the two responses in the 200 ms after event onset (max_c=200) + the random intercepts: overlap_formula = Formula(lhs("y"),[irf(["time"],onsets1,nk=15,basis_kwargs=[{"max_c":200,"min_c":0,"convolve":True}]), irf(["time"],onsets2,nk=15,basis_kwargs=[{"max_c":200,"min_c":0,"convolve":True}]), ri("factor")], data=overlap_dat, series_id="series") # For models with irf terms, the column in the data identifying unique series need to be specified. model = GAMM(overlap_formula,Gaussian()) model.fit()
Note, that care needs to be taken when predicting for models including
irf
terms, because the onset of events can differ between time-series. Hence, model predictions + standard errors should first be obtained for the entire data-set used also to train the model and then extract series-specific predictions from the model-matrix as follows:# Get model matrix for entire data-set but only based on the estimated shape for first irf term: _,pred_mat,ci_b = model.predict([0],overlap_dat,ci=True) # Now extract the prediction + approximate ci boundaries for a single series: s = 8 s_pred = pred_mat[overlap_dat["series"] == s,:]@model.coef s_ci = ci_b[overlap_dat["series"] == s] # Now the estimated response following the onset of the first event can be visualized + an approximate CI: from matplotlib import pyplot as plt plt.plot(overlap_dat["time"][overlap_dat["series"] == s],s_pred,color='blue') plt.plot(overlap_dat["time"][overlap_dat["series"] == s],s_pred+s_ci,color='blue',linestyle='dashed') plt.plot(overlap_dat["time"][overlap_dat["series"] == s],s_pred-s_ci,color='blue',linestyle='dashed')
References:
Ehinger, B. V., & Dimingen, O. (2019). Unfold: an integrated toolbox for overlap correction, non-linear modeling, and regression-based EEG analysis. https://doi.org/10.7717/peerj.7838
- Parameters:
variables (list[str]) – A list of the variables (strings) of which the term is a function. Need to exist in
data
passed toFormula
. Need to be continuous.event_onset ([int]) – A
np.array
containing, for each individual time-series, the index corresponding to the sample/time-point at which the event eliciting the response to be estimate by this term happened.basis_kwargs (dict) – A list containing one or multiple dictionaries specifying how the basis should be computed. For
irf
terms, theconvolve
argument has to be set to True! Also,min_c
andmax_c
must be specified.min_c
corresponds to the assumed min. delay of the response after event onset and can usually be set to 0.max_c
corresponds to the assumed max. delay of the response (in ms) after which the response is believed to have returned to a zero base-line.by (str, optional) – A string corresponding to a factor in
data
passed toFormula
. Separate irf(variables
) (and smoothness penalties) will be estimated per level ofby
.id (int, optional) – Only useful in combination with specifying a
by
variable. Ifid
is set to any integer the penalties placed on the separate irff(variables
) will share a single smoothness penalty.nk (int, optional) – Number of basis functions to use. I.e., if
nk=10
(the default), the term will use 10 basis functions (Note that these terms are not made identifiable by absorbing any kind of constraint).basis (Callable, optional) – The basis functions to use to construct the spline matrix. By default a B-spline basis (Eilers & Marx, 2010) implemented in
src.smooths.B_spline_basis
.is_penalized (bool, optional) – Should the term be left unpenalized or not. There are rarely good reasons to set this to False.
penalty (list[penalties.PenType], optional) – A list of penalty types to be placed on the term.
pen_kwargs (list[dict], optional) – A list containing one or multiple dictionaries specifying how the penalty should be created. For the default difference penalty (Eilers & Marx, 2010) the only keyword argument (with default value) available is:
m=2
. This reflects the order of the difference penalty. Note, that while a higherm
permits penalizing towards smoother functions it also leads to an increased dimensionality of the penalty Kernel (the set of f[variables
] which will not be penalized). In other words, increasingly more complex functions will be left un-penalized for higherm
(except ifpenalize_null
is set to True).m=2
is usually a good choice and thus the default but see Eilers & Marx (2010) for details.
- class mssm.src.python.terms.l(variables: list)
Bases:
GammTerm
Adds a parametric (linear) term to the model formula. The model \(\mu_i = a + b*x_i\) can for example be achieved by adding
[i(), l(['x'])]
to theterm
argument of aFormula
. The coefficient \(b\) estimated for the term will then correspond to the slope of \(x\). This class can also be used to add predictors for categorical variables. If the formula includes an intercept, binary coding will be utilized to add reference-level adjustment coefficients for the remaining k-1 levels of any additional factor variable.If more than one variable is included in
variables
the model will only add the the len(variables
)-interaction to the model! Lower order interactions and main effects will not be included by default (seeli()
function instead, which automatically includes all lower-order interactions and main effects).Example: The interaction effect of factor variable “cond”, with two levels “1” and “2”, and acontinuous variable “x” on the dependent variable “y” are of interest. To estimate such a model, the following formula can be used:
formula = Formula(lhs("y"),terms=[i(),l(["cond"]),l(["x"]),l(["cond","x"])])
This formula will estimate the following model:
\[\mu_i = a + b_1*c_i + b_2*x_i + b_3*c_i*x_i\]Here, \(c\) is a binary predictor variable created so that it is 1 if “cond”=2 else 0 and \(b_3\) is the coefficient that is added because
l(["cond","x"])
is included in the terms (i.e., the interaction effect).To get a model with only main effects for “cond” and “x”, the following formula could be used:
formula = Formula(lhs("y"),terms=[i(),l(["cond"]),l(["x"])])
This formula will estimate:
\[\mu_i = a + b_1*c_i + b_2*x_i\]- Parameters:
variables ([str]) – A list of the variables (strings) for which linear predictors should be included
- mssm.src.python.terms.li(variables: list[str])
Behaves like the
l
class but automatically includes all lower-order interactions and main effects.Example: The interaction effect of factor variable “cond”, with two levels “1” and “2”, and acontinuous variable “x” on the dependent variable “y” are of interest. To estimate such a model, the following formula can be used:
formula = Formula(lhs("y"),terms=[i(),*li(["cond","x"])])
Note, the use of the
*
operator to unpack the individual terms returned from li!This formula will still (see
l
) estimate the following model:\[\mu = a + b_1*c_i + b_2*x_i + b_3*c_i*x_i\]with: \(c\) corresponding to a binary predictor variable created so that it is 1 if “cond”=2 else 0.
To get a model with only main effects for “cond” and “x”
li
cannot be used andl
needs to be used instead:formula = Formula(lhs("y"),terms=[i(),l(["cond"]),l(["x"])])
This formula will estimate:
\[\mu_i = a + b_1*c_i + b_2*x_i\]- Parameters:
variables (list[str]) – A list of the variables (strings) for which linear predictors should be included
- class mssm.src.python.terms.ri(variable: str)
Bases:
GammTerm
Adds a random intercept for the factor
variable
to the model. The random intercepts \(b_i\) are assumed to be i.i.d \(b_i \sim N(0,\sigma_b)\) i.e., normally distributed around zero - the simplest random effect supported bymssm
.Thus, this term achieves exactly what is achieved in
mgcv
by adding the term:s(variable,bs="re")
The
variable
needs to identify a factor-variable in the data (i.e., the .dtype of the variable has to be equal to ‘O’). If you want to add more complex random effects to the model (e.g., random slopes for continuous variable “x” per level of factorvariable
) use thers
class.- Parameters:
variable (str) – The name (string) of a factor variable. For every level of this factor a random intercept will be estimated. The random intercepts are assumed to follow a normal distribution centered around zero.
- class mssm.src.python.terms.rs(variables: list[str], rf: str)
Bases:
GammTerm
Adds random slopes for the effects of
variables
for each level of the random factorrf
. The type of random slope created depends on the content ofvariables
.If
len(variables)==1
, and the string invariables
identifies a categorical variable in the data, then a random offset adjustment (for every level of the categorical variable, so without binary coding!) will be estimated for every level of the random factorrf
.Example: The factor variable “cond”, with two levels “1” and “2” is assumed to have a general effect on the DV “y”. However, data was collected from multiple subjects (random factor
rf
= “subject”) and it is reasonable to assume that the effect of “cond” is slightly different for every subject (it is also assumed that all subjects took part in both conditions identified by “cond”). A model that accounts for this is estimated via:formula = Formula(lhs("y"),terms=[i(),l(["cond"]),rs(["cond"],rf="subject")])
This formula will estimate the following model:
\[\mu = a + b_1*c_i + a_{j(i),cc(i)}\]Here, \(c\) is again a binary predictor variable created so that it is 1 if “cond”=2 for observation i else 0, \(cc(i)\) indexes the level of “cond” at observation \(i\), \(j(i)\) indexes the level of “subject” at observation \(i\), and \(a_{j,cc(i)}\) identifies the random offset estimated for subject \(j\) at the level of “cond” indicated by \(cc(i)\). The \(a_{j,cc(i)}\) are assumed to be i.i.d \(\sim N(0,\sigma_a)\). Note that the fixed effect sturcture uses binary coding but the random effect structure does not!
Hence,
rs(["cond"],rf="subject")
inmssm
corresponds to adding the term below to amgcv
model:s(cond,subject,bs="re")
If all the strings in
variables
identify continuous variables in the data, then a random slope for the len(variables
)-way interaction (will simplify to a slope for a single continuous variable if len(variables
) == 1) will be estimated for every level of the random factorrf
.Example: The continuous variable “x” is assumed to have a general effect on the DV “y”. However, data was collected from multiple subjects (random factor
rf
=”subject”) and it is reasonable to assume that the effect of “x” is slightly different for every subject. A model that accounts for this is estimated via:formula = Formula(lhs("y"),terms=[i(),l(["x"]),rs(["x"],rf="subject")])
This formula will estimate the following model:
\[\mu = a + b*x_i + b_{j(i)} * x_i\]Where, \(j(i)\) again indexes the level of “subject” at observation \(i\), \(b_j(i)\) identifies the random slope (the subject-specific slope adjustment for \(b\)) for variable “x” estimated for subject \(j\) and the \(b_{j(i)}\) are again assumed to be i.i.d from a single \(\sim N(0,\sigma_b)\)
Note, lower-order interaction slopes (as well as main effects) are not pulled in by default! Consider the following formula:
formula = Formula(lhs("y"),terms=[i(),*li(["x","z"]),rs(["x","z"],rf="subject")])
with another continuous variable “z”. This corresponds to the model:
\[\mu = a + b_1*x_i + b_2*z_i + b_3*x_i*z_i + b_{j(i)}*x_i*z_i\]With \(j(i)\) again indexing the level of “subject” at observation i, \(b_{j(i)}\) identifying the random slope (the subject-specific slope adjustment for \(b_3\)) for the interaction of variables \(x\) and \(z\) estimated for subject \(j\). The \(b_{j(i)}\) are again assumed to be i.i.d from a single \(\sim N(0,\sigma_b)\).
To add random slopes for the main effects of either \(x\) or \(z\) as well as an additional random intercept, additional
rs
and ari
terms would have to be added to the formula:formula = Formula(lhs("y"),terms=[i(),*li(["x","z"]), ri("subject"), rs(["x"],rf="subject"), rs(["z"],rf="subject"), rs(["x","z"],rf="subject")])
If
len(variables) > 1
and at least one string invariables
identifies a categorical variable in the data then random slopes for the len(variables
)-way interaction will be estimated for every level of the random factorrf
. Separate distribution parameters (the \(\sigma\) of the Normal) will be estimated for every level of the resulting interaction.Example: The continuous variable “x” and the factor variable “cond”, with two levels “1” and “2” are assumed to have a general interaction effect on the DV “y”. However, data was collected from multiple subjects (random factor
rf
=”subject”) and it is reasonable to assume that their interaction effect is slightly different for every subject. A model that accounts for this is estimated via:formula = Formula(lhs("y"),terms=[i(),*li(["x","cond"]),rs(["x","cond"],rf="subject")])
This formula will estimate the following model:
\[\mu = a + b_1*c_i + b_2*x_i + b_3*x_i*c_i + b_{j(i),cc(i)}*x_i\]With, \(c\) corresponding to a binary predictor variable created so that it is 1 if “cond”=2 for observation \(i\) else 0, \(cc(i)\) corresponds to the level of “cond” at observation \(i\), \(j(i)\) corresponds to the level of “subject” at observation \(i\), and \(b_{j(i),cc(i)}\) identifies the random slope for variable \(x\) at “cond” = \(cc(i)\) estimated for subject \(j\). That is: the \(b_{j,cc(i)}\) where \(cc(i)=1\) are assumed to be i.i.d realizations from normal distribution \(N(0,\sigma_{b_1})\) and the \(b_{j,cc(i)}\) where \(cc(i)=2\) are assumed to be i.i.d realizations from a separate normal distribution \(N(0,\sigma_{b_2})\).
Hence, adding
rs(["x","cond"],rf="subject")
to amssm
model, is equivalent to adding the term below to amgcv
model:s(x,subject,by=cond,bs="re")
Correlations between random effects cannot be taken into account by means of parameters (this is possible for example in
lme4
).- Parameters:
variables ([str]) – A list of variables. Can point to continuous and categorical variables.
rf (str) – A factor variable. Identifies the random factor in the data.
mssm.src.python.utils module
- mssm.src.python.utils.REML(llk, nH, coef, scale, penalties)
Based on Wood (2011). Exact REML for Gaussian GAM, Laplace approximate (Wood, 2016) for everything else. Evaluated after applying stabilizing reparameterization discussed by Wood (2011).
- References:
Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.
Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models
- Parameters:
llk (float) – log-likelihood of model
nH (scipy.sparse.csc_array) – negative hessian of log-likelihood of model
coef (numpy.array) – Estimated vector of coefficients of shape (-1,1)
scale (float) – (Estimated) scale parameter - can be set to 1 for GAMLSS or GSMMs.
penalties ([LambdaTerm]) – List of penalties that were part of the model.
- Returns:
(Approximate) REML score
- Return type:
float
- mssm.src.python.utils.adjust_CI(model, n_ps, b, predi_mat, use_terms, alpha, seed)
Internal function to adjust point-wise CI to behave like whole-function interval (based on Wood, 2017; section 6.10.2 and Simpson, 2016):
model.coef +- b
gives point-wise interval, and for the interval to cover the whole-function,1-alpha
% of posterior samples should be expected to fall completely within these boundaries.From section 6.10 in Wood (2017) we have that \(\boldsymbol{\beta} | \mathbf{y}, \boldsymbol{\lambda} \sim N(\hat{\boldsymbol{\beta}},\mathbf{V})\). \(\mathbf{V}\) is the covariance matrix of this conditional posterior, and can be obtained by evaluating
model.lvi.T @ model.lvi * model.scale
(model.scale
should be set to 1 formsssm.models.GAMMLSS
andmsssm.models.GSMM
).The implication of this result is that we can also expect the deviations \(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}\) to follow \(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}} | \mathbf{y}, \boldsymbol{\lambda} \sim N(0,\mathbf{V})\). In line with the whole-function interval definition above,
1-alpha
% ofpredi_mat@[*coef - coef]
(where[*coef - coef]
representes the deviations \(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}\)) should fall within[b,-b]
. Wood (2017) suggests to finda
so that[a*b,a*-b]
achieves this.To do this, we find
a
for everypredi_mat@[*coef - coef]
and then select the final one so that1-alpha
% of samples had an equal or lower one. The consequence:1-alpha
% of samples drawn should fall completely within the modified boundaries.- References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
Simpson, G. (2016). Simultaneous intervals for smooths revisited.
- Parameters:
model (GAMM or GAMLSS or GSMM) – GAMM,GAMLSS, or GSMM model (which has been fitted) for which to estimate \(\mathbf{V}\)
n_ps (int) – Number of samples to obtain from posterior.
b ([float]) – Ci boundary of point-wise CI.
predi_mat (scipy.sparse.csc_array) – Model matrix for a particular smooth term or additive combination of parameters evaluated usually at a representative sample of predictor variables.
use_terms (list[int] or None) – The indices corresponding to the terms that should be used to obtain the prediction or
None
in which case all terms will be used.alpha (float) – The alpha level to use for the whole-function interval adjustment calculation as outlined above.
seed (int or None) – Can be used to provide a seed for the posterior sampling.
- mssm.src.python.utils.central_hessian(coef, llkfun, *llkgargs, **llkkwargs)
Generic function to approximate the hessian of
llkfun
atcoef
.llkfun
is called asllkfun(coef,*llkargs)
- so all additional arguments have to be passed via the latter.Uses central finite differences with fixed value for epsilon, based on eq. 8 and following paragraphs in Ridout (2009) (also used similarly in numdifftools).
- References:
Wood (2015). Core Statistics
M.S. Ridout (2009). Statistical Applications of the Complex-step Method of Numerical Differentiation
Brodtkorb (2014). numdifftools. see https://numdifftools.readthedocs.io/en/latest/reference/generated/numdifftools.core.Hessian.html#equation-9
- Parameters:
coef (numpy.array) – Current estimate of coefficients of which
llkfun
is some function.llkfun (Callable) – log-likelihood function to optimize.
- Returns:
An approximation of the Hessian as a numpy array.
- Return type:
numpy.array
- mssm.src.python.utils.compute_REML_candidate_GSMM(family, y, Xs, penalties, coef, n_coef, coef_split_idx, method='Chol', conv_tol=1e-07, n_c=10, bfgs_options={}, origNH=None)
Allows to evaluate REML criterion (e.g., Wood, 2011; Wood, 2016) efficiently for a set of lambda values for a GSMM or GAMMLSS.
Internal function used for computing the correction applied to the edf for the GLRT - based on Wood (2017) and Wood et al., (2016).
See
REML()
function for more details.
- mssm.src.python.utils.compute_Vb_corr_WPS(Vbr, Vpr, Vr, H, S_emb, penalties, coef, scale=1)
Computes both correction terms for
Vb
or \(\mathbf{V}_{\boldsymbol{\beta}}\), which is the co-variance matrix for the conditional posterior of \(\boldsymbol{\beta}\) so that \(\boldsymbol{\beta} | y, \boldsymbol{\lambda} \sim N(\hat{\boldsymbol{\beta}},\mathbf{V}_{\boldsymbol{\beta}})\), described by Wood, Pya, & Säfken (2016).- References:
Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.
Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
Vbr (scipy.sparse.csc_array) – Transpose of root for the estimate for the (unscaled) covariance matrix of \(\boldsymbol{\beta} | y, \boldsymbol{\lambda}\) - the coefficients estimated by the model.
Vpr (numpy.array) – A (regularized) estimate of the covariance matrix of \(\boldsymbol{\rho}\) - the log smoothing penalties.
Vr (numpy.array) – Transpose of root of un-regularized covariance matrix of \(\boldsymbol{\rho}\) - the log smoothing penalties.
H (scipy.sparse.csc_array) – The Hessian of the log-likelihood
S_emb (scipy.sparse.csc_array) – The weighted penalty matrix.
penalties ([LambdaTerm]) – A list holding the :class:`Lambdaterm`s estimated for the model.
coef (numpy.array) – An array holding the estimated regression coefficients. Has to be of shape (-1,1)
scale (float) – Any scale parameter estimated as part of the model. Can be omitted for more generic models beyond GAMMs. Defaults to 1.
- Raises:
ArithmeticError – Will throw an error when the negative Hessian of the penalized likelihood is ill-scaled so that a Cholesky decomposition fails.
- Returns:
A tuple containing:
Vc
andVcc
.Vbr.T@Vbr*scale
+Vc
+Vcc
is then approximately the correction devised by WPS (2016).- Return type:
(numpy.array, numpy.array)
- mssm.src.python.utils.compute_Vp_WPS(Vbr, H, S_emb, penalties, coef, scale=1)
Computes the inverse of what is approximately the negative Hessian of the Laplace approximate REML criterion with respect to the log smoothing penalties.
The derivatives computed are only exact for Gaussian additive models and canonical generalized additive models. For all other models they are in-exact in that they assume that the hessian of the log-likelihood does not depend on \(\lambda\) (or \(log(\lambda)\)), so they are essentially the PQL derivatives of Wood et al. (2017). The inverse computed here acts as an approximation to the covariance matrix of the log smoothing parameters.
- References:
Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.
Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
Wood, S. N., Li, Z., Shaddick, G., & Augustin, N. H. (2017). Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data.
- Parameters:
Vbr (scipy.sparse.csc_array) – Transpose of root for the estimate for the (unscaled) covariance matrix of \(\boldsymbol{\beta} | y, \boldsymbol{\lambda}\) - the coefficients estimated by the model.
H (scipy.sparse.csc_array) – The Hessian of the log-likelihood
S_emb (scipy.sparse.csc_array) – The weighted penalty matrix.
penalties ([LambdaTerm]) – A list holding the :class:`Lambdaterm`s estimated for the model.
coef (numpy.array) – An array holding the estimated regression coefficients. Has to be of shape (-1,1)
scale (float) – Any scale parameter estimated as part of the model. Can be omitted for more generic models beyond GAMMs. Defaults to 1.
- mssm.src.python.utils.compute_reml_candidate_GAMM(family, y, X, penalties, n_c=10, offset=0, init_eta=None, method='Chol', compute_inv=False, origNH=None)
Allows to evaluate REML criterion (e.g., Wood, 2011; Wood, 2016) efficiently for a set of lambda values.
Internal function used for computing the correction applied to the edf for the GLRT - based on Wood (2017) and Wood et al., (2016).
See
REML()
function for more details.
- mssm.src.python.utils.correct_VB(model, nR=250, grid_type='JJJ3', a=1e-07, b=10000000.0, df=40, n_c=10, form_t=True, form_t1=False, verbose=False, drop_NA=True, method='Chol', only_expected_edf=False, Vp_fidiff=True, use_reml_weights=True, seed=None, **bfgs_options)
Estimate \(\mathbf{V}\), the covariance matrix of the unconditional posterior \(\boldsymbol{\beta} | y \sim N(\hat{\boldsymbol{\beta}},\mathbf{V})\) to account for smoothness uncertainty.
Wood et al. (2016) and Wood (2017) show that when basing conditional versions of model selection criteria or hypothesis tests on \(\mathbf{V}_{\boldsymbol{\beta}}\), which is the co-variance matrix for the conditional posterior of \(\boldsymbol{\beta}\) so that \(\boldsymbol{\beta} | y, \boldsymbol{\lambda} \sim N(\hat{\boldsymbol{\beta}},\mathbf{V}_{\boldsymbol{\beta}})\), the tests are severely biased. To correct for this they show that uncertainty in \(\boldsymbol{\lambda}\) needs to be accounted for. Hence they suggest to base these tests on \(\mathbf{V}\), the covariance matrix of the unconditional posterior \(\boldsymbol{\beta} | y \sim N(\hat{\boldsymbol{\beta}},\mathbf{V})\). They show how to obtain an estimate of \(\mathbf{V}\), but this requires \(\mathbf{V}_{\boldsymbol{p}}\) - an estimate of the covariance matrix of \(\boldsymbol{p}=log(\boldsymbol{\lambda})\). \(\mathbf{V}_{\boldsymbol{p}}\) requires derivatives that are not available when using the efs update.
Greven & Scheipl (2016) in their comment to the paper by Wood et al. (2016) show another option to estimate \(\mathbf{V}\) that does not require \(\mathbf{V}_{\boldsymbol{p}}\), based either on the total variance property and numeric integration. The latter is implemented below, based on the equations for the expectations outlined in their response.
This function implements multiple strategies to correct for smoothing parameter uncertainty, based on the Wood et al. (2016) and Greven & Scheipl (2016) proposals. The most straightforward strategy (
grid_type = 'JJJ1'
) is to obtain a finite difference approximation for \(\mathbf{V}_{\boldsymbol{p}}\) and to then complete approximately the Wood et al. (2016) correction (exactly done here only for Gaussian additive or canonical Generalized models). This is too costly for large sparse multi-level models. In those cases, the Greven & Scheipl (2016) proposal is attractive. Whengrid_type = 'JJJ2'
, the approximation to \(\mathbf{V}_{\boldsymbol{p}}\) is used to adaptively sample the grid to numerically evaluate the expectations required in their proposal. From this grid, \(\mathbf{V}\) could then be evaluated as shown by Greven & Scheipl (2016). However, by default \(\mathbf{V}\) is here obtained by first computing it according to the Wood et al. (2016) correction and then taking a weighted sum of the former (weighted byV_shrinkage_weight
) and the \(\mathbf{V}\) (weighted by1-V_shrinkage_weight
) obtained as suggested by Greven & Scheipl (2016). SettingV_shrinkage_weight=0
to 0 (andonly_expected_edf=False
) thus recovers exactly the Greven & Scheipl (2016) correction, albeit based on an adaptive grid. This is not efficient for large sparse models either. However, when settingonly_expected_edf=True
, \(\mathbf{V}\) is never actually computed and the uncertainty corrected model edf are instead approximated directly. This remains efficient for large sparse multi-level models.- References:
Wood, S. N., (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.
Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models
Greven, S., & Scheipl, F. (2016). Comment on: Smoothing Parameter and Model Selection for General Smooth Models
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
- Parameters:
model (GAMM or GAMMLSS or GSMM) – GAMM, GAMMLSS, or GSMM model (which has been fitted) for which to estimate \(\mathbf{V}\)
nR (int, optional) – In case
grid!="JJJ1"
,nR
samples/reml scores are generated/computed to numerically evaluate the expectations necessary for the uncertainty correction, defaults to 250grid_type (str, optional) – How to compute the smoothness uncertainty correction - see above for details, defaults to ‘JJJ3’
a (float, optional) – Minimum \(\lambda\) value that is included when forming the initial grid. In addition, any of the \(\lambda\) estimates obtained from
model
(used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\)) which are smaller than this are set to this value as well, defaults to 1e-7 the minimum possible estimateb (float, optional) – Maximum \(\lambda\) value that is included when forming the initial grid. In addition, any of the \(\lambda\) estimates obtained from
model
(used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\)) which are larger than this are set to this value as well, defaults to 1e7 the maximum possible estimatedf (int, optional) – Degrees of freedom used for the multivariate t distribution used to sample the next set of candidates. Setting this to
np.inf
means a multivariate normal is used for sampling, defaults to 40n_c (int, optional) – Number of cores to use to compute the correction, defaults to 10
form_t (bool, optional) – Whether or not the smoothness uncertainty corrected edf should be computed, defaults to True
form_t1 (bool, optional) – Whether or not the smoothness uncertainty + smoothness bias corrected edf should be computed, defaults to False
verbose (bool, optional) – Whether to print progress information or not, defaults to False
drop_NA (bool,optional) – Whether to drop rows in the model matrices corresponding to NAs in the dependent variable vector. Defaults to True.
method (str,optional) – Which method to use to solve for the coefficients (and smoothing parameters). The default (“Chol”) relies on Cholesky decomposition. This is extremely efficient but in principle less stable, numerically speaking. For a maximum of numerical stability set this to “QR/Chol”. In that case a QR decomposition is used - which is first pivoted to maximize sparsity in the resulting decomposition but also pivots for stability in order to get an estimate of rank defficiency. A Cholesky is than used using the combined pivoting strategy obtained from the QR. This takes substantially longer. If this is set to
'qEFS'
, then the coefficients are estimated via quasi netwon and the smoothing penalties are estimated from the quasi newton approximation to the hessian. This only requieres first derviative information. Defaults to “Chol”.only_expected_edf (bool,optional) – Whether to compute edf. from trace of covariance matrix (
only_expected_edf=False
) or based on numeric integration weights. The latter is much more efficient for sparse models. Only makes sense whengrid_type!='JJJ1'
. Defaults to TrueVp_fidiff (bool,optional) – Whether to rely on a finite difference approximation to compute \(\mathbf{V}_{\boldsymbol{p}}\) for grid_type
JJJ1
andJJJ2
or on a PQL approximation. The latter is exact for Gaussian and canonical GAMs and far cheaper if many penalties are to be estimated. Defaults to True (Finite difference approximation)use_reml_weights (bool,optional) – Whether to rely on REMl scores to compute the numerical integration when
grid_type != 'JJJ1'
or on the log-densities of \(\mathbf{V}_{\boldsymbol{p}}\) - the latter assumes that the unconditional posterior is normal. Defaults to True (REML scores are used)seed (int,optional) – Seed to use for random parts of the correction. Defaults to None
bfgs_options (key=value,optional) – Any additional keyword arguments that should be passed on to the call of
scipy.optimize.minimize()
. If none are provided, thegtol
argument will be initialized to 1e-3. Note also, that in any case themaxiter
argument is automatically set to 100. Defaults to None.
- Returns:
A tuple containing: V - an estimate of the unconditional covariance matrix, LV - the Cholesky of the former, Vp - an estimate of the covariance matrix for \(\boldsymbol{\rho}\), Vpr - a root of the former, edf - smoothness uncertainty corrected coefficient-wise edf, total_edf - smoothness uncertainty corrected edf, edf2 - smoothness uncertainty + smoothness bias corrected coefficient-wise edf, total_edf2 - smoothness uncertainty + smoothness bias corrected edf, upper_edf - a heuristic upper bound on the uncertainty corrected edf
- Return type:
(scipy.sparse.csc_array,scipy.sparse.csc_array,numpyp.array,numpy.array,numpy.array,float,numpy.array,float,float)
- mssm.src.python.utils.estimateVp(model, nR=20, lR=100, n_c=10, a=1e-07, b=10000000.0, verbose=False, drop_NA=True, method='Chol', seed=None, conv_tol=1e-07, df=40, strategy='JJJ3', **bfgs_options)
Estimate covariance matrix \(\mathbf{V}_{\boldsymbol{p}}\) of posterior for \(\mathbf{p} = log(\boldsymbol{\lambda})\). Either (see
strategy
parameter) based on finite difference approximation or on using REML scores to approximate the expectation for the covariance matrix, similar to what was suggested by Greven & Scheipl (2016).- References:
https://en.wikipedia.org/wiki/Estimation_of_covariance_matrices
Greven, S., & Scheipl, F. (2016). Comment on: Smoothing Parameter and Model Selection for General Smooth Models
- Parameters:
model (GAMM or GAMLSS or GSMM) – GAMM,GAMLSS, or GSMM model (which has been fitted) for which to estimate \(\mathbf{V}\)
nR (int, optional) – Initial \(\lambda\) Grid is based on nR equally-spaced samples from \(\lambda/lr\) to \(\lambda*lr\). In addition,
nR*len(model.formula.penalties)
updates to \(\mathbf{V}_{\boldsymbol{p}}\) are performed during each of which additional nR \(\lambda\) samples/reml scores are generated/computed, defaults to 20lR (int, optional) – Initial \(\lambda\) Grid is based on nR equally-spaced samples from \(\lambda/lr\) to \(\lambda*lr\), defaults to 100
n_c (int, optional) – Number of cores to use to compute the estimate, defaults to 10
a (float, optional) – Minimum \(\lambda\) value that is included when forming the initial grid. In addition, any of the \(\lambda\) estimates obtained from
model
(used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\)) which are smaller than this are set to this value as well, defaults to 1e-7 the minimum possible estimateb (float, optional) – Maximum \(\lambda\) value that is included when forming the initial grid. In addition, any of the \(\lambda\) estimates obtained from
model
(used to define the mean for the posterior of \(\boldsymbol{p}|y \sim N(log(\hat{\boldsymbol{p}}),\mathbf{V}_{\boldsymbol{p}})\)) which are larger than this are set to this value as well, defaults to 1e7 the maximum possible estimateverbose (bool, optional) – Whether to print progress information or not, defaults to False
drop_NA (bool,optional) – Whether to drop rows in the model matrices corresponding to NAs in the dependent variable vector. Defaults to True.
method (str,optional) – Which method to use to solve for the coefficients (and smoothing parameters). The default (“Chol”) relies on Cholesky decomposition. This is extremely efficient but in principle less stable, numerically speaking. For a maximum of numerical stability set this to “QR/Chol”. In that case a QR decomposition is used - which is first pivoted to maximize sparsity in the resulting decomposition but also pivots for stability in order to get an estimate of rank defficiency. A Cholesky is than used using the combined pivoting strategy obtained from the QR. This takes substantially longer. If this is set to
'qEFS'
, then the coefficients are estimated via quasi netwon and the smoothing penalties are estimated from the quasi newton approximation to the hessian. This only requieres first derviative information. Defaults to “Chol”.seed (int,optional) – Seed to use for random parts of the estimate. Defaults to None
conv_tol (float, optional) – Deprecated, defaults to 1e-7
df (int, optional) – Degrees of freedom used for the multivariate t distribution used to sample the next set of candidates. Setting this to
np.inf
means a multivariate normal is used for sampling, defaults to 40strategy (str, optional) – By default (
strategy='JJJ3'
), the expectation for the covariance matrix is approximated via numeric integration. Whenstrategy='JJJ1'
a finite difference approximation is obtained instead. Whenstrategy='JJJ2'
, the finite difference approximation is updated through numeric integration iteratively. Defaults to ‘JJJ3’bfgs_options (key=value,optional) – Any additional keyword arguments that should be passed on to the call of
scipy.optimize.minimize()
. If none are provided, thegtol
argument will be initialized to 1e-3. Note also, that in any case themaxiter
argument is automatically set to 100. Defaults to None.
- Returns:
An estimate of the covariance matrix of the posterior for \(\mathbf{p} = log(\boldsymbol{\lambda})\)
- Return type:
numpy.array
- mssm.src.python.utils.forward_hessian(coef, llkfun, *llkgargs, **llkkwargs)
Generic function to approximate the hessian of
llkfun
atcoef
.llkfun
is called asllkfun(coef,*llkargs)
- so all additional arguments have to be passed via the latter.Uses finite differences with fixed value for epsilon, based on chapter 5.5.2 in Wood (2015).
- References:
Wood (2015). Core Statistics
- Parameters:
coef (numpy.array) – Current estimate of coefficients of which
llkfun
is some function.llkfun (Callable) – log-likelihood function to optimize.
- Returns:
An approximation of the Hessian as a numpy array.
- Return type:
numpy.array
- mssm.src.python.utils.sample_MVN(n, mu, scale, P, L, LI=None, use=None, seed=None)
Draw
n
samples from multivariate normal with mean \(\boldsymbol{\mu}\) (mu
) and covariance matrix \(\boldsymbol{\Sigma}\).\(\boldsymbol{\Sigma}\) does not need to be provided. Rather the function expects either
L
(\(\mathbf{L}\) in what follows) orLI
(\(\mathbf{L}^{-1}\) in what follows) andscale
(\(\phi\) in what follows). These relate to \(\boldsymbol{\Sigma}\) so that \(\boldsymbol{\Sigma}/\phi = \mathbf{L}^{-T}\mathbf{L}^{-1}\) or \(\mathbf{L}\mathbf{L}^T = [\boldsymbol{\Sigma}/\phi]^{-1}\) so that \(\mathbf{L}*(1/\phi)^{0.5}\) is the Cholesky of the precision matrix of \(\boldsymbol{\Sigma}\).Notably, for models available in
mssm
L
(andLI
) have usually be computed for a permuted matrix, e.g., \(\mathbf{P}[\mathbf{X}^T\mathbf{X} + \mathbf{S}_{\lambda}]\mathbf{P}^T\) (see Wood & Fasiolo, 2017). Hence for sampling we often need to correct for permutation matrix \(\mathbf{P}\) (P
). ifLI
is provided, thenP
can be omitted and is assumed to have been used to un-pivotLI
already.Used for example sample the uncorrected posterior \(\boldsymbol{\beta} | \mathbf{y}, \boldsymbol{\lambda} \sim N(\boldsymbol{\mu} = \hat{\boldsymbol{\beta}},[\mathbf{X}^T\mathbf{X} + \mathbf{S}_{\lambda}]^{-1}\phi)\) for a GAMM (see Wood, 2017). Based on section 7.4 in Gentle (2009), assuming \(\boldsymbol{\Sigma}\) is \(p*p\) and covariance matrix of uncorrected posterior, samples \(\boldsymbol{\beta}\) are then obtained by computing:
\[\boldsymbol{\beta} = \hat{\boldsymbol{\beta}} + [\mathbf{P}^T \mathbf{L}^{-T}*\phi^{0.5}]\mathbf{z}\ \text{where}\ z_i \sim N(0,1)\ \forall i = 1,...,p\]Alternatively, relying on the fact of equivalence that:
\[[\mathbf{L}^T*(1/\phi)^{0.5}]\mathbf{P}[\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}] = \mathbf{z}\]we can first solve for \(\mathbf{y}\) in:
\[[\mathbf{L}^T*(1/\phi)^{0.5}] \mathbf{y} = \mathbf{z}\]followed by computing:
\[ \begin{align}\begin{aligned}\mathbf{y} = \mathbf{P}[\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}]\\\boldsymbol{\beta} = \hat{\boldsymbol{\beta}} + \mathbf{P}^T\mathbf{y}\end{aligned}\end{align} \]The latter avoids forming \(\mathbf{L}^{-1}\) (which unlike \(\mathbf{L}\) might not benefit from the sparsity preserving permutation \(\mathbf{P}\)). If
LI is None
,L
will thus be used for sampling as outlined in these alternative steps.Often we care only about a handfull of elements in
mu
(e.g., the first ones corresponding to “fixed effects’” in a GAMM). In that case we can generate samles only for this sub-set of interest by only using a sub-block of rows of \(\mathbf{L}\) or \(\mathbf{L}^{-1}\) (all columns remain). Argumentuse
can be anp.array
containg the indices of elements inmu
that should be sampled. Because this only works efficiently whenLI
is available an error is raised whennot use is None and LI is None
.If
mu
is set to any integer (i.e., not a Numpy array/list) it is automatically treated as 0. Formssm.models.GAMMLSS
ormssm.models.GSMM
models,scale
can be set to 1.References:
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
Gentle, J. (2009). Computational Statistics.
- Parameters:
n (int) – Number of samples to generate
mu (np.array) – mean of normal distribution as described above
scale (float) – scaling parameter of covariance matrix as described above
P (scp.sparse.csc_array) – Permutation matrix, optional.
L (scp.sparse.csc_array) – Cholesky of precision of scaled covariance matrix as described above.
LI (scp.sparse.csc_array, optional) – Inverse of cholesky factor of precision of scaled covariance matrix as described above.
use ([int], optional) – Indices of parameters in
mu
for which to generate samples, defaults to None in which case all parameters will be sampledseed (int, optional) – Seed to use for random sample generation, defaults to None
- Raises:
ValueError – In case neither
LI
norL
are provided.ValueError – In case
L
is provided butP
is not.ValueError – In case
use
is provided butLI
is not.
- Returns:
Samples from multi-variate normal distribution. In case
use
is not provided, the returned array will be of shape(p,n)
wherep==LI.shape[1]
. Otherwise, the returned array will be of shape(len(use),n)
.- Return type:
np.array
- mssm.src.python.utils.updateVp(ep, remls, rGrid)
Update covariance matrix of posterior for \(\mathbf{p} = log(\boldsymbol{\lambda})\). REML scores are used to approximate expectation, similar to what was suggested by Greven & Scheipl (2016).
- References:
https://en.wikipedia.org/wiki/Estimation_of_covariance_matrices
Greven, S., & Scheipl, F. (2016). Comment on: Smoothing Parameter and Model Selection for General Smooth Models
- Parameters:
ep ([float]) – Model estimate log(lambda), i.e., the expectation over rGrid
remls ([float]) – REML score associated with each lambda candidate in rGrid
rGrid ([float]) – A 2d array, holding all lambda samples considered so far. Each row is one sample
- Returns:
An estimate of the covariance matrix of log(lambda) - 2d array of shape len(mp)*len(mp).
- Return type:
[float]