mssmViz package

Submodules

mssmViz.extract module

mssmViz.extract.eval_coverage(model: ~mssm.models.GAMM, pred_dat, dist_par=0, target: float = 0.0, use: [<class 'int'>] = None, alpha=0.05, whole_function=False, n_ps=10000, seed=None)

Evaluate CI coverage of target function over domain defined by pred_dat.

Parameters:

model (GAMM or GAMMLSS) – GAMM or GAMMLSS model.
pred_dat (pd.Dataframe) – pandas.DataFrame with data used to compute model predictions to be compared against target as well as CI.
dist_par (int, optional) – The index corresponding to the parameter for which to make the prediction (e.g., 0 = mean) - only necessary if a GAMMLSS model is provided, defaults to 0
target (float or [float], optional) – Target function. Can either be set to a float (e.g., 0.0 if the target function is believed to be zero everywhere across the domain) or a list/np.array ``of floats. In the latter case the shape of ``target must be a flattened 1D array. defaults to 0.0
use (list[int] or None) – The indices corresponding to the terms that should be used to obtain the prediction or None in which case all terms will be used.
alpha (float, optional) – The alpha level to use for the standard error calculation. Specifically, 1 - (alpha/2) will be used to determine the critical cut-off value according to a N(0,1).
whole_function (bool, optional) – Whether or not to adjuste the point-wise CI to behave like whole-function (based on Wood, 2017; section 6.10.2 and Simpson, 2016). Defaults to False.
n_ps (int, optional) – How many samples to draw from the posterior in case the point-wise CI is adjusted to behave like a whole-function CI.
seed (int or None, optional) – Can be used to provide a seed for the posterior sampling step in case the point-wise CI is adjusted to behave like a whole-function CI.

Returns:

A tuple with three elements. First is a bool, indicating whether the target function is covered by the CI at every evaluated value of the domain. Second is the (average) coverage across the entire domain. Third is a boolean array indicating for every evaluated value whether the corresponding value of target falls within the CI boundaries.

Return type:

(bool,float,[bool])

mssmViz.extract.get_term_coef(model: ~mssm.models.GAMM, which: [<class 'int'>], dist_par=0)

Get the coefficients associated with a specific term included in the Formula of model. Useful to extract for example the estimated random intercepts from a random effect model.

Note, coefficients will be in order of the model matrix columns. For a random intercept term (ri()) this implies that the first coefficient in the returned vector will correspond to the first encoded level of the random factor. Usually, mssm determines the level-ordering automatically and the actual factor-level corresponding to the first encoded level can be determined by inspecting the code-book returned from formula.get_factor_codings().

Parameters:

model (GAMM or GAMMLSS) – GAMM or GAMMLSS model.
which ([int]) – Index corresponding to the term in the model’s formula for which the coefficients should be extracted.
dist_par (int, optional) – The index corresponding to the parameter for which to make the prediction (e.g., 0 = mean) - only necessary if a GAMMLSS model is provided, defaults to 0

mssmViz.plot module

mssmViz.plot.plot(model: ~mssm.models.GAMM, which: [<class 'int'>] = None, dist_par=0, n_vals: int = 30, ci=None, ci_alpha=0.05, use_inter=False, whole_interval=False, n_ps=10000, seed=None, cmp: str = None, plot_exist=True, plot_exist_style='both', response_scale=False, axs=None, fig_size=(2.3622047244094486, 2.3622047244094486), math_font_size=9, math_font='cm', ylim=None, prov_cols=None, lim_dist=0.1)

Helper function to plot all smooth functions estimated by a GAMM or GAMMLSS model.

Smooth functions are automatically evaluated over a range of n_values spaced equally to cover their entire covariate. For tensor smooths a n_values*n_values grid is created. Visualizations can be obtained on the scale of the linear predictor (the default), but also on what is often referred to as the ‘response’-scale - corresponding to the estimated mean of the RVs modeled by the model. Note, that CIs will then only be approximate.

To simply obtain visualizations of all smooth terms estimated, it is sufficient to call:

plot(model) # or plot(model,dist_par=0) in case of a GAMMLSS model

This will visualize all smooth terms estimated by the model and automatically determine whether confidence intervals should be drawn or not (by default CIs are only visualized for fixed effects). Note that, for tensor smooths, areas of the smooth for which the CI contains zero will be visualized with low opacity if the CI is to be visualized.

References:

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
Simpson, G. (2016). Simultaneous intervals for smooths revisited.

Parameters:

model (GAMM or GAMMLSS) – The estimated GAMM or GAMMLSS model for which the visualizations are to be obtained
which ([int] or None, optional) – The indices corresponding to the smooth that should be visualized or None in which case all smooth terms will be visualized, defaults to None
dist_par (int, optional) – The index corresponding to the parameter for which to make the prediction (e.g., 0 = mean) - only necessary if a GAMMLSS model is provided, defaults to 0
n_vals (int, optional) – Number of covariate values over which to evaluate the function. Will result in n_vals**2 eval points for tensor smooths, defaults to 30
ci (bool or None, optional) – Whether the standard error se for credible interval (CI; see Wood, 2017) calculation should be computed and used to visualize CIs. The CI is then [pred - se, pred + se], defaults to None in which case the CI will be visualized for fixed effects but not for random smooths
ci_alpha (float, optional) – The alpha level to use for the standard error calculation. Specifically, 1 - (alpha/2) will be used to determine the critical cut-off value according to a N(0,1), defaults to 0.05
use_inter (bool, optional) – Whether or not the standard error for CIs should be computed based on just the smooth or based on the smooth + the model intercept - the latter results in better coverage for strongly penalized functions (see Wood, 2017), defaults to False
whole_interval (bool, optional) – Whether or not to adjuste the point-wise CI to behave like whole-function (based on Wood, 2017; section 6.10.2 and Simpson, 2016), defaults to False
n_ps (int, optional) – How many samples to draw from the posterior in case the point-wise CI is adjusted to behave like a whole-function CI, defaults to 10000
seed (int, optional) – Can be used to provide a seed for the posterior sampling step in case the point-wise CI is adjusted to behave like a whole-function CI, defaults to None
cmp (str or None, optional) – string corresponding to name for a matplotlib colormap, defaults to None in which case it will be set to ‘RdYlBu_r’.
plot_exist (bool, optional) – Whether or not an indication of the data distribution should be provided. For univariate smooths setting this to True will add a rug-plot to the bottom, indicating for which covariate values samples existed in the training data. For tensor smooths setting this to true will result in a 2d scatter rug plot being added and/or values outside of data limits being hidden, defaults to True
plot_exist_style (str, optional) – Determines the style of the data distribution indication for smooths. Must be ‘rug’, ‘hide’,or ‘both’. ‘both’ will both add the rug-plot and hide values out of data limits, defaults to ‘both’
response_scale (bool, optional) – Whether or not predictions and CIs should be shown on the scale of the model predictions (linear scale) or on the ‘response-scale’ i.e., the scale of the mean, defaults to False
axs ([matplotlib.axis], optional) – A list of matplotlib.axis on which Figures should be drawn, defaults to None in which case axis will be created by the function and plot.show() will be called at the end
fig_size (tuple, optional) – Tuple holding figure size, which will be used to determine the size of the figures created if axs=None, defaults to (6/2.54,6/2.54)
math_font_size (int, optional) – Font size for math notation, defaults to 9
math_font (str, optional) – Math font to use, defaults to ‘cm’
ylim ((float,float), optional) – Tuple holding y-limits (z-limits for 2d plots), defaults to None in which case y_limits will be inferred from the predictions made
prov_cols (float or [float], optional) – A float or a list (in case of a smooth with a by argument) of floats in [0,1]. Used to get a color for unicariate smooth terms, defaults to None in which case colors will be selected automatically depending on whether the smooth has a by keyword or not
lim_dist (float, optional) – The floating point distance (on normalized scale, i.e., values have to be in [0,1]) at which a point is considered too far away from training data. Setting this to 0 means we visualize only points for which there is trainings data, setting this to 1 means visualizing everything. Defaults to 0.1

Raises:

ValueError – If fewer matplotlib axis are provided than the number of figures that would be created

mssmViz.plot.plot_diff(pred_dat1, pred_dat2, tvars, model: ~mssm.models.GAMM, use: [<class 'int'>] = None, dist_par=0, ci_alpha=0.05, whole_interval=False, n_ps=10000, seed=None, cmp: str = None, plot_exist=True, response_scale=True, ax=None, fig_size=(2.3622047244094486, 2.3622047244094486), ylim=None, col=0.7, label=None, title=None, lim_dist=0.1)

Plots the expected difference (and CI around this expected difference) between two sets of predictions, evaluated for pred_dat1 and pred_dat2.

This function is primarily designed to visualize the expected difference between two levels of a categorical/factor variable. For example, consider the following model below, including a separate smooth of “time” per level of the factor “cond”. It is often of interest to visualize when in time the two levels of “cond” differ from each other in their dependent variable. For this, the difference curve over “time”, essentially the smooth of “time” for the first level subtracted from the smooth of “time” for the second level of factor “cond” (offset terms can also be accounted for, check the use argument), can be visualized together with a CI (Wood, 2017). This CI can provide insights into whether and when the two levels can be expected to be different from each other. To visualize this difference curve as well as the difference CI, this function can be used as follows:

# Define & estimate model
model = GAMM(Formula(lhs("y"),[i(), l(["cond"]), f(["time"],by="cond")],data=dat),Gaussian())
model.fit()

# Create prediction data, differing only in the level of factor cond
time_pred = np.linspace(0,np.max(dat["time"]),30)
new_dat1 = pd.DataFrame({"cond":["a" for _ in range(len(time_pred))],
                        "time":time_pred})

new_dat2 = pd.DataFrame({"cond":["b" for _ in range(len(time_pred))],
                        "time":time_pred})

# Now visualize diff = (lpha_a + f_a(time)) - (lpha_b + f_b(time)) and the CI around diff
plot_diff(pred_dat1,pred_dat2,["time"],model)

This is only the most basic example to illustrate the usefulness of this function. Many other options are possible. Consider for example the model below, which allows for the expected time-course to vary smoothly as a function of additional covariate “x” - achieved by inclusion of the tensor smooth term of “time” and “x”. In addition, this model allows for the shape of the tensor smooth to differ between the levels of factor “cond”:

model = GAMM(Formula(lhs("y"),[i(), l(["cond"]), f(["time","x"],by="cond",te=True)],data=dat),Gaussian())

For such a model, multiple predicted differences might be of interest. One option would be to look only at a single level of “cond” and to visualize the predicted difference in the time-course for two different values of “x” (perhaps two quantiles). In that case, pred_dat1 and pred_dat2 would have to be set up to differ only in the value of “x” - they should be equivalent in terms of “time” and “cond” values.

Alternatively, it might be of interest to look at the predicted difference between the tensor smooth surfaces for two levels of factor “cond”. Rather than being interested in a difference curve, this would mean we are interested in a difference surface. To achieve this, pred_dat1 and pred_dat2 would again have to be set up to differ only in the value of “cond” - they should be equivalent in terms of “time” and “x” values. In addition, it would be necessary to specify tvars=[“time”,”x”]. Note that, for such difference surfaces, areas of the difference prediction for which the CI contains zero will again be visualized with low opacity if the CI is to be visualized.

References:

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
Simpson, G. (2016). Simultaneous intervals for smooths revisited.

Parameters:

pred_dat1 (pandas.DataFrame) – A pandas DataFrame containing new data for which the prediction is to be compared to the prediction obtained for pred_dat2. Importantly, all variables present in the data used to fit the model also need to be present in this DataFrame. Additionally, factor variables must only include levels also present in the data used to fit the model. If you want to exclude a specific factor from the difference prediction (for example the factor subject) don’t include the terms that involve it in the use argument.
pred_dat2 (pandas.DataFrame) – Like pred_dat1 - ideally differing only in the level of a single factor variable or the value of a single continuous variable.
tvars ([str]) – List of variables to be visualized - must contain one string for difference predictions visualized as a function of a single variable, two for difference predictions visualized as a function of two variables
model (GAMM or GAMMLSS) – The estimated GAMM or GAMMLSS model for which the visualizations are to be obtained
use ([int] or None, optional) – The indices corresponding to the terms that should be used to obtain the prediction or None in which case all fixed effects will be used, defaults to None
dist_par (int, optional) – The index corresponding to the parameter for which to make the prediction (e.g., 0 = mean) - only necessary if a GAMMLSS model is provided, defaults to 0
ci_alpha (float, optional) – The alpha level to use for the standard error calculation. Specifically, 1 - (alpha/2) will be used to determine the critical cut-off value according to a N(0,1), defaults to 0.05
whole_interval (bool, optional) – Whether or not to adjuste the point-wise CI to behave like whole-function (based on Wood, 2017; section 6.10.2 and Simpson, 2016), defaults to False
n_ps (int, optional) – How many samples to draw from the posterior in case the point-wise CI is adjusted to behave like a whole-function CI, defaults to 10000
seed (int, optional) – Can be used to provide a seed for the posterior sampling step in case the point-wise CI is adjusted to behave like a whole-function CI, defaults to None
cmp (str or None, optional) – string corresponding to name for a matplotlib colormap, defaults to None in which case it will be set to ‘RdYlBu_r’.
plot_exist (bool, optional) – Whether or not an indication of the data distribution should be provided. For difference predictions visualized as a function of a single variable this will simply hide predictions outside of the data-limits. For difference predictions visualized as a function of a two variables setting this to true will result in values outside of data limits being hidden, defaults to True
response_scale (bool, optional) – Whether or not predictions and CIs should be shown on the scale of the model predictions (linear scale) or on the ‘response-scale’ i.e., the scale of the mean, defaults to True
ax (matplotlib.axis, optional) – A matplotlib.axis on which the Figure should be drawn, defaults to None in which case an axis will be created by the function and plot.show() will be called at the end
fig_size (tuple, optional) – Tuple holding figure size, which will be used to determine the size of the figures created if ax=None, defaults to (6/2.54,6/2.54)
ylim ((float,float), optional) – Tuple holding y-limits (z-limits for 2d plots), defaults to None in which case y_limits will be inferred from the predictions made
col (float, optional) – A float in [0,1]. Used to get a color for univariate predictions from the chosen colormap, defaults to 0.7
label ([str], optional) – A list of label to add to the y axis for univariate predictions or to the color-bar for tensor predictions, defaults to None
title ([str], optional) – A list of titles to add to each plot, defaults to None
lim_dist (float, optional) – The floating point distance (on normalized scale, i.e., values have to be in [0,1]) at which a point is considered too far away from training data. Setting this to 0 means we visualize only points for which there is trainings data, setting this to 1 means visualizing everything. Defaults to 0.1

Raises:

ValueError – If a visualization is requested for more than 2 variables

mssmViz.plot.plot_fitted(pred_dat, tvars, model: ~mssm.models.GAMM, use: [<class 'int'>] = None, pred_factors: [<class 'str'>] = None, dist_par=0, ci=True, ci_alpha=0.05, whole_interval=False, n_ps=10000, seed=None, cmp: str = None, plot_exist=True, plot_exist_style='both', response_scale=True, ax=None, fig_size=(2.3622047244094486, 2.3622047244094486), ylim=None, col=0.7, label=None, legend_label=False, title=None, lim_dist=0.1)

Plots the model prediction based on (a subset of) the terms included in the model for new data pred_dat.

In contrast to plot, the predictions are by default transformed to the scale of the mean (i.e., response-scale). If use=None, the model will simply use all parametric and regular smooth terms (but no random effects) for the prediction (i.e., only the “fixed” effects in the model).

For a GAMM, a simple example of this function would be:

# Fit model
model = GAMM(Formula(lhs("y"),[i(),f(["time"])],data=dat),Gaussian())
model.fit()

# Create prediction data
pred_dat = pd.DataFrame({"time":np.linspace(0,np.max(dat["time"]),30)})

# Plot predicted mean = lpha + f(time)
plot_fitted(pred_dat,["time"],model)

# This is in contrast to `plot`, which would just visualize pred = f(time)
plot(model)

Note that, for predictions visualized as a function of two variables, areas of the prediction for which the CI contains zero will again be visualized with low opacity if the CI is to be visualized.

References:

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).
Simpson, G. (2016). Simultaneous intervals for smooths revisited.

Parameters:

pred_dat (pandas.DataFrame) – A pandas DataFrame containing new data for which to make the prediction. Importantly, all variables present in the data used to fit the model also need to be present in this DataFrame. Additionally, factor variables must only include levels also present in the data used to fit the model. If you want to exclude a specific factor from the prediction (for example the factor subject) don’t include the terms that involve it in the use argument.
tvars ([str]) – List of variables to be visualized - must contain one string for predictions visualized as a function of a single variable, two for predictions visualized as a function of two variables
model (GAMM or GAMMLSS) – The estimated GAMM or GAMMLSS model for which the visualizations are to be obtained
use ([int] or None, optional) – The indices corresponding to the terms that should be used to obtain the prediction or None in which case all fixed effects will be used, defaults to None
pred_factors ([str]) – List of factor variables to consider for data limit/availability computations - by default, all factor variables in the model are considered.
dist_par (int, optional) – The index corresponding to the parameter for which to make the prediction (e.g., 0 = mean) - only necessary if a GAMMLSS model is provided, defaults to 0
ci (bool or None, optional) – Whether the standard error se for credible interval (CI; see Wood, 2017) calculation should be computed and used to visualize CIs. The CI is then [pred - se, pred + se], defaults to None in which case the CI will be visualized for fixed effects but not for random smooths
ci_alpha (float, optional) – The alpha level to use for the standard error calculation. Specifically, 1 - (alpha/2) will be used to determine the critical cut-off value according to a N(0,1), defaults to 0.05
whole_interval (bool, optional) – Whether or not to adjuste the point-wise CI to behave like whole-function (based on Wood, 2017; section 6.10.2 and Simpson, 2016), defaults to False
n_ps (int, optional) – How many samples to draw from the posterior in case the point-wise CI is adjusted to behave like a whole-function CI, defaults to 10000
seed (int, optional) – Can be used to provide a seed for the posterior sampling step in case the point-wise CI is adjusted to behave like a whole-function CI, defaults to None
cmp (str or None, optional) – string corresponding to name for a matplotlib colormap, defaults to None in which case it will be set to ‘RdYlBu_r’.
plot_exist (bool, optional) – Whether or not an indication of the data distribution should be provided. For predictions visualized as a function of a single variable setting this to True will add a rug-plot to the bottom, indicating for which covariate values samples existed in the training data. For predictions visualized as a function of a two variables setting this to true will result in a 2d scatter rug plot being added and/or values outside of data limits being hidden, defaults to True
plot_exist_style (str, optional) – Determines the style of the data distribution indication for smooths. Must be ‘rug’, ‘hide’,or ‘both’. ‘both’ will both add the rug-plot and hide values out of data limits, defaults to ‘both’
response_scale (bool, optional) – Whether or not predictions and CIs should be shown on the scale of the model predictions (linear scale) or on the ‘response-scale’ i.e., the scale of the mean, defaults to True
ax (matplotlib.axis, optional) – A matplotlib.axis on which the Figure should be drawn, defaults to None in which case an axis will be created by the function and plot.show() will be called at the end
fig_size (tuple, optional) – Tuple holding figure size, which will be used to determine the size of the figures created if ax=None, defaults to (6/2.54,6/2.54)
ylim ((float,float), optional) – Tuple holding y-limits (z-limits for 2d plots), defaults to None in which case y_limits will be inferred from the predictions made
col (float, optional) – A float in [0,1]. Used to get a color for univariate predictions from the chosen colormap, defaults to 0.7
label (str, optional) – A label to add to the y axis for univariate predictions (or to a legend, if `legend_label`=True) or to the color-bar for tensor predictions, defaults to None
legend_label (bool, optional) – Whether or not any label should be added to a legend (don’t forget to call plt.legend()) or to the y-axis for univariate predicitions, defaults to False (the latter)
title ([str], optional) – A list of titles to add to each plot, defaults to None
lim_dist (float, optional) – The floating point distance (on normalized scale, i.e., values have to be in [0,1]) at which a point is considered too far away from training data. Setting this to 0 means we visualize only points for which there is trainings data, setting this to 1 means visualizing everything. Defaults to 0.1

Raises:

ValueError – If a visualization is requested for more than 2 variables

mssmViz.plot.plot_val(model: ~mssm.models.GAMM, pred_viz: [<class 'str'>] = None, resid_type='deviance', ar_lag=100, response_scale=False, qq=True, axs=None, fig_size=(2.3622047244094486, 2.3622047244094486))

Plots residual plots useful for validating whether the model meets the regression assumptions.

At least four plots will be generated:

A scatter-plot: Model predictions (always on response/mean scale) vs. Observations
A scatter-plot: Model predictions (optionally on response/mean scale) vs. Residuals
A Histogram/QQ-plot: Residuals (with density overlay of expected distribution)/Quantile-quantile plot for residuals against theoretical quantiles.
An ACF plot: Showing the auto-correlation in the residuals at each of ar_lag lags

For each additional predictor included in pred_viz, an additional scatter-plot will be generated plotting the predictor values against the residuals.

Which residuals will be visualized depends on the choice of model and resid_type. If model is a GAMM model, resid_type will determine whether “Pearson” or “Deviance” (default) residuals are to be plotted (Wood, 2017). Except, for a Gaussian GAMM, in which case the function will always plot the classical residuals, i.e., the difference between model predictions (mu_i) and observed values (y_i). This ensures that by default and for any GAMM we can expect the residuals to look like N independent samples from N(0,sqrt(phi)) - where phi is the scale parameter of the GAMM (sigma^2 for Gaussian). Hence, we can interpret all the plots in the same way. Note, that residuals for Binomial models will generally not look pretty or like N(0,sqrt(phi)) - but they should still be reasonably independent.

If model is a GAMMLSS model, resid_type will be ignored. Instead, the function will always plot standardized residuals that behave a lot like deviance residuals, except for the fact that we also cancel for phi, so that we can expect the residuals to look like N independent samples from N(0,1). In many cases, the computation of these residuals will thus follow the computation for GAMMs (for a Gaussian GAMMLSS model we can for example simply scale the residual vector by the sigma parameter estimated for each observation to achieve the desired distribution result) while for others more complicated standardization might be necessary (see Rigby & Stasinopoulos, 2005) - this will be noted in the docstring of the resid() method implemented by each GAMLSSFamily family. Again, we have to make an exeception for the Multinomial model (MULNOMLSS), which is currently not supported by this function.

References:

Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale and Shape.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.).

Parameters:

model (GAMM or GAMMLSS) – Estimated GAMM or GAMMLSS model, for which the reisdual plots should be generated.
pred_viz ([str] or None, optional) – A list of additional predictor variables included in the model. For each one provided an additional plot will be created with the predictor on the x-axis and the residuals on the y-axis, defaults to None
resid_type (str, optional) – Type of residual to visualize. For a model that is a GAMM this can be “Pearson” or “Deviance” - for a Gaussian GAMM the function will alwasy plot default residuals (y_i - mu_i) independent of what is provided. For a model that is a GAMMLSS, the function will always plot standardized residuals that should approximately behave like deviance ones - except that they can be expected to look like N(0,1) if the model is specified correctly, defaults to “deviance”
ar_lag (int, optional) – Up to which lag the auto-correlation function in the residuals should be computed and visualized, defaults to 100
response_scale (bool, optional) – Whether or not predictions should be visualized on the scale of the mean or not, defaults to False - i.e., predictions are visualized on the scale of the model predictions/linear scale
qq (bool, optional) – Whether or not a qq-plot should be drawn instead of a Histogram, defaults to True
axs ([matplotlib.axis], optional) – A list of matplotlib.axis on which Figures should be drawn, defaults to None in which case axis will be created by the function and plot.show() will be called at the end
fig_size (tuple, optional) – Tuple holding figure size, which will be used to determine the size of the figures created if axs=None, defaults to (6/2.54,6/2.54)

Raises:

ValueError – If fewer matplotlib axis are provided than the number of figures that would be created
TypeError – If the function is called with a model of the MULNOMLSS family, which is currently not supported

mssmViz.sim module

mssmViz.sim.sim1(sim_size, sim_sigma=5.5, sim_lam=0.0001, sim_weak_nonlin=0.5, random_seed=None, fixed_seed=126)

First simulation for an additive time-series model with trial-level non-linear random effects. Data-set contains covariates time & x, for which the effect is different for three levels of factor variable fact.

Parameters:

sim_size (int, optional) – Number of trials, defaults to 1000
sim_sigma (float, optional) – Standard error for residuals, defaults to 5.5
sim_lam (_type_, optional) – Lambda parameter for trial-level non-linear effect complexity, defaults to 1e-4
sim_weak_nonlin (float, optional) – Strength of weakly non-linear covariate x effect, defaults to 0.5
random_seed (int, optional) – Seed for random parts of the simulation - should differ between repeated simulations, defaults to None
fixed_seed (int, optional) – Seed for fixed effects in the simulation - should NOT differ between repeated simulations, defaults to None

Returns:

Tuple, first element contains a pd.DataFrame with simulated data, second element is again a tuple containing: a np.array with the trial-level deviations, design matrices used for simulation, true coefficients used for simulation, and true intercepts used for simulation.

Return type:

(pd.Dataframe,(np.array,np.array,np.array,np.array,np.array,np.array))

mssmViz.sim.sim10(n, c=1, family=<mssm.src.python.exp_fam.GAUMLSS object>, seed=None)

Like sim9, except that c is used to scale effect of f0.

I.e. the model of the mean is: c*f0 + f1 + f4 For the sd: f2 + f3.

References:

Gu, C. & Whaba, G., (1991). Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method.

Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models

mgcv source code: gam.sim.r

Parameters:

scale (float) – Standard deviation for family=’Gaussian’ else scale parameter
c (float) – Effect strength for random effect - 0 = No effect (sd=0), 1 = Maximal effect (sd=1)
family (Family, optional) – Distribution for response variable, must be: GAUMLSS() or GAMMALS(). Defaults to GAUMLSS([Identity(),LOG()])

mssmViz.sim.sim11(n, scale, c=1, binom_offset=0, n_ranef=40, family=<mssm.src.python.exp_fam.Gaussian object>, prop_q=0.95, seed=None)

Like sim4, except that a random smooth of variable x0 is added - extension of the second simulation performed by Wood et al., (2016).

c is used here to scale the contribution of the random smooth. Setting it to 0 means the ground truth is maximally wiggly. Setting it to 1 means the random smooth is actually a random intercept.

References:

Gu, C. & Whaba, G., (1991). Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method.

Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models

mgcv source code: gam.sim.r

Parameters:

scale (float) – Standard deviation for family=’Gaussian’ else scale parameter
c (float) – Effect strength for random smooth - 0 = Maximally wiggly, 1 = ground truth is random intercept
binom_offset (float) – Additive adjustment to log-predictor for Binomial model (5 in mgcv) and baseline hazard parameter for Propoprtional Hazard model. Defaults to 0.
prop_q (float) – Simulated times exceeding q(prop_q)'' where ``q is the quantile function of a Weibull model parameterized with b=np.min(max(binom_offset,0.01)*np.exp(eta)) are treated as censored for Propoprtional Hazard model. Defaults to 0.
n_ranef (int) – Number of levels for the random smooth term. Defaults to 40.
family (Family, optional) – Distribution for response variable, must be: Gaussian(), Gamma(), or Binomial(). Defaults to Gaussian()

mssmViz.sim.sim12(n, c=1, n_ranef=40, family=<mssm.src.python.exp_fam.GAUMLSS object>, seed=None)

Like sim11, except for GAMLSS models. x0, x1, and x4 impact the mean - the remaining variables impact the scale parameter.

c is used here to scale the contribution of the random smooth (which is included in the model fo the mean). Setting it to 0 means the ground truth is maximally wiggly. Setting it to 1 means the random smooth is actually a random intercept.

References:

Gu, C. & Whaba, G., (1991). Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method.

Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models

mgcv source code: gam.sim.r

Parameters:

c (float) – Effect strength for random smooth - 0 = Maximally wiggly, 1 = ground truth is random intercept
n_ranef (int) – Number of levels for the random smooth term. Defaults to 40.
family (Family, optional) – Distribution for response variable, must be: GAUMLSS() or GAMMALS(). Defaults to GAUMLSS([Identity(),LOGb()])

mssmViz.sim.sim2(sim_size, sim_sigma=5.5, sim_lam=0.0001, set_zero=1, random_seed=None, fixed_seed=126)

Second simulation for an additive time-series model with trial-level non-linear random effects. Data contains two additional covariates apart from time (x & z) - values for z vary within and between series, x only between series.

Ground-truth for x or z can be set to zero.

Parameters:

sim_size (int, optional) – Number of trials, defaults to 1000
sim_sigma (float, optional) – Standard error for residuals, defaults to 5.5
sim_lam (_type_, optional) – Lambda parameter for trial-level non-linear effect complexity, defaults to 1e-4
set_zero (int, optional) – Which covariate (1 or 2 for x and z respectively) to set to zero, defaults to 1
random_seed (int, optional) – Seed for random parts of the simulation - should differ between repeated simulations, defaults to None
fixed_seed (int, optional) – Seed for fixed effects in the simulation - should NOT differ between repeated simulations, defaults to None

Returns:

Tuple, first element contains a pd.DataFrame with simulated data, second element is again a tuple containing: a np.array with the trial-level deviations, design matrices used for simulation, true effects used for simulation, and true offset used for simulation.

Return type:

(pd.Dataframe,(np.array,np.array,np.array,np.array,np.array,np.array,np.array,float))

mssmViz.sim.sim3(n, scale, c=1, binom_offset=0, family=<mssm.src.python.exp_fam.Gaussian object>, prop_q=0.95, correlate=False, seed=None)

First Simulation performed by Wood et al., (2016): 4 smooths, 1 is really zero everywhere. Based on the original functions of Gu & Whaba (1991).

This is also the first simulation performed by gamSim() - except for the fact that f(x_0) can also be set to zero, as was done by Wood et al., (2016)

Covariates can also be simulated to correlate with each other, following the steps outlined in supplementary materials E of Wood et al., (2016).

References:

Gu, C. & Whaba, G., (1991). Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method.

Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models

mgcv source code: gam.sim.r

Parameters:

scale (float) – Standard deviation for family=’Gaussian’ else scale parameter
c (float) – Effect strength for x3 effect - 0 = No effect, 1 = Maximal effect
binom_offset (float) – Additive adjustment to log-predictor for Binomial and Poisson model (5 in mgcv) and baseline hazard parameter for Propoprtional Hazard model. Defaults to 0.
prop_q (float) – Simulated times exceeding q(prop_q)'' where ``q is the quantile function of a Weibull model parameterized with b=np.min(max(binom_offset,0.01)*np.exp(eta)) are treated as censored for Propoprtional Hazard model. Defaults to 0.
correlate (bool) – Whether predictor covariates should correlate or not. Defaults to False
family (Family, optional) – Distribution for response variable, must be: Gaussian(), Gamma(), Binomial(), or Poisson(). Defaults to Gaussian()

mssmViz.sim.sim4(n, scale, c=1, binom_offset=0, family=<mssm.src.python.exp_fam.Gaussian object>, prop_q=0.95, correlate=False, seed=None)

Like sim3, except that a random factor is added - second simulation performed by Wood et al., (2016).

This is also the sixth simulation performed by gamSim() - except for the fact that c is used here to scale the contribution of the random factor, as was also done by Wood et al., (2016)

Covariates can also be simulated to correlate with each other, following the steps outlined in supplementary materials E of Wood et al., (2016).

References:

Gu, C. & Whaba, G., (1991). Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method.

Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models

mgcv source code: gam.sim.r

Parameters:

scale (float) – Standard deviation for family=’Gaussian’ else scale parameter
c (float) – Effect strength for random effect - 0 = No effect (sd=0), 1 = Maximal effect (sd=1)
binom_offset (float) – Additive adjustment to log-predictor for Binomial and Poisson model (5 in mgcv) and baseline hazard parameter for Propoprtional Hazard model. Defaults to 0.
prop_q (float) – Simulated times exceeding q(prop_q)'' where ``q is the quantile function of a Weibull model parameterized with b=np.min(max(binom_offset,0.01)*np.exp(eta)) are treated as censored for Propoprtional Hazard model. Defaults to 0.
family (Family, optional) – Distribution for response variable, must be: Gaussian(), Gamma(), Binomial(), or Poisson(). Defaults to Gaussian()
correlate (bool) – Whether predictor covariates should correlate or not. Defaults to False

mssmViz.sim.sim5(n, seed=None)

Simulates n data-points for a Multi-nomial model - probability of Y_i being one of K=5 classes changes smoothly as a function of variable x and differently so for each class - based on slightly modified versions of the original functions of Gu & Whaba (1991).

References: - Gu, C. & Whaba, G., (1991). Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method. - Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models - mgcv source code: gam.sim.r

mssmViz.sim.sim6(n, family=<mssm.src.python.exp_fam.GAUMLSS object>, seed=None)

Simulates n data-points for a Gaussian or Gamma GAMMLSS model - mean and standard deviation/scale change based on the original functions of Gu & Whaba (1991).

References: - Gu, C. & Whaba, G., (1991). Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method. - Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models - mgcv source code: gam.sim.r

Parameters:: family (GAMLSSFamily, optional) – Distribution for response variable, must be: GAUMLSS(), GAMMALS(). Defaults to GAUMLSS([Identity(),LOG()])

mssmViz.sim.sim7(n, c, scale, family=<mssm.src.python.exp_fam.Gaussian object>, seed=None)

An overlap simulation with a random intercept. Two events are present that differ in their onset across different trials - each event triggers a response in the signal (modeled as two functions of Gu & Whaba).

References:

Gu, C. & Whaba, G., (1991). Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method.

Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models

mgcv source code: gam.sim.r

Parameters:

n (int) – Number of trials to simulate
c (float) – Effect strength for random effect - 0 = No effect (sd=0), 1 = Maximal effect (sd=1)
scale (float) – Standard deviation for family=’Gaussian’ else scale parameter
family (Family, optional) – Distribution for response variable, must be: Gaussian(), Gamma(), or Binomial(). Defaults to Gaussian()

mssmViz.sim.sim8(n, c, family=<mssm.src.python.exp_fam.GAUMLSS object>, seed=None)

Like sim6: Simulates n data-points for a Gaussian or Gamma GAMMLSS model - mean and standard deviation/scale change based on the original functions of Gu & Whaba (1991). Difference is that the effect strength of the scale smoother can be manipulated by changing c!

References: - Gu, C. & Whaba, G., (1991). Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method. - Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models - mgcv source code: gam.sim.r

Parameters:

c (float) – Effect strength for random effect - 0 = No effect (sd=0), 1 = Maximal effect (sd=1)
family (GAMLSSFamily, optional) – Distribution for response variable, must be: GAUMLSS(), GAMMALS(). Defaults to GAUMLSS([Identity(),LOG()])

mssmViz.sim.sim9(n, c=1, family=<mssm.src.python.exp_fam.GAUMLSS object>, seed=None)

Like sim4, except for a GAMMLSS model.

The random intercept to be selected is included in the model of the mean. I.e. the model of the mean is: f0 + f1 + c*f4 For the sd: f2 + f3.

References:

Gu, C. & Whaba, G., (1991). Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method.

Wood, S. N., Pya, N., Saefken, B., (2016). Smoothing Parameter and Model Selection for General Smooth Models

mgcv source code: gam.sim.r

Parameters:

scale (float) – Standard deviation for family=’Gaussian’ else scale parameter
c (float) – Effect strength for random effect - 0 = No effect (sd=0), 1 = Maximal effect (sd=1)
family (Family, optional) – Distribution for response variable, must be: GAUMLSS() or GAMMALS(). Defaults to GAUMLSS([Identity(),LOGb()])