{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import scipy as scp\n", "import matplotlib.pyplot as plt\n", "import math\n", "import matplotlib.cm as cmx\n", "from mssm.models import *\n", "from mssm.src.python.compare import compare_CDL\n", "from mssmViz.plot import *\n", "from mssmViz.sim import *\n", "from mssmViz.extract import *\n", "import matplotlib.style as mplstyle\n", "mplstyle.use('fast')\n", "np.set_printoptions(suppress=True)\n", "\n", "nc = 10\n", "colors = [cmx.tab20(x) for x in np.linspace(0.0,min(0.1*nc,1.0),nc)]\n", "single_size = 8/2.54" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# `mssm` Tutorial\n", "\n", "With `mssm` we can estimate a broad variety of smooth models as they are also supported in `mgcv` (Wood, 2017). Terms that are available in both `mgcv` and `mssm` include regular univariate smooth terms, binary smooth terms, Tensor smooth interactions, by-factor smooth terms (ordered and not ordered), and random non-linear smooth terms. Like `mgcv`, `mssm` also supports random effects (random intercepts, random slopes, and as mentioned random smooths). By default a B-spline basis is used by `mssm` for all smooth terms and penalties are either difference (Eilers & Marx, 2010) or identity penalties (Kernel penalties, Marra & Wood (2012) are also supported).\n", "\n", "\n", "Like `mgcv`, `mssm` supports estimation of conventional Generalized additive (mixed) models. Currently, Binomial models (N >= 1), Gamma models, Inverse Gaussian, Poisson, and Gaussian models are supported! `mssm` uses sparse matrices and is thus particularly well equipped to handle models of **multi-level data, including many random effects**! `mssm` also supports estimation of GAMMs of location, scale, and shape parameters (\"GAMMLSS\"; see Rigby & Stasinopoulos, 2005) and even more general smooth models (\"GSMMs\") as defined by Wood, Pya, & Säfken (2016). \n", "\n", "`mssm` automatically figures out the right amount of penalization for all terms and all these models using a combination of the approaches discussed by Wood, Pya, & Säfken (2016), Wood & Fasiolo (2017), Wood, Shaddick, & Augustin, (2017). Alternatively, the `L-qEFS` update by Krause et al. (submitted) can be used to select the regularization/penalization parameters. This update is particularly attrative when working with general smooth models, since it only requires researchers to code up the log-likelihood and (optionally) ways to compute its gradient in order to estimate and regularize the model. Krause et al. (submitted) provides a detailed description of the different algorithms implemented in the `mssm` toolbox.\n", "\n", "In contrast, this tutorial focuses on practical application of the `mssm` toolbox. It contains:\n", "\n", "- a brief discussion of the structure your data should have\n", "- an introduction to the `Formula` class, including a conversion table if you are familiar with `mgcv` syntax and just want to get started\n", "- an extensive section of examples involving Gaussian additive models so that you can get familiar with the `Formula` class and the different terms supported by `mssm`\n", "- examples on how to estimate Generalized additive models and even more general models (i.e., GAMMLSS and GSMMs)\n", "- a section on advanced topics, focused on model selection, posterior sampling, and working with the `GSMM` class\n", "\n", "Along the way, this tutorial introduces the plot functions available in `mssmViz`, which can be used to visualize model predictions and to asses model validity via residual plots for all models estimated by `mssm`. This tutorial assumes that you are already familiar with the idea behind GAMMS (and perhaps even GAMMLSS models; Rigby & Stasinopoulos, 2005) and does not aim to provivde a thorough introduction to their theory. Instead, the aim is to quickly show you how to fit, inspect, validate, and select between these models with `mssm`. If you want to learn more about the theory behind GAMMS (and GAMMLSS models), head to the reference section at the end of the tutorial, which contains pointers to papers and books focused on GAMM & GAMMLSS models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data structure" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Import some simulated data\n", "dat = pd.read_csv('https://raw.githubusercontent.com/JoKra1/mssmViz/main/data/GAMM/sim_dat.csv')\n", "\n", "# mssm requires that the data-type for variables used as factors is 'O'=object\n", "dat = dat.astype({'series': 'O',\n", " 'cond':'O',\n", " 'sub':'O',\n", " 'series':'O'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some remarks about the desired layout of the data to be usable with the GAMM class:\n", "\n", "- Data has to be provided in a `pandas.DataFrame`.\n", "- The dependent variable can contain NAs, which will be taken care of. **No NAs should be present in any of the predictor columns**!\n", "- The data-type (dtype) for numeric columns should be float or int. Categorical predictors need to have the object datatype (see code above).\n", "- No specific row/column sorting is required.\n", "\n", "**Exception to the last point:**\n", "\n", "If you want to approximate the computations for random smooth models of multi-level time-series data or want to include an \"ar1\" model of the residuals (see section on advanced topics), **the data should be ordered according to series, then time**: the ordered (increasing) time-series corresponding to the first series should make up the first X rows. The ordered time-series corresponding to the next series should make up the next Y rows, and so on. More specifically, what matters is that individual series are making up a time-ordered sub-block of dat. So you could also have a dataframe that starts with a series labeled 0, then a series labeled 115, then a series labeled X. As long as any series in the data is not interrupted by rows corresponding to any other series it will be fine." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | y | \n", "time | \n", "series | \n", "cond | \n", "sub | \n", "x | \n", "
---|---|---|---|---|---|---|
0 | \n", "7.005558 | \n", "0 | \n", "0 | \n", "a | \n", "0 | \n", "9.817736 | \n", "
1 | \n", "11.122316 | \n", "20 | \n", "0 | \n", "a | \n", "0 | \n", "10.262371 | \n", "
2 | \n", "4.766720 | \n", "40 | \n", "0 | \n", "a | \n", "0 | \n", "10.445887 | \n", "
3 | \n", "2.952046 | \n", "60 | \n", "0 | \n", "a | \n", "0 | \n", "8.481554 | \n", "
4 | \n", "7.463034 | \n", "80 | \n", "0 | \n", "a | \n", "0 | \n", "10.180660 | \n", "