Generate synthetic data (group difference)

Usage

generate_synthetic_data(
  n_subjects = 100,
  n_timepoints = 11,
  prop_observed = 0.7,
  observation_variance = 1,
  random_effect_variance_ratio = 1,
  random_effect_ar1_correlation = 0.9,
  effect_size = 1,
  grpdiff_function = "sigmoid",
  missingness = "uniform",
  seed = 1
)

Arguments

n_subjects: Number of subjects (default 100). Half of them are in group 0 and the other half in group 1 (on average).
n_timepoints: Number of time points at which the data is sampled spread regularly on the $[0,1]$ interval (default 11).
prop_observed: Proportion of observed time points (default 0.7).
observation_variance: Variance of the observation error (default 1).
random_effect_variance_ratio: Ratio of the variance of the random effect to the variance of the observation error (default 1).
random_effect_ar1_correlation: Correlation between two random effects at two consecutive timepoints (default 0.9).
effect_size: Effect size of the group (default 1).
grpdiff_function: Group difference function (implemented: "sigmoid" (default), "sine".) Can also be a callable function, provided evaluation is vectorized.
missingness: Missing observation mechanism (implemented: "uniform" (default), "contiguous", "sqrt", "fixed_uniform")
seed: Seed for the random number generator (default 1).

Value

A list containing the following elements:

data: A data frame containing the longitudinal data in long format.
data_wide: A data frame containing the longitudinal data in wide format.
data_wide_imputed: A data frame containing the longitudinal data in wide format with imputed missing values.
true_values: A data frame containing the true values of the fixed effects.
times: A vector of timepoints at which the data is sampled.
colnames: A list containing the column names of the data frames.

Details

The data is generated according to the following model: $$y_{ij} = \beta_0(t_{ij}) + \beta_1(t_{ij}) g_i + \theta_{ij} + \epsilon_{ij}$$ where $y_{ij}$ is the response of subject $i$ at time $j$ at time $(t_{ij})$, $g_i$ is the group of subject $i$, $\theta_{ij}$ is the random effect of subject $i$ at time $j$, $\epsilon_{ij}$ is the observation error of subject $i$ at time $j$, $\beta_0(\cdot)$ is the intercept function, and $\beta_1(\cdot)$ is the effect size. The random effects are generated according to a multivariate normal distribution with mean 0 and covariance matrix $\Sigma$ such that $\Sigma_{ij} = r\sigma^2\rho^{|i-j|}$ ($\rho$ defined by random_effect_ar1_correlation and $r$ by random_effect_variance_ratio). The observation errors are generated according to a multivariate normal distribution with mean 0 and covariance matrix $\sigma^2 I$ ($\sigma^2$ defined by observation_variance). The random effects and observation errors are independent. The response is generated according to a logistic function of time and group: $$f(t) = \frac{1}{1 + \exp((0.6-t)20)}$$ The observations are then sampled uniformly at a proportion prop_observed of the timepoints. An imputed version of the data is also provided where the missing values are imputed using the function [refund::fpca.sc]. The data is returned in three formats:

data: A data frame containing the longitudinal data in long format.
data_wide: A data frame containing the longitudinal data in wide format.
data_wide_imputed: A data frame containing the longitudinal data in wide format with imputed missing values.

Examples

instance = generate_synthetic_data()
#> New names:
#> • `` -> `...1`
#> • `` -> `...2`