Skip to contents

Generate synthetic data (multiple features)

Usage

generate_synthetic_data_p(
  n_subjects = 100,
  n_timepoints = 11,
  n_features = 2,
  feature_type = "binary",
  observation_variance = 1,
  random_effect_variance_ratio = 1,
  random_effect_ar1_correlation = 0.9,
  effect_size = 1,
  effect_groupsparsity = 0.5,
  effect_sparsity = 0.5,
  prop_observed = 0.7,
  missingness = "uniform",
  seed = 1
)

Arguments

n_subjects

Number of subjects (default 100).

n_timepoints

Number of time points at which the data is sampled spread regularly on the \([0,1]\) interval (default 11).

n_features

Number of features in addition to the intercept (default 2).

feature_type

Type of feature (implemented: "binary" (default), "uniform", "normal").

observation_variance

Variance of the observation error (default 1).

random_effect_variance_ratio

Ratio of the variance of the random effect to the variance of the observation error (default 1).

random_effect_ar1_correlation

Correlation between two random effects at two consecutive timepoints (default 0.9).

effect_size

Effect size of the features (default 1).

effect_groupsparsity

Proportion of features that have no effect (default 0.5).

effect_sparsity

Proportion of the domain that has null effect (default 0.5).

prop_observed

Proportion of observed time points (default 0.7).

missingness

Missing observation mechanism (implemented: "uniform" (default), "contiguous", "sqrt", "fixed_uniform")

seed

Seed for the random number generator (default 1).

Value

A list containing the following elements:

data

A data frame containing the longitudinal data in long format.

data_wide

A data frame containing the longitudinal data in wide format.

data_wide_imputed

A data frame containing the longitudinal data in wide format with imputed missing values.

true_values

A data frame containing the true values of the fixed effects.

times

A vector of timepoints at which the data is sampled.

colnames

A list containing the column names of the data frames.

Details

The data is generated according to the following model: $$y_{ij} = \beta_0(t_{ij}) + \sum_{k=1}^{p} \beta_k(t_{ij}) x_{ik} + \theta_{ij} + \epsilon_{ij}$$ where \(y_{ij}\) is the \(j\)th response of subject \(i\), at time \((t_{ij})\), \(x_{ik}\) is the \(k\)th feature of subject \(i\), \(\theta_{ij}\) is the random effect of subject \(i\) at time \(j\), \(\epsilon_{ij}\) is the observation error of subject \(i\) at time \(j\), \(\beta_0(\cdot)\) is the intercept function, and \(\beta_1(\cdot)\) is the effect size. The random effects are generated according to a multivariate normal distribution with mean 0 and covariance matrix \(\Sigma\) such that \(\Sigma_{ij} = r\sigma^2\rho^{|i-j|}\) (\(\rho\) defined by random_effect_ar1_correlation and \(r\) by random_effect_variance_ratio). The observation errors are generated according to a multivariate normal distribution with mean 0 and covariance matrix \(\sigma^2 I\) (\(\sigma^2\) defined by observation_variance). The random effects and observation errors are independent. The observations are then sampled uniformly at a proportion prop_observed of the timepoints. An imputed version of the data is also provided where the missing values are imputed using the function [refund::fpca.sc]. The data is returned in three formats:

data

A data frame containing the longitudinal data in long format.

data_wide

A data frame containing the longitudinal data in wide format.

data_wide_imputed

A data frame containing the longitudinal data in wide format with imputed missing values.

Examples

instance = generate_synthetic_data_p()
#> New names:
#>  `` -> `...1`
#>  `X1` -> `X1...2`
#>  `X2` -> `X2...3`
#>  `X1` -> `X1...4`
#>  `X2` -> `X2...5`
#> New names:
#>  `X1` -> `X1...2`
#>  `X2` -> `X2...3`
#>  `X1` -> `X1...4`
#>  `X2` -> `X2...5`