Generate synthetic data (multiple features)
Usage
generate_synthetic_data_p(
n_subjects = 100,
n_timepoints = 11,
n_features = 2,
feature_type = "binary",
observation_variance = 1,
random_effect_variance_ratio = 1,
random_effect_ar1_correlation = 0.9,
effect_size = 1,
effect_groupsparsity = 0.5,
effect_sparsity = 0.5,
prop_observed = 0.7,
missingness = "uniform",
seed = 1
)
Arguments
- n_subjects
Number of subjects (default 100).
- n_timepoints
Number of time points at which the data is sampled spread regularly on the \([0,1]\) interval (default 11).
- n_features
Number of features in addition to the intercept (default 2).
- feature_type
Type of feature (implemented: "binary" (default), "uniform", "normal").
- observation_variance
Variance of the observation error (default 1).
- random_effect_variance_ratio
Ratio of the variance of the random effect to the variance of the observation error (default 1).
- random_effect_ar1_correlation
Correlation between two random effects at two consecutive timepoints (default 0.9).
- effect_size
Effect size of the features (default 1).
- effect_groupsparsity
Proportion of features that have no effect (default 0.5).
- effect_sparsity
Proportion of the domain that has null effect (default 0.5).
- prop_observed
Proportion of observed time points (default 0.7).
- missingness
Missing observation mechanism (implemented: "uniform" (default), "contiguous", "sqrt", "fixed_uniform")
- seed
Seed for the random number generator (default 1).
Value
A list containing the following elements:
- data
A data frame containing the longitudinal data in long format.
- data_wide
A data frame containing the longitudinal data in wide format.
- data_wide_imputed
A data frame containing the longitudinal data in wide format with imputed missing values.
- true_values
A data frame containing the true values of the fixed effects.
- times
A vector of timepoints at which the data is sampled.
- colnames
A list containing the column names of the data frames.
Details
The data is generated according to the following model:
$$y_{ij} = \beta_0(t_{ij}) + \sum_{k=1}^{p} \beta_k(t_{ij}) x_{ik} + \theta_{ij} + \epsilon_{ij}$$
where \(y_{ij}\) is the \(j\)th response of subject \(i\), at time \((t_{ij})\),
\(x_{ik}\) is the \(k\)th feature of subject \(i\),
\(\theta_{ij}\) is the random effect of subject \(i\) at time \(j\),
\(\epsilon_{ij}\) is the observation error of subject \(i\) at time \(j\),
\(\beta_0(\cdot)\) is the intercept function, and \(\beta_1(\cdot)\) is the effect size.
The random effects are generated according to a multivariate normal distribution
with mean 0 and covariance matrix \(\Sigma\) such that \(\Sigma_{ij} = r\sigma^2\rho^{|i-j|}\)
(\(\rho\) defined by random_effect_ar1_correlation
and \(r\) by random_effect_variance_ratio
).
The observation errors are generated according to a multivariate normal distribution
with mean 0 and covariance matrix \(\sigma^2 I\)
(\(\sigma^2\) defined by observation_variance
).
The random effects and observation errors are independent.
The observations are then sampled uniformly at a proportion prop_observed
of the timepoints.
An imputed version of the data is also provided where the missing values are imputed using
the function [refund::fpca.sc].
The data is returned in three formats:
- data
A data frame containing the longitudinal data in long format.
- data_wide
A data frame containing the longitudinal data in wide format.
- data_wide_imputed
A data frame containing the longitudinal data in wide format with imputed missing values.