Synthesising Data from Marginal Distributions
synthesising_data.Rmd
Data is synthesised by sampling from a multivariate cumulative
distribution (Copula), using the simstudy
package.
Without Correlations
Data can be synthesised from marginal distributions using the
synthesise_data()
function:
library(RESIDE)
marginals <- import_marginal_distributions()
simulated_data <- synthesise_data(marginals)
With correlations
User specified correlations can be added to the synthesised data by
supplying a correlation matrix. An empty correlations matrix can be
generated using the export_empty_cor_matrix()
function,
supplying the marginals imported using ‘import_marginal_distributions’
and a folder path respectively:
library(RESIDE)
marginals <- import_marginal_distributions()
export_empty_cor_matrix(marginals, folder_path = tempdir())
- By default the file wil be names correlation_matrix.csv but can be changed with the ‘file_name’ parameter *
The exported CSV file will be a symmetric table which looks like:
Correlations should then be added to the CSV file, without modifying the column / row names. Correlations should use rank order correlations. Categorical variables are represented as dummy variables named using the format variable name underscore category name e.g. SEX_F. Note the correlation matrix should be symmetrical and positive semi definite.
Once the correlations have been added to the CSV file, the correlations can be imported using the `import_cor_matrix’ function:
library(RESIDE)
correlation_matrix <- import_cor_matrix()
By default the filename for the correlation matrix is that of the
exported filename (correlation_matrix.csv
) and is imported
from the current working directory. This can be changed by specifying a
file_path
using the corresponding parameter of the
import_cor_matrix()
function, this file path should be a
relative or absolute file path.
The import_cor_matrix()
function will produce and error
if the matrix is not symmetrical and positive semi definite, or the file
does not exist.
With a correlation matrix data can now be synthesised with the user
specified correlations using the synthesise_data()
function, specifying the correlation matrix imported by the
import_cor_matrix()
function:
library(RESIDE)
marginals <- import_marginal_distributions()
export_empty_cor_matrix(marginals)
correlation_matrix <- import_cor_matrix()
simulated_data <- synthesise_data(
marginals,
correlation_matrix
)
NB It is not possible to entirely maintain all the marginal distributions when specifying correlations, this is a known limitation and is not likely to change.