Measuring re-identification risk using a synthetic estimator to enable data sharing.
Measuring re-identification risk using a synthetic estimator to enable data sharing.
Blog Article
BackgroundOne common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it.To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used.There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population.ObjectivesDevelop an accurate risk estimator for the sample-to-population attack.
MethodsA type of estimator based on creating a synthetic variant of BODY WASH FOREST a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack.The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error.Two estimators were considered, a Gaussian copula and a d-vine copula.They were compared against three other estimators proposed in the literature.
ResultsTaking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values.This was significantly more accurate than existing methods.A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance.
The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey VITEX 80MG dataset.ConclusionsThe average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared.