TY - JOUR
T1 - Revisiting distance-based record linkage for privacy-preserving release of statistical datasets
AU - Herranz, Javier
AU - Nin, Jordi
AU - Rodríguez, Pablo
AU - Tassa, Tamir
N1 - Publisher Copyright:
© 2015 Elsevier B.V. All rights reserved.
PY - 2015/11
Y1 - 2015/11
N2 - Statistical Disclosure Control (SDC, for short) studies the problem of privacy-preserving data publishing in cases where the data is expected to be used for statistical analysis. An original dataset T containing sensitive information is transformed into a sanitized version T′ which is released to the public. Both utility and privacy aspects are very important in this setting. For utility, T′ must allow data miners or statisticians to obtain similar results to those which would have been obtained from the original dataset T. For privacy, T′ must significantly reduce the ability of an adversary to infer sensitive information on the data subjects in T. One of the main a-posteriori measures that the SDC community has considered up to now when analyzing the privacy offered by a given protection method is the Distance-Based Record Linkage (DBRL) risk measure. In this work, we argue that the classical DBRL risk measure is insufficient. For this reason, we introduce the novel Global Distance-Based Record Linkage (GDBRL) risk measure. We claim that this new measure must be evaluated alongside the classical DBRL measure in order to better assess the risk in publishing T′ instead of T. After that, we describe how this new measure can be computed by the data owner and discuss the scalability of those computations. We conclude by extensive experimentation where we compare the risk assessments offered by our novel measure as well as by the classical one, using well-known SDC protection methods. Those experiments validate our hypothesis that the GDBRL risk measure issues, in many cases, higher risk assessments than the classical DBRL measure. In other words, relying solely on the classical DBRL measure for risk assessment might be misleading, as the true risk may be in fact higher. Hence, we strongly recommend that the SDC community considers the new GDBRL risk measure as an additional measure when analyzing the privacy offered by SDC protection algorithms.
AB - Statistical Disclosure Control (SDC, for short) studies the problem of privacy-preserving data publishing in cases where the data is expected to be used for statistical analysis. An original dataset T containing sensitive information is transformed into a sanitized version T′ which is released to the public. Both utility and privacy aspects are very important in this setting. For utility, T′ must allow data miners or statisticians to obtain similar results to those which would have been obtained from the original dataset T. For privacy, T′ must significantly reduce the ability of an adversary to infer sensitive information on the data subjects in T. One of the main a-posteriori measures that the SDC community has considered up to now when analyzing the privacy offered by a given protection method is the Distance-Based Record Linkage (DBRL) risk measure. In this work, we argue that the classical DBRL risk measure is insufficient. For this reason, we introduce the novel Global Distance-Based Record Linkage (GDBRL) risk measure. We claim that this new measure must be evaluated alongside the classical DBRL measure in order to better assess the risk in publishing T′ instead of T. After that, we describe how this new measure can be computed by the data owner and discuss the scalability of those computations. We conclude by extensive experimentation where we compare the risk assessments offered by our novel measure as well as by the classical one, using well-known SDC protection methods. Those experiments validate our hypothesis that the GDBRL risk measure issues, in many cases, higher risk assessments than the classical DBRL measure. In other words, relying solely on the classical DBRL measure for risk assessment might be misleading, as the true risk may be in fact higher. Hence, we strongly recommend that the SDC community considers the new GDBRL risk measure as an additional measure when analyzing the privacy offered by SDC protection algorithms.
KW - Distance-based record linkage
KW - Privacy measures
KW - Statistical Disclosure Control
UR - http://www.scopus.com/inward/record.url?scp=84946483381&partnerID=8YFLogxK
U2 - 10.1016/j.datak.2015.07.009
DO - 10.1016/j.datak.2015.07.009
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:84946483381
SN - 0169-023X
VL - 100
SP - 78
EP - 93
JO - Data and Knowledge Engineering
JF - Data and Knowledge Engineering
ER -