TY - JOUR
T1 - Improving accuracy of classification models induced from anonymized datasets
AU - Last, Mark
AU - Tassa, Tamir
AU - Zhmudyak, Alexandra
AU - Shmueli, Erez
PY - 2014/1/20
Y1 - 2014/1/20
N2 - The performance of classifiers and other data mining models can be significantly enhanced using the large repositories of digital data collected nowadays by public and private organizations. However, the original records stored in those repositories cannot be released to the data miners as they frequently contain sensitive information. The emerging field of Privacy Preserving Data Publishing (PPDP) deals with this important challenge. In this paper, we present NSVDist (Non-homogeneous generalization with Sensitive Value Distributions) - a new anonymization algorithm that, given minimal anonymity and diversity parameters along with an information loss measure, issues corresponding non-homogeneous anonymizations where the sensitive attribute is published as frequency distributions over the sensitive domain rather than in the usual form of exact sensitive values. In our experiments with eight datasets and four different classification algorithms, we show that classifiers induced from data generalized by NSVDist tend to be more accurate than classifiers induced using state-of-the-art anonymization algorithms.
AB - The performance of classifiers and other data mining models can be significantly enhanced using the large repositories of digital data collected nowadays by public and private organizations. However, the original records stored in those repositories cannot be released to the data miners as they frequently contain sensitive information. The emerging field of Privacy Preserving Data Publishing (PPDP) deals with this important challenge. In this paper, we present NSVDist (Non-homogeneous generalization with Sensitive Value Distributions) - a new anonymization algorithm that, given minimal anonymity and diversity parameters along with an information loss measure, issues corresponding non-homogeneous anonymizations where the sensitive attribute is published as frequency distributions over the sensitive domain rather than in the usual form of exact sensitive values. In our experiments with eight datasets and four different classification algorithms, we show that classifiers induced from data generalized by NSVDist tend to be more accurate than classifiers induced using state-of-the-art anonymization algorithms.
KW - Classification
KW - Non-homogeneous anonymization
KW - Privacy preserving data mining
KW - Privacy preserving data publishing
KW - k-Anonymity
KW - ℓ-Diversity
UR - http://www.scopus.com/inward/record.url?scp=84887259053&partnerID=8YFLogxK
U2 - 10.1016/j.ins.2013.07.034
DO - 10.1016/j.ins.2013.07.034
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:84887259053
SN - 0020-0255
VL - 256
SP - 138
EP - 161
JO - Information Sciences
JF - Information Sciences
ER -