Analyzing the structure of social networks is of interest in a wide range of disciplines. Unfortunately, sharing social-network datasets is often restrained by privacy considerations. One way to address the privacy concern is to anonymize the data before publishing. Randomly adding or deleting edges from the social graph is one of the anonymization approaches that have been proposed in the literature. Recent studies have quantified the level of anonymity that is obtained by random perturbation by means of a posteriori belief probabilities and, by conducting experiments on small datasets, arrived at the conclusion that random perturbation cannot achieve meaningful levels of anonymity without deteriorating the graph properties. We offer a new information-theoretic perspective on the question of anonymizing a social network by means of random edge additions and deletions. We make an essential distinction between image and preimage anonymity and propose a more accurate quantification, based on entropy, of the anonymity level that is provided by the perturbed network. We explain why the entropy-based quantification, which is global, is more adequate than the previously used local quantification that was based on a posteriori belief probabilities. We also prove that the anonymity level as quantified by means of entropy is always greater than or equal to the one based on a posteriori belief probabilities. In addition, we introduce and explore the method of random sparsification, which randomly removes edges, without adding new ones. Extensive experimentation on several very large datasets shows that randomization techniques for identity obfuscation are back in the game, as they may achieve meaningful levels of anonymity while still preserving properties of the original graph. As the methods we study add and remove edges, it is natural to ask whether an adversary might use the disclosed perturbed graph structure to reconstruct, even partially, the original graph. We thus study the resilience of obfuscation by random sparsification to adversarial attacks that are based on link prediction. Given a general link prediction method, with a predefined level of prediction accuracy, we show how to quantify the level of anonymity that is guaranteed by the obfuscation. We empirically prove that even for very accurate link prediction methods, the level of anonymity guaranteed remains very close to the one before the attack. Finally, we show how the randomization method may be applied in a distributed setting, where the network data is distributed among several non-trusting sites, and explain why randomization is far more suitable for such settings than other existing approaches.
Bibliographical noteFunding Information:
The first two authors were partially supported by the Spanish Centre for the Development of Industrial Technology under the CENIT program, Project CEN-20101037, “Social Media” ( www.cenitsocialmedia.es ). Part of this research was conducted when the third author was a guest of Yahoo! Research, Barcelona.
- Data publishing
- Information theory
- Social network