Data Engineering. Data anonymity with k-anonymity
Introduction to data anonymization with k-anonymity
In this post, we take a look at the k-anonymity algorithm. We will also look at how to use ARX for anonymizing a dataset.
Let's assume we have at our disposal a dataset $D$. We would like to be able to modify $D$ such that sensitive information about individuals within $D$ is not leaked. Let's denote the modified dataset $D$ with $\hat{D}$. Modification of $D$ cannot be done recklessly because $\hat{D}$ will not have any value in terms of performing analytics on it. Thus, anonymization, is about striking a balance between privacy and utility of the resulting anonymized set.
With $k-$anonymity a dataset $D$ is transformed so that it is difficult for an intruder to determine the identity of the individuals in $D$ [1]. When a dataset is anonymized using $k-$anonymity, it has the property that each record is similar to at least $k-1$ other records on the potentially identifying variables.
Common implementations of the algorithm, use various transformation techniques such as [1]:
- Generalization
- Global recoding
- Suppression
Any record in a $k-$anonymized $D$ has a maximum probability $1/k$ of being re-identified [1].
The algorithm was developed to protect against two types of attacks [1]:
- Re-identification of a specific individual
- Re-identification of an arbitrary individual
In the first type of attack, the intruder would know that a particular individual exists in $\hat{D}$ and wants to discover the record that belongs to that individual. In the second type of attack, the intruder is not interested in a specific individual but rather is interested in that re-identification per se can be done [1].
In most cases, the algorithm is capable of preventing identity disclosure i.e. a record in a $k-$anonymized $D$ cannot be connected to the corresponding record in the non-anonymized dataset. However, it may fail to protect against attribute disclosure [2]. Approaches such as l-diversity and t-closeness have been proposed to overcome the limitations of $k-$anonymity.
package example_2
import org.deidentifier.arx.ARXPopulationModel.Region
import java.io.File
import java.nio.charset.Charset
import java.text.DecimalFormat
import collection.JavaConverters.*
import collection.mutable.ArrayBuffer
import org.deidentifier.arx.{ARXAnonymizer, ARXConfiguration, ARXPopulationModel, ARXResult, AttributeType, Data, DataHandle, DataType}
import org.deidentifier.arx.criteria.KAnonymity
import org.deidentifier.arx.criteria.EqualDistanceTCloseness
import org.deidentifier.arx.criteria.HierarchicalDistanceTCloseness
import org.deidentifier.arx.criteria.DistinctLDiversity
import org.deidentifier.arx.Data
import org.deidentifier.arx.Data.DefaultData
import org.deidentifier.arx.AttributeType.Hierarchy
import org.deidentifier.arx.AttributeType.Hierarchy.DefaultHierarchy
import postprocessor.ResultPrinter.{printResult, printHandle}
/**
* Example1: Load data to ARX
*/
object KAnonymityARX
{
def createData: Data = {
// Define data
val data = Data.create
data.add("age", "gender", "zipcode")
data.add("34", "male", "81667")
data.add("45", "female", "81675")
data.add("66", "male", "81925")
data.add("70", "female", "81931")
data.add("34", "female", "81931")
data.add("70", "male", "81931")
data.add("45", "male", "81931")
data
}
def main(args: Array[String]): Unit ={
System.out.println("Running example 2...")
val data = createData
// check the columns
val nCols = data.getHandle.getNumColumns
println(s"Number of columns ${nCols}")
val nRows = data.getHandle.getNumRows
println(s"Number of rows ${nRows}")
// define hierarchies
val age = Hierarchy.create
age.add("34", "<50", "*")
age.add("45", "<50", "*")
age.add("66", ">=50", "*")
age.add("70", ">=50", "*")
val gender = Hierarchy.create
gender.add("male", "*")
gender.add("female", "*")
// Only excerpts for readability
val zipcode = Hierarchy.create
zipcode.add("81667", "8166*", "816**", "81***", "8****", "*****")
zipcode.add("81675", "8167*", "816**", "81***", "8****", "*****")
zipcode.add("81925", "8192*", "819**", "81***", "8****", "*****")
zipcode.add("81931", "8193*", "819**", "81***", "8****", "*****")
data.getDefinition.setAttributeType("age", age)
data.getDefinition.setAttributeType("gender", gender)
data.getDefinition.setAttributeType("zipcode", zipcode)
System.out.println("Number of sensitive variables=" + data.getHandle.getDefinition.getSensitiveAttributes.size)
// Create an instance of the anonymizer
val anonymizer = new ARXAnonymizer
val config = ARXConfiguration.create
config.addPrivacyModel(new KAnonymity(3))
config.setSuppressionLimit(0d)
val result = anonymizer.anonymize(data, config)
// Print info
printResult(result, data)
// Process results
System.out.println(" - Transformed data:")
printHandle(handle = result.getOutput(false))
System.out.println("Done!")
}
}
- Khaled El, Fiad Kamal Dankar,
Protecting privacy using k-anonymity
, Journal of American Medical Informatics Association, vol 15, pp. 627-637, 2008. - Ismini Psychoula et al,
A deep learning approach for privacy preservation in assisted living
.