Overview

In this post, we take a look at the k-anonymity algorithm. We will also look at how to use ARX for anonymizing a dataset.

Data anonymity with k-anonymity

Let's assume we have at our disposal a dataset $D$. We would like to be able to modify $D$ such that sensitive information about individuals within $D$ is not leaked. Let's denote the modified dataset $D$ with $\hat{D}$. Modification of $D$ cannot be done recklessly because $\hat{D}$ will not have any value in terms of performing analytics on it. Thus, anonymization, is about striking a balance between privacy and utility of the resulting anonymized set.

With $k-$anonymity a dataset $D$ is transformed so that it is difficult for an intruder to determine the identity of the individuals in $D$ [1]. When a dataset is anonymized using $k-$anonymity, it has the property that each record is similar to at least $k-1$ other records on the potentially identifying variables.

Common implementations of the algorithm, use various transformation techniques such as [1]:

Generalization
Global recoding
Suppression

Any record in a $k-$anonymized $D$ has a maximum probability $1/k$ of being re-identified [1].

The algorithm was developed to protect against two types of attacks [1]:

Re-identification of a specific individual
Re-identification of an arbitrary individual

In the first type of attack, the intruder would know that a particular individual exists in $\hat{D}$ and wants to discover the record that belongs to that individual. In the second type of attack, the intruder is not interested in a specific individual but rather is interested in that re-identification per se can be done [1].

In most cases, the algorithm is capable of preventing identity disclosure i.e. a record in a $k-$anonymized $D$ cannot be connected to the corresponding record in the non-anonymized dataset. However, it may fail to protect against attribute disclosure [2]. Approaches such as l-diversity and t-closeness have been proposed to overcome the limitations of $k-$anonymity.

package example_2

import org.deidentifier.arx.ARXPopulationModel.Region

import java.io.File
import java.nio.charset.Charset
import java.text.DecimalFormat
import collection.JavaConverters.*
import collection.mutable.ArrayBuffer

import org.deidentifier.arx.{ARXAnonymizer, ARXConfiguration, ARXPopulationModel, ARXResult, AttributeType, Data, DataHandle, DataType}
import org.deidentifier.arx.criteria.KAnonymity
import org.deidentifier.arx.criteria.EqualDistanceTCloseness
import org.deidentifier.arx.criteria.HierarchicalDistanceTCloseness
import org.deidentifier.arx.criteria.DistinctLDiversity
import org.deidentifier.arx.Data
import org.deidentifier.arx.Data.DefaultData
import org.deidentifier.arx.AttributeType.Hierarchy
import org.deidentifier.arx.AttributeType.Hierarchy.DefaultHierarchy

import postprocessor.ResultPrinter.{printResult, printHandle}



/**
 * Example1: Load data to ARX
 */
object KAnonymityARX
{

  def createData: Data = {

    // Define data
    val data = Data.create
    data.add("age", "gender", "zipcode")
    data.add("34", "male", "81667")
    data.add("45", "female", "81675")
    data.add("66", "male", "81925")
    data.add("70", "female", "81931")
    data.add("34", "female", "81931")
    data.add("70", "male", "81931")
    data.add("45", "male", "81931")
    data
  }

  def main(args: Array[String]): Unit ={

    System.out.println("Running example 2...")

    val data = createData

    // check the columns
    val nCols = data.getHandle.getNumColumns
    println(s"Number of columns ${nCols}")

    val nRows = data.getHandle.getNumRows
    println(s"Number of rows ${nRows}")

    // define hierarchies
    val age = Hierarchy.create
    age.add("34", "<50", "*")
    age.add("45", "<50", "*")
    age.add("66", ">=50", "*")
    age.add("70", ">=50", "*")

    val gender = Hierarchy.create
    gender.add("male", "*")
    gender.add("female", "*")

    // Only excerpts for readability
    val zipcode = Hierarchy.create
    zipcode.add("81667", "8166*", "816**", "81***", "8****", "*****")
    zipcode.add("81675", "8167*", "816**", "81***", "8****", "*****")
    zipcode.add("81925", "8192*", "819**", "81***", "8****", "*****")
    zipcode.add("81931", "8193*", "819**", "81***", "8****", "*****")

    data.getDefinition.setAttributeType("age", age)
    data.getDefinition.setAttributeType("gender", gender)
    data.getDefinition.setAttributeType("zipcode", zipcode)

    System.out.println("Number of sensitive variables=" + data.getHandle.getDefinition.getSensitiveAttributes.size)

    // Create an instance of the anonymizer
    val anonymizer = new ARXAnonymizer
    val config = ARXConfiguration.create
    config.addPrivacyModel(new KAnonymity(3))
    config.setSuppressionLimit(0d)
    val result = anonymizer.anonymize(data, config)

    // Print info
    printResult(result, data)

    // Process results
    System.out.println(" - Transformed data:")
    printHandle(handle = result.getOutput(false))
    System.out.println("Done!")

  }
}

References

Khaled El, Fiad Kamal Dankar, Protecting privacy using k-anonymity, Journal of American Medical Informatics Association, vol 15, pp. 627-637, 2008.
Ismini Psychoula et al, A deep learning approach for privacy preservation in assisted living.