Tutorial¶

For this tutorial we are going to process a data set for private linkage with clkhash using the Python API. Note you can also use the command line tool.

The Python package recordlinkage has a tutorial linking data sets in the clear, we will try duplicate that in a privacy preserving setting.

First install clkhash, recordlinkage and a few data science tools (pandas and numpy).

In [ ]:

!pip install -U clkhash recordlinkage numpy pandas

In [24]:

import io
import numpy as np
import pandas as pd

In [17]:

import clkhash
import recordlinkage
from recordlinkage.datasets import load_febrl4

Data Exploration¶

First we have a look at the dataset.

In [4]:

dfA, dfB = load_febrl4()

dfA.head()

Out[4]:

	given_name	surname	street_number	address_1	address_2	suburb	postcode	state	date_of_birth	soc_sec_id
rec_id
rec-1070-org	michaela	neumann	8	stanley street	miami	winston hills	4223	nsw	19151111	5304218
rec-1016-org	courtney	painter	12	pinkerton circuit	bega flats	richlands	4560	vic	19161214	4066625
rec-4405-org	charles	green	38	salkauskas crescent	kela	dapto	4566	nsw	19480930	4365168
rec-1288-org	vanessa	parr	905	macquoid place	broadbridge manor	south grafton	2135	sa	19951119	9239102
rec-3585-org	mikayla	malloney	37	randwick road	avalind	hoppers crossing	4552	vic	19860208	7207688

For this linkage we will not use the social security id column.

In [18]:

dfA.columns

Out[18]:

Index(['given_name', 'surname', 'street_number', 'address_1', 'address_2',
       'suburb', 'postcode', 'state', 'date_of_birth', 'soc_sec_id'],
      dtype='object')

In [41]:

a_csv = io.StringIO()
dfA.to_csv(a_csv)
a_csv.seek(0)

Out[41]:

Linkage Schema Definition¶

A basic schema definition instructs clkhash how to treat each column. The identifiers are found in clkhash/identifier_types.py. The INDEX columns will not become part of the CLK.

Note: The schema specification is under heavy renovation for the next version.

In [42]:

column_metadata = [
            {"identifier": 'INDEX'},
            {"identifier": 'NAME Surname'},
            {"identifier": 'NAME First Name'},
            {"identifier": 'ADDRESS House Number'},
            {"identifier": 'ADDRESS Place Name'},
            {"identifier": 'ADDRESS Place Name'},
            {"identifier": 'ADDRESS Place Name'},
            {"identifier": 'ADDRESS POSTCODE'},
            {"identifier": 'ADDRESS Place Name'},
            {"identifier": 'DOB YYYY/MM/DD'},
            {"identifier": 'INDEX'}
        ]

In [43]:

schema = clkhash.schema.get_schema_types(column_metadata)

Hash the data¶

We can now hash our PII data from the CSV file using our defined schema. We must provide two secret keys to this command - these keys have to be used by both parties hashing data.

Knowledge of these keys is sufficient to reconstruct the PII information from a CLK! Do not share these keys with anyone, except the other participating party.

In [44]:

hashed_data_a = clkhash.clk.generate_clk_from_csv(a_csv, ('key1', 'key2'), schema)

generating CLKs: 100%|██████████| 5.00K/5.00K [00:02<00:00, 1.46Kclk/s, mean=916, std=27.1]

Inspect the output¶

clkhash has hashed the PII, creating a Cryptographic Longterm Key for each entity. The output of generate_clk_from_csv shows that the mean popcount is quite high (916 out of 1024) which can effect accuracy.

In [45]:

len(hashed_data_a)

Out[45]:

Each CLK is serialized in a JSON friendly base64 format:

In [46]:

hashed_data_a[0]

Out[46]:

'+79/+3//33/+O///fv//N67/7/+/u/vP////+3/+/7//d///////ft/////vM/7v/v//1/+///9/fv9f7/zu//5/+/ffv+9//z97v7///////9P//79f/7///76e/f9///////////P/4/b/S+///fv8//3vf//5n//r37/+/98='

Hash data set B¶

Now we hash the second dataset using the same keys and same schema.

In [47]:

b_csv = io.StringIO()
dfB.to_csv(b_csv)
b_csv.seek(0)
hashed_data_b = clkhash.clk.generate_clk_from_csv(b_csv, ('key1', 'key2'), schema)

generating CLKs: 100%|██████████| 5.00K/5.00K [00:02<00:00, 2.44Kclk/s, mean=909, std=32]

In [48]:

len(hashed_data_b)

Out[48]:

Wrapping Up¶

That is all for this tutorial, next you might want to look at comparing the CLKs with anonlink, or uploading them to an Entity Service.

Note the clkhash command line tool includes commands to upload to an entity service run by Data61.