Tutorial¶
For this tutorial we are going to process a data set for private linkage with clkhash using the Python API. Note you can also use the command line tool.
The Python package recordlinkage
has a
tutorial
linking data sets in the clear, we will try duplicate that in a privacy
preserving setting.
First install clkhash, recordlinkage and a few data science tools (pandas and numpy).
In [ ]:
!pip install -U clkhash recordlinkage numpy pandas
In [24]:
import io
import numpy as np
import pandas as pd
In [17]:
import clkhash
import recordlinkage
from recordlinkage.datasets import load_febrl4
Data Exploration¶
First we have a look at the dataset.
In [4]:
dfA, dfB = load_febrl4()
dfA.head()
Out[4]:
given_name | surname | street_number | address_1 | address_2 | suburb | postcode | state | date_of_birth | soc_sec_id | |
---|---|---|---|---|---|---|---|---|---|---|
rec_id | ||||||||||
rec-1070-org | michaela | neumann | 8 | stanley street | miami | winston hills | 4223 | nsw | 19151111 | 5304218 |
rec-1016-org | courtney | painter | 12 | pinkerton circuit | bega flats | richlands | 4560 | vic | 19161214 | 4066625 |
rec-4405-org | charles | green | 38 | salkauskas crescent | kela | dapto | 4566 | nsw | 19480930 | 4365168 |
rec-1288-org | vanessa | parr | 905 | macquoid place | broadbridge manor | south grafton | 2135 | sa | 19951119 | 9239102 |
rec-3585-org | mikayla | malloney | 37 | randwick road | avalind | hoppers crossing | 4552 | vic | 19860208 | 7207688 |
For this linkage we will not use the social security id column.
In [18]:
dfA.columns
Out[18]:
Index(['given_name', 'surname', 'street_number', 'address_1', 'address_2',
'suburb', 'postcode', 'state', 'date_of_birth', 'soc_sec_id'],
dtype='object')
In [41]:
a_csv = io.StringIO()
dfA.to_csv(a_csv)
a_csv.seek(0)
Out[41]:
0
Linkage Schema Definition¶
A basic schema definition instructs clkhash how to treat each column.
The identifiers are found in
clkhash/identifier_types.py.
The INDEX
columns will not become part of the CLK.
Note: The schema specification is under heavy renovation for the next version.
In [42]:
column_metadata = [
{"identifier": 'INDEX'},
{"identifier": 'NAME Surname'},
{"identifier": 'NAME First Name'},
{"identifier": 'ADDRESS House Number'},
{"identifier": 'ADDRESS Place Name'},
{"identifier": 'ADDRESS Place Name'},
{"identifier": 'ADDRESS Place Name'},
{"identifier": 'ADDRESS POSTCODE'},
{"identifier": 'ADDRESS Place Name'},
{"identifier": 'DOB YYYY/MM/DD'},
{"identifier": 'INDEX'}
]
In [43]:
schema = clkhash.schema.get_schema_types(column_metadata)
Hash the data¶
We can now hash our PII data from the CSV file using our defined schema. We must provide two secret keys to this command - these keys have to be used by both parties hashing data.
Knowledge of these keys is sufficient to reconstruct the PII information from a CLK! Do not share these keys with anyone, except the other participating party.
In [44]:
hashed_data_a = clkhash.clk.generate_clk_from_csv(a_csv, ('key1', 'key2'), schema)
generating CLKs: 100%|██████████| 5.00K/5.00K [00:02<00:00, 1.46Kclk/s, mean=916, std=27.1]
Inspect the output¶
clkhash has hashed the PII, creating a Cryptographic Longterm Key for
each entity. The output of generate_clk_from_csv
shows that the mean
popcount is quite high (916 out of 1024) which can effect accuracy.
In [45]:
len(hashed_data_a)
Out[45]:
5000
Each CLK is serialized in a JSON friendly base64 format:
In [46]:
hashed_data_a[0]
Out[46]:
'+79/+3//33/+O///fv//N67/7/+/u/vP////+3/+/7//d///////ft/////vM/7v/v//1/+///9/fv9f7/zu//5/+/ffv+9//z97v7///////9P//79f/7///76e/f9///////////P/4/b/S+///fv8//3vf//5n//r37/+/98='
Hash data set B¶
Now we hash the second dataset using the same keys and same schema.
In [47]:
b_csv = io.StringIO()
dfB.to_csv(b_csv)
b_csv.seek(0)
hashed_data_b = clkhash.clk.generate_clk_from_csv(b_csv, ('key1', 'key2'), schema)
generating CLKs: 100%|██████████| 5.00K/5.00K [00:02<00:00, 2.44Kclk/s, mean=909, std=32]
In [48]:
len(hashed_data_b)
Out[48]:
5000