Tutorial for CLI tool clkhash
¶
For this tutorial we are going to process a data set for private linkage
with clkhash using the command line tool clkutil
. Note you can also
use the Python API.
The Python package recordlinkage
has a
tutorial
linking data sets in the clear, we will try duplicate that in a privacy
preserving setting.
First install clkhash, recordlinkage and a few data science tools (pandas and numpy).
In [ ]:
!pip install -U clkhash recordlinkage numpy pandas
In [1]:
import io
import json
import numpy as np
import pandas as pd
In [2]:
import recordlinkage
from recordlinkage.datasets import load_febrl4
Data Exploration¶
First we have a look at the dataset.
In [3]:
dfA, dfB = load_febrl4()
dfA.head()
Out[3]:
given_name | surname | street_number | address_1 | address_2 | suburb | postcode | state | date_of_birth | soc_sec_id | |
---|---|---|---|---|---|---|---|---|---|---|
rec_id | ||||||||||
rec-1070-org | michaela | neumann | 8 | stanley street | miami | winston hills | 4223 | nsw | 19151111 | 5304218 |
rec-1016-org | courtney | painter | 12 | pinkerton circuit | bega flats | richlands | 4560 | vic | 19161214 | 4066625 |
rec-4405-org | charles | green | 38 | salkauskas crescent | kela | dapto | 4566 | nsw | 19480930 | 4365168 |
rec-1288-org | vanessa | parr | 905 | macquoid place | broadbridge manor | south grafton | 2135 | sa | 19951119 | 9239102 |
rec-3585-org | mikayla | malloney | 37 | randwick road | avalind | hoppers crossing | 4552 | vic | 19860208 | 7207688 |
Note that for computing this linkage we will not use the social
security id column or the rec_id
index.
In [4]:
dfA.columns
Out[4]:
Index(['given_name', 'surname', 'street_number', 'address_1', 'address_2',
'suburb', 'postcode', 'state', 'date_of_birth', 'soc_sec_id'],
dtype='object')
In [5]:
dfA.to_csv('PII_a.csv')
Hashing Schema Definition¶
A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation.
In [6]:
%%writefile schema.json
{
"version": 1,
"clkConfig": {
"l": 1024,
"k": 30,
"hash": {
"type": "doubleHash"
},
"kdf": {
"type": "HKDF",
"hash": "SHA256",
"info": "c2NoZW1hX2V4YW1wbGU=",
"salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
"keySize": 64
}
},
"features": [
{
"identifier": "rec_id",
"ignored": true
},
{
"identifier": "given_name",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "surname",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "street_number",
"format": { "type": "integer" },
"hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
},
{
"identifier": "address_1",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "address_2",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "suburb",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "postcode",
"format": { "type": "integer", "minimum": 100, "maximum": 9999 },
"hashing": { "ngram": 1, "positional": true, "weight": 1 }
},
{
"identifier": "state",
"format": { "type": "string", "encoding": "utf-8", "maxLength": 3 },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "date_of_birth",
"format": { "type": "integer" },
"hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
},
{
"identifier": "soc_sec_id",
"ignored": true
}
]
}
Overwriting schema.json
Hash the data¶
We can now hash our Personally Identifiable Information (PII) data from the CSV file using our defined linkage schema. We must provide two secret keys to this command - these keys have to be used by both parties hashing data. For this toy example we will use the keys ‘key1’ and ‘key2’, for real data, make sure that the keys contain enough entropy, as knowledge of these keys is sufficient to reconstruct the PII information from a CLK! Also, do not share these keys with anyone, except the other participating party.
In [7]:
!clkutil hash PII_a.csv key1 key2 schema.json clks_a.json
generating CLKs: 100%|█| 5.00k/5.00k [00:05<00:00, 927clk/s, mean=885, std=33.4]
CLK data written to clks_a.json
Inspect the output¶
clkhash has hashed the PII, creating a Cryptographic Longterm Key for each entity. The progress bar output shows that the mean popcount is quite high (885 out of 1024) which can effect accuracy.
There are two ways to control the popcount: - You can change the ‘k’
value in the clkConfig
section of the linkage schema. This controls
the number of entries in the CLK for each n-gram - or you can modify the
individual ‘weight’ values for the different fields. It allows to tune
the contribution of a column to the CLK. This can be used to
de-emphasise columns which are less suitable for linkage (e.g.
information that changes frequently).
First, we will change the value of k from 30 to 15.
In [8]:
schema = json.load(open('schema.json', 'rt'))
schema['clkConfig']['k'] = 15
json.dump(schema, open('schema.json', 'wt'))
!clkutil hash PII_a.csv key1 key2 schema.json clks_a.json
generating CLKs: 100%|█| 5.00k/5.00k [00:04<00:00, 867clk/s, mean=648, std=44.1]
CLK data written to clks_a.json
And now we will modify the weights to de-emphasise the contribution of the address related columns.
In [9]:
schema = json.load(open('schema.json', 'rt'))
schema['clkConfig']['k'] = 20
address_features = ['street_number', 'address_1', 'address_2', 'suburb', 'postcode', 'state']
for feature in schema['features']:
if feature['identifier'] in address_features:
feature['hashing']['weight'] = 0.5
json.dump(schema, open('schema.json', 'wt'))
!clkutil hash PII_a.csv key1 key2 schema.json clks_a.json
generating CLKs: 100%|█| 5.00k/5.00k [00:04<00:00, 924clk/s, mean=602, std=39.8]
CLK data written to clks_a.json
Each CLK is serialized in a JSON friendly base64 format:
In [10]:
# If you have jq tool installed:
#!jq .clks[0] clks_a.json
import json
json.load(open('clks_a.json'))['clks'][0]
Out[10]:
'BD8JWW7DzwP82PjV5/jbN40+bT3V4z7V+QBtHYcdF32WpPvDvHUdLXCX3tuV1/4rv+23v9R1fKmJcmoNi7OvoecRLMnHzqv9J5SfT15VXe7KPht9d49zRt73+l3Tfs+Web8kx32vSdo+SfnlHqKbn11V6w9zFm3kb07e67MX7tw='
Hash data set B¶
Now we hash the second dataset using the same keys and same schema.
In [11]:
dfB.to_csv('PII_b.csv')
!clkutil hash PII_b.csv key1 key2 schema.json clks_b.json
generating CLKs: 100%|█| 5.00k/5.00k [00:04<00:00, 964clk/s, mean=592, std=45.5]
CLK data written to clks_b.json
Find matches between the two sets of CLKs¶
We have generated two sets of CLKs which represent entity information in a privacy-preserving way. The more similar two CLKs are, the more likely it is that they represent the same entity.
For this task we will use the entity service, which is provided by
Data61. The necessary steps are as follows: - The analyst creates a new
project with the output type ‘mapping’. They will receive a set of
credentials from the server. - The analyst then distributes the
update_tokens
to the participating data providers. - The data
providers then individually upload their respective CLKs. - The analyst
can create runs with various thresholds (and other settings) - After
the entity service successfully computed the mapping, it can be accessed
by providing the result_token
First we check the status of an entity service:
In [13]:
SERVER = 'https://testing.es.data61.xyz'
!clkutil status --server={SERVER}
{"project_count": 223, "rate": 52027343, "status": "ok"}
The analyst creates a new project on the entity service by providing the hashing schema and result type. The server returns a set of credentials which provide access to the further steps for project.
In [15]:
!clkutil create-project --server={SERVER} --schema schema.json --output credentials.json --type "mapping" --name "tutorial"
Entity Matching Server: https://testing.es.data61.xyz
Checking server status
Server Status: ok
The returned credentials contain a - project_id
, which identifies
the project - result_token
, which gives access to the mapping
result, once computed - upload_tokens
, one for each provider, allows
uploading CLKs.
In [16]:
credentials = json.load(open('credentials.json', 'rt'))
!python -m json.tool credentials.json
{
"project_id": "5c9a47049161bcb3f32dd1fef4c71c1df9cc7658f5e2cd55",
"result_token": "2886b2faf85ad994339059f192a1b8f32206ec32d878b160",
"update_tokens": [
"7d08294eed16bbe8b3189d193358258b3b5045e67f44306f",
"04da88e3a5e90aa55049c5a2e8a7085a8bc691653d895447"
]
}
Uploading the CLKs to the entity service¶
Each party individually uploads its respective CLKs to the entity
service. They need to provide the resource_id
, which identifies the
correct mapping, and an update_token
.
In [17]:
!clkutil upload \
--project="{credentials['project_id']}" \
--apikey="{credentials['update_tokens'][0]}" \
--output "upload_a.json" \
--server="{SERVER}" \
"clks_a.json"
!clkutil upload \
--project="{credentials['project_id']}" \
--apikey="{credentials['update_tokens'][1]}" \
--output "upload_b.json" \
--server="{SERVER}" \
"clks_b.json"
Uploading CLK data from clks_a.json
To Entity Matching Server: https://testing.es.data61.xyz
Project ID: 5c9a47049161bcb3f32dd1fef4c71c1df9cc7658f5e2cd55
Checking server status
Status: ok
Uploading CLK data to the server
Uploading CLK data from clks_b.json
To Entity Matching Server: https://testing.es.data61.xyz
Project ID: 5c9a47049161bcb3f32dd1fef4c71c1df9cc7658f5e2cd55
Checking server status
Status: ok
Uploading CLK data to the server
Now that the CLK data has been uploaded the analyst can create one or
more runs. Here we will start by calculating a mapping with a
threshold of 0.9
:
In [18]:
!clkutil create --verbose \
--server="{SERVER}" \
--output "run_info.json" \
--threshold=0.9 \
--project="{credentials['project_id']}" \
--apikey="{credentials['result_token']}" \
--name="tutorial_run"
Entity Matching Server: https://testing.es.data61.xyz
Checking server status
Server Status: ok
In [23]:
run_info = json.load(open('run_info.json', 'rt'))
run_info
Out[23]:
{'name': 'tutorial_run',
'notes': 'Run created by clkhash command line tool',
'run_id': 'b700b16393eb5eb704322497226078c36ad9e16724797239',
'threshold': 0.9}
Results¶
Now after some delay (depending on the size) we can fetch the results. This can be done with clkutil:
In [26]:
!clkutil results \
--project="{credentials['project_id']}" \
--apikey="{credentials['result_token']}" \
--run="{run_info['run_id']}" \
--server="{SERVER}" \
--output results.txt
with open('results.txt') as f:
str_mapping = json.load(f)['mapping']
mapping = {int(k): int(v) for k,v in str_mapping.items()}
print('The service linked {} entities.'.format(len(mapping)))
The service linked 3636 entities.
Checking server status
Status: ok
Response code: 200
Received result
Let’s investigate some of those matches and the overall matching quality
In [27]:
with open('PII_a.csv', 'rt') as f:
a_raw = f.readlines()
with open('PII_b.csv', 'rt') as f:
b_raw = f.readlines()
num_entities = len(b_raw) - 1
print('idx_a, idx_b, rec_id_a, rec_id_b')
print('--------------------------------')
for a_i in range(10):
if a_i in mapping:
a_data = a_raw[a_i + 1].split(',')
b_data = b_raw[mapping[a_i] + 1].split(',')
print('{}, {}, {}, {}'.format(a_i+1, mapping[a_i]+1, a_data[0], b_data[0]))
TP = 0; FP = 0; TN = 0; FN = 0
for a_i in range(num_entities):
if a_i in mapping:
if a_raw[a_i + 1].split(',')[0].split('-')[1] == b_raw[mapping[a_i] + 1].split(',')[0].split('-')[1]:
TP += 1
else:
FP += 1
FN += 1 # as we only report one mapping for each element in PII_a, then a wrong mapping is not only a false positive, but also a false negative, as we won't report the true mapping.
else:
FN += 1 # every element in PII_a has a partner in PII_b
print('--------------------------------')
print('Precision: {}, Recall: {}, Accuracy: {}'.format(TP/(TP+FP), TP/(TP+FN), (TP+TN)/(TP+TN+FP+FN)))
idx_a, idx_b, rec_id_a, rec_id_b
--------------------------------
2, 2751, rec-1016-org, rec-1016-dup-0
3, 4657, rec-4405-org, rec-4405-dup-0
4, 4120, rec-1288-org, rec-1288-dup-0
5, 3307, rec-3585-org, rec-3585-dup-0
7, 3945, rec-1985-org, rec-1985-dup-0
8, 993, rec-2404-org, rec-2404-dup-0
9, 4613, rec-1473-org, rec-1473-dup-0
10, 3630, rec-453-org, rec-453-dup-0
--------------------------------
Precision: 1.0, Recall: 0.7272, Accuracy: 0.7272
Precision tells us about how many of the found matches are actual matches. The score of 1.0 means that we did perfectly in this respect, however, recall, the measure of how many of the actual matches were correctly identified, is quite low with only 73%.
Let’s go back and create another mapping with a threshold
value of
0.8
.
Great, for this threshold value we get a precision of 100% and a recall of 95.3%.
The explanation is that when the information about an entity differs slightly in the two datasets (e.g. spelling errors, abbrevations, missing values, …) then the corresponding CLKs will differ in some number of bits as well. For the datasets in this tutorial the perturbations are such that only 72.7% of the derived CLK pairs overlap more than 90%. Whereas almost all matching pairs overlap more than 80%.
If we keep reducing the threshold value, then we will start to observe mistakes in the found matches – the precision decreases. But at the same time the recall value will keep increasing for a while, as a lower threshold allows for more of the actual matches to be found, e.g.: for threshold 0.72, we get precision: 0.997 and recall: 0.992. However, reducing the threshold further will eventually lead to a decrease in both precision and recall: for threshold 0.65 precision is 0.983 and recall is 0.980. Thus it is important to choose an appropriate threshold for the amount of perturbations present in the data.
This concludes the tutorial. Feel free to go back to the CLK generation and experiment on how different setting will affect the matching quality.