Tutorial for Python API¶
For this tutorial we are going to process a data set for private linkage with clkhash using the Python API. Note you can also use the command line tool.
The Python package recordlinkage
has a tutorial linking data sets in the clear, we will try duplicate that in a privacy preserving setting.
First install clkhash, recordlinkage and a few data science tools (pandas and numpy):
$ pip install -U clkhash anonlink recordlinkage numpy pandas
[1]:
import io
import numpy as np
import pandas as pd
[2]:
import clkhash
from clkhash import clk
from clkhash.field_formats import *
from clkhash.schema import Schema
[3]:
import recordlinkage
from recordlinkage.datasets import load_febrl4
Data Exploration¶
First we have a look at the dataset.
[4]:
dfA, dfB = load_febrl4()
dfA.head()
[4]:
given_name | surname | street_number | address_1 | address_2 | suburb | postcode | state | date_of_birth | soc_sec_id | |
---|---|---|---|---|---|---|---|---|---|---|
rec_id | ||||||||||
rec-1070-org | michaela | neumann | 8 | stanley street | miami | winston hills | 4223 | nsw | 19151111 | 5304218 |
rec-1016-org | courtney | painter | 12 | pinkerton circuit | bega flats | richlands | 4560 | vic | 19161214 | 4066625 |
rec-4405-org | charles | green | 38 | salkauskas crescent | kela | dapto | 4566 | nsw | 19480930 | 4365168 |
rec-1288-org | vanessa | parr | 905 | macquoid place | broadbridge manor | south grafton | 2135 | sa | 19951119 | 9239102 |
rec-3585-org | mikayla | malloney | 37 | randwick road | avalind | hoppers crossing | 4552 | vic | 19860208 | 7207688 |
For this linkage we will not use the social security id column.
[5]:
dfA.columns
[5]:
Index(['given_name', 'surname', 'street_number', 'address_1', 'address_2',
'suburb', 'postcode', 'state', 'date_of_birth', 'soc_sec_id'],
dtype='object')
[6]:
a_csv = io.StringIO()
dfA.to_csv(a_csv)
Hashing Schema Definition¶
A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation.
[7]:
fields = [
Ignore('rec_id'),
StringSpec('given_name', FieldHashingProperties(ngram=2, num_bits=300)),
StringSpec('surname', FieldHashingProperties(ngram=2, num_bits=300)),
IntegerSpec('street_number', FieldHashingProperties(ngram=1, positional=True, num_bits=300, missing_value=MissingValueSpec(sentinel=''))),
StringSpec('address_1', FieldHashingProperties(ngram=2, num_bits=300)),
StringSpec('address_2', FieldHashingProperties(ngram=2, num_bits=300)),
StringSpec('suburb', FieldHashingProperties(ngram=2, num_bits=300)),
IntegerSpec('postcode', FieldHashingProperties(ngram=1, positional=True, num_bits=300)),
StringSpec('state', FieldHashingProperties(ngram=2, num_bits=300)),
IntegerSpec('date_of_birth', FieldHashingProperties(ngram=1, positional=True, num_bits=300, missing_value=MissingValueSpec(sentinel=''))),
Ignore('soc_sec_id')
]
schema = Schema(fields, 1024)
Hash the data¶
We can now hash our PII data from the CSV file using our defined schema. We must provide a list of secret keys to this command - these keys have to be used by both parties hashing data. For this toy example we will use the keys ‘key1’ and ‘key2’, for real data, make sure that the keys contain enough entropy, as knowledge of these keys is sufficient to reconstruct the PII information from a CLK!
Also, do not share these keys with anyone, except the other participating party.
[8]:
secret_keys = ('key1', 'key2')
[9]:
a_csv.seek(0)
hashed_data_a = clk.generate_clk_from_csv(a_csv, secret_keys, schema)
generating CLKs: 100%|██████████| 5.00k/5.00k [00:00<00:00, 1.07kclk/s, mean=950, std=9.79]
Inspect the output¶
clkhash has hashed the PII, creating a Cryptographic Longterm Key for each entity. The output of generate_clk_from_csv
shows that the mean popcount is quite high (950 out of 1024) which can affect accuracy.
We can control the popcount by adjusting the hashing strategy. There are currently two different strategies implemented in the library. - fixed k: each n-gram of a feature’s value is inserted into the CLK k times. Increasing k will give the corresponding feature more importance in comparisons, decreasing k will de-emphasise columns which are less suitable for linkage (e.g. information that changes frequently). The fixed k strategy is set with the ‘k=30’ argument for each feature’s FieldHashingProperties. (for a total of numberOfTokens * k insertions) - fixed number of bits: In this strategy we always insert a fixed number of bits into the CLK for a feature, irrespective of the number of n-grams. This strategy is set with the ‘numBits=100’ argument for each feature’s FieldHashingProperties.
In this example, we will reduce the value of num_bits
for address related columns.
[10]:
fields = [
Ignore('rec_id'),
StringSpec('given_name', FieldHashingProperties(ngram=2, num_bits=200)),
StringSpec('surname', FieldHashingProperties(ngram=2, num_bits=200)),
IntegerSpec('street_number', FieldHashingProperties(ngram=1, positional=True, num_bits=100, missing_value=MissingValueSpec(sentinel=''))),
StringSpec('address_1', FieldHashingProperties(ngram=2, num_bits=100)),
StringSpec('address_2', FieldHashingProperties(ngram=2, num_bits=100)),
StringSpec('suburb', FieldHashingProperties(ngram=2, num_bits=100)),
IntegerSpec('postcode', FieldHashingProperties(ngram=1, positional=True, num_bits=100)),
StringSpec('state', FieldHashingProperties(ngram=2, num_bits=100)),
IntegerSpec('date_of_birth', FieldHashingProperties(ngram=1, positional=True, num_bits=200, missing_value=MissingValueSpec(sentinel=''))),
Ignore('soc_sec_id')
]
schema = Schema(fields, 1024)
a_csv.seek(0)
hashed_data_a = clk.generate_clk_from_csv(a_csv, secret_keys, schema)
generating CLKs: 100%|██████████| 5.00k/5.00k [00:00<00:00, 11.3kclk/s, mean=705, std=15.5]
Each CLK is serialized in a JSON friendly base64 format:
[11]:
hashed_data_a[0]
[11]:
'wTmf3/rPF3Pj/85fORXpee/9+v3/1o9714/7d/bW+G7+9N3Cij///a1//nr/9/cZn/BT9+kWnl9203/eOtvM4G4s3e8lX+7X+f0kXez7XbOfevz7/r6wvN99Mncp367yPeZW3uMYv9Evf9/sPuOq3+p79t6/qn/v7O5e/Jurvr8='
Hash data set B¶
Now we hash the second dataset using the same keys and same schema.
[12]:
b_csv = io.StringIO()
dfB.to_csv(b_csv)
b_csv.seek(0)
hashed_data_b = clkhash.clk.generate_clk_from_csv(b_csv, secret_keys, schema)
generating CLKs: 100%|██████████| 5.00k/5.00k [00:00<00:00, 11.5kclk/s, mean=703, std=19.1]
[13]:
len(hashed_data_b)
[13]:
5000
Find matches between the two sets of CLKs¶
We have generated two sets of CLKs which represent entity information in a privacy-preserving way. The more similar two CLKs are, the more likely it is that they represent the same entity.
For this task we will use anonlink, a Python (and optimised C++) implementation of anonymous linkage using CLKs.
As the CLKs are in a string format we first deserialize to use the bitarray type:
[14]:
from bitarray import bitarray
import base64
def deserialize_bitarray(bytes_data):
ba = bitarray(endian='big')
data_as_bytes = base64.decodebytes(bytes_data.encode())
ba.frombytes(data_as_bytes)
return ba
def deserialize_filters(filters):
res = []
for i, f in enumerate(filters):
ba = deserialize_bitarray(f)
res.append(ba)
return res
clks_a = deserialize_filters(hashed_data_a)
clks_b = deserialize_filters(hashed_data_b)
Using anonlink
we find the candidate pairs - which is all possible pairs above the given threshold
. Then we solve for the most likely mapping.
[15]:
import anonlink
def mapping_from_clks(clks_a, clks_b, threshold):
results_candidate_pairs = anonlink.candidate_generation.find_candidate_pairs(
[clks_a, clks_b],
anonlink.similarities.dice_coefficient,
threshold
)
solution = anonlink.solving.greedy_solve(results_candidate_pairs)
print('Found {} matches'.format(len(solution)))
return {a:b for ((_, a),(_, b)) in solution}
[16]:
mapping = mapping_from_clks(clks_a, clks_b, 0.9)
Found 4058 matches
Let’s investigate some of those matches and the overall matching quality
[17]:
a_csv.seek(0)
b_csv.seek(0)
a_raw = a_csv.readlines()
b_raw = b_csv.readlines()
num_entities = len(b_raw) - 1
def describe_accuracy(mapping, show_examples=False):
if show_examples:
print('idx_a, idx_b, rec_id_a, rec_id_b')
print('---------------------------------------------')
for a_i in range(10):
if a_i in mapping:
a_data = a_raw[a_i + 1].split(',')
b_data = b_raw[mapping[a_i] + 1].split(',')
print('{:3}, {:6}, {:>15}, {:>15}'.format(a_i+1, mapping[a_i]+1, a_data[0], b_data[0]))
print('---------------------------------------------')
TP = 0; FP = 0; TN = 0; FN = 0
for a_i in range(num_entities):
if a_i in mapping:
if a_raw[a_i + 1].split(',')[0].split('-')[1] == b_raw[mapping[a_i] + 1].split(',')[0].split('-')[1]:
TP += 1
else:
FP += 1
# as we only report one mapping for each element in PII_a,
# then a wrong mapping is not only a false positive, but
# also a false negative, as we won't report the true mapping.
FN += 1
else:
FN += 1 # every element in PII_a has a partner in PII_b
print()
print("We've got {} true positives, {} false positives, and {} false negatives.".format(TP, FP, FN))
print('Precision: {:.3f}, Recall: {:.3f}, Accuracy: {:.3f}'.format(
TP/(TP+FP),
TP/(TP+FN),
(TP+TN)/(TP+TN+FP+FN)))
[18]:
describe_accuracy(mapping, show_examples=True)
idx_a, idx_b, rec_id_a, rec_id_b
---------------------------------------------
2, 2751, rec-1016-org, rec-1016-dup-0
3, 4657, rec-4405-org, rec-4405-dup-0
4, 4120, rec-1288-org, rec-1288-dup-0
5, 3307, rec-3585-org, rec-3585-dup-0
6, 2306, rec-298-org, rec-298-dup-0
7, 3945, rec-1985-org, rec-1985-dup-0
8, 993, rec-2404-org, rec-2404-dup-0
9, 4613, rec-1473-org, rec-1473-dup-0
10, 3630, rec-453-org, rec-453-dup-0
---------------------------------------------
We've got 4058 true positives, 0 false positives, and 942 false negatives.
Precision: 1.000, Recall: 0.812, Accuracy: 0.812
Precision tells us about how many of the found matches are actual matches. The score of 1.0 means that we did perfectly in this respect, however, recall, the measure of how many of the actual matches were correctly identified, is quite low with only 81%.
Let’s go back to the mapping calculation (calculate_mapping_greedy
) an reduce the value for threshold
to 0.8
.
[19]:
mapping = mapping_from_clks(clks_a, clks_b, 0.8)
describe_accuracy(mapping)
Found 4966 matches
We've got 4966 true positives, 0 false positives, and 34 false negatives.
Precision: 1.000, Recall: 0.993, Accuracy: 0.993
Great, for this threshold value we get a precision of 100% and a recall of 99.3%.
The explanation is that when the information about an entity differs slightly in the two datasets (e.g. spelling errors, abbrevations, missing values, …) then the corresponding CLKs will differ in some number of bits as well. It is important to choose an appropriate threshold for the amount of perturbations present in the data (a threshold of 0.72 and below generates a perfect mapping without mistakes).
This concludes the tutorial. Feel free to go back to the CLK generation and experiment on how different setting will affect the matching quality.
[ ]: