Tutorial for Python API¶
For this tutorial we are going to process a data set for private linkage with clkhash
using the Python API.
The Python package recordlinkage
has a tutorial linking data sets in the clear, we will try duplicate that in a privacy preserving setting.
First install the dependencies we will need:
[ ]:
# NBVAL_IGNORE_OUTPUT
!pip install -U clkhash anonlink recordlinkage pandas
[1]:
# NBVAL_IGNORE_OUTPUT
import io
import itertools
import pandas as pd
[2]:
import clkhash
from clkhash import clk
from clkhash.field_formats import *
from clkhash.schema import Schema
from clkhash.comparators import NgramComparison
from clkhash.serialization import serialize_bitarray
[3]:
from recordlinkage.datasets import load_febrl4
Data Exploration¶
First load the dataset, and preview the first few rows.
[4]:
dfA, dfB = load_febrl4()
dfA.head()
[4]:
given_name | surname | street_number | address_1 | address_2 | suburb | postcode | state | date_of_birth | soc_sec_id | |
---|---|---|---|---|---|---|---|---|---|---|
rec_id | ||||||||||
rec-1070-org | michaela | neumann | 8 | stanley street | miami | winston hills | 4223 | nsw | 19151111 | 5304218 |
rec-1016-org | courtney | painter | 12 | pinkerton circuit | bega flats | richlands | 4560 | vic | 19161214 | 4066625 |
rec-4405-org | charles | green | 38 | salkauskas crescent | kela | dapto | 4566 | nsw | 19480930 | 4365168 |
rec-1288-org | vanessa | parr | 905 | macquoid place | broadbridge manor | south grafton | 2135 | sa | 19951119 | 9239102 |
rec-3585-org | mikayla | malloney | 37 | randwick road | avalind | hoppers crossing | 4552 | vic | 19860208 | 7207688 |
For this linkage we will not use the social security id column.
[5]:
dfA.columns
[5]:
Index(['given_name', 'surname', 'street_number', 'address_1', 'address_2',
'suburb', 'postcode', 'state', 'date_of_birth', 'soc_sec_id'],
dtype='object')
In this tutorial we will use StringIO
buffers instead of files. Let’s dump the data from the pandas dataframe into a csv:
[6]:
a_csv = io.StringIO()
dfA.to_csv(a_csv)
Linkage Schema Definition¶
A hashing schema instructs clkhash
how to treat each feature when encoding a CLK.
The linkage schema below details a 1024 bit encoding using equally weighted features. Most features are encoding using bigrams although the postcode and date of birth use unigrams. The schema specifies to ignore the columns 'rec_id'
and 'soc_sec_id'
.
A detailed description of the linkage schema can be found in the documentation.
[7]:
fields = [
Ignore('rec_id'),
StringSpec('given_name', FieldHashingProperties(comparator=NgramComparison(2), strategy=BitsPerFeatureStrategy(300))),
StringSpec('surname', FieldHashingProperties(comparator=NgramComparison(2), strategy=BitsPerFeatureStrategy(300))),
IntegerSpec('street_number', FieldHashingProperties(comparator=NgramComparison(1, True), strategy=BitsPerFeatureStrategy(300), missing_value=MissingValueSpec(sentinel=''))),
StringSpec('address_1', FieldHashingProperties(comparator=NgramComparison(2), strategy=BitsPerFeatureStrategy(300))),
StringSpec('address_2', FieldHashingProperties(comparator=NgramComparison(2), strategy=BitsPerFeatureStrategy(300))),
StringSpec('suburb', FieldHashingProperties(comparator=NgramComparison(2), strategy=BitsPerFeatureStrategy(300))),
IntegerSpec('postcode', FieldHashingProperties(comparator=NgramComparison(1, True), strategy=BitsPerFeatureStrategy(300))),
StringSpec('state', FieldHashingProperties(comparator=NgramComparison(2), strategy=BitsPerFeatureStrategy(300))),
IntegerSpec('date_of_birth', FieldHashingProperties(comparator=NgramComparison(1, True), strategy=BitsPerFeatureStrategy(300), missing_value=MissingValueSpec(sentinel=''))),
Ignore('soc_sec_id')
]
schema = Schema(fields, 1024)
Encode the data¶
We can now encode our PII data from the CSV file using our defined schema. We must provide a secret to this command - this secret has to be used by both parties hashing data. For this toy example we will use the secret "secret"
, for real data, make sure that the key contains enough entropy, as knowledge of this secret is sufficient to reconstruct the PII information from a CLK!
Also, do not share this secret with anyone, except the other participating party.
[8]:
secret = 'secret'
[9]:
a_csv.seek(0)
hashed_data_a = clk.generate_clk_from_csv(a_csv, secret, schema)
generating CLKs: 100%|██████████| 5.00k/5.00k [00:03<00:00, 1.39kclk/s, mean=944, std=14.4]
Inspect the output¶
clkhash has encoded the PII, creating a Cryptographic Longterm Key for each entity. The output of generate_clk_from_csv
shows that the mean popcount is quite high, more than 900 out of 1024 bits are set on average which can affect accuracy.
We can control the popcount by adjusting the strategy. There are currently two different strategies implemented in the library:
BitsPerToken
: each token of a feature’s value is inserted into the encodingbits_per_token
times. Increasingbits_per_token
will give the corresponding feature more importance in comparisons, decreasingbits_per_token
will de-emphasise columns which are less suitable for linkage (e.g. information that changes frequently). TheBitsPerToken
strategy is set with thestrategy=BitsPerTokenStrategy(bits_per_token=30)
argument for a feature’sFieldHashingProperties
.BitsPerFeature
: In this strategy we always insert a fixed number of bits into the CLK for a feature, irrespective of the number of tokens. This strategy is set with thestrategy=BitsPerFeatureStrategy(bits_per_feature=100)
argument for a feature’sFieldHashingProperties
.
In this example, we will reduce the value of bits_per_feature
for address related columns.
[10]:
fields = [
Ignore('rec_id'),
StringSpec('given_name', FieldHashingProperties(comparator=NgramComparison(2), strategy=BitsPerFeatureStrategy(200))),
StringSpec('surname', FieldHashingProperties(comparator=NgramComparison(2), strategy=BitsPerFeatureStrategy(200))),
IntegerSpec('street_number', FieldHashingProperties(comparator=NgramComparison(1, True), strategy=BitsPerFeatureStrategy(100), missing_value=MissingValueSpec(sentinel=''))),
StringSpec('address_1', FieldHashingProperties(comparator=NgramComparison(2), strategy=BitsPerFeatureStrategy(100))),
StringSpec('address_2', FieldHashingProperties(comparator=NgramComparison(2), strategy=BitsPerFeatureStrategy(100))),
StringSpec('suburb', FieldHashingProperties(comparator=NgramComparison(2), strategy=BitsPerFeatureStrategy(100))),
IntegerSpec('postcode', FieldHashingProperties(comparator=NgramComparison(1, True), strategy=BitsPerFeatureStrategy(100))),
StringSpec('state', FieldHashingProperties(comparator=NgramComparison(2), strategy=BitsPerFeatureStrategy(100))),
IntegerSpec('date_of_birth', FieldHashingProperties(comparator=NgramComparison(1, True), strategy=BitsPerFeatureStrategy(200), missing_value=MissingValueSpec(sentinel=''))),
Ignore('soc_sec_id')
]
schema = Schema(fields, 1024)
a_csv.seek(0)
clks_a = clk.generate_clk_from_csv(a_csv, secret, schema)
generating CLKs: 100%|██████████| 5.00k/5.00k [00:02<00:00, 2.20kclk/s, mean=696, std=22.7]
Each CLK is represented by a bitarray but can be serialized in a compact, JSON friendly base64 format:
[11]:
print("original:")
print(clks_a[0])
print("serialized:")
print(serialize_bitarray(clks_a[0]))
original:
bitarray('1111111100101100001100011011110111100111001111111000111110010100011101111111111110111000110111111110111101011111111001011111011110111011101111001101011101100111101110001101101101010011001100110011010111110011010100101010111011111100101000111111101101111011100011100111110011110110110011110001010101101011011111111011011111110101100110010101111101111111101110001111110111111101010111100101110111100110111110100100110001100010110110111101101111011010111111110011110100101010111111110111011111100110111011111100001011111100011110000101010111111011101111011110110110001000100111111111111011101111101100111110111111011011001111100011111110111110100101101001000100011110101001000010101001110110111111111001111111111111010101011001110110101010110101100110110111000111111110111111000010111111000111110011111000100101111111111011111001111100011001101000110010111110111010001111111101110100101110001111001011111011111111011010110011011011001011010101011111111011011111110101111001101111010101111111011101111010001101110011101110111101')
serialized:
/ywxvec/j5R3/7jf71/l97u812e421MzNfNSrvyj+3uOfPbPFWt/t/WZX3+4/f1eXeb6TGLb29r/PSr/d+bvwvx4Vfu97Yif/u+z79s+P76WkR6kKnb/n/9VnarWbcf78L8fPiX/vnxmjL7o/3S48vv9rNstV/t/Xm9X93o3O70=
Hash data set B¶
Now we hash the second dataset using the same keys and same schema.
[12]:
b_csv = io.StringIO()
dfB.to_csv(b_csv)
b_csv.seek(0)
clks_b = clkhash.clk.generate_clk_from_csv(b_csv, secret, schema)
generating CLKs: 100%|██████████| 5.00k/5.00k [00:01<00:00, 2.58kclk/s, mean=687, std=30.4]
[13]:
len(clks_b)
[13]:
5000
Find matches between the two sets of CLKs¶
We have generated two sets of CLKs which represent entity information in a privacy-preserving way. The more similar two CLKs are, the more likely it is that they represent the same entity.
For this task we will use anonlink, a Python (and optimised C++) implementation of anonymous linkage using CLKs.
Using anonlink
we find the candidate pairs - which is all possible pairs above the given threshold
. Then we solve for the most likely mapping.
[14]:
import anonlink
def mapping_from_clks(clks_a, clks_b, threshold):
results_candidate_pairs = anonlink.candidate_generation.find_candidate_pairs(
[clks_a, clks_b],
anonlink.similarities.dice_coefficient,
threshold
)
solution = anonlink.solving.greedy_solve(results_candidate_pairs)
print('Found {} matches'.format(len(solution)))
# each entry in `solution` looks like this: '((0, 4039), (1, 2689))'.
# The format is ((dataset_id, row_id), (dataset_id, row_id))
# As we only have two parties in this example, we can remove the dataset_ids.
# Also, turning the solution into a set will make it easier to assess the
# quality of the matching.
return set((a, b) for ((_, a), (_, b)) in solution)
[15]:
found_matches = mapping_from_clks(clks_a, clks_b, 0.9)
Found 4049 matches
Evaluate matching quality¶
Let’s investigate some of those matches and the overall matching quality
Fortunately, the febrl4 datasets contain record ids which tell us the correct linkages. Using this information we are able to create a set of the true matches.
[16]:
# rec_id in dfA has the form 'rec-1070-org'. We only want the number. Additionally, as we are
# interested in the position of the records, we create a new index which contains the row numbers.
dfA_ = dfA.rename(lambda x: x[4:-4], axis='index').reset_index()
dfB_ = dfB.rename(lambda x: x[4:-6], axis='index').reset_index()
# now we can merge dfA_ and dfB_ on the record_id.
a = pd.DataFrame({'ida': dfA_.index, 'rec_id': dfA_['rec_id']})
b = pd.DataFrame({'idb': dfB_.index, 'rec_id': dfB_['rec_id']})
dfj = a.merge(b, on='rec_id', how='inner').drop(columns=['rec_id'])
# and build a set of the corresponding row numbers.
true_matches = set((row[0], row[1]) for row in dfj.itertuples(index=False))
[17]:
def describe_matching_quality(found_matches, show_examples=False):
if show_examples:
print('idx_a, idx_b, rec_id_a, rec_id_b')
print('---------------------------------------------')
for a_i, b_i in itertools.islice(found_matches, 10):
print('{:4d}, {:5d}, {:>11}, {:>14}'.format(a_i+1, b_i+1, a.iloc[a_i]['rec_id'], b.iloc[b_i]['rec_id']))
print('---------------------------------------------')
tp = len(found_matches & true_matches)
fp = len(found_matches - true_matches)
fn = len(true_matches - found_matches)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
print('Precision: {:.3f}, Recall: {:.3f}'.format(precision, recall))
[18]:
describe_matching_quality(found_matches, show_examples=True)
idx_a, idx_b, rec_id_a, rec_id_b
---------------------------------------------
3170, 259, 3730, 3730
1685, 3323, 2888, 2888
733, 2003, 4239, 4239
4550, 3627, 4216, 4216
1875, 2991, 4391, 4391
3928, 2377, 3493, 3493
4928, 4656, 276, 276
334, 945, 4848, 4848
2288, 4331, 3491, 3491
4088, 2454, 1850, 1850
---------------------------------------------
Precision: 1.000, Recall: 0.810
Precision tells us about how many of the found matches are actual matches. The score of 1.0 means that we did perfectly in this respect, however, recall, the measure of how many of the actual matches were correctly identified, is quite low with only 81%.
Let’s go back to the mapping calculation (mapping_from_clks
) an reduce the value for threshold
to 0.8
.
[19]:
found_matches = mapping_from_clks(clks_a, clks_b, 0.8)
describe_matching_quality(found_matches)
Found 4962 matches
Precision: 1.000, Recall: 0.992
Great, for this threshold value we get a precision of 100% and a recall of 99.2%.
The explanation is that when the information about an entity differs slightly in the two datasets (e.g. spelling errors, abbrevations, missing values, …) then the corresponding CLKs will differ in some number of bits as well. It is important to choose an appropriate threshold for the amount of perturbations present in the data (a threshold of 0.72 and below generates an almost perfect mapping with little mistakes).
This concludes the tutorial. Feel free to go back to the CLK generation and experiment on how different setting will affect the matching quality.