API Documentation¶
Bloom filter¶
Generate a Bloom filter
-
class
clkhash.bloomfilter.
NgramEncodings
[source]¶ Bases:
enum.Enum
Lists the available schemes for encoding n-grams.
Note
the slightly awkward looking construction with the calls to partial and the overwrite of __call__ are due to compatibility issues with Python 2.7.
-
BLAKE_HASH
= functools.partial(<function blake_encode_ngrams>)¶ uses the BLAKE2 hash function, which is one of the fastest modern hash functions, and does less hash function calls compared to the DOUBLE_HASH based schemes. It avoids one of the exploitable weaknesses of the DOUBLE_HASH scheme. Also see
blake_encode_ngrams()
-
DOUBLE_HASH
= functools.partial(<function double_hash_encode_ngrams>)¶ the initial encoding scheme as described in Schnell, R., Bachteler, T., & Reiher, J. (2011). A Novel Error-Tolerant Anonymous Linking Code. Also see
double_hash_encode_ngrams()
-
DOUBLE_HASH_NON_SINGULAR
= functools.partial(<function double_hash_encode_ngrams_non_singular>)¶ very similar to DOUBLE_HASH, but avoids singularities in the encoding. Also see
double_hash_encode_ngrams_non_singular()
-
-
clkhash.bloomfilter.
blake_encode_ngrams
(ngrams, key, k, l)[source]¶ Computes the encoding of the provided ngrams using the BLAKE2 hash function.
We deliberately do not use the double hashing scheme as proposed in [Schnell2011], because this would introduce an exploitable structure into the Bloom filter. For more details on the weakness, see [Kroll2015].
In short, the double hashing scheme only allows for \(l^2\) different encodings for any possible n-gram, whereas the use of \(k\) different independent hash functions gives you \(\sum_{j=1}^{k}{\binom{l}{j}}\) combinations.
Our construction
It is advantageous to construct Bloom filters using a family of hash functions with the property of k-independence to compute the indices for an entry. This approach minimises the change of collisions.
An informal definition of k-independence of a family of hash functions is, that if selecting a function at random from the family, it guarantees that the hash codes of any designated k keys are independent random variables.
Our construction utilises the fact that the output bits of a cryptographic hash function are uniformly distributed, independent, binary random variables (well, at least as close to as possible. See [Kaminsky2011] for an analysis). Thus, slicing the output of a cryptographic hash function into k different slices gives you k independent random variables.
We chose Blake2 as the cryptographic hash function mainly for two reasons:
- it is fast.
- in keyed hashing mode, Blake2 provides MACs with just one hash function call instead of the two calls in the HMAC construction used in the double hashing scheme.
Warning
Please be aware that, although this construction makes the attack of [Kroll2015] infeasible, it is most likely not enough to ensure security. Or in their own words:
However, we think that using independent hash functions alone will not be sufficient to ensure security, since in this case other approaches (maybe related to or at least inspired through work from the area of Frequent Itemset Mining) are promising to detect at least the most frequent atoms automatically.Parameters: - ngrams – list of n-grams to be encoded
- key – secret key for blake2 as bytes
- k – number of hash functions to use per element of the ngrams
- l – length of the output bitarray (has to be a power of 2)
Returns: bitarray of length l with the bits set which correspond to the encoding of the ngrams
-
clkhash.bloomfilter.
calculate_bloom_filters
(dataset, schema, keys, xor_folds=0)[source]¶ Parameters: - dataset – A list of indexable records.
- schema – An iterable of identifier types.
- keys – A tuple of two lists of secret keys used in the HMAC.
- xor_folds – number of XOR folds to perform
Returns: List of bloom filters as 3-tuples, each containing bloom filter (bitarray), record first element - usually index, bitcount (int)
-
clkhash.bloomfilter.
crypto_bloom_filter
(record, tokenizers, keys, xor_folds=0, l=1024, k=30, ngram_encoding=<NgramEncodings.DOUBLE_HASH: functools.partial(<function double_hash_encode_ngrams>)>)[source]¶ Makes a Bloom filter from a record with given tokenizers and lists of keys.
Using the method from http://www.record-linkage.de/-download=wp-grlc-2011-02.pdf
Parameters: - record – plaintext record tuple. E.g. (index, name, dob, gender)
- tokenizers – A list of IdentifierType tokenizers (one for each record element)
- keys – tuple of tuple of keys for the hash functions as bytes
- xor_folds – number of XOR folds to perform
- l – length of the Bloom filter in number of bits
- k – number of hash functions to use per element
- ngram_encoding –
Returns: 3-tuple: - bloom filter for record as a bitarray - first element of record (usually an index) - number of bits set in the bloomfilter
-
clkhash.bloomfilter.
double_hash_encode_ngrams
(ngrams, keys, k, l)[source]¶ Computes the double hash encoding of the provided ngrams with the given keys.
Using the method from http://www.record-linkage.de/-download=wp-grlc-2011-02.pdf
Parameters: - ngrams – list of n-grams to be encoded
- keys – hmac secret keys for md5 and sha1 as bytes
- k – number of hash functions to use per element of the ngrams
- l – length of the output bitarray
Returns: bitarray of length l with the bits set which correspond to the encoding of the ngrams
-
clkhash.bloomfilter.
double_hash_encode_ngrams_non_singular
(ngrams, keys, k, l)[source]¶ computes the double hash encoding of the provided n-grams with the given keys.
- The original construction of [Schnell2011] displays an abnormality for certain inputs:
- An n-gram can be encoded into just one bit irrespective of the number of k.
Their construction goes as follows: the \(k\) different indices \(g_i\) of the Bloom filter for an n-gram \(x\) are defined as:
\[g_{i}(x) = (h_1(x) + i h_2(x)) \mod l\]with \(0 \leq i < k\) and \(l\) is the length of the Bloom filter. If the value of the hash of \(x\) of the second hash function is a multiple of \(l\), then
\[h_2(x) = 0 \mod l\]and thus
\[g_i(x) = h_1(x) \mod l,\]irrespective of the value \(i\). A discussion of this potential flaw can be found here.
Parameters: - ngrams – list of n-grams to be encoded
- key_sha1 – hmac secret keys for sha1 as bytes
- key_md5 – hmac secret keys for md5 as bytes
- k – number of hash functions to use per element of the ngrams
- l – length of the output bitarray
Returns: bitarray of length l with the bits set which correspond to the encoding of the ngrams
-
clkhash.bloomfilter.
fold_xor
(bloomfilter, folds)[source]¶ Performs XOR folding on a Bloom filter.
If the length of the original Bloom filter is n and we perform r folds, then the length of the resulting filter is n / 2 ** r.
Parameters: - bloomfilter – Bloom filter to fold
- folds – number of folds
Returns: folded bloom filter
-
clkhash.bloomfilter.
stream_bloom_filters
(dataset, schema_types, keys, xor_folds=0)[source]¶ Yield bloom filters
Parameters: - dataset – An iterable of indexable records.
- schema_types – An iterable of identifier type names.
- keys – A tuple of two lists of secret keys used in the HMAC.
- xor_folds – number of XOR folds to perform
Returns: Yields bloom filters as 3-tuples
CLK¶
Generate CLK from CSV file
-
clkhash.clk.
generate_clk_from_csv
(input, keys, schema_types, no_header=False, progress_bar=True, xor_folds=0)[source]¶
-
clkhash.clk.
hash_and_serialize_chunk
(chunk_pii_data, schema_types, keys, xor_folds)[source]¶ Generate Bloom filters (ie hash) from chunks of PII then serialize the generated Bloom filters. It also computes and outputs the Hamming weight (or popcount) – the number of bits set to one – of the generated Bloom filters.
Parameters: - chunk_pii_data – An iterable of indexable records.
- schema_types – An iterable of identifier type names.
- keys – A tuple of two lists of secret keys used in the HMAC.
- xor_folds – Number of XOR folds to perform. Each fold halves the hash length.
Returns: A list of serialized Bloom filters and a list of corresponding popcounts
identifier types¶
Convert PII to tokens
IdentifierType¶
-
class
clkhash.identifier_types.
IdentifierType
(unigram=False, weight=1, **kwargs)[source]¶ Bases:
object
Base class used for all identifier types.
Required to provide a mapping of schema to hash type uni-gram or bi-gram.
-
__init__
(unigram=False, weight=1, **kwargs)[source]¶ Parameters: Note
For each n-gram of an identifier, we compute k different indices in the Bloom filter which will be set to true. There is a global \(k_{default}\) value, and the k value for each identifier is computed as
\[k = weight * k_{default},\]rounded to the nearest integer.
Reasons why you might want to set weights:
- Long identifiers like street name will produce a lot more n-grams than small identifiers like zip code. Thus street name will flip more bits in the Bloom filter and will have a bigger influence in the overall matching score.
- The matching might produce better results if identifiers that are stable and / or have low error rates are given higher prominence in the Bloom filter.
-
__weakref__
¶ list of weak references to the object (if defined)
-
key derivation¶
-
class
clkhash.key_derivation.
HKDFconfig
(master_secret, salt=None, info=None, hash_algo='SHA256')[source]¶ Bases:
object
-
supported_hash_algos
= ('SHA256', 'SHA512')¶
-
-
clkhash.key_derivation.
generate_key_lists
(master_secrets, num_identifier, key_size=64, salt=None, info=None, kdf='HKDF')[source]¶ Generates a derived key for each identifier for each master secret using a key derivation function (KDF).
The only supported key derivation function for now is ‘HKDF’.
The previous key usage can be reproduced by setting kdf to ‘legacy’. This is highly discouraged, as this strategy will map the same n-grams in different identifier to the same bits in the Bloom filter and thus does not lead to good results.
Parameters: - master_secrets – a list of master secrets (either as bytes or strings)
- num_identifier – the number of identifiers
- key_size – the size of the derived keys
- salt – salt for the KDF as bytes
- info – optional context and application specific information as bytes
- kdf – the key derivation function algorithm to use
Returns: The derived keys. First dimension is of size num_identifier, second dimension is the same as master_secrets. A key is represented as bytes.
-
clkhash.key_derivation.
hkdf
(hkdf_config, num_keys, key_size=64)[source]¶ Executes the HKDF key derivation function as described in rfc5869 to derive num_keys keys of size key_size from the master_secret.
Parameters: - hkdf_config – an HKDFconfig object containing the configuration for the HKDF.
- num_keys – the number of keys the kdf should produce
- key_size – the size of the produced keys
Returns: Derived keys
random names¶
Module to produce a dataset of names, genders and dates of birth and manipulate that list
Currently very simple and not realistic. Additional functions for manipulating the list of names - producing reordered and subset lists with a specific overlap
ClassList class - generate a list of length n of [id, name, dob, gender] lists
TODO: Get age distribution right by using a mortality table TODO: Get first name distributions right by using distributions TODO: Generate realistic errors TODO: Add RESTfull api to generate reasonable name data as requested
-
class
clkhash.randomnames.
NameList
(n)[source]¶ Bases:
object
List of randomly generated names
-
generate_random_person
(n)[source]¶ Generator that yields details on a person with plausible name, sex and age.
Yields: Generated data for one person tuple - (id: int, name: str(‘First Last’), birthdate: str(‘DD/MM/YYYY’), sex: str(‘M’ | ‘F’) )
-
generate_subsets
(sz, overlap=0.8)[source]¶ Return a pair of random subsets of the name list with a specified proportion of elements in common.
Parameters: - sz – length of subsets to generate
- overlap – fraction of the subsets that should have the same names in them
Raises: ValueError – if there aren’t sufficiently many names in the list to satisfy the request; more precisely, raises if sz + (1 - overlap)*sz > n = len(self.names)
Returns: pair of subsets
-
load_names
()[source]¶ This function loads a name database into globals firstNames and lastNames
initial version uses data files from http://www.quietaffiliate.com/free-first-name-and-last-name-databases-csv-and-sql/
-
schema
= [{'identifier': 'INDEX'}, {'identifier': 'NAME freetext'}, {'identifier': 'DOB YYYY/MM/DD'}, {'identifier': 'GENDER M or F'}]¶
-
schema_types
¶
-
-
clkhash.randomnames.
load_csv_data
(resource_name)[source]¶ Loads a specified CSV data file and returns the first column as a Python list
schema¶
tokenizer¶
Functions to tokenize words (PII)
-
clkhash.tokenizer.
bigramlist
(word, toremove=None)[source]¶ Make bigrams from word with pre- and ap-pended spaces
s -> [‘ ‘ + s0, s0 + s1, s1 + s2, .. sN + ‘ ‘]
Parameters: - word – string to make bigrams from
- toremove – List of strings to remove before construction
Returns: list of bigrams as strings