API Documentation

Bloom filter

Generate a Bloom filter

class clkhash.bloomfilter.NgramEncodings[source]

Bases: enum.Enum

Lists the available schemes for encoding n-grams.

Note

the slightly awkward looking construction with the calls to partial and the overwrite of __call__ are due to compatibility issues with Python 2.7.

BLAKE_HASH = functools.partial(<function blake_encode_ngrams>)

uses the BLAKE2 hash function, which is one of the fastest modern hash functions, and does less hash function calls compared to the DOUBLE_HASH based schemes. It avoids one of the exploitable weaknesses of the DOUBLE_HASH scheme. Also see blake_encode_ngrams()

DOUBLE_HASH = functools.partial(<function double_hash_encode_ngrams>)

the initial encoding scheme as described in Schnell, R., Bachteler, T., & Reiher, J. (2011). A Novel Error-Tolerant Anonymous Linking Code. Also see double_hash_encode_ngrams()

DOUBLE_HASH_NON_SINGULAR = functools.partial(<function double_hash_encode_ngrams_non_singular>)

very similar to DOUBLE_HASH, but avoids singularities in the encoding. Also see double_hash_encode_ngrams_non_singular()

clkhash.bloomfilter.blake_encode_ngrams(ngrams, key, k, l)[source]

Computes the encoding of the provided ngrams using the BLAKE2 hash function.

We deliberately do not use the double hashing scheme as proposed in [Schnell2011], because this would introduce an exploitable structure into the Bloom filter. For more details on the weakness, see [Kroll2015].

In short, the double hashing scheme only allows for \(l^2\) different encodings for any possible n-gram, whereas the use of \(k\) different independent hash functions gives you \(\sum_{j=1}^{k}{\binom{l}{j}}\) combinations.

Our construction

It is advantageous to construct Bloom filters using a family of hash functions with the property of k-independence to compute the indices for an entry. This approach minimises the change of collisions.

An informal definition of k-independence of a family of hash functions is, that if selecting a function at random from the family, it guarantees that the hash codes of any designated k keys are independent random variables.

Our construction utilises the fact that the output bits of a cryptographic hash function are uniformly distributed, independent, binary random variables (well, at least as close to as possible. See [Kaminsky2011] for an analysis). Thus, slicing the output of a cryptographic hash function into k different slices gives you k independent random variables.

We chose Blake2 as the cryptographic hash function mainly for two reasons:

  • it is fast.
  • in keyed hashing mode, Blake2 provides MACs with just one hash function call instead of the two calls in the HMAC construction used in the double hashing scheme.

Warning

Please be aware that, although this construction makes the attack of [Kroll2015] infeasible, it is most likely not enough to ensure security. Or in their own words:

However, we think that using independent hash functions alone will not be sufficient to ensure security, since in this case other approaches (maybe related to or at least inspired through work from the area of Frequent Itemset Mining) are promising to detect at least the most frequent atoms automatically.
Parameters:
  • ngrams – list of n-grams to be encoded
  • key – secret key for blake2 as bytes
  • k – number of hash functions to use per element of the ngrams
  • l – length of the output bitarray (has to be a power of 2)
Returns:

bitarray of length l with the bits set which correspond to the encoding of the ngrams

clkhash.bloomfilter.calculate_bloom_filters(dataset, schema, keys, xor_folds=0)[source]
Parameters:
  • dataset – A list of indexable records.
  • schema – An iterable of identifier types.
  • keys – A tuple of two lists of secret keys used in the HMAC.
  • xor_folds – number of XOR folds to perform
Returns:

List of bloom filters as 3-tuples, each containing bloom filter (bitarray), record first element - usually index, bitcount (int)

clkhash.bloomfilter.crypto_bloom_filter(record, tokenizers, keys, xor_folds=0, l=1024, k=30, ngram_encoding=<NgramEncodings.DOUBLE_HASH: functools.partial(<function double_hash_encode_ngrams>)>)[source]

Makes a Bloom filter from a record with given tokenizers and lists of keys.

Using the method from http://www.record-linkage.de/-download=wp-grlc-2011-02.pdf

Parameters:
  • record – plaintext record tuple. E.g. (index, name, dob, gender)
  • tokenizers – A list of IdentifierType tokenizers (one for each record element)
  • keys – tuple of tuple of keys for the hash functions as bytes
  • xor_folds – number of XOR folds to perform
  • l – length of the Bloom filter in number of bits
  • k – number of hash functions to use per element
  • ngram_encoding
Returns:

3-tuple: - bloom filter for record as a bitarray - first element of record (usually an index) - number of bits set in the bloomfilter

clkhash.bloomfilter.double_hash_encode_ngrams(ngrams, keys, k, l)[source]

Computes the double hash encoding of the provided ngrams with the given keys.

Using the method from http://www.record-linkage.de/-download=wp-grlc-2011-02.pdf

Parameters:
  • ngrams – list of n-grams to be encoded
  • keys – hmac secret keys for md5 and sha1 as bytes
  • k – number of hash functions to use per element of the ngrams
  • l – length of the output bitarray
Returns:

bitarray of length l with the bits set which correspond to the encoding of the ngrams

clkhash.bloomfilter.double_hash_encode_ngrams_non_singular(ngrams, keys, k, l)[source]

computes the double hash encoding of the provided n-grams with the given keys.

The original construction of [Schnell2011] displays an abnormality for certain inputs:
An n-gram can be encoded into just one bit irrespective of the number of k.

Their construction goes as follows: the \(k\) different indices \(g_i\) of the Bloom filter for an n-gram \(x\) are defined as:

\[g_{i}(x) = (h_1(x) + i h_2(x)) \mod l\]

with \(0 \leq i < k\) and \(l\) is the length of the Bloom filter. If the value of the hash of \(x\) of the second hash function is a multiple of \(l\), then

\[h_2(x) = 0 \mod l\]

and thus

\[g_i(x) = h_1(x) \mod l,\]

irrespective of the value \(i\). A discussion of this potential flaw can be found here.

Parameters:
  • ngrams – list of n-grams to be encoded
  • key_sha1 – hmac secret keys for sha1 as bytes
  • key_md5 – hmac secret keys for md5 as bytes
  • k – number of hash functions to use per element of the ngrams
  • l – length of the output bitarray
Returns:

bitarray of length l with the bits set which correspond to the encoding of the ngrams

clkhash.bloomfilter.fold_xor(bloomfilter, folds)[source]

Performs XOR folding on a Bloom filter.

If the length of the original Bloom filter is n and we perform r folds, then the length of the resulting filter is n / 2 ** r.

Parameters:
  • bloomfilter – Bloom filter to fold
  • folds – number of folds
Returns:

folded bloom filter

clkhash.bloomfilter.serialize_bitarray(ba)[source]

Serialize a bitarray (bloomfilter)

clkhash.bloomfilter.stream_bloom_filters(dataset, schema_types, keys, xor_folds=0)[source]

Yield bloom filters

Parameters:
  • dataset – An iterable of indexable records.
  • schema_types – An iterable of identifier type names.
  • keys – A tuple of two lists of secret keys used in the HMAC.
  • xor_folds – number of XOR folds to perform
Returns:

Yields bloom filters as 3-tuples

CLK

Generate CLK from CSV file

clkhash.clk.chunks(l, n)[source]

Yield successive n-sized chunks from l.

clkhash.clk.generate_clk_from_csv(input, keys, schema_types, no_header=False, progress_bar=True, xor_folds=0)[source]
clkhash.clk.generate_clks(pii_data, schema_types, key_lists, xor_folds, callback=None)[source]
clkhash.clk.hash_and_serialize_chunk(chunk_pii_data, schema_types, keys, xor_folds)[source]

Generate Bloom filters (ie hash) from chunks of PII then serialize the generated Bloom filters. It also computes and outputs the Hamming weight (or popcount) – the number of bits set to one – of the generated Bloom filters.

Parameters:
  • chunk_pii_data – An iterable of indexable records.
  • schema_types – An iterable of identifier type names.
  • keys – A tuple of two lists of secret keys used in the HMAC.
  • xor_folds – Number of XOR folds to perform. Each fold halves the hash length.
Returns:

A list of serialized Bloom filters and a list of corresponding popcounts

identifier types

Convert PII to tokens

clkhash.identifier_types.identifier_type_from_description(schema_object)[source]

Convert a dictionary describing a feature into an IdentifierType

Parameters:schema_object
Returns:An IdentifierType

IdentifierType

class clkhash.identifier_types.IdentifierType(unigram=False, weight=1, **kwargs)[source]

Bases: object

Base class used for all identifier types.

Required to provide a mapping of schema to hash type uni-gram or bi-gram.

__call__(entry)[source]

Call self as a function.

__init__(unigram=False, weight=1, **kwargs)[source]
Parameters:
  • unigram (bool) – Use uni-gram instead of using bi-grams
  • weight (float) – adjusts the “importance” of this identifier in the Bloom filter. Can be set to zero to skip
  • kwargs – Extra keyword arguments passed to the tokenizer

Note

For each n-gram of an identifier, we compute k different indices in the Bloom filter which will be set to true. There is a global \(k_{default}\) value, and the k value for each identifier is computed as

\[k = weight * k_{default},\]

rounded to the nearest integer.

Reasons why you might want to set weights:

  • Long identifiers like street name will produce a lot more n-grams than small identifiers like zip code. Thus street name will flip more bits in the Bloom filter and will have a bigger influence in the overall matching score.
  • The matching might produce better results if identifiers that are stable and / or have low error rates are given higher prominence in the Bloom filter.
__weakref__

list of weak references to the object (if defined)

key derivation

class clkhash.key_derivation.HKDFconfig(master_secret, salt=None, info=None, hash_algo='SHA256')[source]

Bases: object

static check_is_bytes(value)[source]
static check_is_bytes_or_none(value)[source]
supported_hash_algos = ('SHA256', 'SHA512')
clkhash.key_derivation.generate_key_lists(master_secrets, num_identifier, key_size=64, salt=None, info=None, kdf='HKDF')[source]

Generates a derived key for each identifier for each master secret using a key derivation function (KDF).

The only supported key derivation function for now is ‘HKDF’.

The previous key usage can be reproduced by setting kdf to ‘legacy’. This is highly discouraged, as this strategy will map the same n-grams in different identifier to the same bits in the Bloom filter and thus does not lead to good results.

Parameters:
  • master_secrets – a list of master secrets (either as bytes or strings)
  • num_identifier – the number of identifiers
  • key_size – the size of the derived keys
  • salt – salt for the KDF as bytes
  • info – optional context and application specific information as bytes
  • kdf – the key derivation function algorithm to use
Returns:

The derived keys. First dimension is of size num_identifier, second dimension is the same as master_secrets. A key is represented as bytes.

clkhash.key_derivation.hkdf(hkdf_config, num_keys, key_size=64)[source]

Executes the HKDF key derivation function as described in rfc5869 to derive num_keys keys of size key_size from the master_secret.

Parameters:
  • hkdf_config – an HKDFconfig object containing the configuration for the HKDF.
  • num_keys – the number of keys the kdf should produce
  • key_size – the size of the produced keys
Returns:

Derived keys

random names

Module to produce a dataset of names, genders and dates of birth and manipulate that list

Currently very simple and not realistic. Additional functions for manipulating the list of names - producing reordered and subset lists with a specific overlap

ClassList class - generate a list of length n of [id, name, dob, gender] lists

TODO: Get age distribution right by using a mortality table TODO: Get first name distributions right by using distributions TODO: Generate realistic errors TODO: Add RESTfull api to generate reasonable name data as requested

class clkhash.randomnames.NameList(n)[source]

Bases: object

List of randomly generated names

generate_random_person(n)[source]

Generator that yields details on a person with plausible name, sex and age.

Yields:Generated data for one person tuple - (id: int, name: str(‘First Last’), birthdate: str(‘DD/MM/YYYY’), sex: str(‘M’ | ‘F’) )
generate_subsets(sz, overlap=0.8)[source]

Return a pair of random subsets of the name list with a specified proportion of elements in common.

Parameters:
  • sz – length of subsets to generate
  • overlap – fraction of the subsets that should have the same names in them
Raises:

ValueError – if there aren’t sufficiently many names in the list to satisfy the request; more precisely, raises if sz + (1 - overlap)*sz > n = len(self.names)

Returns:

pair of subsets

load_names()[source]

This function loads a name database into globals firstNames and lastNames

initial version uses data files from http://www.quietaffiliate.com/free-first-name-and-last-name-databases-csv-and-sql/

schema = [{'identifier': 'INDEX'}, {'identifier': 'NAME freetext'}, {'identifier': 'DOB YYYY/MM/DD'}, {'identifier': 'GENDER M or F'}]
schema_types
clkhash.randomnames.load_csv_data(resource_name)[source]

Loads a specified CSV data file and returns the first column as a Python list

clkhash.randomnames.random_date(start, end)[source]

This function will return a random datetime between two datetime objects.

Parameters:
  • start – datetime of start
  • end – datetime of end
Returns:

random datetime between start and end

clkhash.randomnames.save_csv(data, schema, file)[source]

Output generated data to file as CSV with header.

Parameters:
  • data – An iterable of tuples containing raw data.
  • schema – Iterable of schema definition dicts
  • file – A writeable stream in which to write the CSV

schema

clkhash.schema.get_schema_types(schema)[source]
clkhash.schema.load_schema(schema_file)[source]

tokenizer

Functions to tokenize words (PII)

clkhash.tokenizer.bigramlist(word, toremove=None)[source]

Make bigrams from word with pre- and ap-pended spaces

s -> [‘ ‘ + s0, s0 + s1, s1 + s2, .. sN + ‘ ‘]

Parameters:
  • word – string to make bigrams from
  • toremove – List of strings to remove before construction
Returns:

list of bigrams as strings

clkhash.tokenizer.positional_unigrams(instr)[source]

Make positional unigrams from a word.

E.g. 1987 -> [“1 1”, “2 9”, “3 8”, “4 7”]

Parameters:instr – input string
Returns:list of strings with unigrams
clkhash.tokenizer.unigramlist(instr, toremove=None, positional=False)[source]

Make 1-grams (unigrams) from a word, possibly excluding particular substrings

Parameters:
  • instr – input string
  • toremove – Iterable of strings to remove
Returns:

list of strings with unigrams