API Documentation¶
Bloom filter¶
Generate a Bloom filter
-
clkhash.bloomfilter.
calculate_bloom_filters
(dataset, schema, keys, xor_folds=0)¶ Parameters: - dataset – A list of indexable records.
- schema – An iterable of identifier types.
- keys – A tuple of two lists of secret keys used in the HMAC.
- xor_folds – number of XOR folds to perform
Returns: List of bloom filters as 3-tuples, each containing bloom filter (bitarray), record first element - usually index, bitcount (int)
-
clkhash.bloomfilter.
crypto_bloom_filter
(record, tokenizers, keys1, keys2, xor_folds=0, l=1024, k=30)¶ Makes a Bloom filter from a record with given tokenizers and lists of keys.
Using the method from http://www.record-linkage.de/-download=wp-grlc-2011-02.pdf
Parameters: - record – plaintext record tuple. E.g. (index, name, dob, gender)
- tokenizers – A list of IdentifierType tokenizers (one for each record element)
- keys1 – list of keys for first hash function as list of bytes
- keys2 – list of keys for second hash function as list of bytes
- xor_folds – number of XOR folds to perform
- l – length of the Bloom filter in number of bits
- k – number of hash functions to use per element
Returns: 3-tuple: - bloom filter for record as a bitarray - first element of record (usually an index) - number of bits set in the bloomfilter
-
clkhash.bloomfilter.
double_hash_encode_ngrams
(ngrams, key_sha1, key_md5, k, l)¶ computes the double hash encoding of the provided ngrams with the given keys.
Using the method from http://www.record-linkage.de/-download=wp-grlc-2011-02.pdf
Parameters: - ngrams – list of n-grams to be encoded
- key_sha1 – hmac secret keys for sha1 as bytes
- key_md5 – hmac secret keys for md5 as bytes
- k – number of hash functions to use per element of the ngrams
- l – length of the output bitarray
Returns: bitarray of length l with the bits set which correspond to the encoding of the ngrams
-
clkhash.bloomfilter.
fold_xor
(bloomfilter, folds)¶ Performs XOR folding on a Bloom filter.
If the length of the original Bloom filter is n and we perform r folds, then the length of the resulting filter is n / 2 ** r.
Parameters: - bloomfilter – Bloom filter to fold
- folds – number of folds
Returns: folded bloom filter
-
clkhash.bloomfilter.
serialize_bitarray
(ba)¶ Serialize a bitarray (bloomfilter)
-
clkhash.bloomfilter.
stream_bloom_filters
(dataset, schema_types, keys, xor_folds=0)¶ Yield bloom filters
Parameters: - dataset – An iterable of indexable records.
- schema_types – An iterable of identifier type names.
- keys – A tuple of two lists of secret keys used in the HMAC.
- xor_folds – number of XOR folds to perform
Returns: Yields bloom filters as 3-tuples
CLK¶
Generate CLK from CSV file
-
clkhash.clk.
chunks
(l, n)¶ Yield successive n-sized chunks from l.
-
clkhash.clk.
generate_clk_from_csv
(input, keys, schema_types, no_header=False, progress_bar=True, xor_folds=0)¶
-
clkhash.clk.
generate_clks
(pii_data, schema_types, key_lists, xor_folds, callback=None)¶
-
clkhash.clk.
hash_and_serialize_chunk
(chunk_pii_data, schema_types, keys, xor_folds)¶ Generate Bloom filters (ie hash) from chunks of PII then serialize the generated Bloom filters.
Parameters: - chunk_pii_data – An iterable of indexable records.
- schema_types – An iterable of identifier type names.
- keys – A tuple of two lists of secret keys used in the HMAC.
- xor_folds – Number of XOR folds to perform. Each fold halves the hash length.
Returns: A list of serialized Bloom filters
identifier types¶
Convert PII to tokens
-
clkhash.identifier_types.
identifier_type_from_description
(schema_object)¶ Convert a dictionary describing a feature into an IdentifierType
Parameters: schema_object – Returns: An IdentifierType
IdentifierType¶
-
class
clkhash.identifier_types.
IdentifierType
(unigram=False, weight=1, **kwargs)¶ Bases:
object
Base class used for all identifier types.
Required to provide a mapping of schema to hash type uni-gram or bi-gram.
-
__call__
(entry)¶ Call self as a function.
-
__init__
(unigram=False, weight=1, **kwargs)¶ Parameters: Note
For each n-gram of an identifier, we compute k different indices in the Bloom filter which will be set to true. There is a global \(k_{default}\) value, and the k value for each identifier is computed as
\[k = weight * k_{default},\]rounded to the nearest integer.
Reasons why you might want to set weights:
- Long identifiers like street name will produce a lot more n-grams than small identifiers like zip code. Thus street name will flip more bits in the Bloom filter and will have a bigger influence in the overall matching score.
- The matching might produce better results if identifiers that are stable and / or have low error rates are given higher prominence in the Bloom filter.
-
__weakref__
¶ list of weak references to the object (if defined)
-
key derivation¶
-
class
clkhash.key_derivation.
HKDFconfig
(master_secret, salt=None, info=None, hash_algo='SHA256')¶ Bases:
object
-
static
check_is_bytes
(value)¶
-
static
check_is_bytes_or_none
(value)¶
-
supported_hash_algos
= ('SHA256', 'SHA512')¶
-
static
-
clkhash.key_derivation.
generate_key_lists
(master_secrets, num_identifier, key_size=64, salt=None, info=None, kdf='HKDF')¶ Generates a derived key for each identifier for each master secret using a key derivation function (KDF).
The only supported key derivation function for now is ‘HKDF’.
The previous key usage can be reproduced by setting kdf to ‘legacy’. This is highly discouraged, as this strategy will map the same n-grams in different identifier to the same bits in the Bloom filter and thus does not lead to good results.
Parameters: - master_secrets – a list of master secrets (either as bytes or strings)
- num_identifier – the number of identifiers
- key_size – the size of the derived keys
- salt – salt for the KDF as bytes
- info – optional context and application specific information as bytes
- kdf – the key derivation function algorithm to use
Returns: The derived keys. First dimension is the same as master_secrets, second dimension is of size num_identifier. A key is represented as bytes.
-
clkhash.key_derivation.
hkdf
(hkdf_config, num_keys, key_size=64)¶ Executes the HKDF key derivation function as described in rfc5869 to derive num_keys keys of size key_size from the master_secret.
Parameters: - hkdf_config – an HKDFconfig object containing the configuration for the HKDF.
- num_keys – the number of keys the kdf should produce
- key_size – the size of the produced keys
Returns: Derived keys
random names¶
Module to produce a dataset of names, genders and dates of birth and manipulate that list
Currently very simple and not realistic. Additional functions for manipulating the list of names - producing reordered and subset lists with a specific overlap
ClassList class - generate a list of length n of [id, name, dob, gender] lists
TODO: Get age distribution right by using a mortality table TODO: Get first name distributions right by using distributions TODO: Generate realistic errors TODO: Add RESTfull api to generate reasonable name data as requested
-
class
clkhash.randomnames.
NameList
(n)¶ Bases:
object
List of randomly generated names
-
generate_random_person
(n)¶ Generator that yields details on a person with plausible name, sex and age.
Yields: Generated data for one person tuple - (id: int, name: str(‘First Last’), birthdate: str(‘DD/MM/YYYY’), sex: str(‘M’ | ‘F’) )
-
generate_subsets
(sz, overlap=0.8)¶ Generate a pair of subsets of the name list with a specified overlap
Parameters: - sz – length of subsets to generate
- overlap – fraction of the subsets that should have the same names in them
Returns: 2-tuple of lists of subsets
-
load_names
()¶ This function loads a name database into globals firstNames and lastNames
initial version uses data files from http://www.quietaffiliate.com/free-first-name-and-last-name-databases-csv-and-sql/
-
schema
= [{'identifier': 'INDEX'}, {'identifier': 'NAME freetext'}, {'identifier': 'DOB YYYY/MM/DD'}, {'identifier': 'GENDER M or F'}]¶
-
schema_types
¶
-
-
clkhash.randomnames.
load_csv_data
(resource_name)¶ Loads a specified data file as csv and returns the first column as a Python list
-
clkhash.randomnames.
random_date
(start, end)¶ This function will return a random datetime between two datetime objects.
Parameters: - start – datetime of start
- end – datetime of end
Returns: random datetime between start and end
-
clkhash.randomnames.
save_csv
(data, schema, file)¶ Output generated data as csv with header.
Parameters: - data – An iterable of tuples containing raw data.
- schema – Iterable of schema definition dicts
- file – A writeable stream in which to write the csv
tokenizer¶
Functions to tokenize words (PII)
-
clkhash.tokenizer.
bigramlist
(word, toremove=None)¶ Make bigrams from word with pre- and ap-pended spaces
s -> [‘ ‘ + s0, s0 + s1, s1 + s2, .. sN + ‘ ‘]
Parameters: - word – string to make bigrams from
- toremove – List of strings to remove before construction
Returns: list of bigrams as strings
-
clkhash.tokenizer.
positional_unigrams
(instr)¶ Make positional unigrams from a word.
E.g. 1987 -> [“1 1”, “2 9”, “3 8”, “4 7”]
Parameters: instr – input string Returns: list of strings with unigrams
-
clkhash.tokenizer.
unigramlist
(instr, toremove=None, positional=False)¶ Make 1-grams (unigrams) from a word, possibly excluding particular substrings
Parameters: - instr – input string
- toremove – Iterable of strings to remove
Returns: list of strings with unigrams