API Documentation¶
Bloom filter¶
Generate a Bloom filter
- clkhash.bloomfilter.blake_encode_ngrams(ngrams: Iterable[str], keys: Sequence[bytes], ks: Sequence[int], l: int, encoding: str) bitarray.bitarray [source]¶
Computes the encoding of the ngrams using the BLAKE2 hash function.
We deliberately do not use the double hashing scheme as proposed in [ Schnell2011]_, because this would introduce an exploitable structure into the Bloom filter. For more details on the weakness, see [Kroll2015].
In short, the double hashing scheme only allows for \(l^2\) different encodings for any possible n-gram, whereas the use of \(k\) different independent hash functions gives you \(\sum_{j=1}^{k}{\binom{l}{j}}\) combinations.
Our construction
It is advantageous to construct Bloom filters using a family of hash functions with the property of k-independence to compute the indices for an entry. This approach minimises the change of collisions.
An informal definition of k-independence of a family of hash functions is, that if selecting a function at random from the family, it guarantees that the hash codes of any designated k keys are independent random variables.
Our construction utilises the fact that the output bits of a cryptographic hash function are uniformly distributed, independent, binary random variables (well, at least as close to as possible. See [Kaminsky2011] for an analysis). Thus, slicing the output of a cryptographic hash function into k different slices gives you k independent random variables.
We chose Blake2 as the cryptographic hash function mainly for two reasons:
it is fast.
in keyed hashing mode, Blake2 provides MACs with just one hash function call instead of the two calls in the HMAC construction used in the double hashing scheme.
Warning
Please be aware that, although this construction makes the attack of [Kroll2015] infeasible, it is most likely not enough to ensure security. Or in their own words:
However, we think that using independent hash functions alone will not be sufficient to ensure security, since in this case other approaches (maybe related to or at least inspired through work from the area of Frequent Itemset Mining) are promising to detect at least the most frequent atoms automatically.- Parameters
ngrams – list of n-grams to be encoded
keys – secret key for blake2 as bytes
ks – ks[i] is k value to use for ngram[i]
l – length of the output bitarray (has to be a power of 2)
encoding – the encoding to use when turning the ngrams to bytes
- Returns
bitarray of length l with the bits set which correspond to the encoding of the ngrams
- clkhash.bloomfilter.crypto_bloom_filter(record: Sequence[str], comparators: List[clkhash.comparators.AbstractComparison], schema: clkhash.schema.Schema, keys: Sequence[Sequence[bytes]]) Tuple[bitarray.bitarray, str, int] [source]¶
Computes the composite Bloom filter encoding of a record.
Using the method from http://www.record-linkage.de/-download=wp-grlc-2011-02.pdf
- Parameters
record – plaintext record tuple. E.g. (index, name, dob, gender)
comparators – A list of comparators. They provide a ‘tokenize’ function to turn string into appropriate tokens.
schema – Schema
keys – Keys for the hash functions as a tuple of lists of bytes.
- Returns
3-tuple:
bloom filter for record as a bitarray
first element of record (usually an index)
number of bits set in the bloomfilter
- clkhash.bloomfilter.double_hash_encode_ngrams(ngrams: Iterable[str], keys: Sequence[bytes], ks: Sequence[int], l: int, encoding: str) bitarray.bitarray [source]¶
Computes the double hash encoding of the ngrams with the given keys.
Using the method from: Schnell, R., Bachteler, T., & Reiher, J. (2011). A Novel Error-Tolerant Anonymous Linking Code. http://grlc.german-microsimulation.de/wp-content/uploads/2017/05/downloadwp-grlc-2011-02.pdf
- Parameters
ngrams – list of n-grams to be encoded
keys – hmac secret keys for md5 and sha1 as bytes
ks – ks[i] is k value to use for ngram[i]
l – length of the output bitarray
encoding – the encoding to use when turning the ngrams to bytes
- Returns
bitarray of length l with the bits set which correspond to the encoding of the ngrams
- clkhash.bloomfilter.double_hash_encode_ngrams_non_singular(ngrams: Iterable[str], keys: Sequence[bytes], ks: Sequence[int], l: int, encoding: str) bitarray.bitarray [source]¶
computes the double hash encoding of the n-grams with the given keys.
The original construction of [Schnell2011] displays an abnormality for certain inputs:
An n-gram can be encoded into just one bit irrespective of the number of k.
Their construction goes as follows: the \(k\) different indices \(g_i\) of the Bloom filter for an n-gram \(x\) are defined as:
\[g_{i}(x) = (h_1(x) + i h_2(x)) \mod l\]with \(0 \leq i < k\) and \(l\) is the length of the Bloom filter. If the value of the hash of \(x\) of the second hash function is a multiple of \(l\), then
\[h_2(x) = 0 \mod l\]and thus
\[g_i(x) = h_1(x) \mod l,\]irrespective of the value \(i\). A discussion of this potential flaw can be found here.
- Parameters
ngrams – list of n-grams to be encoded
keys – tuple with (key_sha1, key_md5). That is, (hmac secret keys for sha1 as bytes, hmac secret keys for md5 as bytes)
ks – ks[i] is k value to use for ngram[i]
l – length of the output bitarray
encoding – the encoding to use when turning the ngrams to bytes
- Returns
bitarray of length l with the bits set which correspond to the encoding of the ngrams
- clkhash.bloomfilter.fold_xor(bloomfilter: bitarray.bitarray, folds: int) bitarray.bitarray [source]¶
Performs XOR folding on a Bloom filter.
If the length of the original Bloom filter is n and we perform r folds, then the length of the resulting filter is n / 2 ** r.
- Parameters
bloomfilter – Bloom filter to fold
folds – number of folds
- Returns
folded bloom filter
- clkhash.bloomfilter.hashing_function_from_properties(fhp: clkhash.field_formats.FieldHashingProperties) Callable[[Iterable[str], Sequence[bytes], Sequence[int], int, str], bitarray.bitarray] [source]¶
Get the hashing function for this field :param fhp: hashing properties for this field :return: the hashing function
- clkhash.bloomfilter.stream_bloom_filters(dataset: Iterable[Sequence[str]], keys: Sequence[Sequence[bytes]], schema: clkhash.schema.Schema) Iterable[Tuple[bitarray.bitarray, str, int]] [source]¶
Compute composite Bloom filters (CLKs) for every record in an iterable dataset.
- Parameters
dataset – An iterable of indexable records.
schema – An instantiated Schema instance
keys – A tuple of two lists of secret keys used in the HMAC.
- Returns
Generator yielding bloom filters as 3-tuples
CLK¶
Generate CLK from data.
- clkhash.clk.chunks(seq: Sequence[clkhash.clk.T], chunk_size: int) Iterable[Sequence[clkhash.clk.T]] [source]¶
Split seq into chunk_size-sized chunks.
- Parameters
seq – A sequence to chunk.
chunk_size – The size of chunk.
- clkhash.clk.generate_clk_from_csv(input_f: TextIO, secret: AnyStr, schema: clkhash.schema.Schema, validate: bool = True, header: Union[bool, AnyStr] = True, progress_bar: bool = True, max_workers: Optional[int] = None) List[bitarray.bitarray] [source]¶
Generate Bloom filters from CSV file, then serialise them.
This function also computes and outputs the Hamming weight (a.k.a popcount – the number of bits set to high) of the generated Bloom filters.
- Parameters
input_f – A file-like object of csv data to hash.
secret – A secret.
schema – Schema specifying the record formats and hashing settings.
validate – Set to False to disable validation of data against the schema. Note that this will silence warnings whose aim is to keep the hashes consistent between data sources; this may affect linkage accuracy.
header – Set to False if the CSV file does not have a header. Set to ‘ignore’ if the CSV file does have a header but it should not be checked against the schema.
progress_bar (bool) – Set to False to disable the progress bar.
max_workers (int) – Passed to ProcessPoolExecutor except for the special case where the value is 1, in which case no processes or threads are used. This may be useful or required on platforms that are not capable of spawning subprocesses.
- Returns
A list of Bloom filters as bitarrays and a list of corresponding popcounts.
- clkhash.clk.generate_clks(pii_data: Sequence[Sequence[str]], schema: clkhash.schema.Schema, secret: AnyStr, validate: bool = True, callback: Optional[Callable[[int, Sequence[int]], None]] = None, max_workers: Optional[int] = None) List[bitarray.bitarray] [source]¶
- clkhash.clk.hash_chunk(chunk_pii_data: Sequence[Sequence[str]], keys: Sequence[Sequence[bytes]], schema: clkhash.schema.Schema) Tuple[List[bitarray.bitarray], Sequence[int]] [source]¶
Generate Bloom filters (ie hash) from chunks of PII. It also computes and outputs the Hamming weight (or popcount) – the number of bits set to one – of the generated Bloom filters.
- Parameters
chunk_pii_data – An iterable of indexable records.
keys – A tuple of two lists of keys used in the HMAC. Should have been created by generate_key_lists.
schema (Schema) – Schema specifying the entry formats and hashing settings.
- Returns
A list of Bloom filters as bitarrays and a list of corresponding popcounts
key derivation¶
- clkhash.key_derivation.generate_key_lists(secret: Union[bytes, str], num_identifier: int, num_hashing_methods: int = 2, key_size: int = 64, salt: Optional[bytes] = None, info: Optional[bytes] = None, kdf: str = 'HKDF', hash_algo: str = 'SHA256') Tuple[Tuple[bytes, ...], ...] [source]¶
Generates num_hashing_methods derived keys for each identifier for the secret using a key derivation function (KDF).
The only supported key derivation function for now is ‘HKDF’.
The previous secret usage can be reproduced by setting kdf to ‘legacy’, but it will use the secret twice. This is highly discouraged, as this strategy will map the same n-grams in different identifier to the same bits in the Bloom filter and thus does not lead to good results.
- Parameters
secret – a secret (either as bytes or string)
num_identifier – the number of identifiers
num_hashing_methods – number of hashing methods used per identifier, each of them requiring a different key
key_size – the size of the derived keys
salt – salt for the KDF as bytes
info – optional context and application specific information as bytes
kdf – the key derivation function algorithm to use
hash_algo – the hashing algorithm to use (ignored if kdf is not ‘HKDF’)
- Returns
The derived keys. First dimension is of size num_identifier, second dimension is of size num_hashing_methods A key is represented as bytes.
- clkhash.key_derivation.hkdf(secret: bytes, num_keys: int, hash_algo: str = 'SHA256', salt: Optional[bytes] = None, info: Optional[bytes] = None, key_size: int = 64) Tuple[bytes, ...] [source]¶
Executes the HKDF key derivation function as described in rfc5869 to derive num_keys keys of size key_size from the secret.
- Parameters
secret – input keying material
num_keys – the number of keys the kdf should produce
hash_algo – The hash function used by HKDF for the internal HMAC calls. The choice of hash function defines the maximum length of the output key material. Output bytes <= 255 * hash digest size (in bytes).
salt – HKDF is defined to operate with and without random salt. This is done to accommodate applications where a salt value is not available. We stress, however, that the use of salt adds significantly to the strength of HKDF, ensuring independence between different uses of the hash function, supporting “source-independent” extraction, and strengthening the analytical results that back the HKDF design. Random salt differs fundamentally from the initial keying material in two ways: it is non-secret and can be re-used. Ideally, the salt value is a random (or pseudorandom) string of the length HashLen. Yet, even a salt value of less quality (shorter in size or with limited entropy) may still make a significant contribution to the security of the output keying material.
info – While the ‘info’ value is optional in the definition of HKDF, it is often of great importance in applications. Its main objective is to bind the derived key material to application- and context-specific information. For example, ‘info’ may contain a protocol number, algorithm identifiers, user identities, etc. In particular, it may prevent the derivation of the same keying material for different contexts (when the same input key material (IKM) is used in such different contexts). It may also accommodate additional inputs to the key expansion part, if so desired (e.g., an application may want to bind the key material to its length L, thus making L part of the ‘info’ field). There is one technical requirement from ‘info’: it should be independent of the input key material value IKM.
key_size – the size of the produced keys
- Returns
Derived keys
random names¶
Module to produce a dataset of names, genders and dates of birth and manipulate that list
Names and ages are based on Australian and USA census data, but are not correlated. Additional functions for manipulating the list of names - producing reordered and subset lists with a specific overlap
ClassList class - generate a list of length n of [id, name, dob, gender] lists
TODO: Generate realistic errors TODO: Add RESTful api to generate reasonable name data as requested
- class clkhash.randomnames.Distribution(resource_name: str)[source]¶
Bases:
object
Creates a random value generator with a weighted distribution
- class clkhash.randomnames.NameList(n: int)[source]¶
Bases:
object
Randomly generated PII records.
- SCHEMA = <Schema (v3): 4 fields>¶
- generate_random_person(n: int) Iterable[Tuple[str, str, str, str]] [source]¶
Generator that yields details on a person with plausible name, sex and age.
- Yields
Generated data for one person tuple - (id: str, name: str(‘First Last’), birthdate: str(‘DD/MM/YYYY’), sex: str(‘M’ | ‘F’) )
- generate_subsets(sz: int, overlap: float = 0.8, subsets: int = 2) Tuple[List, ...] [source]¶
Return random subsets with nonempty intersection.
The random subsets are of specified size. If an element is common to two subsets, then it is common to all subsets. This overlap is controlled by a parameter.
- Parameters
sz – size of subsets to generate
overlap – size of the intersection, as fraction of the subset length
subsets – number of subsets to generate
- Raises
ValueError – if there aren’t sufficiently many names in the list to satisfy the request; more precisely, raises if (1 - subsets) * floor(overlap * sz) + subsets * sz > len(self.names).
- Returns
tuple of subsets
- load_data() None [source]¶
Loads databases from package data
Uses data files sourced from http://www.quietaffiliate.com/free-first-name-and-last-name-databases-csv-and-sql/ https://www.census.gov/topics/population/genealogy/data/2010_surnames.html https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3101.0Jun%202016
- randomname_schema = {'clkConfig': {'kdf': {'hash': 'SHA256', 'info': 'c2NoZW1hX2V4YW1wbGU=', 'keySize': 64, 'salt': 'SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==', 'type': 'HKDF'}, 'l': 1024}, 'features': [{'identifier': 'INDEX', 'ignored': True}, {'identifier': 'NAME freetext', 'format': {'type': 'string', 'encoding': 'utf-8', 'case': 'mixed', 'minLength': 3}, 'hashing': {'comparison': {'type': 'ngram', 'n': 2, 'positional': False}, 'strategy': {'bitsPerToken': 15}, 'hash': {'type': 'doubleHash'}}}, {'identifier': 'DOB YYYY/MM/DD', 'format': {'type': 'date', 'description': 'Numbers separated by slashes, in the year, month, day order', 'format': '%Y/%m/%d'}, 'hashing': {'comparison': {'type': 'ngram', 'n': 1, 'positional': True}, 'strategy': {'bitsPerToken': 30}, 'hash': {'type': 'doubleHash'}}}, {'identifier': 'GENDER M or F', 'format': {'type': 'enum', 'values': ['M', 'F']}, 'hashing': {'comparison': {'type': 'ngram', 'n': 1, 'positional': False}, 'strategy': {'bitsPerToken': 60}, 'hash': {'type': 'doubleHash'}}}], 'version': 3}¶
- randomname_schema_bytes = b'{\n "version": 3,\n "clkConfig": {\n "l": 1024,\n "kdf": {\n "type": "HKDF",\n "hash": "SHA256",\n "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",\n "info": "c2NoZW1hX2V4YW1wbGU=",\n "keySize": 64\n }\n },\n "features": [\n {\n "identifier": "INDEX",\n "ignored": true\n },\n {\n "identifier": "NAME freetext",\n "format": {\n "type": "string",\n "encoding": "utf-8",\n "case": "mixed",\n "minLength": 3\n },\n "hashing": {\n "comparison": {\n "type": "ngram",\n "n": 2\n },\n "strategy": {\n "bitsPerToken": 15\n },\n "hash": {"type": "doubleHash"}\n }\n },\n {\n "identifier": "DOB YYYY/MM/DD",\n "format": {\n "type": "date",\n "description": "Numbers separated by slashes, in the year, month, day order",\n "format": "%Y/%m/%d"\n },\n "hashing": {\n "comparison": {\n "type": "ngram",\n "n": 1,\n "positional": true\n },\n "strategy": {\n "bitsPerToken": 30\n },\n "hash": {"type": "doubleHash"}\n }\n },\n {\n "identifier": "GENDER M or F",\n "format": {\n "type": "enum",\n "values": ["M", "F"]\n },\n "hashing": {\n "comparison": {\n "type": "ngram",\n "n": 1\n },\n "strategy": {\n "bitsPerToken": 60\n },\n "hash": {"type": "doubleHash"}\n }\n }\n ]\n}\n'¶
- property schema_types: Sequence[clkhash.field_formats.FieldSpec]¶
- clkhash.randomnames.random_date(year: int, age_distribution: Optional[clkhash.randomnames.Distribution]) datetime.datetime [source]¶
Generate a random datetime between two datetime objects.
- Parameters
start – datetime of start
end – datetime of end
- Returns
random datetime between start and end
- clkhash.randomnames.save_csv(data: Iterable[Tuple[Union[str, int], ...]], headers: Iterable[str], file: TextIO) None [source]¶
Output generated data to file as CSV with header.
- Parameters
data – An iterable of tuples containing raw data.
headers – Iterable of feature names
file – A writeable stream in which to write the CSV
schema¶
Schema loading and validation.
- exception clkhash.schema.MasterSchemaError[source]¶
Bases:
Exception
Master schema missing? Corrupted? Otherwise surprising? This is the exception for you!
- class clkhash.schema.Schema(fields: Sequence[clkhash.field_formats.FieldSpec], l: int, xor_folds: int = 0, kdf_type: str = 'HKDF', kdf_hash: str = 'SHA256', kdf_info: Optional[bytes] = None, kdf_salt: Optional[bytes] = None, kdf_key_size: int = 64)[source]¶
Bases:
object
Linkage Schema which describes how to encode plaintext identifiers.
- Variables
fields – the features or field definitions
l (int) – The length of the resulting encoding in bits. This is the length after XOR folding.
xor_folds (int) – The number of XOR folds to perform on the hash.
kdf_type (str) – The key derivation function to use. Currently, the only permitted value is ‘HKDF’.
kdf_hash (str) – The hash function to use in key derivation. The options are ‘SHA256’ and ‘SHA512’.
kdf_info (bytes) – The info for key derivation. See documentation of
key_derivation.hkdf()
for details.kdf_salt (bytes) – The salt for key derivation. See documentation of
key_derivation.hkdf()
for details.kdf_key_size (int) – The size of the derived keys in bytes.
- exception clkhash.schema.SchemaError(msg: str, errors: Optional[Sequence[clkhash.field_formats.InvalidSchemaError]] = None)[source]¶
Bases:
Exception
The user-defined schema is invalid.
- clkhash.schema.convert_to_latest_version(schema_dict: Dict[str, Any], validate_result: Optional[bool] = False) Dict[str, Any] [source]¶
Convert the given schema to latest schema version.
- Parameters
schema_dict – A dictionary describing a linkage schema. This dictionary must have a ‘version’ key containing a master schema version. The rest of the schema dict must conform to the corresponding master schema.
validate_result – validate converted schema against schema specification
- Returns
schema dict of the latest version
- Raises
SchemaError – if schema version is not supported
- clkhash.schema.from_json_dict(dct: Dict[str, Any], validate: bool = True) clkhash.schema.Schema [source]¶
Create a Schema of the most recent version according to dct
if the provided schema dict is of an older version, then it will be automatically converted to the latest.
- Parameters
dct – This dictionary must have a ‘features’ key specifying the columns of the dataset. It must have a ‘version’ key containing the master schema version that this schema conforms to. It must have a ‘hash’ key with all the globals.
validate – (default True) Raise an exception if the schema does not conform to the master schema.
- Raises
SchemaError – An exception containing details about why the schema is not valid.
- Returns
the Schema
- clkhash.schema.from_json_file(schema_file: TextIO, validate: bool = True) clkhash.schema.Schema [source]¶
Load a Schema object from a json file.
- Parameters
schema_file – A JSON file containing the schema.
validate – (default True) Raise an exception if the schema does not conform to the master schema.
- Raises
SchemaError – When the schema is invalid.
- Returns
the Schema
- clkhash.schema.validate_schema_dict(schema: Dict[str, Any]) None [source]¶
Validate the schema.
This raises iff either the schema or the master schema are invalid. If it’s successful, it returns nothing.
- Parameters
schema – The schema to validate, as parsed by json.
- Raises
SchemaError – When the schema is invalid.
MasterSchemaError – When the master schema is invalid.
field_formats¶
Classes that specify the requirements for each column in a dataset. They take care of validation, and produce the settings required to perform the hashing.
- class clkhash.field_formats.BitsPerFeatureStrategy(bits_per_feature: int)[source]¶
Bases:
clkhash.field_formats.StrategySpec
Have a fixed number of filter insertions for a feature, irrespective of the actual number of tokens.
This strategy allows to reason about the importance of a feature, irrespective of the lengths of the feature values. For example, in the BitsPerTokenStrategy the name ‘Bob’ affects only have the number of bits in the Bloom filter than ‘Robert’. With this BitsPerFeatureStrategy, both names set the same number of bits in the filter, thus allowing to adjust importance on a per feature basis.
- Variables
bits_per_feature (int) – total number of insertions for this feature, will be spread across all tokens.
- class clkhash.field_formats.BitsPerTokenStrategy(bits_per_token: int)[source]¶
Bases:
clkhash.field_formats.StrategySpec
Insert every token the same number of times.
This is the strategy from the original Schnell paper. The provided value bits_per_token (the ‘k’ value in the paper) defines the number of hash functions that are used to insert each token into the Bloom filter.
One important property of this strategy is that the total number of inserted bits for a feature relates to the length of its value. This can have privacy implications, as the number of bits set in a Bloom filter correlate to the number of tokens of the PII.
- Variables
bits_per_token (int) – how often each token should be inserted into the filter
- class clkhash.field_formats.DateSpec(identifier: str, hashing_properties: clkhash.field_formats.FieldHashingProperties, format: str, description: Optional[str] = None)[source]¶
Bases:
clkhash.field_formats.FieldSpec
Represents a field that holds dates.
Dates are specified as full-dates in a format that can be described as a strptime() (C89 standard) compatible format string. E.g.: the format for the standard internet format RFC3339 (e.g. 1996-12-19) is ‘%Y-%m-%d’.
- Variables
format (str) – The format of the date.
- OUTPUT_FORMAT = '%Y%m%d'¶
- classmethod from_json_dict(json_dict: Dict[str, Any]) clkhash.field_formats.DateSpec [source]¶
Make a DateSpec object from a dictionary containing its properties.
- Parameters
json_dict (dict) – This dictionary must contain a ‘format’ key. In addition, it must contain a ‘hashing’ key, whose contents are passed to
FieldHashingProperties
.json_dict – The properties dictionary.
- validate(str_in: str) None [source]¶
Validates an entry in the field.
Raises InvalidEntryError iff the entry is invalid.
An entry is invalid iff (1) the string does not represent a date in the correct format; or (2) the date it represents is invalid (such as 30 February).
- Parameters
str_in (str) – String to validate.
- Raises
InvalidEntryError – Iff entry is invalid.
ValueError – When self.format is unrecognised.
- class clkhash.field_formats.EnumSpec(identifier: str, hashing_properties: clkhash.field_formats.FieldHashingProperties, values: Iterable[str], description: Optional[str] = None)[source]¶
Bases:
clkhash.field_formats.FieldSpec
Represents a field that holds an enum.
The finite collection of permitted values must be specified.
- Variables
values – The set of permitted values.
- classmethod from_json_dict(json_dict: Dict[str, Any]) clkhash.field_formats.EnumSpec [source]¶
Make a EnumSpec object from a dictionary containing its properties.
- Parameters
json_dict (dict) – This dictionary must contain an ‘enum’ key specifying the permitted values. In addition, it must contain a ‘hashing’ key, whose contents are passed to
FieldHashingProperties
.
- class clkhash.field_formats.FieldHashingProperties(comparator: clkhash.comparators.AbstractComparison, strategy: clkhash.field_formats.StrategySpec, encoding: str = 'utf-8', hash_type: str = 'blakeHash', prevent_singularity: Optional[bool] = None, missing_value: Optional[clkhash.field_formats.MissingValueSpec] = None)[source]¶
Bases:
object
Stores the settings used to hash a field.
This includes the encoding and tokenisation parameters.
- Variables
comparator (AbstractComparison) – provides a tokenizer for desired comparison strategy
encoding (str) – The encoding to use when converting the string to bytes. Refer to Python’s documentation for possible values.
hash_type (str) – hash function to use for hashing
prevent_singularity (bool) – the ‘doubleHash’ function has a singularity problem
num_bits (int) – dynamic k = num_bits / number of n-grams
k (int) – max number of bits per n-gram
missing_value (MissingValueSpec) – specifies how to handle missing values
- class clkhash.field_formats.FieldSpec(identifier: str, hashing_properties: Optional[clkhash.field_formats.FieldHashingProperties], description: Optional[str] = None)[source]¶
Bases:
object
Abstract base class representing the specification of a column in the dataset. Subclasses validate entries, and modify the hashing_properties ivar to customise hashing procedures.
- Variables
identifier (str) – The name of the field.
description (str) – Description of the field format.
hashing_properties (FieldHashingProperties) – The properties for hashing. None if field ignored.
- format_value(str_in: str) str [source]¶
formats the value ‘str_in’ for hashing according to this field’s spec.
There are several reasons why this might be necessary:
This field contains missing values which have to be replaced by some other string
There are several different ways to describe a specific value for this field, e.g.: all of ‘+65’, ‘ 65’, ‘65’ are valid representations of the integer 65.
Entries of this field might contain elements with no entropy, e.g. dates might be formatted as yyyy-mm-dd, thus all dates will have ‘-’ at the same place. These artifacts have no value for entity resolution and should be removed.
- Parameters
str_in (str) – the string to format
- Returns
a string representation of ‘str_in’ which is ready to be hashed
- classmethod from_json_dict(field_dict: Dict[str, Any]) clkhash.field_formats.FieldSpec [source]¶
Initialise a
FieldSpec
object from a dictionary of properties.- Parameters
field_dict (dict) – The properties dictionary to use. Must contain a ‘hashing’ key that meets the requirements of
FieldHashingProperties
.- Raises
InvalidSchemaError – When the properties dictionary contains invalid values. Exactly what that means is decided by the subclasses.
- is_missing_value(str_in: str) bool [source]¶
tests if ‘str_in’ is the sentinel value for this field
- Parameters
str_in (str) – String to test if it stands for missing value
- Returns
True if a missing value is defined for this field and str_in matches this value
- abstract validate(str_in: str) None [source]¶
Validates an entry in the field.
Raises
InvalidEntryError
iff the entry is invalid.Subclasses must override this method with their own validation. They should call the parent’s validate method via super.
- Parameters
str_in (str) – String to validate.
- Raises
InvalidEntryError – When entry is invalid.
- class clkhash.field_formats.Ignore(identifier: Optional[str] = None)[source]¶
Bases:
clkhash.field_formats.FieldSpec
represent a field which will be ignored throughout the clk processing.
- validate(str_in: str)[source]¶
Validates an entry in the field.
Raises
InvalidEntryError
iff the entry is invalid.Subclasses must override this method with their own validation. They should call the parent’s validate method via super.
- Parameters
str_in (str) – String to validate.
- Raises
InvalidEntryError – When entry is invalid.
- class clkhash.field_formats.IntegerSpec(identifier: str, hashing_properties: clkhash.field_formats.FieldHashingProperties, description: Optional[str] = None, minimum: Optional[int] = None, maximum: Optional[int] = None, **kwargs: Dict[str, Any])[source]¶
Bases:
clkhash.field_formats.FieldSpec
Represents a field that holds integers.
Minimum and maximum values may be specified.
- Variables
- classmethod from_json_dict(json_dict: Dict[str, Any]) clkhash.field_formats.IntegerSpec [source]¶
Make a IntegerSpec object from a dictionary containing its properties.
- Parameters
json_dict (dict) – This dictionary may contain ‘minimum’ and ‘maximum’ keys. In addition, it must contain a ‘hashing’ key, whose contents are passed to
FieldHashingProperties
.json_dict – The properties dictionary.
- validate(str_in: str) None [source]¶
Validates an entry in the field.
Raises InvalidEntryError iff the entry is invalid.
An entry is invalid iff (1) the string does not represent a base-10 integer; (2) the integer is not between self.minimum and self.maximum, if those exist; or (3) the integer is negative.
- Parameters
str_in (str) – String to validate.
- Raises
InvalidEntryError – When entry is invalid.
- exception clkhash.field_formats.InvalidEntryError[source]¶
Bases:
ValueError
An entry in the data file does not conform to the schema.
- field_spec = None¶
- exception clkhash.field_formats.InvalidSchemaError[source]¶
Bases:
ValueError
Raised if the schema of a field specification is invalid.
For example, a regular expression included in the schema is not syntactically correct.
- field_spec_index = None¶
- json_field_spec = None¶
- class clkhash.field_formats.MissingValueSpec(sentinel: str, replace_with: Optional[str] = None)[source]¶
Bases:
object
Stores the information about how to find and treat missing values.
- Variables
- class clkhash.field_formats.StrategySpec[source]¶
Bases:
object
Stores the information about the insertion strategy.
A strategy has to implement the ‘bits_per_token’ function, which defines how often each token gets inserted into the Bloom filter.
- abstract bits_per_token(num_tokens: int) List[int] [source]¶
Return a list of integers, one for each of the num_tokens tokens, defining how often that token gets inserted into the Bloom filter.
- Parameters
num_tokens (int) – number of tokens in the feature’s value
- Returns
[ k, … ] with k’s >= 0
- classmethod from_json_dict(json_dict: Dict[str, Union[str, SupportsInt]]) clkhash.field_formats.StrategySpec [source]¶
- class clkhash.field_formats.StringSpec(identifier: str, hashing_properties: clkhash.field_formats.FieldHashingProperties, description: Optional[str] = None, regex: Optional[str] = None, case: str = 'mixed', min_length: int = 0, max_length: Optional[int] = None)[source]¶
Bases:
clkhash.field_formats.FieldSpec
Represents a field that holds strings.
One way to specify the format of the entries is to provide a regular expression that they must conform to. Another is to provide zero or more of: minimum length, maximum length, casing (lower, upper, mixed).
Each string field also specifies an encoding used when turning characters into bytes. This is stored in hashing_properties since it is needed for hashing.
- Variables
encoding (str) –
The encoding to use when converting the string to bytes. Refer to Python’s documentation for possible values.
regex – Compiled regular expression that entries must conform to. Present only if the specification is regex- based.
case (str) – The casing of the entries. One of ‘lower’, ‘upper’, or ‘mixed’. Default is ‘mixed’. Present only if the specification is not regex-based.
min_length (int) – The minimum length of the string. None if there is no minimum length. Present only if the specification is not regex-based.
max_length (int) – The maximum length of the string. None if there is no maximum length. Present only if the specification is not regex-based.
- classmethod from_json_dict(json_dict: Dict[str, Any]) clkhash.field_formats.StringSpec [source]¶
Make a StringSpec object from a dictionary containing its properties.
- Parameters
json_dict (dict) – This dictionary must contain an ‘encoding’ key associated with a Python-conformant encoding. It must also contain a ‘hashing’ key, whose contents are passed to
FieldHashingProperties
. Permitted keys also include ‘pattern’, ‘case’, ‘minLength’, and ‘maxLength’.- Raises
InvalidSchemaError – When a regular expression is provided but is not a valid pattern.
- validate(str_in: str) None [source]¶
Validates an entry in the field.
Raises InvalidEntryError iff the entry is invalid.
An entry is invalid iff (1) a pattern is part of the specification of the field and the string does not match it; (2) the string does not match the provided casing, minimum length, or maximum length; or (3) the specified encoding cannot represent the string.
- Parameters
str_in (str) – String to validate.
- Raises
InvalidEntryError – When entry is invalid.
ValueError – When self.case is not one of the permitted values (‘lower’, ‘upper’, or ‘mixed’).
- clkhash.field_formats.fhp_from_json_dict(json_dict: Dict[str, Any]) clkhash.field_formats.FieldHashingProperties [source]¶
Make a
FieldHashingProperties
object from a dictionary.- Parameters
json_dict (dict) – Conforming to the hashingConfig definition in the v2 linkage schema.
- Returns
A
FieldHashingProperties
instance.
- clkhash.field_formats.spec_from_json_dict(json_dict: Dict[str, Any]) clkhash.field_formats.FieldSpec [source]¶
Turns a dictionary into the appropriate FieldSpec object.
- Parameters
json_dict (dict) – A dictionary with properties.
- Raises
- Returns
An initialised instance of the appropriate FieldSpec subclass.
comparators¶
- class clkhash.comparators.AbstractComparison[source]¶
Bases:
object
Abstract base class for all comparisons
- abstract tokenize(word: str) Iterable[str] [source]¶
The tokenization function.
Takes a string and returns an iterable of tokens (as strings). This should be implemented in a way that the intersection of two sets of tokens produced by this function approximates the desired comparison criteria.
- Parameters
word – The string to tokenize.
- Returns
Iterable of tokens.
- class clkhash.comparators.ExactComparison[source]¶
Bases:
clkhash.comparators.AbstractComparison
Enables exact comparisons
High similarity score if inputs are identical, low otherwise.
Internally, this is done by treating the whole input as one token. Thus, if you have chosen the ‘bitsPerToken’ strategy for hashing, you might want to adjust the value such that the corresponding feature gets an appropriate representation in the filter.
- tokenize(word: str) Iterable[str] [source]¶
The tokenization function.
Takes a string and returns an iterable of tokens (as strings). This should be implemented in a way that the intersection of two sets of tokens produced by this function approximates the desired comparison criteria.
- Parameters
word – The string to tokenize.
- Returns
Iterable of tokens.
- class clkhash.comparators.NgramComparison(n: int, positional: Optional[bool] = False)[source]¶
Bases:
clkhash.comparators.AbstractComparison
Enables ‘n’-gram comparison for approximate string matching. An n-gram is a contiguous sequence of n items from a given text.
For Example: the 2-grams of ‘clkhash’ are ‘ c’, ‘cl’, ‘lk’, ‘kh’, ‘ha’, ‘as’, ‘sh’, ‘h ‘. Note the white- space in the first and last token. They serve the purpose to a) indicate the beginning and end of a word, and b) gives every character in the input text a representation in two tokens.
‘n’-gram comparison of strings is tolerant to spelling mistakes, e.g., the strings ‘clkhash’ and ‘clkhush’ have 6 out of 8 2-grams in common. One wrong character will affect ‘n’ ‘n’-grams. Thus, the larger you choose ‘n’, the more the error propagates.
A positional n-gram also encodes the position of the n-gram within the word. The positional 2-grams of ‘clkhash’ are ‘1 c’, ‘2 cl’, ‘3 lk’, ‘4 kh’, ‘5 ha’, ‘6 as’, ‘7 sh’, ‘8 h ‘. Positional n-grams can be useful for comparing words where the position of the characters are important, e.g., postcodes or phone numbers.
- Variables
n – the n in n-gram, non-negative integer
positional – enables positional n-gram tokenization
- class clkhash.comparators.NonComparison[source]¶
Bases:
clkhash.comparators.AbstractComparison
Non comparison.
- tokenize(word: str) Iterable[str] [source]¶
Null tokenizer returns empty Iterable.
FieldSpec Ignore has hashing_properties = None and get_tokenizer has to return something for this case, even though it’s never called. An alternative would be to use an Optional[Callable]].
- Parameters
word – not used
- Returns
empty Iterable
- class clkhash.comparators.NumericComparison(threshold_distance: float, resolution: int, fractional_precision: int = 0)[source]¶
Bases:
clkhash.comparators.AbstractComparison
enables numerical comparisons of integers or floating point numbers.
The numerical distance between two numbers relate to the similarity of the tokens produces by this comparison class. We implemented the idea of Vatsalan and Christen (Privacy-preserving matching of similar patients, Journal of Biomedical Informatics, 2015).
The basic idea is to encode a number’s neighbourhood such that the neighbourhoods of close numbers overlap. For example, the neighbourhood of x=21 is 19, 20, 21, 22, 23, and the neighbourhood of y=23 is 21, 22, 23, 24, 25. These two neighbourhoods share three elements. The overlap of the neighbourhoods of two numbers increases the closer the numbers are to each other.
There are two parameters to control the overlap.
- threshold_distance: the maximum distance which leads to an non-empty overlap. Neighbourhoods for points which
are further apart have no elements in common. (*)
- resolution: controls how many tokens are generated. (the b in the paper). Given an interval of size
threshold_distance we create ‘resolution tokens to either side of the mid-point plus one token for the mid-point. Thus, 2 * resolution + 1 tokens in total. A higher resolution differentiates better between different values, but should be chosen such that it plays nicely with the overall Bloom filter size and insertion strategy.
(*) the reality is a bit more tricky. We first have to quantize the inputs to multiples of threshold_distance / (2 * resolution), in order to get comparable neighbourhoods. For example, if we choose a threshold_distance of 8 and a resolution of 2, then, without quantization, the neighbourhood of x=25 would be [21, 23, 25, 27, 29] and for y=26 [22, 24, 26, 28, 30], resulting in no overlap. The quantization ensures that the inputs are mapped onto a common grid. In our example, the values would be quantized to even numbers (multiples of 8 / (2 * 2) = 2). Thus x=25 would be mapped to 26. The quantization has the side effect that sometimes two values which are further than threshold_distance but not more than threshold_distance + 1/2 quantization level apart can share a common token. For instance, a=24.99 would be mapped to 24 with a neighbourhood of [20, 22, 24, 26, 28], and b=16 neighbourhood is [12, 14, 16, 18, 20].
We produce the output tokens based on the neighbourhood in the following way. Instead of creating a neighbourhood around the quantized input with values dist_interval = threshold_distance / (2 * resolution) apart, we instead multiply all values by (2 * resolution). This saves the division, which can introduce numerical inaccuracies. Thus, the tokens for x=25 are [88, 96, 104, 112, 120].
We are dealing with floating point numbers by quantizing them to integers by multiplying them with 10 ** fractional_precision and then rounding them to the nearest integer.
Thus, we don’t support to full range of floats, but the subset between 2.2250738585072014e-(308 - fractional_precision - log(resolution, 10)) and 1.7976931348623157e+(308 - fractional_precision - log(resolution, 10))
- Variables
threshold_distance – maximum detectable distance. Points that are further apart won’t have tokens in common.
resolution – controls the amount of generated tokens. Total number of tokens will be 2 * resolution + 1
fractional_precision – number of digits after the point to be considered
- tokenize(word: str) Iterable[str] [source]¶
The tokenization function.
Takes a string and returns an iterable of tokens (as strings). This should be implemented in a way that the intersection of two sets of tokens produced by this function approximates the desired comparison criteria.
- Parameters
word – The string to tokenize.
- Returns
Iterable of tokens.
- clkhash.comparators.get_comparator(comp_desc: Dict[str, Any]) clkhash.comparators.AbstractComparison [source]¶
Creates the comparator as defined in the schema. A comparator provides a tokenization method suitable for that type of comparison.
This function takes a dictionary, containing the schema definition. It returns a subclass of AbstractComparison.