API Documentation¶

Bloom filter¶

Generate a Bloom filter

clkhash.bloomfilter.blake_encode_ngrams(ngrams, keys, ks, l, encoding)[source]¶

Computes the encoding of the ngrams using the BLAKE2 hash function.

We deliberately do not use the double hashing scheme as proposed in [ Schnell2011]_, because this would introduce an exploitable structure into the Bloom filter. For more details on the weakness, see [Kroll2015].

In short, the double hashing scheme only allows for \(l^2\) different encodings for any possible n-gram, whereas the use of \(k\) different independent hash functions gives you \(\sum_{j=1}^{k}{\binom{l}{j}}\) combinations.

Our construction

It is advantageous to construct Bloom filters using a family of hash functions with the property of k-independence to compute the indices for an entry. This approach minimises the change of collisions.

An informal definition of k-independence of a family of hash functions is, that if selecting a function at random from the family, it guarantees that the hash codes of any designated k keys are independent random variables.

Our construction utilises the fact that the output bits of a cryptographic hash function are uniformly distributed, independent, binary random variables (well, at least as close to as possible. See [Kaminsky2011] for an analysis). Thus, slicing the output of a cryptographic hash function into k different slices gives you k independent random variables.

We chose Blake2 as the cryptographic hash function mainly for two reasons:

it is fast.
in keyed hashing mode, Blake2 provides MACs with just one hash function call instead of the two calls in the HMAC construction used in the double hashing scheme.

Warning

Please be aware that, although this construction makes the attack of [Kroll2015] infeasible, it is most likely not enough to ensure security. Or in their own words:

However, we think that using independent hash functions alone will not be sufficient to ensure security, since in this case other approaches (maybe related to or at least inspired through work from the area of Frequent Itemset Mining) are promising to detect at least the most frequent atoms automatically.

Parameters:	ngrams – list of n-grams to be encoded keys – secret key for blake2 as bytes ks – ks[i] is k value to use for ngram[i] l – length of the output bitarray (has to be a power of 2) encoding – the encoding to use when turning the ngrams to bytes
Returns:	bitarray of length l with the bits set which correspond to the encoding of the ngrams

clkhash.bloomfilter.crypto_bloom_filter(record, tokenizers, schema, keys)[source]¶

Computes the composite Bloom filter encoding of a record.

Using the method from http://www.record-linkage.de/-download=wp-grlc-2011-02.pdf

Parameters:	record – plaintext record tuple. E.g. (index, name, dob, gender) tokenizers – A list of tokenizers. A tokenizer is a function that returns tokens from a string. schema – Schema keys – Keys for the hash functions as a tuple of lists of bytes.
Returns:	3-tuple: - bloom filter for record as a bitarray - first element of record (usually an index) - number of bits set in the bloomfilter

clkhash.bloomfilter.double_hash_encode_ngrams(ngrams, keys, ks, l, encoding)[source]¶

Computes the double hash encoding of the ngrams with the given keys.

Using the method from: Schnell, R., Bachteler, T., & Reiher, J. (2011). A Novel Error-Tolerant Anonymous Linking Code. http://grlc.german-microsimulation.de/wp-content/uploads/2017/05/downloadwp-grlc-2011-02.pdf

Parameters:	ngrams – list of n-grams to be encoded keys – hmac secret keys for md5 and sha1 as bytes ks – ks[i] is k value to use for ngram[i] l – length of the output bitarray encoding – the encoding to use when turning the ngrams to bytes
Returns:	bitarray of length l with the bits set which correspond to the encoding of the ngrams

clkhash.bloomfilter.double_hash_encode_ngrams_non_singular(ngrams, keys, ks, l, encoding)[source]¶

computes the double hash encoding of the n-grams with the given keys.

The original construction of [Schnell2011] displays an abnormality for certain inputs:

An n-gram can be encoded into just one bit irrespective of the number of k.

Their construction goes as follows: the \(k\) different indices \(g_i\) of the Bloom filter for an n-gram \(x\) are defined as:

\[g_{i}(x) = (h_1(x) + i h_2(x)) \mod l\]

with \(0 \leq i < k\) and \(l\) is the length of the Bloom filter. If the value of the hash of \(x\) of the second hash function is a multiple of \(l\), then

\[h_2(x) = 0 \mod l\]

and thus

\[g_i(x) = h_1(x) \mod l,\]

irrespective of the value \(i\). A discussion of this potential flaw can be found here.

Parameters:	ngrams – list of n-grams to be encoded keys – tuple with (key_sha1, key_md5). That is, (hmac secret keys for sha1 as bytes, hmac secret keys for md5 as bytes) ks – ks[i] is k value to use for ngram[i] l – length of the output bitarray encoding – the encoding to use when turning the ngrams to bytes
Returns:	bitarray of length l with the bits set which correspond to the encoding of the ngrams

clkhash.bloomfilter.fold_xor(bloomfilter, folds)[source]¶

Performs XOR folding on a Bloom filter.

If the length of the original Bloom filter is n and we perform r folds, then the length of the resulting filter is n / 2 ** r.

Parameters:	bloomfilter – Bloom filter to fold folds – number of folds
Returns:	folded bloom filter

clkhash.bloomfilter.hashing_function_from_properties(fhp)[source]¶: Get the hashing function for this field :param fhp: hashing properties for this field :return: the hashing function

clkhash.bloomfilter.int_from_bytes()¶

Return the integer represented by the given array of bytes.

bytes: Holds the array of bytes to convert. The argument must either support the buffer protocol or be an iterable object producing bytes. Bytes and bytearray are examples of built-in objects that support the buffer protocol.
byteorder: The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value.
signed: Indicates whether two’s complement is used to represent the integer.

clkhash.bloomfilter.stream_bloom_filters(dataset, keys, schema)[source]¶

Compute composite Bloom filters (CLKs) for every record in an iterable dataset.

Parameters:	dataset – An iterable of indexable records. schema – An instantiated Schema instance keys – A tuple of two lists of secret keys used in the HMAC.
Returns:	Generator yielding bloom filters as 3-tuples

CLK¶

Generate CLK from data.

clkhash.clk.chunks(seq, chunk_size)[source]¶

Split seq into chunk_size-sized chunks.

Parameters:	seq – A sequence to chunk. chunk_size – The size of chunk.

clkhash.clk.generate_clk_from_csv(input_f, keys, schema, validate=True, header=True, progress_bar=True)[source]¶

Generate Bloom filters from CSV file, then serialise them.

This function also computes and outputs the Hamming weight (a.k.a popcount – the number of bits set to high) of the generated Bloom filters.

Parameters:

input_f – A file-like object of csv data to hash.
keys – A tuple of two lists of secret keys.
schema – Schema specifying the record formats and hashing settings.
validate – Set to False to disable validation of data against the schema. Note that this will silence warnings whose aim is to keep the hashes consistent between data sources; this may affect linkage accuracy.
header – Set to False if the CSV file does not have a header. Set to ‘ignore’ if the CSV file does have a header but it should not be checked against the schema.
progress_bar (bool) – Set to False to disable the progress bar.

Returns:

A list of serialized Bloom filters and a list of corresponding popcounts.

clkhash.clk.generate_clks(pii_data, schema, keys, validate=True, callback=None)[source]¶

clkhash.clk.hash_and_serialize_chunk(chunk_pii_data, keys, schema)[source]¶

Generate Bloom filters (ie hash) from chunks of PII then serialize the generated Bloom filters. It also computes and outputs the Hamming weight (or popcount) – the number of bits set to one – of the generated Bloom filters.

Parameters:	chunk_pii_data – An iterable of indexable records. keys – A tuple of two lists of secret keys used in the HMAC. schema (Schema) – Schema specifying the entry formats and hashing settings.
Returns:	A list of serialized Bloom filters and a list of corresponding popcounts

key derivation¶

clkhash.key_derivation.generate_key_lists(master_secrets, num_identifier, key_size=64, salt=None, info=None, kdf='HKDF', hash_algo='SHA256')[source]¶

Generates a derived key for each identifier for each master secret using a key derivation function (KDF).

The only supported key derivation function for now is ‘HKDF’.

The previous key usage can be reproduced by setting kdf to ‘legacy’. This is highly discouraged, as this strategy will map the same n-grams in different identifier to the same bits in the Bloom filter and thus does not lead to good results.

Parameters:

master_secrets – a list of master secrets (either as bytes or strings)
num_identifier – the number of identifiers
key_size – the size of the derived keys
salt – salt for the KDF as bytes
info – optional context and application specific information as bytes
kdf – the key derivation function algorithm to use
hash_algo – the hashing algorithm to use (ignored if kdf is not ‘HKDF’)

Returns:

The derived keys. First dimension is of size num_identifier, second dimension is the same as master_secrets. A key is represented as bytes.

clkhash.key_derivation.hkdf(master_secret, num_keys, hash_algo='SHA256', salt=None, info=None, key_size=64)[source]¶

Executes the HKDF key derivation function as described in rfc5869 to derive num_keys keys of size key_size from the master_secret.

Parameters:

master_secret – input keying material
num_keys – the number of keys the kdf should produce
hash_algo – The hash function used by HKDF for the internal HMAC calls. The choice of hash function defines the maximum length of the output key material. Output bytes <= 255 * hash digest size (in bytes).
salt –
HKDF is defined to operate with and without random salt. This is done to accommodate applications where a salt value is not available. We stress, however, that the use of salt adds significantly to themstrength of HKDF, ensuring independence between different uses of the hash function, supporting “source-independent” extraction, and strengthening the analytical results that back the HKDF design.

Random salt differs fundamentally from the initial keying

material in two ways: it is non-secret and can be re-used.

Ideally, the salt value is a random (or pseudorandom) string

of the length HashLen. Yet, even a salt value of less quality (shorter in size or with limited entropy) may still make a significant contribution to the security of the output keying material.
info – While the ‘info’ value is optional in the definition of HKDF, it is often of great importance in applications. Its main objective is to bind the derived key material to application- and context-specific information. For example, ‘info’ may contain a protocol number, algorithm identifiers, user identities, etc. In particular, it may prevent the derivation of the same keying material for different contexts (when the same input key material (IKM) is used in such different contexts). It may also accommodate additional inputs to the key expansion part, if so desired (e.g., an application may want to bind the key material to its length L, thus making L part of the ‘info’ field). There is one technical requirement from ‘info’: it should be independent of the input key material value IKM.
key_size – the size of the produced keys

Returns:

Derived keys

random names¶

Module to produce a dataset of names, genders and dates of birth and manipulate that list

Names and ages are based on Australian and USA census data, but are not correlated. Additional functions for manipulating the list of names - producing reordered and subset lists with a specific overlap

ClassList class - generate a list of length n of [id, name, dob, gender] lists

TODO: Generate realistic errors TODO: Add RESTful api to generate reasonable name data as requested

class clkhash.randomnames.Distribution(resource_name)[source]¶

Bases: object

Creates a random value generator with a weighted distribution

generate()[source]¶: Generates a random value, weighted by the known distribution

load_csv_data(resource_name)[source]¶: Loads the first two columns of the specified CSV file from package data. The first column represents the value and the second column represents the count in the population.

class clkhash.randomnames.NameList(n)[source]¶

Bases: object

Randomly generated PII records.

SCHEMA = <Schema (v2): 4 fields>¶

generate_random_person(n)[source]¶

Generator that yields details on a person with plausible name, sex and age.

Yields:	Generated data for one person tuple - (id: str, name: str(‘First Last’), birthdate: str(‘DD/MM/YYYY’), sex: str(‘M’ \| ‘F’) )

generate_subsets(sz, overlap=0.8, subsets=2)[source]¶

Return random subsets with nonempty intersection.

The random subsets are of specified size. If an element is common to two subsets, then it is common to all subsets. This overlap is controlled by a parameter.

Parameters:

sz – size of subsets to generate
overlap – size of the intersection, as fraction of the subset length
subsets – number of subsets to generate

Raises:

ValueError –

if there aren’t sufficiently many names in the list to satisfy the request; more precisely, raises if (1 - subsets) * floor(overlap * sz)

subsets * sz > len(self.names).

Returns:

tuple of subsets

load_data()[source]¶

Loads databases from package data

Uses data files sourced from http://www.quietaffiliate.com/free-first-name-and-last-name-databases-csv-and-sql/ https://www.census.gov/topics/population/genealogy/data/2010_surnames.html https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3101.0Jun%202016

randomname_schema = {'clkConfig': {'hash': {'type': 'doubleHash'}, 'k': 30, 'kdf': {'hash': 'SHA256', 'info': 'c2NoZW1hX2V4YW1wbGU=', 'keySize': 64, 'salt': 'SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==', 'type': 'HKDF'}, 'l': 1024}, 'features': [{'identifier': 'INDEX', 'format': {'type': 'integer'}, 'hashing': {'ngram': 1, 'weight': 0}}, {'identifier': 'NAME freetext', 'format': {'type': 'string', 'encoding': 'utf-8', 'case': 'mixed', 'minLength': 3}, 'hashing': {'ngram': 2, 'weight': 0.5}}, {'identifier': 'DOB YYYY/MM/DD', 'format': {'type': 'date', 'description': 'Numbers separated by slashes, in the year, month, day order', 'format': '%Y/%m/%d'}, 'hashing': {'ngram': 1, 'positional': True}}, {'identifier': 'GENDER M or F', 'format': {'type': 'enum', 'values': ['M', 'F']}, 'hashing': {'ngram': 1, 'weight': 2}}], 'version': 1}¶

randomname_schema_bytes = b'{\n "version": 1,\n "clkConfig": {\n "l": 1024,\n "k": 30,\n "hash": {\n "type": "doubleHash"\n },\n "kdf": {\n "type": "HKDF",\n "hash": "SHA256",\n "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",\n "info": "c2NoZW1hX2V4YW1wbGU=",\n "keySize": 64\n }\n },\n "features": [\n {\n "identifier": "INDEX",\n "format": {\n "type": "integer"\n },\n "hashing": {\n "ngram": 1,\n "weight": 0\n }\n },\n {\n "identifier": "NAME freetext",\n "format": {\n "type": "string",\n "encoding": "utf-8",\n "case": "mixed",\n "minLength": 3\n },\n "hashing": {\n "ngram": 2,\n "weight": 0.5\n }\n },\n {\n "identifier": "DOB YYYY/MM/DD",\n "format": {\n "type": "date",\n "description": "Numbers separated by slashes, in the year, month, day order",\n "format": "%Y/%m/%d"\n },\n "hashing": {\n "ngram": 1,\n "positional": true\n }\n },\n {\n "identifier": "GENDER M or F",\n "format": {\n "type": "enum",\n "values": ["M", "F"]\n },\n "hashing": {\n "ngram": 1,\n "weight": 2\n }\n }\n ]\n}\n'¶

schema_types¶

clkhash.randomnames.random_date(year, age_distribution)[source]¶

Generate a random datetime between two datetime objects.

Parameters:	start – datetime of start end – datetime of end
Returns:	random datetime between start and end

clkhash.randomnames.save_csv(data, headers, file)[source]¶

Output generated data to file as CSV with header.

Parameters:	data – An iterable of tuples containing raw data. headers – Iterable of feature names file – A writeable stream in which to write the CSV

schema¶

Schema loading and validation.

exception clkhash.schema.MasterSchemaError[source]¶

Bases: Exception

Master schema missing? Corrupted? Otherwise surprising? This is the exception for you!

class clkhash.schema.Schema(fields, l, xor_folds=0, kdf_type='HKDF', kdf_hash='SHA256', kdf_info=None, kdf_salt=None, kdf_key_size=64)[source]¶

Bases: object

Linkage Schema which describes how to encode plaintext identifiers.

exception clkhash.schema.SchemaError(msg, errors=None)[source]¶

Bases: Exception

The user-defined schema is invalid.

clkhash.schema.convert_v1_to_v2(dict)[source]¶: Convert v1 schema dict to v2 schema dict. :param dict: v1 schema dict :return: v2 schema dict

clkhash.schema.from_json_dict(dct, validate=True)[source]¶

Create a Schema for v1 or v2 according to dct

Parameters:	dct – This dictionary must have a ‘features’ key specifying the columns of the dataset. It must have a ‘version’ key containing the master schema version that this schema conforms to. It must have a ‘hash’ key with all the globals. validate – (default True) Raise an exception if the schema does not conform to the master schema.
Raises:	SchemaError – An exception containing details about why the schema is not valid.
Returns:	the Schema

clkhash.schema.from_json_file(schema_file, validate=True)[source]¶

Load a Schema object from a json file. :param schema_file: A JSON file containing the schema. :param validate: (default True) Raise an exception if the

schema does not conform to the master schema.

Raises:	SchemaError – When the schema is invalid.
Returns:	the Schema

clkhash.schema.validate_schema_dict(schema)[source]¶

Validate the schema.

This raises iff either the schema or the master schema are invalid. If it’s successful, it returns nothing.

Parameters:	schema – The schema to validate, as parsed by json.
Raises:	SchemaError – When the schema is invalid. MasterSchemaError – When the master schema is invalid.

field_formats¶

Classes that specify the requirements for each column in a dataset. They take care of validation, and produce the settings required to perform the hashing.

class clkhash.field_formats.DateSpec(identifier, hashing_properties, format, description=None)[source]¶

Bases: clkhash.field_formats.FieldSpec

Represents a field that holds dates.

Dates are specified as full-dates in a format that can be described as a strptime() (C89 standard) compatible format string. E.g.: the format for the standard internet format RFC3339 (e.g. 1996-12-19) is ‘%Y-%m-%d’.

ivar str format:

The format of the date.

OUTPUT_FORMAT = '%Y%m%d'¶

classmethod from_json_dict(json_dict)[source]¶

Make a DateSpec object from a dictionary containing its properties.

Parameters:	json_dict (dict) – This dictionary must contain a ‘format’ key. In addition, it must contain a ‘hashing’ key, whose contents are passed to `FieldHashingProperties`. json_dict – The properties dictionary.

validate(str_in)[source]¶

Validates an entry in the field.

Raises InvalidEntryError iff the entry is invalid.

An entry is invalid iff (1) the string does not represent a date in the correct format; or (2) the date it represents is invalid (such as 30 February).

Parameters:	str_in (str) – String to validate.
Raises:	InvalidEntryError – Iff entry is invalid. ValueError – When self.format is unrecognised.

class clkhash.field_formats.EnumSpec(identifier, hashing_properties, values, description=None)[source]¶

Bases: clkhash.field_formats.FieldSpec

Represents a field that holds an enum.

The finite collection of permitted values must be specified.

Variables:	values – The set of permitted values.

classmethod from_json_dict(json_dict)[source]¶

Make a EnumSpec object from a dictionary containing its properties.

Parameters:	json_dict (dict) – This dictionary must contain an ‘enum’ key specifying the permitted values. In addition, it must contain a ‘hashing’ key, whose contents are passed to `FieldHashingProperties`.

validate(str_in)[source]¶

Validates an entry in the field.

Raises InvalidEntryError iff the entry is invalid.

An entry is invalid iff it is not one of the permitted values.

Parameters:	str_in (str) – String to validate.
Raises:	InvalidEntryError – When entry is invalid.

class clkhash.field_formats.FieldHashingProperties(ngram, encoding='utf-8', positional=False, hash_type='blakeHash', prevent_singularity=None, num_bits=None, k=None, missing_value=None)[source]¶

Bases: object

Stores the settings used to hash a field.

This includes the encoding and tokenisation parameters.

ivar str encoding:

The encoding to use when converting the string to bytes. Refer to Python’s documentation <https://docs.python.org/3/library/codecs.html#standard-encodings> for possible values.

ivar int ngram: The n in n-gram. Possible values are 0, 1, and 2.

ivar bool positional:

Controls whether the n-grams are positional.

ivar int num_bits:

dynamic k = num_bits / number of n-grams

ivar int k: max number of bits per n-gram

ks(num_ngrams)[source]¶: Provide a k for each ngram in the field value. :param num_ngrams: number of ngrams in the field value :return: [ k, … ] a k value for each of num_ngrams such that the sum is exactly num_bits

replace_missing_value(str_in)[source]¶

returns ‘str_in’ if it is not equals to the ‘sentinel’ as defined in the missingValue section of the schema. Else it will return the ‘replaceWith’ value.

Parameters:	str_in –
Returns:	str_in or the missingValue replacement value

class clkhash.field_formats.FieldSpec(identifier, hashing_properties, description=None)[source]¶

Bases: object

Abstract base class representing the specification of a column in the dataset. Subclasses validate entries, and modify the hashing_properties ivar to customise hashing procedures.

Variables:	identifier (str) – The name of the field. description (str) – Description of the field format. hashing_properties (FieldHashingProperties) – The properties for hashing. None if field ignored.

format_value(str_in)[source]¶

formats the value ‘str_in’ for hashing according to this field’s spec.

There are several reasons why this might be necessary:

1. This field contains missing values which have to be replaced by some other string 2. There are several different ways to describe a specific value for this field, e.g.: all of ‘+65’, ‘ 65’,

‘65’ are valid representations of the integer 65.

3. Entries of this field might contain elements with no entropy, e.g. dates might be formatted as

yyyy-mm-dd, thus all dates will have ‘-‘ at the same place. These artifacts have no value for entity resolution and should be removed.

param str str_in:

the string to format

return: a string representation of ‘str_in’ which is ready to

be hashed

classmethod from_json_dict(field_dict)[source]¶

Initialise a FieldSpec object from a dictionary of properties.

Parameters:	field_dict (dict) – The properties dictionary to use. Must contain a ‘hashing’ key that meets the requirements of `FieldHashingProperties`. Subclasses may require
Raises:	InvalidSchemaError – When the properties dictionary contains invalid values. Exactly what that means is decided by the subclasses.

is_missing_value(str_in)[source]¶

tests if ‘str_in’ is the sentinel value for this field

Parameters:	str_in (str) – String to test if it stands for missing value
Returns:	True if a missing value is defined for this field and

str_in matches this value

validate(str_in)[source]¶

Validates an entry in the field.

Raises InvalidEntryError iff the entry is invalid.

Subclasses must override this method with their own validation. They should call the parent’s validate method via super.

Parameters:	str_in (str) – String to validate.
Raises:	InvalidEntryError – When entry is invalid.

class clkhash.field_formats.Ignore(identifier=None)[source]¶

Bases: clkhash.field_formats.FieldSpec

represent a field which will be ignored throughout the clk processing.

validate(str_in)[source]¶

Validates an entry in the field.

Raises InvalidEntryError iff the entry is invalid.

Subclasses must override this method with their own validation. They should call the parent’s validate method via super.

Parameters:	str_in (str) – String to validate.
Raises:	InvalidEntryError – When entry is invalid.

class clkhash.field_formats.IntegerSpec(identifier, hashing_properties, description=None, minimum=None, maximum=None, **kwargs)[source]¶

Bases: clkhash.field_formats.FieldSpec

Represents a field that holds integers.

Minimum and maximum values may be specified.

Variables:	minimum (int) – The minimum permitted value. maximum (int) – The maximum permitted value or None.

classmethod from_json_dict(json_dict)[source]¶

Make a IntegerSpec object from a dictionary containing its properties.

Parameters:	json_dict (dict) – This dictionary may contain ‘minimum’ and ‘maximum’ keys. In addition, it must contain a ‘hashing’ key, whose contents are passed to `FieldHashingProperties`. json_dict – The properties dictionary.

validate(str_in)[source]¶

Validates an entry in the field.

Raises InvalidEntryError iff the entry is invalid.

An entry is invalid iff (1) the string does not represent a base-10 integer; (2) the integer is not between self.minimum and self.maximum, if those exist; or (3) the integer is negative.

Parameters:	str_in (str) – String to validate.
Raises:	InvalidEntryError – When entry is invalid.

exception clkhash.field_formats.InvalidEntryError[source]¶

Bases: ValueError

An entry in the data file does not conform to the schema.

field_spec = None¶

exception clkhash.field_formats.InvalidSchemaError[source]¶

Bases: ValueError

Raised if the schema of a field specification is invalid.

For example, a regular expression included in the schema is not syntactically correct.

field_spec_index = None¶

json_field_spec = None¶

class clkhash.field_formats.MissingValueSpec(sentinel, replace_with=None)[source]¶

Bases: object

Stores the information about how to find and treat missing values.

Variables:	sentinel (str) – sentinel is the string that identifies a

missing value e.g.: ‘N/A’, ‘’. The sentinel will not be validated against the feature format definition :ivar str replaceWith: defines the string which replaces the sentinel whenever present, can be ‘None’, then sentinel will

not be replaced.

classmethod from_json_dict(json_dict)[source]¶

class clkhash.field_formats.StringSpec(identifier, hashing_properties, description=None, regex=None, case='mixed', min_length=0, max_length=None)[source]¶

Bases: clkhash.field_formats.FieldSpec

Represents a field that holds strings.

One way to specify the format of the entries is to provide a regular expression that they must conform to. Another is to provide zero or more of: minimum length, maximum length, casing (lower, upper, mixed).

Each string field also specifies an encoding used when turning characters into bytes. This is stored in hashing_properties since it is needed for hashing.

Variables:

encoding (str) – The encoding to use when converting the string to bytes. Refer to Python’s documentation for possible values.
regex – Compiled regular expression that entries must conform to. Present only if the specification is regex- -based.
case (str) – The casing of the entries. One of ‘lower’, ‘upper’, or ‘mixed’. Default is ‘mixed’. Present only if the specification is not regex-based.
min_length (int) – The minimum length of the string. None if there is no minimum length. Present only if the specification is not regex-based.
max_length (int) – The maximum length of the string. None if there is no maximum length. Present only if the specification is not regex-based.

classmethod from_json_dict(json_dict)[source]¶

Make a StringSpec object from a dictionary containing its properties.

Parameters:	json_dict (dict) – This dictionary must contain an ‘encoding’ key associated with a Python-conformant encoding. It must also contain a ‘hashing’ key, whose contents are passed to `FieldHashingProperties`. Permitted keys also include ‘pattern’, ‘case’, ‘minLength’, and ‘maxLength’.
Raises:	InvalidSchemaError – When a regular expression is provided but is not a valid pattern.

validate(str_in)[source]¶

Validates an entry in the field.

Raises InvalidEntryError iff the entry is invalid.

An entry is invalid iff (1) a pattern is part of the specification of the field and the string does not match it; (2) the string does not match the provided casing, minimum length, or maximum length; or (3) the specified encoding cannot represent the string.

Parameters:	str_in (str) – String to validate.
Raises:	InvalidEntryError – When entry is invalid. ValueError – When self.case is not one of the permitted values (‘lower’, ‘upper’, or ‘mixed’).

clkhash.field_formats.fhp_from_json_dict(json_dict)[source]¶

Make a FieldHashingProperties object from a dictionary.

Parameters:	json_dict (dict) – Conforming to the hashingConfig definition in the v2 linkage schema.
Returns:	A `FieldHashingProperties` instance.

clkhash.field_formats.spec_from_json_dict(json_dict)[source]¶

Turns a dictionary into the appropriate FieldSpec object.

Parameters:	json_dict (dict) – A dictionary with properties.
Raises:	InvalidSchemaError –
Returns:	An initialised instance of the appropriate FieldSpec subclass.

tokenizer¶

Functions to tokenize words (PII)

clkhash.tokenizer.get_tokenizer(fhp)[source]¶

Get tokeniser function from the hash settings.

This function takes a FieldHashingProperties object. It returns a function that takes a string and tokenises based on those properties.

ivar str encoding:
	The encoding to use when converting the string to bytes. Refer to Python’s documentation <https://docs.python.org/3/library/codecs.html#standard-encodings> for possible values.
ivar int ngram:	The n in n-gram. Possible values are 0, 1, and 2.
ivar bool positional:
	Controls whether the n-grams are positional.
ivar int num_bits:
	dynamic k = num_bits / number of n-grams
ivar int k:	max number of bits per n-gram

param str str_in:
	the string to format
return:	a string representation of ‘str_in’ which is ready to