Hashing Schema¶
As CLKs are usually used for privacy preserving linkage, it is important that participating organisations agree on how raw personally identifiable information is hashed to create the CLKs.
We call the configuration of how to create CLKs a hashing schema. The organisations agree on one hashing schema as configuration to ensure that their respective CLKs have been created in the same way.
This aims to be an open standard such that different client implementations could take the schema and create identical CLKS given the same data.
The hashing-schema is a detailed description of exactly what is fed to the hashing operation, along with any configuration for the hashing itself.
The format of the hashing schema is defined in a separate JSON Schema
document master-schemas/v1.json.
Basic Structure¶
A hashing schema consists of three parts:
Example Schema¶
{
"version": 1,
"clkConfig": {
"l": 1024,
"k": 20,
"hash": {
"type": "doubleHash"
},
"kdf": {
"type": "HKDF"
}
},
"features": [
{
"identifier": "index",
"ignored": true
},
{
"identifier": "full name",
"format": {
"type": "string",
"maxLength": 30,
"encoding": "utf-8"
},
"hashing": { "ngram": 2 }
},
{
"identifier": "gender",
"format": {
"type": "enum",
"values": ["M", "F", "O"]
},
"hashing": { "ngram": 1 }
},
{
"identifier": "postcode",
"format": {
"type": "integer",
"minimum": 1000,
"maximum": 9999
},
"hashing":{
"ngram": 1,
"positional": true,
"missingValue": {
"sentinel": "N/A",
"replaceWith": ""
}
}
}
]
}
A more advanced example can be found here.
Schema Components¶
Version¶
Integer value which describes the version of the hashing schema.
clkConfig¶
Describes the general construction of the CLK.
name | type | optional | description |
---|---|---|---|
l | integer | no | the length of the CLK in bits |
k | integer | no | max number of indices per n-gram |
xorFolds | integer | yes | number of XOR folds (as proposed in [Schnell2016]). |
kdf | KDF | no | defines the key derivation function used to generate individual secrets for each feature derived from the master secret |
hash | Hash | no | defines the hashing scheme to encode the n-grams |
KDF¶
We currently only support HKDF (for a basic description, see https://en.wikipedia.org/wiki/HKDF).
name | type | optional | description |
---|---|---|---|
type | string | no | must be set to “HKDF” |
hash | enum | yes | hash function used by HKDF, either “SHA256” or “SHA512” |
salt | string | yes | base64 encoded bytes |
info | string | yes | base64 encoded bytes |
keySize | integer | yes | size of the generated keys in bytes |
Hash¶
Describes and cofigures the hash that is used to encode the n-grams.
Choose one of:
- double hash, as described in [Schnell2011].
name | type | optional | description |
---|---|---|---|
type | string | no | must be set to “doubleHash” |
prevent_singularity | boolean | yes | see discussion in https://github.com/n1analytics/clkhash/issues/33 |
- blake hash
name | type | optional | description |
---|---|---|---|
type | string | no | must be set to “blakeHash” |
features¶
A feature is either described by a featureConfig, or alternatively, it can be ignored by the clkhash library by defining a ignoreFeature section.
ignoreFeature¶
If defined, then clkhash will ignore this feature.
name | type | optional | description |
---|---|---|---|
identifier | string | no | the name of the feature |
ignored | boolean | no | has to be set to “True” |
description | string | yes | free text, ignored by clkhash |
featureConfig¶
A feature is configured in three parts:
- identifier, the name of the feature
- format, describes the expected format of the values of this feature
- hashing, configures the hashing
name | type | optional | description |
---|---|---|---|
identifier | string | no | the name of the feature |
description | string | yes | free text, ignored by clkhash |
hashing | hashingConfig | no | configures feature specific hashing parameters |
format | one of: textFormat, textPatternFormat, numberFormat, dateFormat, enumFormat | no | describes the expected format of the feature values |
hashingConfig¶
name | type | optional | description |
---|---|---|---|
ngram | integer | no | specifies the n in n-gram (the tokenization of the input values). |
positional | boolean | yes | adds the position to the n-grams. String “222” would be tokenized (as uni-grams) to “1 2”, “2 2”, “3 2” |
weight | float | yes | positive number, which adjusts the number of hash functions (k) used for encoding. Thus giving this feature more or less importance compared to others. |
missingValue | missingValue | yes | allows to define how missing values are handled |
missingValue¶
Data sets are not always complete – they can contain missing values. If specified, then clkhash will not check the format for these missing values, and will optionally replace them with the ‘replaceWith’ value. This can be useful if the data
name | type | optional | description |
---|---|---|---|
sentinel | string | no | the sentinel value indicates missing data, e.g. ‘Null’, ‘N/A’, ‘’, … |
replaceWith | string | yes | specifies the value clkhash should use instead of the sentinel value. |
textFormat¶
name | type | optional | description |
---|---|---|---|
type | string | no | has to be “string” |
encoding | enum | yes | one of “ascii”, “utf-8”, “utf-16”, “utf-32”. Default is “utf-8”. |
case | enum | yes | one of “upper”, “lower”, “mixed”. |
minLength | integer | yes | positive integer describing the minimum length of the input string. |
maxLength | integer | yes | positive integer describing the maximum length of the input string. |
description | string | yes | free text, ignored by clkhash. |
textPatternFormat¶
name | type | optional | description |
---|---|---|---|
type | string | no | has to be “string” |
encoding | enum | yes | one of “ascii”, “utf-8”, “utf-16”, “utf-32”. Default is “utf-8”. |
pattern | string | no | a regular expression describing the input format. |
description | string | yes | free text, ignored by clkhash. |
numberFormat¶
name | type | optional | description |
---|---|---|---|
type | string | no | has to be “integer” |
minimum | integer | yes | integer describing the lower bound of the input values. |
maximum | integer | yes | integer describing the upper bound of the input values. |
description | string | yes | free text, ignored by clkhash. |
dateFormat¶
A date is described by an ISO C89 compatible strftime() format string. For example, the format string for the internet date format as described in rfc3339, would be ‘%Y-%m-%d’. The clkhash library will convert the given date to the ‘%Y%m%d’ representation for hashing, as any fill character like ‘-‘ or ‘/’ do not add to the uniqueness of an entity.
name | type | optional | description |
---|---|---|---|
type | string | no | has to be “date” |
format | string | no | ISO C89 compatible format string, eg: for 1989-11-09 the format is ‘%Y-%m-%d’ |
description | string | yes | free text, ignored by clkhash. |
The following subset contains the most useful format codes:
directive | meaning | example |
---|---|---|
%Y | Year with century as a decimal number | 1984, 3210, 0001 |
%y | Year without century, zero-padded | 00, 09, 99 |
%m | Month as a zero-padded decimal number | 01, 12 |
%d | Day of the month, zero-padded | 01, 25, 31 |
enumFormat¶
name | type | optional | description |
---|---|---|---|
type | string | no | has to be “enum” |
values | array | no | an array of items of type “string” |
description | string | yes | free text, ignored by clkhash. |