Field Format API Documentation

field_formats

Classes that specify the requirements for each column in a dataset. They take care of validation, and produce the settings required to perform the hashing.

class clkhash.field_formats.DateSpec(identifier, hashing_properties, format, description=None)[source]

Bases: clkhash.field_formats.FieldSpec

Represents a field that holds dates.

Dates are specified as full-dates as defined in RFC3339 E.g., 1996-12-19

ivar str format:
 The format of the date.
classmethod from_json_dict(json_dict)[source]

Make a DateSpec object from a dictionary containing its properties.

Parameters:
  • json_dict (dict) – This dictionary must contain a ‘format’ key. In addition, it must contain a ‘hashing’ key, whose contents are passed to FieldHashingProperties.
  • json_dict – The properties dictionary.
validate(str_in)[source]

Validates an entry in the field.

Raises InvalidEntryError iff the entry is invalid.

An entry is invalid iff (1) the string does not represent a date in the correct format; or (2) the date it represents is invalid (such as 30 February).

Parameters:

str_in (str) – String to validate.

Raises:
class clkhash.field_formats.EnumSpec(identifier, hashing_properties, values, description=None)[source]

Bases: clkhash.field_formats.FieldSpec

Represents a field that holds an enum.

The finite collection of permitted values must be specified.

Variables:values – The set of permitted values.
classmethod from_json_dict(json_dict)[source]

Make a EnumSpec object from a dictionary containing its properties.

Parameters:json_dict (dict) – This dictionary must contain an ‘enum’ key specifying the permitted values. In addition, it must contain a ‘hashing’ key, whose contents are passed to FieldHashingProperties.
validate(str_in)[source]

Validates an entry in the field.

Raises InvalidEntryError iff the entry is invalid.

An entry is invalid iff it is not one of the permitted values.

Parameters:str_in (str) – String to validate.
Raises:InvalidEntryError – When entry is invalid.
class clkhash.field_formats.FieldHashingProperties(ngram, encoding='utf-8', weight=1, positional=False)[source]

Bases: object

Stores the settings used to hash a field. This includes the encoding and tokenisation parameters.

Variables:
  • encoding (str) – The encoding to use when converting the string to bytes. Refer to Python’s documentation for possible values.
  • ngram (int) – The n in n-gram. Possible values are 0, 1, and 2.
  • positional (bool) – Controls whether the n-grams are positional.
  • weight (float) – Controls the weight of the field in the Bloom filter.
classmethod from_json_dict(json_dict)[source]

Make a FieldHashingProperties object from a dictionary.

Parameters:json_dict (dict) – The dictionary must have have an ‘ngram’ key. It may have ‘positional’ and ‘weight’ keys; if these are missing, then they are filled with the default values. The encoding is always set to the default value.
Returns:A FieldHashingProperties instance.
class clkhash.field_formats.FieldSpec(identifier, hashing_properties, description=None)[source]

Bases: object

Abstract base class representing the specification of a column in the dataset. Subclasses validate entries, and modify the `hashing_properties ivar to customise hashing procedures.

Variables:
  • identifier (str) – The name of the field.
  • description (str) – Description of the field format.
  • hashing_properties (FieldHashingProperties) – The properties for hashing.
classmethod from_json_dict(field_dict)[source]

Initialise a FieldSpec object from a dictionary of properties.

Parameters:field_dict (dict) – The properties dictionary to use. Must contain a ‘hashing’ key that meets the requirements of FieldHashingProperties. Subclasses may requrire
Raises:InvalidSchemaError – When the properties dictionary contains invalid values. Exactly what that means is decided by the subclasses.
validate(str_in)[source]

Validates an entry in the field.

Raises InvalidEntryError iff the entry is invalid.

Subclasses must override this method with their own validation. They should call the parent’s validate method via super.

Parameters:str_in (str) – String to validate.
Raises:InvalidEntryError – When entry is invalid.
class clkhash.field_formats.IntegerSpec(identifier, hashing_properties, description=None, minimum=0, maximum=None, **kwargs)[source]

Bases: clkhash.field_formats.FieldSpec

Represents a field that holds integers.

Minimum and maximum values may be specified.

Variables:
  • minimum (int) – The minimum permitted value.
  • maximum (int) – The maximum permitted value or None.
classmethod from_json_dict(json_dict)[source]

Make a IntegerSpec object from a dictionary containing its properties.

Parameters:
  • json_dict (dict) – This dictionary may contain ‘minimum’ and ‘maximum’ keys. In addition, it must contain a ‘hashing’ key, whose contents are passed to FieldHashingProperties.
  • json_dict – The properties dictionary.
validate(str_in)[source]

Validates an entry in the field.

Raises InvalidEntryError iff the entry is invalid.

An entry is invalid iff (1) the string does not represent a base-10 integer; (2) the integer is not between self.minimum and self.maximum, if those exist; or (3) the integer is negative.

Parameters:str_in (str) – String to validate.
Raises:InvalidEntryError – When entry is invalid.
exception clkhash.field_formats.InvalidEntryError[source]

Bases: ValueError

An entry in the data file does not conform to the schema.

exception clkhash.field_formats.InvalidSchemaError[source]

Bases: ValueError

The schema is not valid.

This exception is raised if, for example, a regular expression included in the schema is not syntactically correct.

class clkhash.field_formats.StringSpec(identifier, hashing_properties, description=None, regex=None, case='mixed', min_length=0, max_length=None)[source]

Bases: clkhash.field_formats.FieldSpec

Represents a field that holds strings.

One way to specify the format of the entries is to provide a regular expression that they must conform to. Another is to provide zero or more of: minimum length, maximum length, casing (lower, upper, mixed).

Each string field also specifies an encoding used when turning characters into bytes. This is stored in hashing_properties since it is needed for hashing.

Variables:
  • regex – Compiled regular expression that entries must conform to. Present only if the specification is regex- -based.
  • case (str) – The casing of the entries. One of ‘lower’, ‘upper’, or ‘mixed’. Default is ‘mixed’. Present only if the specification is not regex-based.
  • min_length (int) – The minimum length of the string. None if there is no minimum length. Present only if the specification is not regex-based.
  • max_length (int) – The maximum length of the string. None if there is no maximum length. Present only if the specification is not regex-based.
classmethod from_json_dict(json_dict)[source]

Make a StringSpec object from a dictionary containing its properties.

Parameters:json_dict (dict) – This dictionary must contain an ‘encoding’ key associated with a Python-conformant encoding. It must also contain a ‘hashing’ key, whose contents are passed to FieldHashingProperties. Permitted keys also include ‘pattern’, ‘case’, ‘minLength’, and ‘maxLength’.
Raises:InvalidSchemaError – When a regular expression is provided but is not a valid pattern.
validate(str_in)[source]

Validates an entry in the field.

Raises InvalidEntryError iff the entry is invalid.

An entry is invalid iff (1) a pattern is part of the specification of the field and the string does not match it; (2) the string does not match the provided casing, minimum length, or maximum length; or (3) the specified encoding cannot represent the string.

Parameters:

str_in (str) – String to validate.

Raises:
  • InvalidEntryError – When entry is invalid.
  • ValueError – When self.case is not one of the permitted values (‘lower’, ‘upper’, or ‘mixed’).
clkhash.field_formats.spec_from_json_dict(json_dict)[source]

Turns a dictionary into the appropriate object.

Parameters:json_dict (dict) – A dictionary with properties.
Returns:An initialised instance of the appropriate FieldSpec subclass.