Field Format API Documentation¶
field_formats¶
Classes that specify the requirements for each column in a dataset. They take care of validation, and produce the settings required to perform the hashing.
-
class
clkhash.field_formats.
DateSpec
(identifier, hashing_properties, format, description=None)[source]¶ Bases:
clkhash.field_formats.FieldSpec
Represents a field that holds dates.
Dates are specified as full-dates as defined in RFC3339 E.g.,
1996-12-19
ivar str format: The format of the date. -
classmethod
from_json_dict
(json_dict)[source]¶ Make a DateSpec object from a dictionary containing its properties.
Parameters: - json_dict (dict) – This dictionary must contain a
‘format’ key. In addition, it must contain a
‘hashing’ key, whose contents are passed to
FieldHashingProperties
. - json_dict – The properties dictionary.
- json_dict (dict) – This dictionary must contain a
‘format’ key. In addition, it must contain a
‘hashing’ key, whose contents are passed to
-
validate
(str_in)[source]¶ Validates an entry in the field.
Raises InvalidEntryError iff the entry is invalid.
An entry is invalid iff (1) the string does not represent a date in the correct format; or (2) the date it represents is invalid (such as 30 February).
Parameters: str_in (str) – String to validate.
Raises: - InvalidEntryError – Iff entry is invalid.
- ValueError – When self.format is unrecognised.
-
classmethod
-
class
clkhash.field_formats.
EnumSpec
(identifier, hashing_properties, values, description=None)[source]¶ Bases:
clkhash.field_formats.FieldSpec
Represents a field that holds an enum.
The finite collection of permitted values must be specified.
Variables: values – The set of permitted values. -
classmethod
from_json_dict
(json_dict)[source]¶ Make a EnumSpec object from a dictionary containing its properties.
Parameters: json_dict (dict) – This dictionary must contain an ‘enum’ key specifying the permitted values. In addition, it must contain a ‘hashing’ key, whose contents are passed to FieldHashingProperties
.
-
validate
(str_in)[source]¶ Validates an entry in the field.
Raises InvalidEntryError iff the entry is invalid.
An entry is invalid iff it is not one of the permitted values.
Parameters: str_in (str) – String to validate. Raises: InvalidEntryError – When entry is invalid.
-
classmethod
-
class
clkhash.field_formats.
FieldHashingProperties
(ngram, encoding='utf-8', weight=1, positional=False)[source]¶ Bases:
object
Stores the settings used to hash a field. This includes the encoding and tokenisation parameters.
Variables: - encoding (str) – The encoding to use when converting the string to bytes. Refer to Python’s documentation for possible values.
- ngram (int) – The n in n-gram. Possible values are 0, 1, and 2.
- positional (bool) – Controls whether the n-grams are positional.
- weight (float) – Controls the weight of the field in the Bloom filter.
-
classmethod
from_json_dict
(json_dict)[source]¶ Make a
FieldHashingProperties
object from a dictionary.Parameters: json_dict (dict) – The dictionary must have have an ‘ngram’ key. It may have ‘positional’ and ‘weight’ keys; if these are missing, then they are filled with the default values. The encoding is always set to the default value. Returns: A FieldHashingProperties
instance.
-
class
clkhash.field_formats.
FieldSpec
(identifier, hashing_properties, description=None)[source]¶ Bases:
object
Abstract base class representing the specification of a column in the dataset. Subclasses validate entries, and modify the `hashing_properties ivar to customise hashing procedures.
Variables: - identifier (str) – The name of the field.
- description (str) – Description of the field format.
- hashing_properties (FieldHashingProperties) – The properties for hashing.
-
classmethod
from_json_dict
(field_dict)[source]¶ Initialise a FieldSpec object from a dictionary of properties.
Parameters: field_dict (dict) – The properties dictionary to use. Must contain a ‘hashing’ key that meets the requirements of FieldHashingProperties
. Subclasses may requrireRaises: InvalidSchemaError – When the properties dictionary contains invalid values. Exactly what that means is decided by the subclasses.
-
validate
(str_in)[source]¶ Validates an entry in the field.
Raises
InvalidEntryError
iff the entry is invalid.Subclasses must override this method with their own validation. They should call the parent’s validate method via super.
Parameters: str_in (str) – String to validate. Raises: InvalidEntryError – When entry is invalid.
-
class
clkhash.field_formats.
IntegerSpec
(identifier, hashing_properties, description=None, minimum=0, maximum=None, **kwargs)[source]¶ Bases:
clkhash.field_formats.FieldSpec
Represents a field that holds integers.
Minimum and maximum values may be specified.
Variables: -
classmethod
from_json_dict
(json_dict)[source]¶ Make a IntegerSpec object from a dictionary containing its properties.
Parameters: - json_dict (dict) – This dictionary may contain
‘minimum’ and ‘maximum’ keys. In addition, it must
contain a ‘hashing’ key, whose contents are passed to
FieldHashingProperties
. - json_dict – The properties dictionary.
- json_dict (dict) – This dictionary may contain
‘minimum’ and ‘maximum’ keys. In addition, it must
contain a ‘hashing’ key, whose contents are passed to
-
validate
(str_in)[source]¶ Validates an entry in the field.
Raises InvalidEntryError iff the entry is invalid.
An entry is invalid iff (1) the string does not represent a base-10 integer; (2) the integer is not between self.minimum and self.maximum, if those exist; or (3) the integer is negative.
Parameters: str_in (str) – String to validate. Raises: InvalidEntryError – When entry is invalid.
-
classmethod
-
exception
clkhash.field_formats.
InvalidEntryError
[source]¶ Bases:
ValueError
An entry in the data file does not conform to the schema.
-
exception
clkhash.field_formats.
InvalidSchemaError
[source]¶ Bases:
ValueError
The schema is not valid.
This exception is raised if, for example, a regular expression included in the schema is not syntactically correct.
-
class
clkhash.field_formats.
StringSpec
(identifier, hashing_properties, description=None, regex=None, case='mixed', min_length=0, max_length=None)[source]¶ Bases:
clkhash.field_formats.FieldSpec
Represents a field that holds strings.
One way to specify the format of the entries is to provide a regular expression that they must conform to. Another is to provide zero or more of: minimum length, maximum length, casing (lower, upper, mixed).
Each string field also specifies an encoding used when turning characters into bytes. This is stored in hashing_properties since it is needed for hashing.
Variables: - regex – Compiled regular expression that entries must conform to. Present only if the specification is regex- -based.
- case (str) – The casing of the entries. One of ‘lower’, ‘upper’, or ‘mixed’. Default is ‘mixed’. Present only if the specification is not regex-based.
- min_length (int) – The minimum length of the string. None if there is no minimum length. Present only if the specification is not regex-based.
- max_length (int) – The maximum length of the string. None if there is no maximum length. Present only if the specification is not regex-based.
-
classmethod
from_json_dict
(json_dict)[source]¶ Make a StringSpec object from a dictionary containing its properties.
Parameters: json_dict (dict) – This dictionary must contain an ‘encoding’ key associated with a Python-conformant encoding. It must also contain a ‘hashing’ key, whose contents are passed to FieldHashingProperties
. Permitted keys also include ‘pattern’, ‘case’, ‘minLength’, and ‘maxLength’.Raises: InvalidSchemaError – When a regular expression is provided but is not a valid pattern.
-
validate
(str_in)[source]¶ Validates an entry in the field.
Raises InvalidEntryError iff the entry is invalid.
An entry is invalid iff (1) a pattern is part of the specification of the field and the string does not match it; (2) the string does not match the provided casing, minimum length, or maximum length; or (3) the specified encoding cannot represent the string.
Parameters: str_in (str) – String to validate.
Raises: - InvalidEntryError – When entry is invalid.
- ValueError – When self.case is not one of the permitted values (‘lower’, ‘upper’, or ‘mixed’).