Reference¶

dapla_metadata package¶

Subpackages¶

Module contents¶

Tools and clients for working with the Dapla Metadata system.

dapla_metadata.datasets package¶

Subpackages¶

Submodules¶

dapla_metadata.datasets.code_list module¶

class CodeList(executor, classification_id)¶

Bases: GetExternalSource

Class for retrieving classifications from Klass.

This class fetches a classification given a classification ID and supports multiple languages.

Parameters:

executor (ThreadPoolExecutor)
classification_id (int | None)

supported_languages¶: A list of supported language codes.

_classifications¶: A list to store classification items.

classification_id¶: The ID of the classification to retrieve.

classifications_dataframes¶: A dictionary to store dataframes of classifications.

property classifications: list[CodeListItem]¶

Get the list of classifications.

Returns:: A list of CodeListItem objects.

class CodeListItem(titles, code)¶

Bases: object

Data structure for a code list item.

Parameters:

titles (dict[SupportedLanguages, str])
code (str)

titles¶: A dictionary mapping language codes to titles.

code¶: The code associated with the item.

code: str¶

get_title(language)¶

Return the title in the specified language.

Parameters:: language (SupportedLanguages) – The language code for which to get the title.
Return type:: str
Returns:: The title in the specified language. It returns the title in Norwegian Bokmål (“nb”) if the language is either Norwegian Bokmål or Norwegian Nynorsk, otherwise it returns the title in English (“en”). If none of these are available, it returns an empty string and logs an exception.

titles: dict[SupportedLanguages, str]¶

dapla_metadata.datasets.config module¶

dapla_metadata.datasets.core module¶

Handle reading, updating and writing of metadata.

class Datadoc(dataset_path=None, metadata_document_path=None, statistic_subject_mapping=None, errors_as_warnings=False, validate_required_fields_on_existing_metadata=False)¶

Bases: object

Handle reading, updating and writing of metadata.

If a metadata document exists, it is this information that is loaded. Nothing is inferred from the dataset. If only a dataset path is supplied the metadata document path is build based on the dataset path.

Example: /path/to/dataset.parquet -> /path/to/dataset__DOC.json

Parameters:

dataset_path (str | None)
metadata_document_path (str | None)
statistic_subject_mapping (StatisticSubjectMapping | None)
errors_as_warnings (bool)
validate_required_fields_on_existing_metadata (bool)

dataset_path¶: A file path to the path to where the dataset is stored.

metadata_document_path¶: A path to a metadata document if it exists.

statistic_subject_mapping¶: An instance of StatisticSubjectMapping.

add_pseudonymization(variable_short_name, pseudonymization=None)¶

Adds a new pseudo variable to the list of pseudonymized variables also sets is_personal_data to true.

If there is no pseudonymization supplied an empty Pseudonymization structure will be added to the model.

Parameters:

variable_short_name (str) – The short name for the variable that one wants to update the pseudo for.
pseudonymization (Pseudonymization | None) – The updated pseudonymization.

Return type:

None

static build_metadata_document_path(dataset_path)¶

Build the path to the metadata document corresponding to the given dataset.

Parameters:: dataset_path (Path | CloudPath) – Path to the dataset we wish to create metadata for.
Return type:: Path | CloudPath

datadoc_model()¶

Return the underlying datadoc model.

Return type:: MetadataContainer

property percent_complete: int¶

The percentage of obligatory metadata completed.

A metadata field is counted as complete when any non-None value is assigned. Used for a live progress bar in the UI, as well as being saved in the datadoc as a simple quality indicator.

remove_pseudonymization(variable_short_name)¶

Removes a pseudo variable by using the shortname.

Updates the pseudo variable lookup by creating a new one. Sets is_personal_data to non pseudonymized encrypted personal data.

Parameters:: variable_short_name (str) – The short name for the variable that one wants to remove the pseudo for.
Return type:: None

write_metadata_document()¶

Write all currently known metadata to file.

Return type:: None

Side Effects:

Updates the dataset’s metadata_last_updated_date and
metadata_last_updated_by attributes.
Updates the dataset’s file_path attribute.
Validates the metadata model and stores it in a MetadataContainer.
Writes the validated metadata to a file if the metadata_document
attribute is set.
Logs the action and the content of the metadata document.

Raises:: ValueError – If no metadata document is specified for saving.
Return type:: None

exception InconsistentDatasetsError¶

Bases: ValueError

Existing and new datasets differ significantly from one another.

exception InconsistentDatasetsWarning¶

Bases: UserWarning

Existing and new datasets differ significantly from one another.

dapla_metadata.datasets.dapla_dataset_path_info module¶

Extract info from a path following SSB’s dataset naming convention.

class DaplaDatasetPathInfo(dataset_path)¶

Bases: object

Extract info from a path following SSB’s dataset naming convention.

Parameters:: dataset_path (str | os.PathLike[str])

property bucket_name: str | None¶

Extract the bucket name from the dataset path.

Returns:: The bucket name or None if the dataset path is not a GCS path nor ssb bucketeer path.

Examples

>>> DaplaDatasetPathInfo('gs://ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet').bucket_name
ssb-staging-dapla-felles-data-delt

>>> DaplaDatasetPathInfo(pathlib.Path('gs://ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet')).bucket_name
ssb-staging-dapla-felles-data-delt

>>> DaplaDatasetPathInfo('gs:/ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet').bucket_name
ssb-staging-dapla-felles-data-delt

>>> DaplaDatasetPathInfo('ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet').bucket_name
None

>>> DaplaDatasetPathInfo('ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet').bucket_name
None

>>> DaplaDatasetPathInfo('buckets/ssb-staging-dapla-felles-data-delt/stat/utdata/person_data_p2021_v2.parquet').bucket_name
ssb-staging-dapla-felles-data-delt

>>> DaplaDatasetPathInfo('buckets/ssb-staging-dapla-felles-data-delt/person_data_p2021_v2.parquet').bucket_name
ssb-staging-dapla-felles-data-delt

>>> DaplaDatasetPathInfo('home/work/buckets/ssb-staging-dapla-felles-produkt/stat/utdata/person_data_p2021_v2.parquet').bucket_name
ssb-staging-dapla-felles-produkt

property contains_data_from: date | None¶

The earliest date from which data in the dataset is relevant for.

Returns:: The earliest relevant date for the dataset if available, otherwise None.

property contains_data_until: date | None¶

The latest date until which data in the dataset is relevant for.

Returns:: The latest relevant date for the dataset if available, otherwise None.

property dataset_short_name: str | None¶

Extract the dataset short name from the filepath.

The dataset short name is defined as the first section of the stem, up to the period information or the version information if no period information is present.

Returns:: The extracted dataset short name if it can be determined, otherwise None.

Examples

>>> DaplaDatasetPathInfo('prosjekt/befolkning/klargjorte_data/person_data_v1.parquet').dataset_short_name
person_data

>>> DaplaDatasetPathInfo('befolkning/inndata/sykepenger_p2022Q1_p2022Q2_v23.parquet').dataset_short_name
sykepenger

>>> DaplaDatasetPathInfo('my_data/simple_dataset_name.parquet').dataset_short_name
simple_dataset_name

>>> DaplaDatasetPathInfo('gs:/ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet').dataset_short_name
person_data

>>> DaplaDatasetPathInfo('buckets/ssb-staging-dapla-felles-data-delt/stat/utdata/folk_data_p2021_v2.parquet').dataset_short_name
folk_data

>>> DaplaDatasetPathInfo('buckets/ssb-staging-dapla-felles-data-delt/stat/utdata/dapla/bus_p2021_v2.parquet').dataset_short_name
bus

property dataset_state: DataSetState | None¶

Extract the dataset state from the path.

We assume that files are saved in the Norwegian language as specified by SSB.

Returns:: The extracted dataset state if it can be determined from the path, otherwise None.

Examples

>>> DaplaDatasetPathInfo('klargjorte_data/person_data_v1.parquet').dataset_state
<DataSetState.PROCESSED_DATA: 'PROCESSED_DATA'>

>>> DaplaDatasetPathInfo('klargjorte-data/person_data_v1.parquet').dataset_state
<DataSetState.PROCESSED_DATA: 'PROCESSED_DATA'>

>>> DaplaDatasetPathInfo('utdata/min_statistikk/person_data_v1.parquet').dataset_state
<DataSetState.OUTPUT_DATA: 'OUTPUT_DATA'>

>>> DaplaDatasetPathInfo('buckets/bucket_name/stat_name/inndata/min_statistikk/person_data_v1.parquet').dataset_state
<DataSetState.INPUT_DATA: 'INPUT_DATA'>

>>> DaplaDatasetPathInfo('my_special_data/person_data_v1.parquet').dataset_state
None

property dataset_version: str | None¶

Extract version information if exists in filename.

Returns:: The extracted version information if available in the filename, otherwise None.

Examples

>>> DaplaDatasetPathInfo('person_data_v1.parquet').dataset_version
'1'

>>> DaplaDatasetPathInfo('person_data_v20.parquet').dataset_version
'20'

>>> DaplaDatasetPathInfo('person_data.parquet').dataset_version
None

>>> DaplaDatasetPathInfo('buckets/bucket_name/stat_name/inndata/min_statistikk/person_data_v1.parquet').dataset_version
'1'

>>> DaplaDatasetPathInfo('buckets/bucket_name/stat_name/inndata/min_statistikk/person_data.parquet').dataset_version
None

path_complies_with_naming_standard()¶

Check if path is valid according to SSB standard.

Read more about SSB naming convention in the Dapla manual: https://manual.dapla.ssb.no/statistikkere/navnestandard.html

Return type:: bool
Returns:: True if the path conforms to the SSB naming standard, otherwise False.

property statistic_short_name: str | None¶

Extract the statistical short name from the filepath.

Extract the statistical short name from the filepath either after bucket name or right before the dataset state based on the Dapla filepath naming convention.

Returns:: The extracted statistical short name if it can be determined, otherwise None.

Examples

>>> DaplaDatasetPathInfo('prosjekt/befolkning/klargjorte_data/person_data_v1.parquet').statistic_short_name
befolkning

>>> DaplaDatasetPathInfo('buckets/prosjekt/befolkning/person_data_v1.parquet').statistic_short_name
befolkning

>>> DaplaDatasetPathInfo('befolkning/inndata/person_data_v1.parquet').statistic_short_name
befolkning

>>> DaplaDatasetPathInfo('buckets/bucket_name/stat_name/inndata/min_statistikk/person_data.parquet').statistic_short_name
stat_name

>>> DaplaDatasetPathInfo('buckets/stat_name/utdata/person_data.parquet').statistic_short_name
None

>>> DaplaDatasetPathInfo('befolkning/person_data.parquet').statistic_short_name
None

>>> DaplaDatasetPathInfo('buckets/produkt/befolkning/utdata/person_data.parquet').statistic_short_name
befolkning

>>> DaplaDatasetPathInfo('resources/buckets/produkt/befolkning/utdata/person_data.parquet').statistic_short_name
befolkning

>>> DaplaDatasetPathInfo('gs://statistikk/produkt/klargjorte-data/persondata_p1990-Q1_p2023-Q4_v1/aar=2019/data.parquet').statistic_short_name
produkt

>>> DaplaDatasetPathInfo('gs://statistikk/produkt/persondata_p1990-Q1_p2023-Q4_v1/aar=2019/data.parquet').statistic_short_name
None

>>> DaplaDatasetPathInfo('buckets/ssb-staging-dapla-felles-data-delt/person_data_p2021_v2.parquet').statistic_short_name
None

class DateFormat(name, regex_pattern, arrow_pattern, timeframe)¶

Bases: ABC

A super class for date formats.

Parameters:

name (str)
regex_pattern (str)
arrow_pattern (str)
timeframe (Literal['year', 'month', 'day', 'week'])

arrow_pattern: str¶

abstractmethod get_ceil(period_string)¶

Abstract method implemented in the child class.

Return the last date of the timeframe period.

Parameters:: period_string (str) – A string representing the timeframe period.
Return type:: date | None

abstractmethod get_floor(period_string)¶

Abstract method implemented in the child class.

Return the first date of the timeframe period.

Parameters:: period_string (str) – A string representing the timeframe period.
Return type:: date | None

name: str¶

regex_pattern: str¶

timeframe: Literal['year', 'month', 'day', 'week']¶

class IsoDateFormat(name, regex_pattern, arrow_pattern, timeframe)¶

Bases: DateFormat

A subclass of Dateformat with relevant patterns for ISO dates.

Parameters:

name (str)
regex_pattern (str)
arrow_pattern (str)
timeframe (Literal['year', 'month', 'day', 'week'])

get_ceil(period_string)¶

Return last date of timeframe period defined in ISO date format.

Return type:: date | None
Parameters:: period_string (str)

Examples

>>> ISO_YEAR.get_ceil("1921")
datetime.date(1921, 12, 31)

>>> ISO_YEAR_MONTH.get_ceil("2021-05")
datetime.date(2021, 5, 31)

get_floor(period_string)¶

Return first date of timeframe period defined in ISO date format.

Return type:: date | None
Parameters:: period_string (str)

Examples

>>> ISO_YEAR_MONTH.get_floor("1980-08")
datetime.date(1980, 8, 1)

>>> ISO_YEAR.get_floor("2021")
datetime.date(2021, 1, 1)

class SsbDateFormat(name, regex_pattern, arrow_pattern, timeframe, ssb_dates)¶

Bases: DateFormat

A subclass of Dateformat with relevant patterns for SSB unique dates.

Parameters:

name (str)
regex_pattern (str)
arrow_pattern (str)
timeframe (Literal['year', 'month', 'day', 'week'])
ssb_dates (dict)

ssb_dates¶: A dictionary where keys are date format strings and values are corresponding date patterns specific to SSB.

get_ceil(period_string)¶

Return last date of the timeframe period defined in SSB date format.

Convert SSB format to date-string and return the last date.

Parameters:: period_string (str) – A string representing the timeframe period in SSB format.
Return type:: date | None
Returns:: The last date of the period if the period_string is a valid SSB format, otherwise None.

Example

>>> SSB_TRIANNUAL.get_ceil("1999T11")
None

>>> SSB_HALF_YEAR.get_ceil("2024H1")
datetime.date(2024, 6, 30)

>>> SSB_HALF_YEAR.get_ceil("2024-H1")
datetime.date(2024, 6, 30)

get_floor(period_string)¶

Return first date of the timeframe period defined in SSB date format.

Convert SSB format to date-string and return the first date.

Parameters:: period_string (str) – A string representing the timeframe period in SSB format.
Return type:: date | None
Returns:: The first date of the period if the period_string is a valid SSB format, otherwise None.

Example

>>> SSB_BIMESTER.get_floor("2003B8")
None

>>> SSB_BIMESTER.get_floor("2003B4")
datetime.date(2003, 7, 1)

>>> SSB_BIMESTER.get_floor("2003-B4")
datetime.date(2003, 7, 1)

ssb_dates: dict¶

categorize_period_string(period)¶

Categorize a period string into one of the supported date formats.

Parameters:: period (str) – A string representing the period to be categorized.
Return type:: IsoDateFormat | SsbDateFormat
Returns:: An instance of either IsoDateFormat or SsbDateFormat depending on the format of the input period string.
Raises:: NotImplementedError – If the period string is not recognized as either an ISO or SSB date format.

Examples

>>> date_format = categorize_period_string('2022-W01')
>>> date_format.name
ISO_YEAR_WEEK

>>> date_format = categorize_period_string('1954T2')
>>> date_format.name
SSB_TRIANNUAL

>>> categorize_period_string('unknown format')
Traceback (most recent call last):
...
NotImplementedError: Period format unknown format is not supported

dapla_metadata.datasets.dataset_parser module¶

Abstractions for dataset file formats.

Handles reading in the data and transforming data types to generic metadata types.

class DatasetParser(dataset)¶

Bases: ABC

Abstract Base Class for all Dataset parsers.

Implements: - A static factory method to get the correct implementation for each file extension. - A static method for data type conversion.

Requires implementation by subclasses: - A method to extract variables (columns) from the dataset, so they may be documented.

Parameters:: dataset (pathlib.Path | CloudPath)

static for_file(dataset)¶

Return the correct subclass based on the given dataset file.

Return type:: DatasetParser
Parameters:: dataset (Path | CloudPath)

abstractmethod get_fields()¶

Abstract method, must be implemented by subclasses.

Return type:: list[Variable]

static transform_data_type(data_type)¶

Transform a concrete data type to an abstract data type.

In statistical metadata, one is not interested in how the data is technically stored, but in the meaning of the data type. Because of this, we transform known data types to their abstract metadata representations.

If we encounter a data type we don’t know, we just ignore it and let the user handle it in the GUI.

Parameters:: data_type (str) – The concrete data type to map.
Return type:: DataType | None
Returns:: The abstract data type or None

class DatasetParserParquet(dataset)¶

Bases: DatasetParser

Concrete implementation for parsing parquet files.

Parameters:: dataset (pathlib.Path | CloudPath)

get_fields()¶

Extract the fields from this dataset.

Return type:: list[Variable]

class DatasetParserSas7Bdat(dataset)¶

Bases: DatasetParser

Concrete implementation for parsing SAS7BDAT files.

Parameters:: dataset (pathlib.Path | CloudPath)

get_fields()¶

Extract the fields from this dataset.

Return type:: list[Variable]

dapla_metadata.datasets.model_backwards_compatibility module¶

Upgrade old metadata files to be compatible with new versions.

An important principle of Datadoc is that we ALWAYS guarantee backwards compatibility of existing metadata documents. This means that we guarantee that a user will never lose data, even if their document is decades old.

For each document version we release with breaking changes, we implement a handler and register the version by defining a BackwardsCompatibleVersion instance. These documents will then be upgraded when they’re opened in Datadoc.

A test must also be implemented for each new version.

class BackwardsCompatibleVersion(version, handler)¶

Bases: object

A version which we support with backwards compatibility.

This class registers a version and its corresponding handler function for backwards compatibility.

Parameters:

version (str)
handler (Callable[[dict[str, Any]], dict[str, Any]])

handler: Callable[[dict[str, Any]], dict[str, Any]]¶

version: str¶

exception UnknownModelVersionError(supplied_version, *args)¶

Bases: Exception

Exception raised for unknown model versions.

This error is thrown when an unrecognized model version is encountered.

Parameters:

supplied_version (str)
args (tuple[Any, ...])

Return type:

None

add_container(existing_metadata)¶

Add container for previous versions.

Adds a container structure for previous versions of metadata. This function wraps the existing metadata in a new container structure that includes the ‘document_version’, ‘datadoc’, and ‘pseudonymization’ fields. The ‘document_version’ is set to “0.0.1” and ‘pseudonymization’ is set to None.

Parameters:: existing_metadata (dict) – The original metadata dictionary to be wrapped.
Return type:: dict
Returns:: A new dictionary containing the wrapped metadata with additional fields.

convert_is_personal_data(supplied_metadata)¶

Convert ‘is_personal_data’ values in the supplied metadata to boolean.

Iterates over variables in the supplied metadata and updates the ‘is_personal_data’ field:

Sets it to True for NON_PSEUDONYMISED_ENCRYPTED_PERSONAL_DATA and PSEUDONYMISED_ENCRYPTED_PERSONAL_DATA.

Sets it to False for NOT_PERSONAL_DATA.

Parameters:: supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.
Return type:: None

copy_pseudonymization_metadata(supplied_metadata)¶

Copies pseudonymization metadata from the old pseudonymization section into the corresponding variable.

For each variable in supplied_metadata[“datadoc”][“variables”] that has a matching short_name in supplied_metadata[“pseudonymization”][“pseudo_variables”], this function copies the following fields into the variable’s ‘pseudonymization’ dictionary:

stable_identifier_type

stable_identifier_version

encryption_algorithm

encryption_key_reference

encryption_algorithm_parameters

From the pseudo_dataset the value dataset_pseudo_time is copied to each variable as pseudonymization_time.

Parameters:: supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.
Return type:: None

handle_current_version(supplied_metadata)¶

Handle the current version of the metadata.

This function returns the supplied metadata unmodified.

Parameters:: supplied_metadata (dict[str, Any]) – The metadata for the current version.
Return type:: dict[str, Any]
Returns:: The unmodified supplied metadata.

handle_version_0_1_1(supplied_metadata)¶

Handle breaking changes for version 0.1.1.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 1.0.0. Specifically, it renames certain keys within the dataset and variables sections, and replaces empty string values with None for dataset keys.

Parameters:: supplied_metadata (dict[str, Any]) – The metadata dictionary that needs to be updated.
Return type:: dict[str, Any]
Returns:: The updated metadata dictionary.

References

PR ref: https://github.com/statisticsnorway/ssb-datadoc-model/pull/4

handle_version_1_0_0(supplied_metadata)¶

Handle breaking changes for version 1.0.0.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 2.1.0. Specifically, it updates the date fields ‘metadata_created_date’ and ‘metadata_last_updated_date’ to ISO 8601 format with UTC timezone. It also converts the ‘data_source’ field from a string to a dictionary with language keys if necessary and removes the ‘data_source_path’ field. The ‘document_version’ is updated to “2.1.0”.

Parameters:: supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.
Return type:: dict[str, Any]
Returns:: The updated metadata dictionary.

handle_version_2_1_0(supplied_metadata)¶

Handle breaking changes for version 2.1.0.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 2.2.0. Specifically, it updates the ‘owner’ field in the ‘dataset’ section of the supplied metadata dictionary by converting it from a LanguageStringType to a string. The ‘document_version’ is updated to “2.2.0”.

Parameters:: supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.
Return type:: dict[str, Any]
Returns:: The updated metadata dictionary.

handle_version_2_2_0(supplied_metadata)¶

Handle breaking changes for version 2.2.0.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 3.1.0. Specifically, it updates the ‘subject_field’ in the ‘dataset’ section of the supplied metadata dictionary by converting it to a string. It also removes the ‘register_uri’ field from the ‘dataset’. Additionally, it removes ‘sentinel_value_uri’ from each variable, sets ‘special_value’ and ‘custom_type’ fields to None, and updates language strings in the ‘variables’ and ‘dataset’ sections. The ‘document_version’ is updated to “3.1.0”.

Parameters:: supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.
Return type:: dict[str, Any]
Returns:: The updated metadata dictionary.

handle_version_3_1_0(supplied_metadata)¶

Handle breaking changes for version 3.1.0.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 3.2.0. Specifically, it updates the ‘data_source’ field in both the ‘dataset’ and ‘variables’ sections of the supplied metadata dictionary by converting value to string. The ‘document_version’ field is also updated to “3.2.0”.

Parameters:: supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.
Return type:: dict[str, Any]
Returns:: The updated metadata dictionary.

handle_version_3_2_0(supplied_metadata)¶

Handle breaking changes for version 3.2.0.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 3.3.0. Specifically, it updates the ‘contains_data_from’ and ‘contains_data_until’ fields in both the ‘dataset’ and ‘variables’ sections of the supplied metadata dictionary to ensure they are stored as date strings. It also updates the ‘document_version’ field to “3.3.0”.

Parameters:: supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.
Return type:: dict[str, Any]
Returns:: The updated metadata dictionary.

handle_version_3_3_0(supplied_metadata)¶

Handle breaking changes for version 3.3.0.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 4.0.0. Specifically, it removes the ‘direct_person_identifying’ field from each variable in ‘datadoc.variables’ and updates the ‘document_version’ field to “4.0.0”.

Version 4.0.0 used an enum for is_personal_data, however this was changed to a bool again for version 5.0.1. We skip setting the enum here and just keep the value it has.

Parameters:: supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.
Return type:: dict[str, Any]
Returns:: The updated metadata dictionary.

handle_version_4_0_0(supplied_metadata)¶

Handle breaking changes for version 5.0.1.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 5.0.1. Specifically, it: - Copies pseudonymization metadata if pseudonymization is enabled. - Converts the ‘is_personal_data’ fields to be a bool. - Updates the ‘document_version’ field in the ‘datadoc’ section to “5.0.1”. - All ‘pseudonymization’ from the container is removed. - It also updates the container version to 1.0.0 from 0.0.1

Parameters:: supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.
Return type:: dict[str, Any]
Returns:: The updated metadata dictionary.

is_metadata_in_container_structure(metadata)¶

Check if the metadata is in the container structure.

At a certain point a metadata ‘container’ was introduced. The container provides a structure for different ‘types’ of metadata, such as ‘datadoc’, ‘pseudonymization’ etc. This function determines if the given metadata dictionary follows this container structure by checking for the presence of the ‘datadoc’ field.

Parameters:: metadata (dict) – The metadata dictionary to check.
Return type:: bool
Returns:: True if the metadata is in the container structure (i.e., contains the ‘datadoc’ field), False otherwise.

upgrade_metadata(fresh_metadata)¶

Upgrade the metadata to the latest version using registered handlers.

This function checks the version of the provided metadata and applies a series of upgrade handlers to migrate the metadata to the latest version. It starts from the provided version and applies all subsequent handlers in sequence. If the metadata is already in the latest version or the version cannot be determined, appropriate actions are taken.

Parameters:: fresh_metadata (dict[str, Any]) – The metadata dictionary to be upgraded. This dictionary must include version information that determines which handlers to apply.
Return type:: dict[str, Any]
Returns:: The upgraded metadata dictionary, after applying all necessary handlers.
Raises:: UnknownModelVersionError – If the metadata’s version is unknown or unsupported.

dapla_metadata.datasets.model_validation module¶

Handle validation for metadata with pydantic validators and custom warnings.

exception ObligatoryDatasetWarning¶

Bases: UserWarning

Custom warning for checking obligatory metadata for dataset.

exception ObligatoryVariableWarning¶

Bases: UserWarning

Custom warning for checking obligatory metadata for variables.

class ValidateDatadocMetadata(**data)¶

Bases: DatadocMetadata

Class that inherits from DatadocMetadata, providing additional validation.

Parameters:

percentage_complete (int | None)
document_version (Literal['5.0.1'])
dataset (Dataset | None)
variables (list[Variable] | None)

check_date_order()¶

Validate the order of date fields.

Check that dataset and variable date fields contains_data_from and contains_data_until are in chronological order.

Mode: This validator runs after other validation.

Return type:: Self
Returns:: The instance of the model after validation.
Raises:: ValueError – If contains_data_until date is earlier than contains_data_from date.

check_inherit_values()¶

Inherit values from dataset to variables if not set.

Sets values for ‘data source’, ‘temporality type’, ‘contains data from’, and ‘contains data until’ if they are None.

Mode: This validator runs after other validation.

Return type:: Self
Returns:: The instance of the model after validation.

check_metadata_created_date()¶

Ensure metadata_created_date is set for the dataset.

Sets the current timestamp if metadata_created_date is None.

Mode: This validator runs after other validation.

Return type:: Self
Returns:: The instance of the model after validation.

check_obligatory_dataset_metadata()¶

Check obligatory dataset fields and issue a warning if any are missing.

Mode:: This validator runs after other validation.

Return type:: Self
Returns:: The instance of the model after validation.
Raises:: ObligatoryDatasetWarning – If not all obligatory dataset metadata fields are filled in.

check_obligatory_variables_metadata()¶

Check obligatory variable fields and issue a warning if any are missing.

Mode:: This validator runs after other validation.

Return type:: Self
Returns:: The instance of the model after validation.
Raises:: ObligatoryVariableWarning – If not all obligatory variable metadata fields are filled in.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'use_enum_values': True, 'validate_assignment': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

exception ValidationWarning¶

Bases: UserWarning

Custom warning for validation purposes.

custom_warning_handler(message, category, filename, lineno, file=None, line=None)¶

Handle warnings.

Return type:

None

Parameters:

message (Warning | str)
category (type[Warning])
filename (str)
lineno (int)
file (TextIO | None)
line (str | None)

dapla_metadata.datasets.statistic_subject_mapping module¶

class PrimarySubject(titles, subject_code, secondary_subjects)¶

Bases: Subject

Data structure for primary subjects or ‘hovedemne’.

Parameters:

titles (dict[str, str])
subject_code (str)
secondary_subjects (list[SecondarySubject])

secondary_subjects: list[SecondarySubject]¶

class SecondarySubject(titles, subject_code, statistic_short_names)¶

Bases: Subject

Data structure for secondary subjects or ‘delemne’.

Parameters:

titles (dict[str, str])
subject_code (str)
statistic_short_names (list[str])

statistic_short_names: list[str]¶

class StatisticSubjectMapping(executor, source_url)¶

Bases: GetExternalSource

Provide mapping between statistic short name and primary and secondary subject.

Parameters:

executor (ThreadPoolExecutor)
source_url (str | None)

get_secondary_subject(statistic_short_name)¶

Looks up the secondary subject for the given statistic short name in the mapping dict.

Returns the secondary subject string if found, else None.

Return type:: str | None
Parameters:: statistic_short_name (str | None)

property primary_subjects: list[PrimarySubject]¶: Getter for primary subjects.

class Subject(titles, subject_code)¶

Bases: object

Base class for Primary and Secondary subjects.

A statistical subject is a related grouping of statistics.

Parameters:

titles (dict[str, str])
subject_code (str)

get_title(language)¶

Get the title in the given language.

Return type:: str
Parameters:: language (SupportedLanguages)

subject_code: str¶

titles: dict[str, str]¶

dapla_metadata.datasets.user_info module¶

Module contents¶

Document dataset.

dapla_metadata.datasets.utility package¶

Submodules¶

dapla_metadata.datasets.utility.constants module¶

Repository for constant values in Datadoc backend.

dapla_metadata.datasets.utility.enums module¶

Enumerations used in Datadoc.

class SupportedLanguages(*values)¶

Bases: str, Enum

The list of languages metadata may be recorded in.

Reference: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

ENGLISH = 'en'¶

NORSK_BOKMÅL = 'nb'¶

NORSK_NYNORSK = 'nn'¶

dapla_metadata.datasets.utility.utils module¶

calculate_percentage(completed, total)¶

Calculate percentage as a rounded integer.

Parameters:

completed (int) – The number of completed items.
total (int) – The total number of items.

Return type:

int

Returns:

The rounded percentage of completed items out of the total.

derive_assessment_from_state(state)¶

Derive assessment from dataset state.

Parameters:: state (DataSetState) – The state of the dataset.
Return type:: Assessment
Returns:: The derived assessment of the dataset.

get_missing_obligatory_dataset_fields(dataset)¶

Identify all obligatory dataset fields that are missing values.

This function checks for obligatory fields that are either directly missing (i.e., set to None) or have multilanguage values with empty content.

Parameters:

dataset (Dataset | Dataset) – The dataset object to examine. This object must support the model_dump() method which returns a dictionary of field names and values.

Returns:

Fields that are directly None and are listed as obligatory metadata.
Multilanguage fields (listed as obligatory metadata`) where

the value exists but the primary language text is empty.

Return type:

A list of field names (as strings) that are missing values. This includes

get_missing_obligatory_variables_fields(variables)¶

Identify obligatory variable fields that are missing values for each variable.

This function checks for obligatory fields that are either directly missing (i.e., set to None) or have multilanguage values with empty content.

Parameters:

variables (list) – A list of variable objects to check for missing obligatory fields.

Return type:

list[dict]

Returns:

A list of dictionaries with variable short names as keys and list of missing obligatory variable fields as values. This includes:

Fields that are directly None and are llisted as obligatory metadata.

Multilanguage fields (listed as obligatory metadata) where the value

exists but the primary language text is empty.

get_timestamp_now()¶

Return a timestamp for the current moment.

Return type:: datetime

incorrect_date_order(date_from, date_until)¶

Evaluate the chronological order of two dates.

This function checks if ‘date until’ is earlier than ‘date from’. If so, it indicates an incorrect date order.

Parameters:

date_from (date | None) – The start date of the time period.
date_until (date | None) – The end date of the time period.

Return type:

bool

Returns:

True if ‘date_until’ is earlier than ‘date_from’ or if only ‘date_from’ is None, False otherwise.

Example

>>> incorrect_date_order(datetime.date(1980, 1, 1), datetime.date(1967, 1, 1))
True

>>> incorrect_date_order(datetime.date(1967, 1, 1), datetime.date(1980, 1, 1))
False

>>> incorrect_date_order(None, datetime.date(2024,7,1))
True

merge_variables(existing_metadata, extracted_metadata, merged_metadata)¶

Merges variables from the extracted metadata into the existing metadata and updates the merged metadata.

This function compares the variables from extracted_metadata with those in existing_metadata. For each variable in extracted_metadata, it checks if a variable with the same short_name exists in existing_metadata. If a match is found, it updates the existing variable with information from extracted_metadata. If no match is found, the variable from extracted_metadata is directly added to merged_metadata.

Parameters:

existing_metadata (DatadocMetadata | DatadocMetadata | None) – The metadata object containing the current state of variables.
extracted_metadata (DatadocMetadata) – The metadata object containing new or updated variables to merge.
merged_metadata (DatadocMetadata) – The metadata object that will contain the result of the merge.

Returns:

The merged_metadata object containing variables from both existing_metadata and extracted_metadata.

Return type:

all_optional_model.DatadocMetadata

normalize_path(path)¶

Obtain a pathlib compatible Path.

Obtains a pathlib compatible Path regardless of whether the file is on a filesystem or in GCS.

Parameters:: path (str) – Path on a filesystem or in cloud storage.
Return type:: Path | CloudPath
Returns:: Pathlib compatible object.

num_obligatory_dataset_fields_completed(dataset)¶

Count the number of completed obligatory dataset fields.

This function returns the total count of obligatory fields in the dataset that have values (are not None).

Parameters:: dataset (Dataset | Dataset) – The dataset object for which to count the fields.
Return type:: int
Returns:: The number of obligatory dataset fields that have been completed (not None).

num_obligatory_variable_fields_completed(variable)¶

Count the number of obligatory fields completed for one variable.

This function calculates the total number of obligatory fields that have values (are not None) for one variable in the list.

Parameters:: variable (Variable) – The variable to count obligatory fields for.
Return type:: int
Returns:: The total number of obligatory variable fields that have been completed (not None) for one variable.

num_obligatory_variables_fields_completed(variables)¶

Count the number of obligatory fields completed for all variables.

This function calculates the total number of obligatory fields that have values (are not None) for one variable in the list.

Parameters:: variables (list) – A list with variable objects.
Return type:: int
Returns:: The total number of obligatory variable fields that have been completed (not None) for all variables.

override_dataset_fields(merged_metadata, existing_metadata)¶

Overrides specific fields in the dataset of merged_metadata with values from the dataset of existing_metadata.

This function iterates over a predefined list of fields, DATASET_FIELDS_FROM_EXISTING_METADATA, and sets the corresponding fields in the merged_metadata.dataset object to the values from the existing_metadata.dataset object.

Parameters:

merged_metadata (DatadocMetadata) – An instance of DatadocMetadata containing the dataset to be updated.
existing_metadata (DatadocMetadata | DatadocMetadata) – An instance of DatadocMetadata containing the dataset whose values are used to update merged_metadata.dataset.

Return type:

None

Returns:

None.

running_in_notebook()¶

Return True if running in Jupyter Notebook.

Return type:: bool

set_dataset_owner(dataset)¶

Sets the owner of the dataset from the DAPLA_GROUP_CONTEXT enviornment variable.

Parameters:: dataset (Dataset | Dataset) – The dataset object to set default values on.
Return type:: None

set_default_values_dataset(dataset)¶

Set default values on dataset.

Parameters:: dataset (Dataset | Dataset) – The dataset object to set default values on.
Return type:: None

Example

>>> dataset = model.Dataset(id=None, contains_personal_data=None)
>>> set_default_values_dataset(dataset)
>>> dataset.id is not None
True

>>> dataset.contains_personal_data == False
True

set_default_values_variables(variables)¶

Set default values on variables.

Parameters:: variables (list) – A list of variable objects to set default values on.
Return type:: None

Example

>>> variables = [model.Variable(short_name="pers",id=None, is_personal_data = None), model.Variable(short_name="fnr",id='9662875c-c245-41de-b667-12ad2091a1ee', is_personal_data=True)]
>>> set_default_values_variables(variables)
>>> isinstance(variables[0].id, uuid.UUID)
True

>>> variables[1].is_personal_data == True
True

>>> variables[0].is_personal_data == False
True

set_variables_inherit_from_dataset(dataset, variables)¶

Set specific dataset values on a list of variable objects.

This function populates ‘data source’, ‘temporality type’, ‘contains data from’, and ‘contains data until’ fields in each variable if they are not set (None). The values are inherited from the corresponding fields in the dataset.

Parameters:

dataset (Dataset | Dataset) – The dataset object from which to inherit values.
variables (list) – A list of variable objects to update with dataset values.

Return type:

None

Example

>>> dataset = model.Dataset(short_name='person_data_v1',data_source='01',temporality_type='STATUS',id='9662875c-c245-41de-b667-12ad2091a1ee',contains_data_from="2010-09-05",contains_data_until="2022-09-05")
>>> variables = [model.Variable(short_name="pers",data_source =None,temporality_type = None, contains_data_from = None,contains_data_until = None)]
>>> set_variables_inherit_from_dataset(dataset, variables)
>>> variables[0].data_source == dataset.data_source
True

>>> variables[0].temporality_type is None
False

>>> variables[0].contains_data_from == dataset.contains_data_from
True

>>> variables[0].contains_data_until == dataset.contains_data_until
True

Module contents¶

Utility files for Datadoc.

dapla_metadata.datasets.external_sources package¶

Submodules¶

dapla_metadata.datasets.external_sources.external_sources module¶

class GetExternalSource(executor)¶

Bases: ABC, Generic[T]

Abstract base class for retrieving data from external sources asynchronously.

This class provides methods to initiate an asynchronous data retrieval operation, check its status, and retrieve the result once the operation completes. Subclasses must implement the _fetch_data_from_external_source method to define how data is fetched from the specific external source.

Parameters:: executor (ThreadPoolExecutor)

check_if_external_data_is_loaded()¶

Check if the thread getting the external data has finished running.

Return type:: bool
Returns:: True if the data fetching operation is complete, False otherwise.

retrieve_external_data()¶

Retrieve the result of the data fetching operation.

This method checks if the asynchronous data fetching operation has completed. If the operation is finished, it returns the result. Otherwise, it returns None.

Return type:: Optional[TypeVar(T)]
Returns:: The result of the data fetching operation if it is complete or None if the operation has not yet finished.

wait_for_external_result()¶

Wait for the thread responsible for loading the external request to finish.

If there is no future to wait for, it logs a warning and returns immediately.

Return type:: None

Module contents¶

Abstract parent class for interacting with external resources asynchorously.

dapla_metadata.variable_definitions package¶

Subpackages¶

Module contents¶

Client for working with Variable Definitions at Statistics Norway.

Reference¶

dapla_metadata package¶

Subpackages¶

Module contents¶

dapla_metadata.datasets package¶

Subpackages¶

Submodules¶

dapla_metadata.datasets.code_list module¶

dapla_metadata.datasets.config module¶

dapla_metadata.datasets.core module¶

dapla_metadata.datasets.dapla_dataset_path_info module¶

dapla_metadata.datasets.dataset_parser module¶

dapla_metadata.datasets.model_backwards_compatibility module¶

dapla_metadata.datasets.model_validation module¶

dapla_metadata.datasets.statistic_subject_mapping module¶

dapla_metadata.datasets.user_info module¶

Module contents¶

dapla_metadata.datasets.utility package¶

Submodules¶

dapla_metadata.datasets.utility.constants module¶

dapla_metadata.datasets.utility.enums module¶

dapla_metadata.datasets.utility.utils module¶

Module contents¶

dapla_metadata.datasets.external_sources package¶

Submodules¶

dapla_metadata.datasets.external_sources.external_sources module¶

Module contents¶

dapla_metadata.variable_definitions package¶

Subpackages¶

Module contents¶

dapla_metadata.variable_definitions.generated package¶

Subpackages¶

Module contents¶

dapla_metadata.variable_definitions.generated.vardef_client package¶

Subpackages¶

Submodules¶

dapla_metadata.variable_definitions.generated.vardef_client.api_client module¶

dapla_metadata.variable_definitions.generated.vardef_client.api_response module¶

dapla_metadata.variable_definitions.generated.vardef_client.configuration module¶

dapla_metadata.variable_definitions.generated.vardef_client.exceptions module¶

dapla_metadata.variable_definitions.generated.vardef_client.rest module¶

Module contents¶

dapla_metadata.variable_definitions.generated.vardef_client.api package¶

Submodules¶

dapla_metadata.variable_definitions.generated.vardef_client.api.data_migration_api module¶

dapla_metadata.variable_definitions.generated.vardef_client.api.draft_variable_definitions_api module¶

dapla_metadata.variable_definitions.generated.vardef_client.api.patches_api module¶

dapla_metadata.variable_definitions.generated.vardef_client.api.public_api module¶

dapla_metadata.variable_definitions.generated.vardef_client.api.validity_periods_api module¶

dapla_metadata.variable_definitions.generated.vardef_client.api.variable_definitions_api module¶

Module contents¶

dapla_metadata.variable_definitions.generated.vardef_client.models package¶

Submodules¶

dapla_metadata.variable_definitions.generated.vardef_client.models.complete_response module¶

dapla_metadata.variable_definitions.generated.vardef_client.models.contact module¶

dapla_metadata.variable_definitions.generated.vardef_client.models.draft module¶

dapla_metadata.variable_definitions.generated.vardef_client.models.language_string_type module¶

dapla_metadata.variable_definitions.generated.vardef_client.models.owner module¶

dapla_metadata.variable_definitions.generated.vardef_client.models.patch module¶

dapla_metadata.variable_definitions.generated.vardef_client.models.person module¶

dapla_metadata.variable_definitions.generated.vardef_client.models.supported_languages module¶

dapla_metadata.variable_definitions.generated.vardef_client.models.update_draft module¶

dapla_metadata.variable_definitions.generated.vardef_client.models.validity_period module¶

dapla_metadata.variable_definitions.generated.vardef_client.models.variable_status module¶

Module contents¶