Reference

dapla_metadata package

Subpackages

Module contents

Tools and clients for working with the Dapla Metadata system.

dapla_metadata.datasets package

Subpackages

Submodules

dapla_metadata.datasets.code_list module

class CodeList(executor, classification_id)

Bases: GetExternalSource

Class for retrieving classifications from Klass.

This class fetches a classification given a classification ID and supports multiple languages.

Parameters:
  • executor (ThreadPoolExecutor)

  • classification_id (int | None)

supported_languages

A list of supported language codes.

_classifications

A list to store classification items.

classification_id

The ID of the classification to retrieve.

classifications_dataframes

A dictionary to store dataframes of classifications.

property classifications: list[CodeListItem]

Get the list of classifications.

Returns:

A list of CodeListItem objects.

class CodeListItem(titles, code)

Bases: object

Data structure for a code list item.

Parameters:
titles

A dictionary mapping language codes to titles.

code

The code associated with the item.

code: str
get_title(language)

Return the title in the specified language.

Parameters:

language (SupportedLanguages) – The language code for which to get the title.

Return type:

str

Returns:

The title in the specified language. It returns the title in Norwegian Bokmål (“nb”) if the language is either Norwegian Bokmål or Norwegian Nynorsk, otherwise it returns the title in English (“en”). If none of these are available, it returns an empty string and logs an exception.

titles: dict[SupportedLanguages, str]

dapla_metadata.datasets.config module

Configuration management for dataset package.

get_dapla_region()

Get the Dapla region we’re running on.

Return type:

DaplaRegion | None

get_dapla_service()

Get the Dapla service we’re running on.

Return type:

DaplaService | None

get_jupyterhub_user()

Get the JupyterHub user name.

Return type:

str | None

get_oidc_token()

Get the JWT token from the environment.

Return type:

str | None

get_statistical_subject_source_url()

Get the URL to the statistical subject source.

Return type:

str | None

dapla_metadata.datasets.core module

Handle reading, updating and writing of metadata.

class Datadoc(dataset_path=None, metadata_document_path=None, statistic_subject_mapping=None, *, errors_as_warnings=False)

Bases: object

Handle reading, updating and writing of metadata.

If a metadata document exists, it is this information that is loaded. Nothing is inferred from the dataset. If only a dataset path is supplied the metadata document path is build based on the dataset path.

Example: /path/to/dataset.parquet -> /path/to/dataset__DOC.json

Parameters:
  • dataset_path (str | None)

  • metadata_document_path (str | None)

  • statistic_subject_mapping (StatisticSubjectMapping | None)

  • errors_as_warnings (bool)

dataset_path

A file path to the path to where the dataset is stored.

metadata_document_path

A path to a metadata document if it exists.

statistic_subject_mapping

An instance of StatisticSubjectMapping.

static build_metadata_document_path(dataset_path)

Build the path to the metadata document corresponding to the given dataset.

Parameters:

dataset_path (Path | CloudPath) – Path to the dataset we wish to create metadata for.

Return type:

Path | CloudPath

property percent_complete: int

The percentage of obligatory metadata completed.

A metadata field is counted as complete when any non-None value is assigned. Used for a live progress bar in the UI, as well as being saved in the datadoc as a simple quality indicator.

write_metadata_document()

Write all currently known metadata to file.

Return type:

None

Side Effects:
  • Updates the dataset’s metadata_last_updated_date and

    metadata_last_updated_by attributes.

  • Updates the dataset’s file_path attribute.

  • Validates the metadata model and stores it in a MetadataContainer.

  • Writes the validated metadata to a file if the metadata_document

    attribute is set.

  • Logs the action and the content of the metadata document.

Raises:

ValueError – If no metadata document is specified for saving.

Return type:

None

exception InconsistentDatasetsError

Bases: ValueError

Existing and new datasets differ significantly from one another.

exception InconsistentDatasetsWarning

Bases: UserWarning

Existing and new datasets differ significantly from one another.

dapla_metadata.datasets.dapla_dataset_path_info module

Extract info from a path following SSB’s dataset naming convention.

class DaplaDatasetPathInfo(dataset_path)

Bases: object

Extract info from a path following SSB’s dataset naming convention.

Parameters:

dataset_path (str | os.PathLike[str])

property bucket_name: str | None

Extract the bucket name from the dataset path.

Returns:

The bucket name or None if the dataset path is not a GCS path.

Examples

>>> DaplaDatasetPathInfo('gs://ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet').bucket_name
ssb-staging-dapla-felles-data-delt
>>> DaplaDatasetPathInfo(pathlib.Path('gs://ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet')).bucket_name
ssb-staging-dapla-felles-data-delt
>>> DaplaDatasetPathInfo('gs:/ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet').bucket_name
ssb-staging-dapla-felles-data-delt
>>> DaplaDatasetPathInfo('ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet').bucket_name
None
property contains_data_from: date | None

The earliest date from which data in the dataset is relevant for.

Returns:

The earliest relevant date for the dataset if available, otherwise None.

property contains_data_until: date | None

The latest date until which data in the dataset is relevant for.

Returns:

The latest relevant date for the dataset if available, otherwise None.

property dataset_short_name: str | None

Extract the dataset short name from the filepath.

The dataset short name is defined as the first section of the stem, up to the period information or the version information if no period information is present.

Returns:

The extracted dataset short name if it can be determined, otherwise None.

Examples

>>> DaplaDatasetPathInfo('prosjekt/befolkning/klargjorte_data/person_data_v1.parquet').dataset_short_name
person_data
>>> DaplaDatasetPathInfo('befolkning/inndata/sykepenger_p2022Q1_p2022Q2_v23.parquet').dataset_short_name
sykepenger
>>> DaplaDatasetPathInfo('my_data/simple_dataset_name.parquet').dataset_short_name
simple_dataset_name
property dataset_state: DataSetState | None

Extract the dataset state from the path.

We assume that files are saved in the Norwegian language as specified by SSB.

Returns:

The extracted dataset state if it can be determined from the path, otherwise None.

Examples

>>> DaplaDatasetPathInfo('klargjorte_data/person_data_v1.parquet').dataset_state
<DataSetState.PROCESSED_DATA: 'PROCESSED_DATA'>
>>> DaplaDatasetPathInfo('klargjorte-data/person_data_v1.parquet').dataset_state
<DataSetState.PROCESSED_DATA: 'PROCESSED_DATA'>
>>> DaplaDatasetPathInfo('utdata/min_statistikk/person_data_v1.parquet').dataset_state
<DataSetState.OUTPUT_DATA: 'OUTPUT_DATA'>
>>> DaplaDatasetPathInfo('my_special_data/person_data_v1.parquet').dataset_state
None
property dataset_version: str | None

Extract version information if exists in filename.

Returns:

The extracted version information if available in the filename, otherwise None.

Examples

>>> DaplaDatasetPathInfo('person_data_v1.parquet').dataset_version
'1'
>>> DaplaDatasetPathInfo('person_data_v20.parquet').dataset_version
'20'
>>> DaplaDatasetPathInfo('person_data.parquet').dataset_version
None
path_complies_with_naming_standard()

Check if path is valid according to SSB standard.

Read more about SSB naming convention in the Dapla manual: https://manual.dapla.ssb.no/statistikkere/navnestandard.html

Return type:

bool

Returns:

True if the path conforms to the SSB naming standard, otherwise False.

property statistic_short_name: str | None

Extract the statistical short name from the filepath.

Extract the statistical short name from the filepath right before the dataset state based on the Dapla filepath naming convention.

Returns:

The extracted statistical short name if it can be determined, otherwise None.

Examples

>>> DaplaDatasetPathInfo('prosjekt/befolkning/klargjorte_data/person_data_v1.parquet').statistic_short_name
befolkning
>>> DaplaDatasetPathInfo('befolkning/inndata/person_data_v1.parquet').statistic_short_name
befolkning
>>> DaplaDatasetPathInfo('befolkning/person_data.parquet').statistic_short_name
None
class DateFormat(name, regex_pattern, arrow_pattern, timeframe)

Bases: ABC

A super class for date formats.

Parameters:
  • name (str)

  • regex_pattern (str)

  • arrow_pattern (str)

  • timeframe (Literal['year', 'month', 'day', 'week'])

arrow_pattern: str
abstract get_ceil(period_string)

Abstract method implemented in the child class.

Return the last date of the timeframe period.

Parameters:

period_string (str) – A string representing the timeframe period.

Return type:

date | None

abstract get_floor(period_string)

Abstract method implemented in the child class.

Return the first date of the timeframe period.

Parameters:

period_string (str) – A string representing the timeframe period.

Return type:

date | None

name: str
regex_pattern: str
timeframe: Literal['year', 'month', 'day', 'week']
class IsoDateFormat(name, regex_pattern, arrow_pattern, timeframe)

Bases: DateFormat

A subclass of Dateformat with relevant patterns for ISO dates.

Parameters:
  • name (str)

  • regex_pattern (str)

  • arrow_pattern (str)

  • timeframe (Literal['year', 'month', 'day', 'week'])

get_ceil(period_string)

Return last date of timeframe period defined in ISO date format.

Return type:

date | None

Parameters:

period_string (str)

Examples

>>> ISO_YEAR.get_ceil("1921")
datetime.date(1921, 12, 31)
>>> ISO_YEAR_MONTH.get_ceil("2021-05")
datetime.date(2021, 5, 31)
get_floor(period_string)

Return first date of timeframe period defined in ISO date format.

Return type:

date | None

Parameters:

period_string (str)

Examples

>>> ISO_YEAR_MONTH.get_floor("1980-08")
datetime.date(1980, 8, 1)
>>> ISO_YEAR.get_floor("2021")
datetime.date(2021, 1, 1)
class SsbDateFormat(name, regex_pattern, arrow_pattern, timeframe, ssb_dates)

Bases: DateFormat

A subclass of Dateformat with relevant patterns for SSB unique dates.

Parameters:
  • name (str)

  • regex_pattern (str)

  • arrow_pattern (str)

  • timeframe (Literal['year', 'month', 'day', 'week'])

  • ssb_dates (dict)

ssb_dates

A dictionary where keys are date format strings and values are corresponding date patterns specific to SSB.

get_ceil(period_string)

Return last date of the timeframe period defined in SSB date format.

Convert SSB format to date-string and return the last date.

Parameters:

period_string (str) – A string representing the timeframe period in SSB format.

Return type:

date | None

Returns:

The last date of the period if the period_string is a valid SSB format, otherwise None.

Example

>>> SSB_TRIANNUAL.get_ceil("1999T11")
None
>>> SSB_HALF_YEAR.get_ceil("2024H1")
datetime.date(2024, 6, 30)
>>> SSB_HALF_YEAR.get_ceil("2024-H1")
datetime.date(2024, 6, 30)
get_floor(period_string)

Return first date of the timeframe period defined in SSB date format.

Convert SSB format to date-string and return the first date.

Parameters:

period_string (str) – A string representing the timeframe period in SSB format.

Return type:

date | None

Returns:

The first date of the period if the period_string is a valid SSB format, otherwise None.

Example

>>> SSB_BIMESTER.get_floor("2003B8")
None
>>> SSB_BIMESTER.get_floor("2003B4")
datetime.date(2003, 7, 1)
>>> SSB_BIMESTER.get_floor("2003-B4")
datetime.date(2003, 7, 1)
ssb_dates: dict
categorize_period_string(period)

Categorize a period string into one of the supported date formats.

Parameters:

period (str) – A string representing the period to be categorized.

Return type:

IsoDateFormat | SsbDateFormat

Returns:

An instance of either IsoDateFormat or SsbDateFormat depending on the format of the input period string.

Raises:

NotImplementedError – If the period string is not recognized as either an ISO or SSB date format.

Examples

>>> date_format = categorize_period_string('2022-W01')
>>> date_format.name
ISO_YEAR_WEEK
>>> date_format = categorize_period_string('1954T2')
>>> date_format.name
SSB_TRIANNUAL
>>> categorize_period_string('unknown format')
Traceback (most recent call last):
...
NotImplementedError: Period format unknown format is not supported

dapla_metadata.datasets.dataset_parser module

Abstractions for dataset file formats.

Handles reading in the data and transforming data types to generic metadata types.

class DatasetParser(dataset)

Bases: ABC

Abstract Base Class for all Dataset parsers.

Implements: - A static factory method to get the correct implementation for each file extension. - A static method for data type conversion.

Requires implementation by subclasses: - A method to extract variables (columns) from the dataset, so they may be documented.

Parameters:

dataset (pathlib.Path | CloudPath)

static for_file(dataset)

Return the correct subclass based on the given dataset file.

Return type:

DatasetParser

Parameters:

dataset (Path | CloudPath)

abstract get_fields()

Abstract method, must be implemented by subclasses.

Return type:

list[Variable]

static transform_data_type(data_type)

Transform a concrete data type to an abstract data type.

In statistical metadata, one is not interested in how the data is technically stored, but in the meaning of the data type. Because of this, we transform known data types to their abstract metadata representations.

If we encounter a data type we don’t know, we just ignore it and let the user handle it in the GUI.

Parameters:

data_type (str) – The concrete data type to map.

Return type:

DataType | None

class DatasetParserParquet(dataset)

Bases: DatasetParser

Concrete implementation for parsing parquet files.

Parameters:

dataset (pathlib.Path | CloudPath)

get_fields()

Extract the fields from this dataset.

Return type:

list[Variable]

class DatasetParserSas7Bdat(dataset)

Bases: DatasetParser

Concrete implementation for parsing SAS7BDAT files.

Parameters:

dataset (pathlib.Path | CloudPath)

get_fields()

Extract the fields from this dataset.

Return type:

list[Variable]

dapla_metadata.datasets.model_backwards_compatibility module

Upgrade old metadata files to be compatible with new versions.

An important principle of Datadoc is that we ALWAYS guarantee backwards compatibility of existing metadata documents. This means that we guarantee that a user will never lose data, even if their document is decades old.

For each document version we release with breaking changes, we implement a handler and register the version by defining a BackwardsCompatibleVersion instance. These documents will then be upgraded when they’re opened in Datadoc.

A test must also be implemented for each new version.

class BackwardsCompatibleVersion(version, handler)

Bases: object

A version which we support with backwards compatibility.

This class registers a version and its corresponding handler function for backwards compatibility.

Parameters:
  • version (str)

  • handler (Callable[[dict[str, Any]], dict[str, Any]])

handler: Callable[[dict[str, Any]], dict[str, Any]]
version: str
exception UnknownModelVersionError(supplied_version, *args)

Bases: Exception

Exception raised for unknown model versions.

This error is thrown when an unrecognized model version is encountered.

Parameters:
  • supplied_version (str)

  • args (tuple[Any, ...])

Return type:

None

add_container(existing_metadata)

Add container for previous versions.

Adds a container structure for previous versions of metadata. This function wraps the existing metadata in a new container structure that includes the ‘document_version’, ‘datadoc’, and ‘pseudonymization’ fields. The ‘document_version’ is set to “0.0.1” and ‘pseudonymization’ is set to None.

Parameters:

existing_metadata (dict) – The original metadata dictionary to be wrapped.

Return type:

dict

Returns:

A new dictionary containing the wrapped metadata with additional fields.

handle_current_version(supplied_metadata)

Handle the current version of the metadata.

This function returns the supplied metadata unmodified.

Parameters:

supplied_metadata (dict[str, Any]) – The metadata for the current version.

Return type:

dict[str, Any]

Returns:

The unmodified supplied metadata.

handle_version_0_1_1(supplied_metadata)

Handle breaking changes for version 0.1.1.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 1.0.0. Specifically, it renames certain keys within the dataset and variables sections, and replaces empty string values with None for dataset keys.

Parameters:

supplied_metadata (dict[str, Any]) – The metadata dictionary that needs to be updated.

Return type:

dict[str, Any]

Returns:

The updated metadata dictionary.

References

PR ref: https://github.com/statisticsnorway/ssb-datadoc-model/pull/4

handle_version_1_0_0(supplied_metadata)

Handle breaking changes for version 1.0.0.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 2.1.0. Specifically, it updates the date fields ‘metadata_created_date’ and ‘metadata_last_updated_date’ to ISO 8601 format with UTC timezone. It also converts the ‘data_source’ field from a string to a dictionary with language keys if necessary and removes the ‘data_source_path’ field. The ‘document_version’ is updated to “2.1.0”.

Parameters:

supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.

Return type:

dict[str, Any]

Returns:

The updated metadata dictionary.

handle_version_2_1_0(supplied_metadata)

Handle breaking changes for version 2.1.0.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 2.2.0. Specifically, it updates the ‘owner’ field in the ‘dataset’ section of the supplied metadata dictionary by converting it from a LanguageStringType to a string. The ‘document_version’ is updated to “2.2.0”.

Parameters:

supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.

Return type:

dict[str, Any]

Returns:

The updated metadata dictionary.

handle_version_2_2_0(supplied_metadata)

Handle breaking changes for version 2.2.0.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 3.1.0. Specifically, it updates the ‘subject_field’ in the ‘dataset’ section of the supplied metadata dictionary by converting it to a string. It also removes the ‘register_uri’ field from the ‘dataset’. Additionally, it removes ‘sentinel_value_uri’ from each variable, sets ‘special_value’ and ‘custom_type’ fields to None, and updates language strings in the ‘variables’ and ‘dataset’ sections. The ‘document_version’ is updated to “3.1.0”.

Parameters:

supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.

Return type:

dict[str, Any]

Returns:

The updated metadata dictionary.

handle_version_3_1_0(supplied_metadata)

Handle breaking changes for version 3.1.0.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 3.2.0. Specifically, it updates the ‘data_source’ field in both the ‘dataset’ and ‘variables’ sections of the supplied metadata dictionary by converting value to string. The ‘document_version’ field is also updated to “3.2.0”.

Parameters:

supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.

Return type:

dict[str, Any]

Returns:

The updated metadata dictionary.

handle_version_3_2_0(supplied_metadata)

Handle breaking changes for version 3.2.0.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 3.3.0. Specifically, it updates the ‘contains_data_from’ and ‘contains_data_until’ fields in both the ‘dataset’ and ‘variables’ sections of the supplied metadata dictionary to ensure they are stored as date strings. It also updates the ‘document_version’ field to “3.3.0”.

Parameters:

supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.

Return type:

dict[str, Any]

Returns:

The updated metadata dictionary.

handle_version_3_3_0(supplied_metadata)

Handle breaking changes for version 3.3.0.

This function modifies the supplied metadata to accommodate breaking changes introduced in version 4.0.0. Specifically, it removes the ‘direct_person_identifying’ field from each variable in ‘datadoc.variables’ and updates the ‘document_version’ field to “4.0.0”.

Parameters:

supplied_metadata (dict[str, Any]) – The metadata dictionary to be updated.

Return type:

dict[str, Any]

Returns:

The updated metadata dictionary.

is_metadata_in_container_structure(metadata)

Check if the metadata is in the container structure.

At a certain point a metadata ‘container’ was introduced. The container provides a structure for different ‘types’ of metadata, such as ‘datadoc’, ‘pseudonymization’ etc. This function determines if the given metadata dictionary follows this container structure by checking for the presence of the ‘datadoc’ field.

Parameters:

metadata (dict) – The metadata dictionary to check.

Return type:

bool

Returns:

True if the metadata is in the container structure (i.e., contains the ‘datadoc’ field), False otherwise.

upgrade_metadata(fresh_metadata)

Upgrade the metadata to the latest version using registered handlers.

This function checks the version of the provided metadata and applies a series of upgrade handlers to migrate the metadata to the latest version. It starts from the provided version and applies all subsequent handlers in sequence. If the metadata is already in the latest version or the version cannot be determined, appropriate actions are taken.

Parameters:

fresh_metadata (dict[str, Any]) – The metadata dictionary to be upgraded. This dictionary must include version information that determines which handlers to apply.

Return type:

dict[str, Any]

Returns:

The upgraded metadata dictionary, after applying all necessary handlers.

Raises:

UnknownModelVersionError – If the metadata’s version is unknown or unsupported.

dapla_metadata.datasets.model_validation module

Handle validation for metadata with pydantic validators and custom warnings.

exception ObligatoryDatasetWarning

Bases: UserWarning

Custom warning for checking obligatory metadata for dataset.

exception ObligatoryVariableWarning

Bases: UserWarning

Custom warning for checking obligatory metadata for variables.

class ValidateDatadocMetadata(**data)

Bases: DatadocMetadata

Class that inherits from DatadocMetadata, providing additional validation.

Parameters:
  • percentage_complete (int | None)

  • document_version (Literal['4.0.0'])

  • dataset (Dataset | None)

  • variables (list[Variable] | None)

check_date_order()

Validate the order of date fields.

Check that dataset and variable date fields contains_data_from and contains_data_until are in chronological order.

Mode: This validator runs after other validation.

Return type:

Self

Returns:

The instance of the model after validation.

Raises:

ValueError – If contains_data_until date is earlier than contains_data_from date.

check_inherit_values()

Inherit values from dataset to variables if not set.

Sets values for ‘data source’, ‘temporality type’, ‘contains data from’, and ‘contains data until’ if they are None.

Mode: This validator runs after other validation.

Return type:

Self

Returns:

The instance of the model after validation.

check_metadata_created_date()

Ensure metadata_created_date is set for the dataset.

Sets the current timestamp if metadata_created_date is None.

Mode: This validator runs after other validation.

Return type:

Self

Returns:

The instance of the model after validation.

check_obligatory_dataset_metadata()

Check obligatory dataset fields and issue a warning if any are missing.

Mode:

This validator runs after other validation.

Return type:

Self

Returns:

The instance of the model after validation.

Raises:

ObligatoryDatasetWarning – If not all obligatory dataset metadata fields are filled in.

check_obligatory_variables_metadata()

Check obligatory variable fields and issue a warning if any are missing.

Mode:

This validator runs after other validation.

Return type:

Self

Returns:

The instance of the model after validation.

Raises:

ObligatoryVariableWarning – If not all obligatory variable metadata fields are filled in.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'use_enum_values': True, 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'dataset': FieldInfo(annotation=Union[Dataset, NoneType], required=False, default=None), 'document_version': FieldInfo(annotation=Literal['4.0.0'], required=False, default='4.0.0', description='Version of this model'), 'percentage_complete': FieldInfo(annotation=Union[int, NoneType], required=False, default=None, description='Percentage of obligatory metadata fields populated.'), 'variables': FieldInfo(annotation=Union[list[Variable], NoneType], required=False, default=None)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

exception ValidationWarning

Bases: UserWarning

Custom warning for validation purposes.

custom_warning_handler(message, category, filename, lineno, file=None, line=None)

Handle warnings.

Return type:

None

Parameters:
  • message (Warning | str)

  • category (type[Warning])

  • filename (str)

  • lineno (int)

  • file (TextIO | None)

  • line (str | None)

dapla_metadata.datasets.statistic_subject_mapping module

class PrimarySubject(titles, subject_code, secondary_subjects)

Bases: Subject

Data structure for primary subjects or ‘hovedemne’.

Parameters:
  • titles (dict[str, str])

  • subject_code (str)

  • secondary_subjects (list[SecondarySubject])

secondary_subjects: list[SecondarySubject]
class SecondarySubject(titles, subject_code, statistic_short_names)

Bases: Subject

Data structure for secondary subjects or ‘delemne’.

Parameters:
  • titles (dict[str, str])

  • subject_code (str)

  • statistic_short_names (list[str])

statistic_short_names: list[str]
class StatisticSubjectMapping(executor, source_url)

Bases: GetExternalSource

Provide mapping between statistic short name and primary and secondary subject.

Parameters:
  • executor (ThreadPoolExecutor)

  • source_url (str | None)

get_secondary_subject(statistic_short_name)

Looks up the secondary subject for the given statistic short name in the mapping dict.

Returns the secondary subject string if found, else None.

Return type:

str | None

Parameters:

statistic_short_name (str | None)

property primary_subjects: list[PrimarySubject]

Getter for primary subjects.

class Subject(titles, subject_code)

Bases: object

Base class for Primary and Secondary subjects.

A statistical subject is a related grouping of statistics.

Parameters:
  • titles (dict[str, str])

  • subject_code (str)

get_title(language)

Get the title in the given language.

Return type:

str

Parameters:

language (SupportedLanguages)

subject_code: str
titles: dict[str, str]

dapla_metadata.datasets.user_info module

class DaplaLabUserInfo

Bases: object

Information about the current user when running on Dapla Lab.

property short_email: str | None

Get the short email address.

class JupyterHubUserInfo

Bases: object

Information about the current user when running on JupyterHub.

property short_email: str | None

Get the short email address.

class TestUserInfo

Bases: object

Information about the current user for local development and testing.

property short_email: str | None

Get the short email address.

class UnknownUserInfo

Bases: object

Fallback when no implementation is found.

property short_email: str | None

Unknown email address.

class UserInfo(*args, **kwargs)

Bases: Protocol

Information about the current user.

Implementations may be provided for different platforms or testing.

property short_email: str | None

Get the short email address.

get_user_info_for_current_platform()

Return the correct implementation of UserInfo for the current platform.

Return type:

UserInfo

Module contents

Document dataset.

dapla_metadata.datasets.utility package

Submodules

dapla_metadata.datasets.utility.constants module

Repository for constant values in Datadoc backend.

dapla_metadata.datasets.utility.enums module

Enumerations used in Datadoc.

class DaplaRegion(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: str, Enum

Dapla platforms/regions.

BIP = 'BIP'
CLOUD_RUN = 'CLOUD_RUN'
DAPLA_LAB = 'DAPLA_LAB'
ON_PREM = 'ON_PREM'
class DaplaService(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: str, Enum

Dapla services.

DATADOC = 'DATADOC'
JUPYTERLAB = 'JUPYTERLAB'
KILDOMATEN = 'KILDOMATEN'
R_STUDIO = 'R_STUDIO'
VS_CODE = 'VS_CODE'
class SupportedLanguages(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: str, Enum

The list of languages metadata may be recorded in.

Reference: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

ENGLISH = 'en'
NORSK_BOKMÅL = 'nb'
NORSK_NYNORSK = 'nn'

dapla_metadata.datasets.utility.utils module

calculate_percentage(completed, total)

Calculate percentage as a rounded integer.

Parameters:
  • completed (int) – The number of completed items.

  • total (int) – The total number of items.

Return type:

int

Returns:

The rounded percentage of completed items out of the total.

derive_assessment_from_state(state)

Derive assessment from dataset state.

Parameters:

state (DataSetState) – The state of the dataset.

Return type:

Assessment

Returns:

The derived assessment of the dataset.

get_missing_obligatory_dataset_fields(dataset)

Identify all obligatory dataset fields that are missing values.

This function checks for obligatory fields that are either directly missing (i.e., set to None) or have multilanguage values with empty content.

Parameters:

dataset (Dataset) – The dataset object to examine. This object must support the model_dump() method which returns a dictionary of field names and values.

Returns:

  • Fields that are directly None and are listed as obligatory metadata.

  • Multilanguage fields (listed as obligatory metadata`) where

the value exists but the primary language text is empty.

Return type:

A list of field names (as strings) that are missing values. This includes

get_missing_obligatory_variables_fields(variables)

Identify obligatory variable fields that are missing values for each variable.

This function checks for obligatory fields that are either directly missing (i.e., set to None) or have multilanguage values with empty content.

Parameters:

variables (list) – A list of variable objects to check for missing obligatory fields.

Return type:

list[dict]

Returns:

A list of dictionaries with variable short names as keys and list of missing obligatory variable fields as values. This includes:

  • Fields that are directly None and are llisted as obligatory metadata.

  • Multilanguage fields (listed as obligatory metadata) where the value

exists but the primary language text is empty.

get_timestamp_now()

Return a timestamp for the current moment.

Return type:

datetime

incorrect_date_order(date_from, date_until)

Evaluate the chronological order of two dates.

This function checks if ‘date until’ is earlier than ‘date from’. If so, it indicates an incorrect date order.

Parameters:
  • date_from (date | None) – The start date of the time period.

  • date_until (date | None) – The end date of the time period.

Return type:

bool

Returns:

True if ‘date_until’ is earlier than ‘date_from’ or if only ‘date_from’ is None, False otherwise.

Example

>>> incorrect_date_order(datetime.date(1980, 1, 1), datetime.date(1967, 1, 1))
True
>>> incorrect_date_order(datetime.date(1967, 1, 1), datetime.date(1980, 1, 1))
False
>>> incorrect_date_order(None, datetime.date(2024,7,1))
True
merge_variables(existing_metadata, extracted_metadata, merged_metadata)

Merges variables from the extracted metadata into the existing metadata and updates the merged metadata.

This function compares the variables from extracted_metadata with those in existing_metadata. For each variable in extracted_metadata, it checks if a variable with the same short_name exists in existing_metadata. If a match is found, it updates the existing variable with information from extracted_metadata. If no match is found, the variable from extracted_metadata is directly added to merged_metadata.

Parameters:
  • existing_metadata (DatadocMetadata) – The metadata object containing the current state of variables.

  • extracted_metadata (DatadocMetadata) – The metadata object containing new or updated variables to merge.

  • merged_metadata (DatadocMetadata) – The metadata object that will contain the result of the merge.

Returns:

The merged_metadata object containing variables from both existing_metadata and extracted_metadata.

Return type:

model.DatadocMetadata

normalize_path(path)

Obtain a pathlib compatible Path.

Obtains a pathlib compatible Path regardless of whether the file is on a filesystem or in GCS.

Parameters:

path (str) – Path on a filesystem or in cloud storage.

Return type:

Path | CloudPath

Returns:

Pathlib compatible object.

num_obligatory_dataset_fields_completed(dataset)

Count the number of completed obligatory dataset fields.

This function returns the total count of obligatory fields in the dataset that have values (are not None).

Parameters:

dataset (Dataset) – The dataset object for which to count the fields.

Return type:

int

Returns:

The number of obligatory dataset fields that have been completed (not None).

num_obligatory_variable_fields_completed(variable)

Count the number of obligatory fields completed for one variable.

This function calculates the total number of obligatory fields that have values (are not None) for one variable in the list.

Parameters:

variable (Variable) – The variable to count obligatory fields for.

Return type:

int

Returns:

The total number of obligatory variable fields that have been completed (not None) for one variable.

num_obligatory_variables_fields_completed(variables)

Count the number of obligatory fields completed for all variables.

This function calculates the total number of obligatory fields that have values (are not None) for one variable in the list.

Parameters:

variables (list) – A list with variable objects.

Return type:

int

Returns:

The total number of obligatory variable fields that have been completed (not None) for all variables.

override_dataset_fields(merged_metadata, existing_metadata)

Overrides specific fields in the dataset of merged_metadata with values from the dataset of existing_metadata.

This function iterates over a predefined list of fields, DATASET_FIELDS_FROM_EXISTING_METADATA, and sets the corresponding fields in the merged_metadata.dataset object to the values from the existing_metadata.dataset object.

Parameters:
  • merged_metadata (DatadocMetadata) – An instance of DatadocMetadata containing the dataset to be updated.

  • existing_metadata (DatadocMetadata) – An instance of DatadocMetadata containing the dataset whose values are used to update merged_metadata.dataset.

Return type:

None

Returns:

None.

running_in_notebook()

Return True if running in Jupyter Notebook.

Return type:

bool

set_default_values_dataset(dataset)

Set default values on dataset.

Parameters:

dataset (Dataset) – The dataset object to set default values on.

Return type:

None

Example

>>> dataset = model.Dataset(id=None, contains_personal_data=None)
>>> set_default_values_dataset(dataset)
>>> dataset.id is not None
True
>>> dataset.contains_personal_data == False
True
set_default_values_variables(variables)

Set default values on variables.

Parameters:

variables (list) – A list of variable objects to set default values on.

Return type:

None

Example

>>> variables = [model.Variable(short_name="pers",id=None, is_personal_data = None), model.Variable(short_name="fnr",id='9662875c-c245-41de-b667-12ad2091a1ee', is_personal_data='PSEUDONYMISED_ENCRYPTED_PERSONAL_DATA')]
>>> set_default_values_variables(variables)
>>> isinstance(variables[0].id, uuid.UUID)
True
>>> variables[1].is_personal_data == 'PSEUDONYMISED_ENCRYPTED_PERSONAL_DATA'
True
>>> variables[0].is_personal_data == 'NOT_PERSONAL_DATA'
True
set_variables_inherit_from_dataset(dataset, variables)

Set specific dataset values on a list of variable objects.

This function populates ‘data source’, ‘temporality type’, ‘contains data from’, and ‘contains data until’ fields in each variable if they are not set (None). The values are inherited from the corresponding fields in the dataset.

Parameters:
  • dataset (Dataset) – The dataset object from which to inherit values.

  • variables (list) – A list of variable objects to update with dataset values.

Return type:

None

Example

>>> dataset = model.Dataset(short_name='person_data_v1',data_source='01',temporality_type='STATUS',id='9662875c-c245-41de-b667-12ad2091a1ee',contains_data_from="2010-09-05",contains_data_until="2022-09-05")
>>> variables = [model.Variable(short_name="pers",data_source =None,temporality_type = None, contains_data_from = None,contains_data_until = None)]
>>> set_variables_inherit_from_dataset(dataset, variables)
>>> variables[0].data_source == dataset.data_source
True
>>> variables[0].temporality_type is None
False
>>> variables[0].contains_data_from == dataset.contains_data_from
True
>>> variables[0].contains_data_until == dataset.contains_data_until
True

Module contents

Utility files for Datadoc.

dapla_metadata.datasets.external_sources package

Submodules

dapla_metadata.datasets.external_sources.external_sources module

class GetExternalSource(executor)

Bases: ABC, Generic[T]

Abstract base class for retrieving data from external sources asynchronously.

This class provides methods to initiate an asynchronous data retrieval operation, check its status, and retrieve the result once the operation completes. Subclasses must implement the _fetch_data_from_external_source method to define how data is fetched from the specific external source.

Parameters:

executor (ThreadPoolExecutor)

check_if_external_data_is_loaded()

Check if the thread getting the external data has finished running.

Return type:

bool

Returns:

True if the data fetching operation is complete, False otherwise.

retrieve_external_data()

Retrieve the result of the data fetching operation.

This method checks if the asynchronous data fetching operation has completed. If the operation is finished, it returns the result. Otherwise, it returns None.

Return type:

Optional[TypeVar(T)]

Returns:

The result of the data fetching operation if it is complete or None if the operation has not yet finished.

wait_for_external_result()

Wait for the thread responsible for loading the external request to finish.

If there is no future to wait for, it logs a warning and returns immediately.

Return type:

None

Module contents

Abstract parent class for interacting with external resources asynchorously.