dapla_metadata.datasets package¶
Subpackages¶
- dapla_metadata.datasets.compatibility package
- dapla_metadata.datasets.external_sources package
- dapla_metadata.datasets.utility package
- dapla_metadata.datasets.utility.constants module
- dapla_metadata.datasets.utility.enums module
- dapla_metadata.datasets.utility.urn module
- dapla_metadata.datasets.utility.utils module
calculate_percentage()derive_assessment_from_state()get_current_date()get_missing_obligatory_dataset_fields()get_missing_obligatory_variables_fields()get_missing_obligatory_variables_pseudo_fields()get_timestamp_now()incorrect_date_order()num_obligatory_dataset_fields_completed()num_obligatory_pseudo_fields_missing()num_obligatory_variable_fields_completed()num_obligatory_variables_fields_completed()running_in_notebook()set_dataset_owner()set_default_values_dataset()set_default_values_pseudonymization()set_default_values_variables()set_variables_inherit_from_dataset()
dapla_metadata.datasets.code_list module¶
- class CodeList(executor, classification_id)[source]¶
Bases:
GetExternalSourceClass for retrieving classifications from Klass.
This class fetches a classification given a classification ID and supports multiple languages.
- Parameters:
executor (ThreadPoolExecutor)
classification_id (int | None)
- supported_languages¶
A list of supported language codes.
- _classifications¶
A list to store classification items.
- classification_id¶
The ID of the classification to retrieve.
- classifications_dataframes¶
A dictionary to store dataframes of classifications.
- property classifications: list[CodeListItem]¶
Get the list of classifications.
- Returns:
A list of CodeListItem objects.
- class CodeListItem(titles, code)[source]¶
Bases:
objectData structure for a code list item.
- Parameters:
titles (dict[SupportedLanguages, str])
code (str)
- titles¶
A dictionary mapping language codes to titles.
- code¶
The code associated with the item.
-
code:
str¶
- get_title(language)[source]¶
Return the title in the specified language.
- Parameters:
language (
SupportedLanguages) – The language code for which to get the title.- Return type:
str- Returns:
The title in the specified language. It returns the title in Norwegian Bokmål (“nb”) if the language is either Norwegian Bokmål or Norwegian Nynorsk, otherwise it returns the title in English (“en”). If none of these are available, it returns an empty string and logs an exception.
-
titles:
dict[SupportedLanguages,str]¶
dapla_metadata.datasets.core module¶
Handle reading, updating and writing of metadata.
- class Datadoc(dataset_path=None, metadata_document_path=None, statistic_subject_mapping=None, errors_as_warnings=False, validate_required_fields_on_existing_metadata=False)[source]¶
Bases:
objectHandle reading, updating and writing of metadata.
If a metadata document exists, it is this information that is loaded. Nothing is inferred from the dataset. If only a dataset path is supplied the metadata document path is build based on the dataset path.
Example: /path/to/dataset.parquet -> /path/to/dataset__DOC.json
- Parameters:
dataset_path (ReadablePathLike | None)
metadata_document_path (ReadablePathLike | None)
statistic_subject_mapping (StatisticSubjectMapping | None)
errors_as_warnings (bool)
validate_required_fields_on_existing_metadata (bool)
- dataset_path¶
A file path to the path to where the dataset is stored.
- metadata_document_path¶
A path to a metadata document if it exists.
- statistic_subject_mapping¶
An instance of StatisticSubjectMapping.
- add_pseudonymization(variable_short_name, pseudonymization=None)[source]¶
Adds a new pseudo variable to the list of pseudonymized variables.
If pseudonymization is not supplied, an empty Pseudonymization structure will be created and assigned to the variable. If an encryption algorithm is recognized (one of the standard Dapla algorithms), default values are filled for any missing fields.
- Parameters:
variable_short_name (
str) – The short name for the variable that one wants to update the pseudo for.pseudonymization (
Optional[Pseudonymization]) – The updated pseudonymization.
- Return type:
None
- static build_metadata_document_path(dataset_path)[source]¶
Build the path to the metadata document corresponding to the given dataset.
- Parameters:
dataset_path (
Union[ReadablePath,VFSPathLike,PathLike[str],str]) – Path to the dataset we wish to create metadata for.- Return type:
UPath
- property percent_complete: int¶
The percentage of obligatory metadata completed.
A metadata field is counted as complete when any non-None value is assigned. Used for a live progress bar in the UI, as well as being saved in the datadoc as a simple quality indicator.
- remove_pseudonymization(variable_short_name)[source]¶
Removes a pseudo variable by using the shortname.
Updates the pseudo variable lookup by creating a new one.
- Parameters:
variable_short_name (
str) – The short name for the variable that one wants to remove the pseudo for.- Return type:
None
- write_metadata_document()[source]¶
Write all currently known metadata to file.
- Return type:
None
- Side Effects:
- Updates the dataset’s metadata_last_updated_date and
metadata_last_updated_by attributes.
Updates the dataset’s file_path attribute.
Validates the metadata model and stores it in a MetadataContainer.
- Writes the validated metadata to a file if the metadata_document
attribute is set.
Logs the action and the content of the metadata document.
- Raises:
ValueError – If no metadata document is specified for saving.
- Return type:
None
dapla_metadata.datasets.dapla_dataset_path_info module¶
Extract info from a path following SSB’s dataset naming convention.
- class DaplaDatasetPathInfo(dataset_path)[source]¶
Bases:
objectExtract info from a path following SSB’s dataset naming convention.
- Parameters:
dataset_path (ReadablePathLike)
- property bucket_name: str | None¶
Extract the bucket name from the dataset path.
- Returns:
The bucket name or None if the dataset path is not a GCS path nor ssb bucketeer path.
Examples
>>> DaplaDatasetPathInfo('gs://ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet').bucket_name ssb-staging-dapla-felles-data-delt
>>> DaplaDatasetPathInfo('ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet').bucket_name None
>>> DaplaDatasetPathInfo('ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet').bucket_name None
>>> DaplaDatasetPathInfo('buckets/ssb-staging-dapla-felles-data-delt/stat/utdata/person_data_p2021_v2.parquet').bucket_name ssb-staging-dapla-felles-data-delt
>>> DaplaDatasetPathInfo('buckets/ssb-staging-dapla-felles-data-delt/person_data_p2021_v2.parquet').bucket_name ssb-staging-dapla-felles-data-delt
>>> DaplaDatasetPathInfo('home/work/buckets/ssb-staging-dapla-felles-produkt/stat/utdata/person_data_p2021_v2.parquet').bucket_name ssb-staging-dapla-felles-produkt
- property contains_data_from: date | None¶
The earliest date from which data in the dataset is relevant for.
- Returns:
The earliest relevant date for the dataset if available, otherwise None.
- property contains_data_until: date | None¶
The latest date until which data in the dataset is relevant for.
- Returns:
The latest relevant date for the dataset if available, otherwise None.
- property dataset_short_name: str | None¶
Extract the dataset short name from the filepath.
The dataset short name is defined as the first section of the stem, up to the period information or the version information if no period information is present.
- Returns:
The extracted dataset short name if it can be determined, otherwise None.
Examples
>>> DaplaDatasetPathInfo('prosjekt/befolkning/klargjorte_data/person_data_v1.parquet').dataset_short_name person_data
>>> DaplaDatasetPathInfo('befolkning/inndata/sykepenger_p2022Q1_p2022Q2_v23.parquet').dataset_short_name sykepenger
>>> DaplaDatasetPathInfo('my_data/simple_dataset_name.parquet').dataset_short_name simple_dataset_name
>>> DaplaDatasetPathInfo('gs://ssb-staging-dapla-felles-data-delt/datadoc/utdata/person_data_p2021_v2.parquet').dataset_short_name person_data
>>> DaplaDatasetPathInfo('buckets/ssb-staging-dapla-felles-data-delt/stat/utdata/folk_data_p2021_v2.parquet').dataset_short_name folk_data
>>> DaplaDatasetPathInfo('buckets/ssb-staging-dapla-felles-data-delt/stat/utdata/dapla/bus_p2021_v2.parquet').dataset_short_name bus
- property dataset_state: DataSetState | None¶
Extract the dataset state from the path.
We assume that files are saved in the Norwegian language as specified by SSB.
- Returns:
The extracted dataset state if it can be determined from the path, otherwise None.
Examples
>>> DaplaDatasetPathInfo('klargjorte_data/person_data_v1.parquet').dataset_state <DataSetState.PROCESSED_DATA: 'PROCESSED_DATA'>
>>> DaplaDatasetPathInfo('klargjorte-data/person_data_v1.parquet').dataset_state <DataSetState.PROCESSED_DATA: 'PROCESSED_DATA'>
>>> DaplaDatasetPathInfo('utdata/min_statistikk/person_data_v1.parquet').dataset_state <DataSetState.OUTPUT_DATA: 'OUTPUT_DATA'>
>>> DaplaDatasetPathInfo('buckets/bucket_name/stat_name/inndata/min_statistikk/person_data_v1.parquet').dataset_state <DataSetState.INPUT_DATA: 'INPUT_DATA'>
>>> DaplaDatasetPathInfo('my_special_data/person_data_v1.parquet').dataset_state None
- property dataset_version: str | None¶
Extract version information if exists in filename.
- Returns:
The extracted version information if available in the filename, otherwise None.
Examples
>>> DaplaDatasetPathInfo('person_data_v1.parquet').dataset_version '1'
>>> DaplaDatasetPathInfo('person_data_v20.parquet').dataset_version '20'
>>> DaplaDatasetPathInfo('person_data.parquet').dataset_version None
>>> DaplaDatasetPathInfo('buckets/bucket_name/stat_name/inndata/min_statistikk/person_data_v1.parquet').dataset_version '1'
>>> DaplaDatasetPathInfo('buckets/bucket_name/stat_name/inndata/min_statistikk/person_data.parquet').dataset_version None
- path_complies_with_naming_standard()[source]¶
Check if path is valid according to SSB standard.
Read more about SSB naming convention in the Dapla manual: https://manual.dapla.ssb.no/statistikkere/navnestandard.html
- Return type:
bool- Returns:
True if the path conforms to the SSB naming standard, otherwise False.
- property statistic_short_name: str | None¶
Extract the statistical short name from the filepath.
Extract the statistical short name from the filepath either after bucket name or right before the dataset state based on the Dapla filepath naming convention.
- Returns:
The extracted statistical short name if it can be determined, otherwise None.
Examples
>>> DaplaDatasetPathInfo('prosjekt/befolkning/klargjorte_data/person_data_v1.parquet').statistic_short_name befolkning
>>> DaplaDatasetPathInfo('buckets/prosjekt/befolkning/person_data_v1.parquet').statistic_short_name befolkning
>>> DaplaDatasetPathInfo('befolkning/inndata/person_data_v1.parquet').statistic_short_name befolkning
>>> DaplaDatasetPathInfo('buckets/bucket_name/stat_name/inndata/min_statistikk/person_data.parquet').statistic_short_name stat_name
>>> DaplaDatasetPathInfo('buckets/stat_name/utdata/person_data.parquet').statistic_short_name None
>>> DaplaDatasetPathInfo('befolkning/person_data.parquet').statistic_short_name None
>>> DaplaDatasetPathInfo('buckets/produkt/befolkning/utdata/person_data.parquet').statistic_short_name befolkning
>>> DaplaDatasetPathInfo('resources/buckets/produkt/befolkning/utdata/person_data.parquet').statistic_short_name befolkning
>>> DaplaDatasetPathInfo('gs://statistikk/produkt/klargjorte-data/persondata_p1990-Q1_p2023-Q4_v1/aar=2019/data.parquet').statistic_short_name produkt
>>> DaplaDatasetPathInfo('gs://statistikk/produkt/persondata_p1990-Q1_p2023-Q4_v1/aar=2019/data.parquet').statistic_short_name None
>>> DaplaDatasetPathInfo('buckets/ssb-staging-dapla-felles-data-delt/person_data_p2021_v2.parquet').statistic_short_name None
- class DateFormat(name, regex_pattern, arrow_pattern, timeframe)[source]¶
Bases:
ABCA super class for date formats.
- Parameters:
name (str)
regex_pattern (str)
arrow_pattern (str)
timeframe (Literal['year', 'month', 'day', 'week'])
-
arrow_pattern:
str¶
- abstract get_ceil(period_string)[source]¶
Abstract method implemented in the child class.
Return the last date of the timeframe period.
- Parameters:
period_string (
str) – A string representing the timeframe period.- Return type:
Optional[date]
- abstract get_floor(period_string)[source]¶
Abstract method implemented in the child class.
Return the first date of the timeframe period.
- Parameters:
period_string (
str) – A string representing the timeframe period.- Return type:
Optional[date]
-
name:
str¶
-
regex_pattern:
str¶
-
timeframe:
Literal['year','month','day','week']¶
- class IsoDateFormat(name, regex_pattern, arrow_pattern, timeframe)[source]¶
Bases:
DateFormatA subclass of Dateformat with relevant patterns for ISO dates.
- Parameters:
name (str)
regex_pattern (str)
arrow_pattern (str)
timeframe (Literal['year', 'month', 'day', 'week'])
- class SsbDateFormat(name, regex_pattern, arrow_pattern, timeframe, ssb_dates)[source]¶
Bases:
DateFormatA subclass of Dateformat with relevant patterns for SSB unique dates.
- Parameters:
name (str)
regex_pattern (str)
arrow_pattern (str)
timeframe (Literal['year', 'month', 'day', 'week'])
ssb_dates (dict)
- ssb_dates¶
A dictionary where keys are date format strings and values are corresponding date patterns specific to SSB.
- get_ceil(period_string)[source]¶
Return last date of the timeframe period defined in SSB date format.
Convert SSB format to date-string and return the last date.
- Parameters:
period_string (
str) – A string representing the timeframe period in SSB format.- Return type:
Optional[date]- Returns:
The last date of the period if the period_string is a valid SSB format, otherwise None.
Example
>>> SSB_TRIANNUAL.get_ceil("1999T11") None
>>> SSB_HALF_YEAR.get_ceil("2024H1") datetime.date(2024, 6, 30)
>>> SSB_HALF_YEAR.get_ceil("2024-H1") datetime.date(2024, 6, 30)
- get_floor(period_string)[source]¶
Return first date of the timeframe period defined in SSB date format.
Convert SSB format to date-string and return the first date.
- Parameters:
period_string (
str) – A string representing the timeframe period in SSB format.- Return type:
Optional[date]- Returns:
The first date of the period if the period_string is a valid SSB format, otherwise None.
Example
>>> SSB_BIMESTER.get_floor("2003B8") None
>>> SSB_BIMESTER.get_floor("2003B4") datetime.date(2003, 7, 1)
>>> SSB_BIMESTER.get_floor("2003-B4") datetime.date(2003, 7, 1)
-
ssb_dates:
dict¶
- categorize_period_string(period)[source]¶
Categorize a period string into one of the supported date formats.
- Parameters:
period (
str) – A string representing the period to be categorized.- Return type:
Union[IsoDateFormat,SsbDateFormat]- Returns:
An instance of either IsoDateFormat or SsbDateFormat depending on the format of the input period string.
- Raises:
NotImplementedError – If the period string is not recognized as either an ISO or SSB date format.
Examples
>>> date_format = categorize_period_string('2022-W01') >>> date_format.name ISO_YEAR_WEEK
>>> date_format = categorize_period_string('1954T2') >>> date_format.name SSB_TRIANNUAL
>>> categorize_period_string('unknown format') Traceback (most recent call last): ... NotImplementedError: Period format unknown format is not supported
dapla_metadata.datasets.dataset_parser module¶
Abstractions for dataset file formats.
Handles reading in the data and transforming data types to generic metadata types.
- class DatasetParser(dataset)[source]¶
Bases:
ABCAbstract Base Class for all Dataset parsers.
Implements: - A static factory method to get the correct implementation for each file extension. - A static method for data type conversion.
Requires implementation by subclasses: - A method to extract variables (columns) from the dataset, so they may be documented.
- Parameters:
dataset (ReadablePathLike)
- static for_file(dataset)[source]¶
Return the correct subclass based on the given dataset file.
- Return type:
- Parameters:
dataset (ReadablePath | VFSPathLike | PathLike[str] | str)
- abstract get_fields()[source]¶
Abstract method, must be implemented by subclasses.
- Return type:
list[Variable]
- static transform_data_type(data_type)[source]¶
Transform a concrete data type to an abstract data type.
In statistical metadata, one is not interested in how the data is technically stored, but in the meaning of the data type. Because of this, we transform known data types to their abstract metadata representations.
If we encounter a data type we don’t know, we just ignore it and let the user handle it in the GUI.
- Parameters:
data_type (
str) – The concrete data type to map.- Return type:
Optional[DataType]- Returns:
The abstract data type or None
- class DatasetParserParquet(dataset)[source]¶
Bases:
DatasetParserConcrete implementation for parsing parquet files.
- Parameters:
dataset (UPath)
- class DatasetParserSas7Bdat(dataset)[source]¶
Bases:
DatasetParserConcrete implementation for parsing SAS7BDAT files.
- Parameters:
dataset (ReadablePathLike)
dapla_metadata.datasets.model_validation module¶
Handle validation for metadata with pydantic validators and custom warnings.
- exception ObligatoryDatasetWarning[source]¶
Bases:
UserWarningCustom warning for checking obligatory metadata for dataset.
- exception ObligatoryVariableWarning[source]¶
Bases:
UserWarningCustom warning for checking obligatory metadata for variables.
- class ValidateDatadocMetadata(**data)[source]¶
Bases:
DatadocMetadataClass that inherits from DatadocMetadata, providing additional validation.
- Parameters:
percentage_complete (int | None)
document_version (Literal['6.1.0'])
dataset (Dataset | None)
variables (list[Variable] | None)
- check_date_order()[source]¶
Validate the order of date fields.
Check that dataset and variable date fields contains_data_from and contains_data_until are in chronological order.
Mode: This validator runs after other validation.
- Return type:
Self- Returns:
The instance of the model after validation.
- Raises:
ValueError – If contains_data_until date is earlier than contains_data_from date.
- check_inherit_values()[source]¶
Inherit values from dataset to variables if not set.
Sets values for ‘data source’, ‘temporality type’, ‘contains data from’, and ‘contains data until’ if they are None.
Mode: This validator runs after other validation.
- Return type:
Self- Returns:
The instance of the model after validation.
- check_metadata_created_date()[source]¶
Ensure metadata_created_date is set for the dataset.
Sets the current timestamp if metadata_created_date is None.
Mode: This validator runs after other validation.
- Return type:
Self- Returns:
The instance of the model after validation.
- check_obligatory_dataset_metadata()[source]¶
Check obligatory dataset fields and issue a warning if any are missing.
- Mode:
This validator runs after other validation.
- Return type:
Self- Returns:
The instance of the model after validation.
- Raises:
ObligatoryDatasetWarning – If not all obligatory dataset metadata fields are filled in.
- check_obligatory_variables_metadata()[source]¶
Check obligatory variable fields and issue a warning if any are missing.
- Mode:
This validator runs after other validation.
- Return type:
Self- Returns:
The instance of the model after validation.
- Raises:
ObligatoryVariableWarning – If not all obligatory variable metadata fields are filled in.
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'use_enum_values': True, 'validate_assignment': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
dapla_metadata.datasets.statistic_subject_mapping module¶
- class PrimarySubject(titles, subject_code, secondary_subjects)[source]¶
Bases:
SubjectData structure for primary subjects or ‘hovedemne’.
- Parameters:
titles (dict[str, str])
subject_code (str)
secondary_subjects (list[SecondarySubject])
-
secondary_subjects:
list[SecondarySubject]¶
- class SecondarySubject(titles, subject_code, statistic_short_names)[source]¶
Bases:
SubjectData structure for secondary subjects or ‘delemne’.
- Parameters:
titles (dict[str, str])
subject_code (str)
statistic_short_names (list[str])
-
statistic_short_names:
list[str]¶
- class StatisticSubjectMapping(executor, source_url)[source]¶
Bases:
GetExternalSourceProvide mapping between statistic short name and primary and secondary subject.
- Parameters:
executor (ThreadPoolExecutor)
source_url (str | None)
- get_secondary_subject(statistic_short_name)[source]¶
Looks up the secondary subject for the given statistic short name in the mapping dict.
Returns the secondary subject string if found, else None.
- Return type:
Optional[str]- Parameters:
statistic_short_name (str | None)
- property primary_subjects: list[PrimarySubject]¶
Getter for primary subjects.
- class Subject(titles, subject_code)[source]¶
Bases:
objectBase class for Primary and Secondary subjects.
A statistical subject is a related grouping of statistics.
- Parameters:
titles (dict[str, str])
subject_code (str)
- get_title(language)[source]¶
Get the title in the given language.
- Return type:
str- Parameters:
language (SupportedLanguages)
-
subject_code:
str¶
-
titles:
dict[str,str]¶