ssb_utdanning.katalog package

Subpackages

ssb_utdanning.katalog.katalog module

The main class for Katalogs at 360.

Katalogs are files that are somewhere inbetween real data and metadata.

They usually have a single columns with some sort of identifier, like orgnr or nus2000. Then they have 2+ columns of other groupings or data that can be “attached” to real data. They may represent a list of idents that there is no other info on, that we have tracked down info for, that we would like to “re-attach” each year, for example.

Katalogs can also be called “kodeverk”, “kodelister”, “omkodingskatalog” etc. View “katalog” as an umbrella-term above these.

class UtdKatalog(key_cols, data=None, path='', glob_pattern_latest='', exclude_keywords=None)

Bases: UtdData

The main class for handling catalog-like datasets, extending UtdData with additional catalog-specific functionalities.

Parameters:
  • key_cols (list[str] | str)

  • data (DataFrame | None)

  • path (Path | GSPath | str)

  • glob_pattern_latest (str)

  • exclude_keywords (list[str] | None)

apply_format(df, catalog_col_name='', data_key_col_name='', catalog_key_col_name='', new_col_data_name='', level=0, ordered=False, remove_unused=False)

Applies format on DataFrame.

Applies the catalog formatting to a DataFrame by mapping values from a catalog column to a dataset column based on a key. The function can also set the new column as a categorical type, with options for ordering and removing unused categories.

Parameters:
  • df (pd.DataFrame) – The dataset to which the catalog formatting will be applied.

  • catalog_col_name (str) – The name of the column in the catalog whose values will be applied to the dataset. If not specified, the second column of the catalog data is used by default.

  • data_key_col_name (str) – The name of the column in the dataset to which the catalog values will be mapped. If not specified, it defaults to the catalog key column name.

  • catalog_key_col_name (str) – The name of the key column in the catalog used for mapping. If not specified, it defaults to the first key column of the catalog.

  • new_col_data_name (str) – The name for the new column in the dataset after applying the catalog format. If not specified, it defaults to the catalog column name.

  • level (int) – The level of detail (length of string positions) for the key used in formatting. Defaults to 0, which includes all.

  • ordered (bool) – Specifies whether the new column should be treated as an ordered categorical. Defaults to False.

  • remove_unused (bool) – Whether to remove unused categories from the new categorical column. Defaults to False.

Returns:

The DataFrame with the new formatting applied.

Return type:

pd.DataFrame

Notes

This method involves several default behaviors when parameters are not specified, including defaulting to the second column of the catalog for the value mapping and the first key column for the key mapping. Care should be taken when leaving parameters unspecified to ensure the correct application of the format.

merge_on(dataset, key_col_in_data, keep_cols=None, merge=False, return_lengths=False)

Merges catalog data with an external dataset based on a specified key column.

Parameters:
  • dataset (pd.DataFrame | UtdData) – The dataset to merge with the catalog.

  • key_col_in_data (str) – The key column name in the dataset for merging.

  • keep_cols (list[str] | None) – Specific columns to keep from the catalog in the merged data.

  • merge (bool) – If True, performs an actual merge operation; otherwise just checks for matching keys.

  • return_lengths (bool) – If True, returns a tuple of the merged DataFrame and a dictionary of lengths of each category after merge.

Returns:

The result of the merge operation, containing data from both the catalog and the input dataset.

Return type:

pd.DataFrame

to_dict(col='', level=0, key_col='')

Converts a column from the Katalog data into a dictionary, mapping keys from another column to these values.

Parameters:
  • col (str) – The column whose values will be used as dictionary values. Defaults to the second column if not specified.

  • level (int) – The level (length) of the key entries to be included. Defaults to all if 0.

  • key_col (str) – The column to use as keys in the dictionary. Defaults to the first key column specified in key_cols.

Returns:

A dictionary mapping keys to values as per the specified columns and level.

Return type:

dict[str, str | int | float]

ssb_utdanning.katalog.katalog_utils module

create_new_utd_katalog(path, key_col_name, extra_cols=None, versioned=True, **metadata)

Make a new, empty Katalog.

Parameters:
  • path (str) – Path the katalog should be stored to.

  • key_col_name (str) – Name of the key column.

  • extra_cols (list[str]) – Extra columns to add to the katalog. Defaults to an empty list (None).

  • versioned (bool) – If True, the katalog will be versioned. Defaults to True.

  • **metadata (str | dict[str, str]) – Additional metadata to add to the katalog.

Returns:

The new katalog.

Return type:

UtdKatalog

open_utd_katalog_from_metadata(meta_path)

The metadata contains the path of the Katalog, this function opens the katalog-data just from being shown the metadata-file.

Parameters:

meta_path (str) – Path to the metadata-file.

Returns:

The katalog.

Return type:

UtdKatalog