ssb_utdanning.data package

ssb_utdanning.data.dtypes module

Automatically changes dtypes on pandas dataframes using logic.

Tries to keep objects as strings if numeric, but with leading zeros. Downcasts ints to smalles size. Changes possible columns to categoricals.

dtype_apply_from_json(df, json_path, filesystem)

Apply dtypes onto a pandas dataframe from a json stored on disk.

Parameters:
  • df (pd.DataFrame) – Dataframe to apply dtypes to.

  • json_path (str) – Path to json file with dtypes.

  • filesystem (gcsfs.GCSFileSystem) – If working on google, provide the GCSFileSystem.

Returns:

Dataframe with dtypes applied.

Return type:

pd.DataFrame

dtype_store_json(df, json_path, filesystem)

Store dtypes of a pandas dataframe as a jsonfile.

Parameters:
  • df (pd.DataFrame) – Dataframe to store dtypes from.

  • json_path (str) – Path to json file to store dtypes in.

  • filesystem (gcsfs.GCSFileSystem) – If working on google, provide the GCSFileSystem.

Return type:

None

Returns:

None

ssb_utdanning.data.utd_data module

class OverwriteMode(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enum for specifying overwrite behaviors in file operations.

overwrite

Allows overwriting of existing files.

Type:

str

filebump

Bumps the file version if a file already exists, preventing overwriting.

Type:

str

NONE

Does not allow overwriting; if the file exists, an error is raised.

Type:

str

NONE = ''
filebump = 'filebump'
overwrite = 'overwrite'
class UtdData(data=None, path='', glob_pattern_latest='', exclude_keywords=None)

Bases: object

Manages loading, saving and access to metadata.

Manages the loading, storage, and access of structured data, providing robust methods for handling data retrieval, path validation, and metadata management based on specified paths or patterns. This class integrates functionality to dynamically load data from local or cloud storage based on specified file paths or glob patterns, manage data versions, and ensure that file paths meet specified criteria before data loading occurs.

The class also supports optional keyword exclusions for file searches and integrates metadata handling capabilities, making it suitable for applications where data integrity, version control, and comprehensive data management are critical.

Parameters:
  • data (DataFrame | None)

  • path (Path | GSPath | str)

  • glob_pattern_latest (str)

  • exclude_keywords (list[str] | None)

data

The DataFrame loaded into the instance, either provided at initialization or loaded from the specified path.

Type:

pd.DataFrame | None

path

The primary file path associated with the data.

Type:

Union[Path, GSPath, str]

metadata

Metadata associated with the data, automatically managed based on file path changes or data updates.

Type:

DataDocMetadata

__init__()

Constructor for initializing a new UtdData instance with optional data and path specifications.

get_data()

Loads data from the specified path if not already loaded.

Return type:

None | DataFrame | tuple[DataFrame, dict[str, str | bool]]

save()

Saves the current data and metadata to a specified path, managing file versioning and overwrite behavior.

Parameters:
  • path (str | Path | GSPath)

  • bump_version (bool)

  • overwrite_mode (str | OverwriteMode)

  • save_metadata (bool)

Return type:

None

_metadata_from_path()

Updates metadata based on the current data path, extracting relevant details as needed.

This class is designed to be versatile, supporting various data storage formats and environments, and is easily extendable for additional data handling and processing needs.

static bump_path(path, num_bumps=1)

Increments the version number in the path.

Parameters:
  • path (Union[str, Path, GSPath]) – The file path to increment.

  • num_bumps (int) – Number of version increments.

Returns:

Updated path with the incremented version number.

Return type:

str

Raises:

TypeError – if type returned by version.bump_path is unrecognized

get_data()

Loads the data from the specified path, or the most recent file version if the specified path is outdated.

Returns:

The loaded data and metadata if successful, None otherwise.

Return type:

None | tuple[pd.DataFrame, dict[str, str|bool]]

Raises:

OSError – If the file extension is not parquet or sas7bdat.

get_latest_version_path()

Determines the most recent version path for the current file.

Returns:

Path of the most recent file version.

Return type:

str

get_similar_paths()

Finds paths that are similar to the current path, excluding versions.

Returns:

A sorted list of similar file paths.

Return type:

List[str]

get_version(path='')

Gets the version number of the file at the specified path.

Parameters:

path (Union[str, Path, GSPath]) – The path to check for versioning.

Returns:

The version number.

Return type:

int

save(path='', bump_version=True, overwrite_mode=OverwriteMode.NONE, save_metadata=True)

Saves data to path.

Saves the data to a specified path, manages file versioning, and handles file overwriting based on the provided parameters. The method also offers the option to save associated metadata alongside the data.

Parameters:
  • path (Union[str, Path, GSPath]) – The file path where the data should be saved. Defaults to the current object path.

  • bump_version (bool) – Whether to automatically increment the file version. Defaults to True.

  • overwrite_mode (Union[str, OverwriteMode]) – Specifies how to handle existing files at the target path. Can be ‘none’, ‘overwrite’, or ‘filebump’. Defaults to OverwriteMode.NONE.

  • save_metadata (bool) – Whether to save metadata along with the data. Defaults to True.

Return type:

None

Returns:

None

Raises:

OSError – If the file already exists and the conditions for overwriting are not met as per the overwrite_mode.

Notes

The method converts string paths to Path or GSPath based on the runtime environment. The path is also forced to have a ‘.parquet’ suffix. If bump_version is enabled, the file path will be adjusted to accommodate a new version number according to existing files in the target directory.

Interactive prompts are used to confirm overwriting actions when necessary, providing a safety mechanism to prevent accidental data loss.