ssb_utdanning.data package¶
ssb_utdanning.data.dtypes module¶
Automatically changes dtypes on pandas dataframes using logic.
Tries to keep objects as strings if numeric, but with leading zeros. Downcasts ints to smalles size. Changes possible columns to categoricals.
- dtype_apply_from_json(df, json_path, filesystem)¶
Apply dtypes onto a pandas dataframe from a json stored on disk.
- Parameters:
df (pd.DataFrame) – Dataframe to apply dtypes to.
json_path (str) – Path to json file with dtypes.
filesystem (gcsfs.GCSFileSystem) – If working on google, provide the GCSFileSystem.
- Returns:
Dataframe with dtypes applied.
- Return type:
pd.DataFrame
- dtype_store_json(df, json_path, filesystem)¶
Store dtypes of a pandas dataframe as a jsonfile.
- Parameters:
df (pd.DataFrame) – Dataframe to store dtypes from.
json_path (str) – Path to json file to store dtypes in.
filesystem (gcsfs.GCSFileSystem) – If working on google, provide the GCSFileSystem.
- Return type:
None
- Returns:
None
ssb_utdanning.data.utd_data module¶
- class OverwriteMode(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)¶
Bases:
Enum
Enum for specifying overwrite behaviors in file operations.
- overwrite¶
Allows overwriting of existing files.
- Type:
str
- filebump¶
Bumps the file version if a file already exists, preventing overwriting.
- Type:
str
- NONE¶
Does not allow overwriting; if the file exists, an error is raised.
- Type:
str
- NONE = ''¶
- filebump = 'filebump'¶
- overwrite = 'overwrite'¶
- class UtdData(data=None, path='', glob_pattern_latest='', exclude_keywords=None)¶
Bases:
object
Manages loading, saving and access to metadata.
Manages the loading, storage, and access of structured data, providing robust methods for handling data retrieval, path validation, and metadata management based on specified paths or patterns. This class integrates functionality to dynamically load data from local or cloud storage based on specified file paths or glob patterns, manage data versions, and ensure that file paths meet specified criteria before data loading occurs.
The class also supports optional keyword exclusions for file searches and integrates metadata handling capabilities, making it suitable for applications where data integrity, version control, and comprehensive data management are critical.
- Parameters:
data (DataFrame | None)
path (Path | GSPath | str)
glob_pattern_latest (str)
exclude_keywords (list[str] | None)
- data¶
The DataFrame loaded into the instance, either provided at initialization or loaded from the specified path.
- Type:
pd.DataFrame | None
- path¶
The primary file path associated with the data.
- Type:
Union[Path, GSPath, str]
- metadata¶
Metadata associated with the data, automatically managed based on file path changes or data updates.
- Type:
DataDocMetadata
- __init__()¶
Constructor for initializing a new UtdData instance with optional data and path specifications.
- get_data()¶
Loads data from the specified path if not already loaded.
- Return type:
None | DataFrame | tuple[DataFrame, dict[str, str | bool]]
- save()¶
Saves the current data and metadata to a specified path, managing file versioning and overwrite behavior.
- Parameters:
path (str | Path | GSPath)
bump_version (bool)
overwrite_mode (str | OverwriteMode)
save_metadata (bool)
- Return type:
None
- _metadata_from_path()¶
Updates metadata based on the current data path, extracting relevant details as needed.
This class is designed to be versatile, supporting various data storage formats and environments, and is easily extendable for additional data handling and processing needs.
- static bump_path(path, num_bumps=1)¶
Increments the version number in the path.
- Parameters:
path (Union[str, Path, GSPath]) – The file path to increment.
num_bumps (int) – Number of version increments.
- Returns:
Updated path with the incremented version number.
- Return type:
str
- Raises:
TypeError – if type returned by version.bump_path is unrecognized
- get_data()¶
Loads the data from the specified path, or the most recent file version if the specified path is outdated.
- Returns:
The loaded data and metadata if successful, None otherwise.
- Return type:
None | tuple[pd.DataFrame, dict[str, str|bool]]
- Raises:
OSError – If the file extension is not parquet or sas7bdat.
- get_latest_version_path()¶
Determines the most recent version path for the current file.
- Returns:
Path of the most recent file version.
- Return type:
str
- get_similar_paths()¶
Finds paths that are similar to the current path, excluding versions.
- Returns:
A sorted list of similar file paths.
- Return type:
List[str]
- get_version(path='')¶
Gets the version number of the file at the specified path.
- Parameters:
path (Union[str, Path, GSPath]) – The path to check for versioning.
- Returns:
The version number.
- Return type:
int
- save(path='', bump_version=True, overwrite_mode=OverwriteMode.NONE, save_metadata=True)¶
Saves data to path.
Saves the data to a specified path, manages file versioning, and handles file overwriting based on the provided parameters. The method also offers the option to save associated metadata alongside the data.
- Parameters:
path (Union[str, Path, GSPath]) – The file path where the data should be saved. Defaults to the current object path.
bump_version (bool) – Whether to automatically increment the file version. Defaults to True.
overwrite_mode (Union[str, OverwriteMode]) – Specifies how to handle existing files at the target path. Can be ‘none’, ‘overwrite’, or ‘filebump’. Defaults to OverwriteMode.NONE.
save_metadata (bool) – Whether to save metadata along with the data. Defaults to True.
- Return type:
None
- Returns:
None
- Raises:
OSError – If the file already exists and the conditions for overwriting are not met as per the overwrite_mode.
Notes
The method converts string paths to Path or GSPath based on the runtime environment. The path is also forced to have a ‘.parquet’ suffix. If bump_version is enabled, the file path will be adjusted to accommodate a new version number according to existing files in the target directory.
Interactive prompts are used to confirm overwriting actions when necessary, providing a safety mechanism to prevent accidental data loss.