ssb_utdanning.paths package¶
ssb_utdanning.paths.get_paths module¶
- get_path_dates(path)¶
Extracts data information about file from path.
Extracts date information from a given file path based on specific patterns in the filename. This function assumes that the filename includes date information encoded within parts of the filename, specifically formatted and separated by underscores. Dates are expected to be in segments prefixed by ‘p’ or directly as the second last part of the filename, following conventions like ‘pYYYY-MM-DD’.
- Parameters:
path (str) – The full path to the file, from which the filename will be parsed for date information.
- Returns:
A tuple containing the parsed datetime objects. The tuple will contain two datetime objects if the filename includes two date segments (‘pYYYY-MM-DD_pYYYY-MM-DD’); otherwise, it will contain only one datetime object if only one date segment is found.
- Return type:
tuple[datetime.datetime] | tuple[datetime.datetime, datetime.datetime]
Note
The function utilizes the dateutil.parser.parse method to convert date strings into datetime objects, which allows for flexible parsing but also requires careful handling to avoid misinterpretation of non-standard date formats. It defaults to using a global DEFAULT_DATE as the base date if the date string is incomplete or partially specified.
- get_path_latest(glob_pattern, exclude_keywords=None)¶
Retrieves path for most recent file matching glob pattern.
Retrieves the path of the most recently modified file that matches a specified glob pattern, excluding any files containing specified keywords. This function uses the get_paths function to gather all matching paths, then returns the first path from the sorted list, which represents the most recent file due to the sorting order set in get_paths.
- Parameters:
glob_pattern (str) – The glob pattern used to identify files. This pattern can specify locations within the local filesystem or a cloud storage path, depending on the execution environment.
exclude_keywords (str | list[str] | None) – A list of keywords to exclude in the search for files. Files containing these keywords in their filenames will be ignored. Defaults to None, which means no exclusions.
- Returns:
- The path to the most recent file matching the specified criteria. If no files match,
this function will raise an IndexError
- Return type:
str
- get_path_reference_date(reference_datetime, glob_pattern, exclude_keywords=None)¶
Finds path from glob pattern and reference date.
Finds and returns the file path for files identified by glob_pattern whose date or date range matches a specified reference date. The function ensures the reference date falls within the date range or on the specified date but not on the exact bounds of a date range.
- Parameters:
reference_datetime (datetime.datetime | str) – The reference date used for finding the file. If a string is provided, it will be parsed into a datetime object.
glob_pattern (str) – The glob pattern to identify relevant files. This pattern should align with how files are named or structured to contain date information.
exclude_keywords (str | list[str] | None, optional) – Keywords to exclude certain files from being considered. Files containing any of these keywords in their path will be ignored.
- Returns:
The path of the file that aligns with the reference datetime.
- Return type:
str
- Raises:
ValueError – Raised in three cases: 1. If the reference datetime matches the boundary dates of a range exactly when two dates are provided. 2. If no files meet the criteria of being within the specified date range or on the date. 3. If the datetime string is malformed and cannot be parsed.
Note
The function uses get_paths_dates which retrieves a mapping of file paths to their respective dates. Dates should be part of the filenames and parseable based on the structure expected by the get_paths_dates. The date comparison is inclusive of the start date but exclusive of the end date when a range is specified.
- get_paths(glob_pattern, exclude_keywords=None)¶
Retrieves a list of file paths that match a specified glob pattern and do not include any of the specified exclude keywords.
This function supports both local file systems and Google Cloud Storage (GCS) through the Data Platform (DP) interface, adapting based on the execution environment (REGION). It filters out any paths that contain keywords specified in exclude_keywords.
- Parameters:
glob_pattern (str) – The glob pattern used to find files. This pattern can represent a path on the local filesystem or a GCS bucket depending on the REGION.
exclude_keywords (str | list[str] | None, optional) – A string or list of strings representing keywords to exclude in the file search. Files containing these keywords in their paths will be excluded from the results. Defaults to None, which means no exclusions.
- Returns:
- A list of strings, each representing a path to a file that matches the glob pattern and does not include
the exclude keywords. The list is sorted in descending order based on the natural sorting of the paths.
- Return type:
list[str]
- Raises:
None directly, but may log errors or raise exceptions depending on the file system access permissions and availability. –
- get_paths_dates(glob_pattern, exclude_keywords=None)¶
Retrieves a dictionary of dates to corresponding paths matching a glob pattern.
This function aggregates dates associated with multiple files, typically used in scenarios where file modifications or creation dates are tracked alongside file paths. It maps each file path that matches a specified glob pattern to a date, extracted by the get_path_dates function. Files containing any specified exclude keywords are omitted from the search.
- Parameters:
glob_pattern (str) – The glob pattern used to identify files. This can include paths on a local filesystem or within cloud storage, depending on the execution environment.
exclude_keywords (str | list[str] | None, optional) – A list of keywords that, if present in a file’s path, will cause that file to be excluded from the results. Defaults to None, meaning no exclusions are applied.
- Returns:
A dictionary where each key is a file path and each value is the date associated with that file, as determined by the get_path_dates function. The date format and the exact nature of the date (e.g., modification, creation) depend on the implementation of get_path_dates.
- Return type:
dict[str, tuple[datetime.datetime] | tuple[datetime.datetime, datetime.datetime]]
Note
This function assumes that get_path_dates is capable of extracting a meaningful date string from each path. The specific nature of the date retrieved (creation, modification, etc.) should be documented in the get_path_dates function.
ssb_utdanning.paths.versioning module¶
- bump_path(path, n=1)¶
Bumps version.
Increments the version number encoded in the file name of the specified path by a given amount. The version number is expected to be at the end of the filename, immediately preceding the file extension and prefixed with ‘v’, such as ‘file_v2.txt’.
This function can handle both string paths and Path-like objects (pathlib.Path, GSPath), updating the version accordingly.
- Parameters:
path (Union[str, Path, GSPath]) – The original file path whose version needs to be incremented. Can be a string or a Path-like object.
n (int) – The amount by which the version number should be incremented. Defaults to 1.
- Returns:
- The new file path with the incremented version number. The type of the returned path matches
the type of the input path.
- Return type:
Union[str, Path, GSPath]
Examples
Given a path ‘data/file_v2.txt’ and n=1, it returns ‘data/file_v3.txt’.
For a pathlib.Path object representing ‘output/report_v12.csv’ and n=2, it returns a Path object for ‘output/report_v14.csv’.
Note
The function assumes that the file name ends with a version number formatted as ‘_v<number>’. It also assumes that the file name contains only one period, which separates the name from the extension.
- get_version(path)¶
Extracts version number.
Extracts the version number from a file path where the version is expected to be encoded in the filename, typically at the end of the filename just before the file extension and prefixed by ‘v’. For example, in ‘file_v2.txt’, the version number is 2.
- Parameters:
path (Union[str, Path, GSPath]) – The file path or path object. The path can be a string, or an object from pathlib.Path or a cloud path object that implements similar functionality.
- Returns:
The version number extracted from the file name.
- Return type:
int
- Raises:
ValueError – If the version segment in the file name is not prefixed with ‘v’, or if the version number following ‘v’ is not a digit.
Examples
Given a path ‘folder/subfolder/file_v10.txt’, it will return 10.
Given a path ‘dataset_v3.parquet’, it will return 3.
Note
The function is dependent on a strict naming convention and will not correctly interpret file names that do not follow the expected pattern (‘prefix_v<number>.extension’). It will raise an error if the version part is malformed or absent.