arbmark.functions package¶
Submodules¶
arbmark.functions.aggregation module¶
- proc_sums(df, groups, values=None, agg_func=None)¶
Compute aggregations for combinations of columns and return a new DataFrame with these aggregations.
- Parameters:
df (
DataFrame
) – The input DataFrame.groups (
list
[str
]) – List of columns to be considered for groupings.values (
list
[str
] |None
) – List of columns on which the aggregation functions will be applied. If None and agg_func is provided, it defaults to the keys of agg_func.agg_func (
dict
[str
,Any
|list
[Any
]] |None
) – Dictionary mapping columns to aggregation functions corresponding to the ‘values’ list. If None, defaults to ‘sum’ for all columns in ‘values’. Default None.
- Return type:
DataFrame
- Returns:
A DataFrame containing aggregations for all combinations of ‘groups’ with an additional ‘level’ column indicating the level of grouping.
- Raises:
ValueError – If any of the specified columns in ‘groups’ or ‘values’ are not present in the DataFrame.
ValueError – If any columns in ‘values’ are not numeric and no aggregation function is provided.
Note
The returned DataFrame also contains an additional column named ‘level’ indicating the level of grouping.
Columns not used in a particular level of grouping will have a value ‘Total’.
If ‘values’ is None and ‘agg_func’ is provided, ‘values’ is automatically set to the keys of ‘agg_func’.
arbmark.functions.files module¶
This function is outdated use ‘latest_version_path’ from ssb-fagfunksjoner instead.
- read_latest(path, name, dottype='.parquet')¶
Finds the latest version of a specified file in a given directory and returns its name.
This function is outdated use ‘latest_version_path’ from ssb-fagfunksjoner instead.
This function searches for files in the specified path that match the given name and file type, sorts them by modification time, and returns the path of the latest version. If no files are found, it returns None.
- Parameters:
path (str) – The directory path where the files are located.
name (str) – The base name of the files to search for.
dottype (str) – The file extension to look for. Defaults to “.parquet”.
- Returns:
The path of the latest version of the file if found, None otherwise.
- Return type:
Optional[str]
arbmark.functions.interval module¶
- pinterval(start_p, end_p, sep='', freq='m')¶
This function generates a list of monthly or quarterly periods between two given periods.
The periods are strings in the format ‘YYYY<separator>MM’ or ‘YYYYMM’ for monthly intervals, and ‘YYYY<separator>Q’ for quarterly intervals, where YYYY is a 4-digit year and MM is a 2-digit month (01 to 12) or Q is a 1-digit quarter (1 to 4). The function handles cases where the start and end periods are in the same year or in different years. The separator between year and month/quarter is customizable.
- Parameters:
start_p (
str
) – The start period in the format ‘YYYY<sep>MM’ or ‘YYYYMM’ for monthly intervals, and ‘YYYY<sep>Q’ for quarterly intervals.end_p (
str
) – The end period in the format ‘YYYY<sep>MM’ or ‘YYYYMM’ for monthly intervals, and ‘YYYY<sep>Q’ for quarterly intervals.sep (
str
) – A string to separate the year and month/quarter. Defaults to empty.freq (
str
) – The intervals frequency, ‘m’ for monthly or ‘q’ for quarterly. Defaults to ‘m’.
- Return type:
list
[str
]- Returns:
A list of strings representing the monthly or quarterly periods from start_p to end_p, inclusive.
- Raises:
ValueError – If the frequency is not ‘monthly’ or ‘quarterly’.
ValueError – If the start and end period do not include the specified separator.
Example: >>> pinterval(‘2022k1’, ‘2023k2’, sep=’k’, freq=’quarterly’) [‘2022k1’, ‘2022k2’, ‘2022k3’, ‘2022k4’, ‘2023k1’, ‘2023k2’]
arbmark.functions.merge module¶
- indicate_merge(left, right, how, on)¶
Perform a merge of two DataFrames and prints a frequency table indicating the merge type for each row.
- The merge types are determined as follows (left-to-right):
‘one-to-zero’: Rows that exist only in the left DataFrame.
‘zero-to-one’: Rows that exist only in the right DataFrame.
‘many-to-zero’: Rows in the right DataFrame with multiple identical entries and no matching entries in the left DataFrame.
‘zero-to-many’: Rows in the left DataFrame with multiple identical entries and no matching entries in the right DataFrame.
‘one-to-one’: Rows that have a matching entry in both left and right DataFrames.
‘many-to-one’: Rows in the right DataFrame with multiple matching entries in the left DataFrame.
‘one-to-many’: Rows in the left DataFrame with multiple matching entries in the right DataFrame.
‘many-to-many’: Rows in both left and right DataFrames with multiple matching entries.
- Parameters:
left (
DataFrame
) – The left DataFrame to be merged.right (
DataFrame
) – The right DataFrame to be merged.how (
Literal
['left'
,'right'
,'outer'
,'inner'
,'cross'
]) – The type of merge to be performed. Options are: ‘inner’, ‘outer’, ‘left’, ‘right’.on (
str
|list
[str
]) – A list of column names to merge on.
- Return type:
DataFrame
- Returns:
The merged DataFrame.
arbmark.functions.quarter module¶
- first_last_date_quarter(year_str, quarter_str)¶
Given a year and a quarter, this function calculates the first and last dates of the specified quarter using pandas.
- Parameters:
year_str (
str
) – The year as a string.quarter_str (
str
) – The quarter as a string.
- Return type:
tuple
[str
,str
]- Returns:
A tuple containing two strings, the first and last dates of the specified quarter in ‘YYYY-MM-DD’ format.
arbmark.functions.reference module¶
- ref_day(from_dates, to_dates)¶
Determines if the reference day falls between given date ranges.
This function checks if the 16th day of each month (reference day) is within the range specified by the corresponding ‘from_dates’ and ‘to_dates’. It requires that both ‘from_dates’ and ‘to_dates’ are in the same year and month.
- Parameters:
from_dates (
Series
) – A Series of dates representing the start of a period. These dates should be in the ‘YYYY-MM-DD’ format.to_dates (
Series
) – A Series of dates representing the end of a period. These dates should also be in the ‘YYYY-MM-DD’ format.
- Return type:
Series
- Returns:
A Pandas Series of boolean values. Each element in the Series corresponds to whether the 16th day of the month for each period is within the respective date range.
- Raises:
ValueError – If ‘from_dates’ and ‘to_dates’ are not in the same year, or if they are not in the same month.
- ref_tuesday(from_dates, to_dates)¶
Determines if the Tuesday in the same week as the 16th falls between given date ranges.
This function finds the Tuesday in the same week as the 16th day of each month and checks if it is within the range specified by the corresponding ‘from_dates’ and ‘to_dates’. It requires that both ‘from_dates’ and ‘to_dates’ are in the same year and month.
- Parameters:
from_dates (
Series
) – A Series of dates representing the start of a period. These dates should be in the ‘YYYY-MM-DD’ format.to_dates (
Series
) – A Series of dates representing the end of a period. These dates should also be in the ‘YYYY-MM-DD’ format.
- Return type:
Series
- Returns:
A Pandas Series of boolean values. Each element in the Series corresponds to whether the Tuesday in the week of the 16th day of the month for each period is within the respective date range.
- Raises:
ValueError – If ‘from_dates’ and ‘to_dates’ are not in the same year, or if they are not in the same month.
- ref_week(from_dates, to_dates)¶
Determines if any date in each date range falls in the reference week.
This function checks if any date between the ‘from_dates’ and ‘to_dates’ is within the reference week. The reference week is defined as the ISO week which includes the 16th day of each month. The use of the ISO week date system ensures consistency with international standards, where the week starts on Monday and the first week of the year is the one containing the first Thursday. It requires that both ‘from_dates’ and ‘to_dates’ are in the same year and month.
- Parameters:
from_dates (
Series
) – A Series of dates representing the start of a period. These dates should be in the ‘YYYY-MM-DD’ format.to_dates (
Series
) – A Series of dates representing the end of a period. These dates should also be in the ‘YYYY-MM-DD’ format.
- Return type:
Series
- Returns:
A Series of booleans, where each boolean corresponds to whether any date in the period from ‘from_dates’ to ‘to_dates’ falls within the reference week of the month as defined by the ISO week date system.
- Raises:
ValueError – If ‘from_dates’ and ‘to_dates’ are not in the same year, or if they are not in the same month.
arbmark.functions.statbank_formats module¶
- sb_integer(number, unit=0)¶
Format a pandas Series of numbers as rounded integers, with optional unit scaling.
- Parameters:
number (
Series
) – A pandas Series containing numeric values.unit (
int
) – The power of 10 to which to round the numbers. Default is 0 (no scaling).
- Return type:
Series
- Returns:
A pandas Series with the numbers rounded to the specified unit, converted to strings, and with NaNs replaced by empty strings.
- sb_percent(fraction, decimals=1)¶
Convert a pandas Series of fractions to percentages, formatted as strings.
- Parameters:
fraction (
Series
) – A pandas Series containing fractional values (e.g., 0.25 for 25%).decimals (
int
) – Number of decimal places to round the percentage values to. Default is 1.
- Return type:
Series
- Returns:
A pandas Series with the percentage values formatted as strings, with a comma as the decimal separator and empty strings for NaNs and infinities.
arbmark.functions.workdays module¶
- count_days(from_dates, to_dates, calendar)¶
Counts the days between pairs of start and end dates using a provided calendar.
- Parameters:
from_dates (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Numpy array of start dates.to_dates (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Numpy array of end dates.calendar (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Numpy array representing the days to be counted.
- Return type:
Series
- Returns:
A Pandas Series with the count of days for each pair.
- count_holidays(from_dates, to_dates)¶
Counts the number of holidays between pairs of dates in given series.
This function calculates the number of holidays for each pair of start and end dates provided in the from_dates and to_dates series. It uses the holidays specific to Norway and considers each pair’s specific date range.
- Parameters:
from_dates (
Series
) – A pandas Series containing the start dates of the periods.to_dates (
Series
) – A pandas Series containing the end dates of the periods.
- Return type:
Series
- Returns:
A Pandas Series containing the number of holidays for each date pair.
- count_weekend_days(from_dates, to_dates)¶
Counts the number of weekend days between pairs of dates in given series.
This function calculates the number of weekend days for each pair of start and end dates provided in the from_dates and to_dates series. It identifies weekends based on a calculation using the Unix epoch as the reference point. The result includes the total number of Saturdays and Sundays within each specified date range.
- Parameters:
from_dates (
Series
) – A pandas Series containing the start dates of the periods.to_dates (
Series
) – A pandas Series containing the end dates of the periods.
- Return type:
Series
- Returns:
A Pandas Series containing the number of weekend days for each date pair.
- count_workdays(from_dates, to_dates)¶
Counts the number of workdays between pairs of dates in given series.
This function calculates the number of workdays for each pair of start and end dates provided in the from_dates and to_dates series. It handles date ranges spanning multiple years and excludes weekends and holidays specific to Norway. The function dynamically fetches Norwegian holidays for the relevant years based on the input dates. Weekends are identified using a calculation that considers the Unix epoch (1970-01-01) as the reference starting point. After adjusting with a -4 shift and modulo 7, the weekdays are mapped as Monday (0) through Sunday (6), with Saturday (5) and Sunday (6) identified as the weekend days.
- Parameters:
from_dates (
Series
) – A pandas Series containing the start dates of the periods.to_dates (
Series
) – A pandas Series containing the end dates of the periods.
- Return type:
Series
- Returns:
A Pandas Series containing the number of workdays for each date pair.
- filter_holidays(calendar, holidays)¶
Filters out the holiday dates from a given calendar.
This function identifies and returns only the dates in the calendar that are recognized as holidays, excluding holidays on weekends.
- Parameters:
calendar (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Numpy array of dates.holidays (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Numpy array of dates that are holidays.
- Return type:
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
A Numpy array of dates that are recognized as holidays.
- filter_weekends(calendar)¶
Filters out the weekend dates from a given calendar.
This function identifies which days in the provided calendar are weekends and returns only those dates.
- Parameters:
calendar (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Numpy array of dates, typically encompassing multiple weeks.- Return type:
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
A Numpy array of dates that fall on weekends (Saturday and Sunday).
- filter_workdays(calendar, holidays)¶
Filters out weekends and holidays from a calendar, leaving only workdays.
- Parameters:
calendar (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Numpy array of dates.holidays (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Numpy array of holiday dates.
- Return type:
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
A Numpy array of dates that are workdays.
- get_calendar(from_date, to_date)¶
Generates a calendar as a range of dates from a start date to an end date.
- Parameters:
from_date (
datetime64
) – The start date.to_date (
datetime64
) – The end date.
- Return type:
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
A Numpy array representing a range of dates from start to end.
- get_norwegian_holidays(years)¶
Fetches Norwegian holidays for a given range of years and returns them as a sorted Numpy array of dates.
- Parameters:
years (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Numpy array of years for which holidays are to be fetched.- Return type:
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
A Numpy array of holiday dates.
- get_years(from_dates, to_dates)¶
Extracts unique years from two series of dates.
- Parameters:
from_dates (
Series
) – A Pandas Series of start dates.to_dates (
Series
) – A Pandas Series of end dates.
- Return type:
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
A Numpy array of unique years derived from the date ranges.
- is_weekend(calendar)¶
Determines which days in a given calendar are weekends.
- Parameters:
calendar (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Numpy array of dates.- Return type:
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
A Numpy boolean array where True indicates a weekend.
- numpy_dates(dates)¶
Converts a Pandas Series of timestamps to a Numpy array of dates in ‘datetime64[D]’ format.
- Parameters:
dates (
Series
) – A pandas Series containing timestamps.- Return type:
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
A Numpy array containing dates.
Module contents¶
A collection of useful functions.