Reference

arbmark package

Subpackages

arbmark.functions package

arbmark.functions.aggregation module

proc_sums(df, groups, values=None, agg_func=None)

Compute aggregations for combinations of columns and return a new DataFrame with these aggregations.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • groups (list[str]) – List of columns to be considered for groupings.

  • values (list[str] | None) – List of columns on which the aggregation functions will be applied. If None and agg_func is provided, it defaults to the keys of agg_func.

  • agg_func (dict[str, Any | list[Any]] | None) – Dictionary mapping columns to aggregation functions corresponding to the ‘values’ list. If None, defaults to ‘sum’ for all columns in ‘values’. Default None.

Return type:

DataFrame

Returns:

A DataFrame containing aggregations for all combinations of ‘groups’ with an additional ‘level’ column indicating the level of grouping.

Raises:
  • ValueError – If any of the specified columns in ‘groups’ or ‘values’ are not present in the DataFrame.

  • ValueError – If any columns in ‘values’ are not numeric and no aggregation function is provided.

Note

  • The returned DataFrame also contains an additional column named ‘level’ indicating the level of grouping.

  • Columns not used in a particular level of grouping will have a value ‘Total’.

  • If ‘values’ is None and ‘agg_func’ is provided, ‘values’ is automatically set to the keys of ‘agg_func’.

arbmark.functions.files module

This function is outdated use ‘latest_version_path’ from ssb-fagfunksjoner instead.

read_latest(path, name, dottype='.parquet')

Finds the latest version of a specified file in a given directory and returns its name.

This function is outdated use ‘latest_version_path’ from ssb-fagfunksjoner instead.

This function searches for files in the specified path that match the given name and file type, sorts them by modification time, and returns the path of the latest version. If no files are found, it returns None.

Parameters:
  • path (str) – The directory path where the files are located.

  • name (str) – The base name of the files to search for.

  • dottype (str) – The file extension to look for. Defaults to “.parquet”.

Returns:

The path of the latest version of the file if found, None otherwise.

Return type:

Optional[str]

arbmark.functions.interval module

pinterval(start_p, end_p, sep='', freq='m')

This function generates a list of monthly or quarterly periods between two given periods.

The periods are strings in the format ‘YYYY<separator>MM’ or ‘YYYYMM’ for monthly intervals, and ‘YYYY<separator>Q’ for quarterly intervals, where YYYY is a 4-digit year and MM is a 2-digit month (01 to 12) or Q is a 1-digit quarter (1 to 4). The function handles cases where the start and end periods are in the same year or in different years. The separator between year and month/quarter is customizable.

Parameters:
  • start_p (str) – The start period in the format ‘YYYY<sep>MM’ or ‘YYYYMM’ for monthly intervals, and ‘YYYY<sep>Q’ for quarterly intervals.

  • end_p (str) – The end period in the format ‘YYYY<sep>MM’ or ‘YYYYMM’ for monthly intervals, and ‘YYYY<sep>Q’ for quarterly intervals.

  • sep (str) – A string to separate the year and month/quarter. Defaults to empty.

  • freq (str) – The intervals frequency, ‘m’ for monthly or ‘q’ for quarterly. Defaults to ‘m’.

Return type:

list[str]

Returns:

A list of strings representing the monthly or quarterly periods from start_p to end_p, inclusive.

Raises:
  • ValueError – If the frequency is not ‘monthly’ or ‘quarterly’.

  • ValueError – If the start and end period do not include the specified separator.

Example: >>> pinterval(‘2022k1’, ‘2023k2’, sep=’k’, freq=’quarterly’) [‘2022k1’, ‘2022k2’, ‘2022k3’, ‘2022k4’, ‘2023k1’, ‘2023k2’]

arbmark.functions.merge module

indicate_merge(left, right, how, on)

Perform a merge of two DataFrames and prints a frequency table indicating the merge type for each row.

The merge types are determined as follows (left-to-right):
  • ‘one-to-zero’: Rows that exist only in the left DataFrame.

  • ‘zero-to-one’: Rows that exist only in the right DataFrame.

  • ‘many-to-zero’: Rows in the right DataFrame with multiple identical entries and no matching entries in the left DataFrame.

  • ‘zero-to-many’: Rows in the left DataFrame with multiple identical entries and no matching entries in the right DataFrame.

  • ‘one-to-one’: Rows that have a matching entry in both left and right DataFrames.

  • ‘many-to-one’: Rows in the right DataFrame with multiple matching entries in the left DataFrame.

  • ‘one-to-many’: Rows in the left DataFrame with multiple matching entries in the right DataFrame.

  • ‘many-to-many’: Rows in both left and right DataFrames with multiple matching entries.

Parameters:
  • left (DataFrame) – The left DataFrame to be merged.

  • right (DataFrame) – The right DataFrame to be merged.

  • how (Literal['left', 'right', 'outer', 'inner', 'cross']) – The type of merge to be performed. Options are: ‘inner’, ‘outer’, ‘left’, ‘right’.

  • on (str | list[str]) – A list of column names to merge on.

Return type:

DataFrame

Returns:

The merged DataFrame.

arbmark.functions.quarter module

first_last_date_quarter(year_str, quarter_str)

Given a year and a quarter, this function calculates the first and last dates of the specified quarter using pandas.

Parameters:
  • year_str (str) – The year as a string.

  • quarter_str (str) – The quarter as a string.

Return type:

tuple[str, str]

Returns:

A tuple containing two strings, the first and last dates of the specified quarter in ‘YYYY-MM-DD’ format.

arbmark.functions.reference module

ref_day(from_dates, to_dates)

Determines if the reference day falls between given date ranges.

This function checks if the 16th day of each month (reference day) is within the range specified by the corresponding ‘from_dates’ and ‘to_dates’. It requires that both ‘from_dates’ and ‘to_dates’ are in the same year and month.

Parameters:
  • from_dates (Series) – A Series of dates representing the start of a period. These dates should be in the ‘YYYY-MM-DD’ format.

  • to_dates (Series) – A Series of dates representing the end of a period. These dates should also be in the ‘YYYY-MM-DD’ format.

Return type:

Series

Returns:

A Pandas Series of boolean values. Each element in the Series corresponds to whether the 16th day of the month for each period is within the respective date range.

Raises:

ValueError – If ‘from_dates’ and ‘to_dates’ are not in the same year, or if they are not in the same month.

ref_tuesday(from_dates, to_dates)

Determines if the Tuesday in the same week as the 16th falls between given date ranges.

This function finds the Tuesday in the same week as the 16th day of each month and checks if it is within the range specified by the corresponding ‘from_dates’ and ‘to_dates’. It requires that both ‘from_dates’ and ‘to_dates’ are in the same year and month.

Parameters:
  • from_dates (Series) – A Series of dates representing the start of a period. These dates should be in the ‘YYYY-MM-DD’ format.

  • to_dates (Series) – A Series of dates representing the end of a period. These dates should also be in the ‘YYYY-MM-DD’ format.

Return type:

Series

Returns:

A Pandas Series of boolean values. Each element in the Series corresponds to whether the Tuesday in the week of the 16th day of the month for each period is within the respective date range.

Raises:

ValueError – If ‘from_dates’ and ‘to_dates’ are not in the same year, or if they are not in the same month.

ref_week(from_dates, to_dates)

Determines if any date in each date range falls in the reference week.

This function checks if any date between the ‘from_dates’ and ‘to_dates’ is within the reference week. The reference week is defined as the ISO week which includes the 16th day of each month. The use of the ISO week date system ensures consistency with international standards, where the week starts on Monday and the first week of the year is the one containing the first Thursday. It requires that both ‘from_dates’ and ‘to_dates’ are in the same year and month.

Parameters:
  • from_dates (Series) – A Series of dates representing the start of a period. These dates should be in the ‘YYYY-MM-DD’ format.

  • to_dates (Series) – A Series of dates representing the end of a period. These dates should also be in the ‘YYYY-MM-DD’ format.

Return type:

Series

Returns:

A Series of booleans, where each boolean corresponds to whether any date in the period from ‘from_dates’ to ‘to_dates’ falls within the reference week of the month as defined by the ISO week date system.

Raises:

ValueError – If ‘from_dates’ and ‘to_dates’ are not in the same year, or if they are not in the same month.

arbmark.functions.workdays module

count_days(from_dates, to_dates, calendar)

Counts the days between pairs of start and end dates using a provided calendar.

Parameters:
  • from_dates (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Numpy array of start dates.

  • to_dates (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Numpy array of end dates.

  • calendar (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Numpy array representing the days to be counted.

Return type:

Series

Returns:

A Pandas Series with the count of days for each pair.

count_holidays(from_dates, to_dates)

Counts the number of holidays between pairs of dates in given series.

This function calculates the number of holidays for each pair of start and end dates provided in the from_dates and to_dates series. It uses the holidays specific to Norway and considers each pair’s specific date range.

Parameters:
  • from_dates (Series) – A pandas Series containing the start dates of the periods.

  • to_dates (Series) – A pandas Series containing the end dates of the periods.

Return type:

Series

Returns:

A Pandas Series containing the number of holidays for each date pair.

count_weekend_days(from_dates, to_dates)

Counts the number of weekend days between pairs of dates in given series.

This function calculates the number of weekend days for each pair of start and end dates provided in the from_dates and to_dates series. It identifies weekends based on a calculation using the Unix epoch as the reference point. The result includes the total number of Saturdays and Sundays within each specified date range.

Parameters:
  • from_dates (Series) – A pandas Series containing the start dates of the periods.

  • to_dates (Series) – A pandas Series containing the end dates of the periods.

Return type:

Series

Returns:

A Pandas Series containing the number of weekend days for each date pair.

count_workdays(from_dates, to_dates)

Counts the number of workdays between pairs of dates in given series.

This function calculates the number of workdays for each pair of start and end dates provided in the from_dates and to_dates series. It handles date ranges spanning multiple years and excludes weekends and holidays specific to Norway. The function dynamically fetches Norwegian holidays for the relevant years based on the input dates. Weekends are identified using a calculation that considers the Unix epoch (1970-01-01) as the reference starting point. After adjusting with a -4 shift and modulo 7, the weekdays are mapped as Monday (0) through Sunday (6), with Saturday (5) and Sunday (6) identified as the weekend days.

Parameters:
  • from_dates (Series) – A pandas Series containing the start dates of the periods.

  • to_dates (Series) – A pandas Series containing the end dates of the periods.

Return type:

Series

Returns:

A Pandas Series containing the number of workdays for each date pair.

filter_holidays(calendar, holidays)

Filters out the holiday dates from a given calendar.

This function identifies and returns only the dates in the calendar that are recognized as holidays, excluding holidays on weekends.

Parameters:
  • calendar (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Numpy array of dates.

  • holidays (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Numpy array of dates that are holidays.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A Numpy array of dates that are recognized as holidays.

filter_weekends(calendar)

Filters out the weekend dates from a given calendar.

This function identifies which days in the provided calendar are weekends and returns only those dates.

Parameters:

calendar (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Numpy array of dates, typically encompassing multiple weeks.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A Numpy array of dates that fall on weekends (Saturday and Sunday).

filter_workdays(calendar, holidays)

Filters out weekends and holidays from a calendar, leaving only workdays.

Parameters:
  • calendar (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Numpy array of dates.

  • holidays (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Numpy array of holiday dates.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A Numpy array of dates that are workdays.

get_calendar(from_date, to_date)

Generates a calendar as a range of dates from a start date to an end date.

Parameters:
  • from_date (datetime64) – The start date.

  • to_date (datetime64) – The end date.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A Numpy array representing a range of dates from start to end.

get_norwegian_holidays(years)

Fetches Norwegian holidays for a given range of years and returns them as a sorted Numpy array of dates.

Parameters:

years (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Numpy array of years for which holidays are to be fetched.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A Numpy array of holiday dates.

get_years(from_dates, to_dates)

Extracts unique years from two series of dates.

Parameters:
  • from_dates (Series) – A Pandas Series of start dates.

  • to_dates (Series) – A Pandas Series of end dates.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A Numpy array of unique years derived from the date ranges.

is_weekend(calendar)

Determines which days in a given calendar are weekends.

Parameters:

calendar (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Numpy array of dates.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A Numpy boolean array where True indicates a weekend.

numpy_dates(dates)

Converts a Pandas Series of timestamps to a Numpy array of dates in ‘datetime64[D]’ format.

Parameters:

dates (Series) – A pandas Series containing timestamps.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A Numpy array containing dates.

arbmark.groups package

arbmark.groups.age module

alder_5grp(alder, display='label')

Categorize a pandas Series of person ages into predefined groups used in ARBLONN.

Parameters:
  • alder (Series) – A pandas Series containing the person ages.

  • display (str) – If ‘label’, returns group labels; if ‘number’, returns keys; for any other string, returns a combination of keys and labels.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A numpy Array where the original person ages are replaced by group labels, keys, or a combination.

alder_grp(alder, display='label')

Categorize a pandas Series of person ages into predefined groups used in SYKEFR.

Parameters:
  • alder (Series) – A pandas Series containing the person ages.

  • display (str) – If ‘label’, returns group labels; if ‘number’, returns keys; for any other string, returns a combination of keys and labels.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A numpy Array where the original person ages are replaced by group labels, keys, or a combination.

arbmark.groups.company_size module

virk_str_8grp(ansatte, display='label')

Categorize a pandas Series of employee counts into predefined groups.

Parameters:
  • ansatte (Series) – A pandas Series containing the employee counts.

  • display (str) – If ‘label’, returns group labels; if ‘number’, returns keys; for any other string, returns a combination of keys and labels.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A numpy Array where the original employee counts are replaced by group labels or keys.

arbmark.groups.country_origin module

landbakgrunn_grp(landbakgrunn, display='label')

Categorize a pandas Series of country origins from 3 generations into world regions.

Parameters:
  • landbakgrunn (Series) – A pandas Series containing the country origins.

  • display (str) – If ‘label’, returns group labels; if ‘number’, returns keys; if ‘arblonn’, returns specific labels for ARBLONN; for any other string, returns a combination of keys and labels.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A numpy Array where the original country origins are replaced by group labels or keys.

arbmark.groups.nace module

clean_nace_17_groups(val)

Cleans the NACE code value by removing redundant parts.

This function checks if the input string val contains a hyphen (‘-’) and if the parts before and after the hyphen are identical. If they are, it returns only the part before the hyphen. Otherwise, it returns the original input value.

Parameters:

val (str) – A string containing the NACE code to be cleaned.

Return type:

str

Returns:

A string with the cleaned NACE code.

nace_sn07_47grp(nace_sn07, display='label')

Categorize a pandas Series of NACE-codes (SN07) into predefined groups.

Parameters:
  • nace_sn07 (Series) – A pandas Series containing the NACE-codes.

  • display (str) – If ‘label’, returns group labels; if ‘number’, returns keys; for any other string, returns a combination of keys and labels.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A numpy Array where the original NACE-codes are replaced by group labels or keys.

nace_to_17_groups(nace, label=False)

Converts NACE codes in a Pandas Series to their corresponding group codes or labels.

NACE (Nomenclature of Economic Activities) is the European industry standard classification system. This function maps NACE codes to a higher-level group (level 2) and optionally returns the group’s name instead of its code.

Parameters:
  • nace (Series) – A Pandas Series containing NACE codes.

  • label (bool) – If True, returns the names of the groups instead of their codes. Defaults to False.

Return type:

Series

Returns:

A Pandas Series with the mapped group codes or names, depending on the ‘label’ argument.

Note

The function relies on a predefined mapping (‘KlassVariant(1616).data’) to perform the conversion. It assumes that this mapping has a specific structure, with ‘level’, ‘code’, and ‘parentCode’ (or ‘name’ if labels are requested) columns.

arbmark.groups.sector module

sektor2_grp(sektor, display='label')

Categorize a pandas Series of sectors into predefined groups.

Parameters:
  • sektor (Series) – A pandas Series containing the sector codes.

  • display (str) – If ‘label’, returns group labels; if ‘number’, returns keys; for any other string, returns a combination of keys and labels.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A numpy Array where the original sector is replaced by group labels or keys.

arbmark.groups.shift_work module

turnuskoder(arb_tid_ordning)

Assigns codes based on work schedule categories.

This function takes a pandas Series containing work schedule categories and assigns corresponding codes based on specific conditions. The conditions are as follows: - ‘20’ is assigned to categories [‘dogn355’, ‘helkont336’, ‘offshore336’, ‘skift365’, ‘andre_skift’] - ‘25’ is assigned to the ‘ikke_skift’ category - ‘99’ is assigned to values [‘-2’, ‘’, ‘-1’] or NaN values in the series Any value that doesn’t match these conditions will be assigned an empty string.

Parameters:

arb_tid_ordning (Series) – A pandas Series object containing strings that represent different work schedule categories.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

An array of strings, where each string is a code corresponding to the work schedule category in arb_tid_ordning.

Example

>>> arb_tid_ordning = pd.Series(['dogn355', 'helkont336', 'ikke_skift', '-2', 'offshore336', ''])
>>> turnuskoder(arb_tid_ordning)
array(['20', '20', '25', '99', '20', '99'], dtype='<U2')