Reference¶

fagfunksjoner package¶

Subpackages¶

Submodules¶

fagfunksjoner.fagfunksjoner_logger module¶

class ColoredFormatter(*args, colors=None, **kwargs)¶

Bases: Formatter

Colored log formatter.

Initialize the formatter with specified format strings.

Parameters:

args (Any)
colors (dict[str, str] | None)
kwargs (Any)

format(record)¶

Format the specified record as text.

Return type:: str
Parameters:: record (LogRecord)

silence_logger(func, *args, **kwargs)¶

Silences INFO and WARNING logs for the duration of the function call.

Return type:

Any

Parameters:

func (Callable[[...], Any])
args (Any)
kwargs (Any)

Module contents¶

Fagfunksjoner is a place for “loose, small functionality” produced at Statistics Norway in Python.

Often created by “fag”, not IT these are often small “helper-functions” that many might be interested in.

class ProjectRoot¶

Bases: object

Contextmanager to import locally “with”.

As in:

with ProjectRoot():
    from src.functions.local_functions import local_function

So this class navigates back and forth using a single line/”instruction”

Initialize the projectroot by finding the correct folder.

And navigating back to, and storing the starting folder.

static load_toml(config_file)¶

Looks for a .toml file to load the contents from.

Looks in the current folder, the specified path, the project root.

Parameters:: config_file (str) – The path or filename of the config-file to load.
Returns:: The contents of the toml-file.
Return type:: dict[Any]

class SsbFormat(start_dict=None)¶

Bases: dict[Any, Any]

Custom dictionary class designed to handle specific formatting conventions, including mapping intervals (defined as range strings) even when they map to the same value.

Initializes the SsbFormat instance.

Parameters:: start_dict (dict, optional) – Initial dictionary to populate SsbFormat.

static check_if_na(key)¶

Checks if the specified key represents a NA (Not Available) value.

Parameters:: key (Any) – Key to be checked for NA value.
Returns:: True if the key represents NA, False otherwise.
Return type:: bool

int_str_confuse(key)¶

Handles conversion between integer and string keys.

Parameters:: key (str | int | float | NAType | None) – Key to be converted or checked for existence in the dictionary.
Return type:: None | Any
Returns:: The value associated with the key (if found) or None.

look_in_ranges(key)¶

Returns the mapping value for the key if it falls within any defined range.

The method attempts to convert the key to a float and then checks if it lies within any of the stored range intervals. If the key is None, NA, or not of a convertible type, the method returns None.

Return type:: None | Any
Parameters:: key (str | int | float | NAType | None)

set_na_value()¶

Sets the value for NA (Not Available) keys in the SsbFormat.

Returns:: True if NA value is successfully set, False otherwise.
Return type:: bool

set_other_as_lowercase()¶

Ensures that the ‘other’ key is stored in lowercase.

If a key matching ‘other’ in any other case is found, its value is reassigned to ‘other’.

Return type:: None

store(output_path, force=False)¶

Stores the SsbFormat instance in a specified output path.

Parameters:

output_path (str) – Path where the format will be stored.
force (bool) – Flag to force storing even for cached instances.

Raises:

ValueError – If storing a cached SsbFormat might lead to an unexpectedly large number of keys.

Return type:

None

store_ranges()¶

Stores ranges by converting range-string keys into tuple keys.

For example, a key “0-18” with value “A” will be stored as {(0.0, 18.0): “A”}.

Return type:: None

update_format()¶

Update method to set special instance attributes.

Return type:: None

class StatLogger(log_level=10, log_file='app.log', loggers=(LoggerType.CONSOLE, LoggerType.FILE))¶

Bases: object

A root logger class that facilitates logging to console and files.

This class is meant to be the root-level logger in an application, that receives log messages from all other modules. It formats the messages in a uniform way and directs the messages to the specified outputs (console, file, etc.)

There is only one instance of this class, ensured by a singelton pattern implementation.

Initialize the StatLogger class.

Parameters:

log_level (int) – The logging level. Defaults to logging.DEBUG.
log_file (str | Path) – The file where logs will be written. Defaults to ‘app.log’.
loggers (Iterable[LoggerType]) – Optional list of LoggerTypes that should be added. Defaults to LoggerType.CONSOLE and LoggerType.FILE.
args (Any)
kwargs (Any)

Raises:

TypeError – If not all loggers have type LoggerType.

Return type:

Any

getLogger()¶

Returns the configured logger instance.

Return type:: Logger

all_combos_agg(df, groupcols, valuecols=None, aggargs=None, fillna_dict=None, keep_empty=False, grand_total='')¶

Generate all aggregation levels for a set of columns in a dataframe.

Parameters:

df (DataFrame) – DataFrame to aggregate.
groupcols (list[str]) – List of columns to group by.
valuecols (list[str] | None) – List of columns to apply aggregation functions on. Defaults to None, in which case all numeric columns are used.
aggargs (Callable[[Any], Any] | str | ufunc | Mapping[str, Callable[[Any], Any] | str | ufunc] | dict[str, list[str]] | None) – Dictionary or function specifying aggregation for each column in valuecols. If None, defaults to ‘sum’ for each column in valuecols.
fillna_dict (dict[str, Any] | None) – Dictionary specifying values to fill NA in each column of groupcols. Useful for indicating totals in the final table.
keep_empty (bool) – If True, preserves empty groups in the output.
grand_total (dict[str, str] | str) – Dictionary or string to indicate a grand total row. If a dictionary, the values are applied in each corresponding groupcols.

Returns:

groupcols: group-by columns with filled total values as needed.
level: indicates aggregation level.
ways: counts the number of grouping columns used for each aggregation.

Return type:

DataFrame with all aggregation levels, including

Examples

>>> data = pd.DataFrame({
        'age': [20, 60, 33, 33, 20],
        'region': ['0301', '3001', '0301', '5401', '0301'],
        'gender': ['1', '2', '1', '2', '2'],
        'income': [1000000, 120000, 220000, 550000, 50000],
        'wealth': [25000, 50000, 33000, 44000, 90000]
    })
>>> all_combos_agg(data, groupcols=['gender', 'age'], aggargs={'income': ['mean', 'sum']})

all_combos_agg_inclusive(df, groupcols=None, category_mappings=None, valuecols=None, aggargs=None, totalcodes=None, keep_empty=False, grand_total=True)¶

Generate all aggregation levels for a set of columns in a dataframe, for non-exclusive categories.

Creates aggregations over all combinations of categorical variables specified in groupcols and applies aggregation functions on valuecols. Allows for inclusion of grand totals and customized fill values for missing groups. It is basically a more general version of the all_combos_agg function, allowing for inclusive (non-exclusive) categories. Inclusive categories are defined by a dictionary of mappings in category_mappings. Variables in groupcols are assumed to be categorical, and their categories are treated as mutually exclusive.

Parameters:

df (DataFrame) – DataFrame to aggregate.
groupcols (None | list[str]) – List of columns to group by.
category_mappings (None | dict[str, dict[str, list[Any] | str] | Any]) – Dictionary of dictionaries, where each key is a column name and each value is a dictionary of mappings. ‘__ALL__’ can be used to indicate ‘all values’ in a column, and is used for totals.
valuecols (None | list[str]) – List of columns to apply aggregation functions on. Defaults to None, in which case all numeric columns are used.
aggargs (None | dict[str, Any] | Callable[..., Any] | str | list[Any]) – Dictionary or function specifying aggregation for each column in valuecols. If None, defaults to ‘sum’ for each column in valuecols.
totalcodes (None | dict[str, str]) – Dictionary specifying values to use as labels representing totals in each column.
keep_empty (bool) – If True, preserves empty groups in the output.
grand_total (bool) – Dictionary or string to indicate a grand total row. If a dictionary, the values are applied in each corresponding groupcols.

Raises:

ValueError – If a column in groupcols is not found in the DataFrame.

Returns:

DataFrame with all aggregation levels for the specified columns.

Return type:

pd.DataFrame

Examples

>>> # Define the categorical bins based on the metadata
>>> gender_bins = {"1": "Menn", "2": "Kvinner"}

>>> # Generate synthetic data
>>> np.random.seed(42)
>>> num_samples = 100

>>> synthetic_data = pd.DataFrame({
        "Tid": np.random.choice(["2021", "2022", "2023"], num_samples),
        "UtdanningOppl": np.random.choice(list(range(1,19)), num_samples),
        "Kjonn": np.random.choice(list(gender_bins.keys()), num_samples),
        "Alder": np.random.randint(15, 67, num_samples),  # Ages between 15 and 66
        "syss_student": np.random.choice(["01", "02", "03", "04"], num_samples),
        "n": 1
    })

>>> category_mappings = {
        "Alder": {
            "15-24": range(15, 25),
            "25-34": range(25, 35),
            "35-44": range(35, 45),
            "45-54": range(45, 55),
            "55-66": range(55, 67),
            "15-21": range(15, 22),
            "22-30": range(22, 31),
            "31-40": range(31, 41),
            "41-50": range(41, 51),
            "51-66": range(51, 67),
            "15-30": range(15, 31),
            "31-45": range(31, 46),
            "46-66": range(46, 67),
        },
        "syss_student": {
            "01": ["01", "02"],
            "02": ["03", "04"],
            "03": ["02"],
            "04": ["04"],
        },
        "Kjonn": {
            "Menn": ["1"],
            "Kvinner": ["2"],
        }
    }

>>> totalcodes = {
            "Alder": "Total",
            "syss_student": "Total",
            "Kjonn": "Begge"
    }

>>> all_combos_agg_inclusive(synthetic_data,
                                 groupcols = [],
                                 category_mappings=category_mappings,
                                 totalcodes=totalcodes,
                                 valuecols = ["n"],
                                 aggargs={"n": "sum"},
                                 grand_total=True)

auto_dtype(df, cardinality_threshold=0, copy_df=True, show_memory=True)¶

Clean up a dataframes dtypes.

First lowers all column names. Tries to decodes byte strings to utf8. Runs pandas’ convert_dtypes() Tries to convert object to string, and strips empty spaces Downcasts ints to lower versions of ints If cardinality_threshold is set above 0, will convert object and strings to categoricals, if number of unique values in the columns are below the threshold.

Parameters:

df (DataFrame) – The dataframe to manipulate
cardinality_threshold (int) – Less unique values in columns than this threshold, means it should be converted to a categorical. Defaults to 0, meaning no conversion to categoricals.
copy_df (bool) – The reverse of inplace, make a copy in memory. This may give a memory impact, but be safer. Defaults to True.
show_memory (bool) – Show the user how much memory was saved by doing the conversion, does require some processing. Defaults to True.

Returns:

_description_

Return type:

pd.DataFrame

check_env(raise_err=True)¶

Check if you are on Dapla or in prodsone.

Parameters:: raise_err (bool) – Set to False if you don’t want the code to raise an error on an unrecognized environment.
Returns:: “DAPLA” if on Dapla, “PROD” if in prodsone, otherwise “UNKNOWN”.
Return type:: str
Raises:: OSError – If no environment indications match (Dapla or Prod), and raise_err is set to True.

get_latest_fileversions(glob_list_path)¶

Receives a list of filenames with multiple versions and returns the latest versions of the files.

Recommend using glob operation to create the input list. See doc for glob operations: - GCS: https://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFileSystem.glob - Locally: https://docs.python.org/3/library/glob.html

Parameters:: glob_list_path (list[str] | list[Path] | str | Path) – List of strings/Paths or single string/Path that represents a filepath. Recommend that the list is created with glob operation.
Returns:: List of strings, or Paths (if path was submitted) with unique filepaths and their latest versions.
Return type:: list[str | Path]
Raises:: TypeError – If parameter does not fit with type-narrowing to list of strings.

Example:

import dapla as dp
fs = dp.FileClient.get_gcs_file_system()
all_files = fs.glob("gs://dir/statdata_v*.parquet")
latest_files = get_latest_fileversions(all_files)

latest_version_path(filepath)¶

Finds the path to the latest version of a specified file.

This function retrieves all versioned files matching the provided file path pattern and identifies the latest version. It supports both Google Cloud Storage (GCS) paths and local file paths, provided they follow the required naming convention with version numbers (e.g., ‘_v1’). If no versions are found, it defaults to returning a pattern representing version 1.

Parameters:

filepath (str | Path) – The full path of the file, either a GCS path or a local path. It should follow the naming standard, including the version indicator.

Returns:

The path to the latest version of the file. If no versions are found, returns: a pattern for version 1 of the file.

Return type:

str | Path

Raises:

ValueError – If get_latest_fileversions returns a list of more than one file.
ValueError – If the filepath does not follow the naming convention with ‘_v’ followed by digits to denote version, when a versioned file is required.

Examples

‘ssb-prod-ofi-skatteregn-data-produkt/skatteregn/inndata/skd_data/2023/skd_p2023-01_v1.parquet’
‘/ssb/stammeXX/kortkode/inndata/skd_data/2023/skd_p2023-01_v1.parquet’

linux_shortcuts(insert_environ=False)¶

Manually load the “linux-forkortelser” in as dict.

If the function can find the file they are shared in.

Parameters:: insert_environ (bool) – Set to True if you want the dict to be inserted into the environment variables (os.environ).
Returns:: The “linux-forkortelser” as a dict
Return type:: dict[str, str]
Raises:: ValueError – If the stamme_variabel file is wrongly formatted.

make_klass_xml_codelist(path, codes, names_bokmaal=None, names_nynorsk=None, names_engelsk=None, parent=None, shortname_bokmaal=None, shortname_nynorsk=None, shortname_engelsk=None, notes_bokmaal=None, notes_nynorsk=None, notes_engelsk=None, valid_from=None, valid_to=None)¶

Make a klass xml file and pandas Dataframe from a list of codes and names.

This XML can be loaded into the old KLASS UI under version -> import to the top right.

Parameters:

path (str) – Path to save the xml file.
codes (list[str | int]) – List of codes.
names_bokmaal (list[str | None] | None) – List of names in Bokmål.
names_nynorsk (list[str | None] | None) – List of names in Nynorsk.
names_engelsk (list[str | None] | None) – List of names in English.
parent (list[str | None] | None) – List of parent codes that applies to the codes (for hierarchical codelists).
shortname_bokmaal (list[str | None] | None) – Shortname in Bokmål.
shortname_nynorsk (list[str | None] | None) – Shortname in Nynorsk.
shortname_engelsk (list[str | None] | None) – Shortname in English.
notes_bokmaal (list[str | None] | None) – Notes in Bokmål.
notes_nynorsk (list[str | None] | None) – Notes in Nynorsk.
notes_engelsk (list[str | None] | None) – Notes in English.
valid_from (list[str | None] | None) – Valid from date.
valid_to (list[str | None] | None) – Valid to date.

Returns:

Dataframe with columns for codes and names.

Return type:

pd.DataFrame

Raises:

ValueError – If the length of the lists sent in are not the same

next_version_path(filepath)¶

Generates a new file path with an incremented version number.

Constructs a filepath for a new version of a file, based on the latest existing version found in a specified folder. Meaning it skips to “one after the highest version it finds”. It increments the version number by one, to ensure the new file path is unique.

Parameters:: filepath (str | Path) – The path for the file.
Returns:: The new file path with an incremented version number and specified suffix.
Return type:: str | Path

Example:

next_version_path('gs://my-bucket/datasets/data_v1.parquet')
'gs://my-bucket/datasets/data_v2.parquet'

open_path_datadok(path, **read_fwf_params)¶

Get archive data only based on the path of the .dat or .txt file.

This function attempts to correct and test options, to try track down the file and metadata mentioned.

Parameters:

path (str | Path) – The path to the archive file in prodsonen to attempt to get metadata for and open.
read_fwf_params (Any) – Remaining parameters to pass to pd.read_fwf, dtype, widths, names and na_values is overwritten, so dont use those.

Returns:

An ArchiveData object containing the imported data, metadata, and code lists.

Return type:

ArchiveData

Raises:

ValueError – If no datadok-api endpoint is found for the path given.

open_path_metapath_datadok(path, metapath, **read_fwf_params)¶

If open_path_datadok doesnt work, specify the path on linux AND the path in Datadok.

Parameters:

path (str) – Path to the archive file on linux.
metapath (str) – Path described in datadok.
read_fwf_params (Any) – Remaining parameters to pass to pd.read_fwf, dtype, widths, names and na_values is overwritten, so dont use those.

Returns:

An ArchiveData object containing the imported data, metadata, and code lists.

Return type:

ArchiveData

round_up(data, decimal_places=0, col_names='')¶

Round up a number, to a given number of decimal places. Avoids Pythons default of rounding to even.

Parameters:

data (DataFrame | object | float | NAType) – The data to round up, can be a float, Series, or DataFrame.
decimal_places (int) – The number of decimal places to round up to. Ignored if you send a dictionary into col_names with column names and decimal places.
col_names (str | list[str] | dict[str, int]) – The column names to round up. If a dictionary is provided, it should map column names to the number of decimal places for each column. If a list is provided, it should contain the names of the columns to round up. If a string is provided, it should be the name of a single column to round up.

Returns:

The rounded up number as an int, float, Series, or DataFrame.

Return type:

pd.DataFrame | pd.Series | int | float

Raises:

TypeError – If data is not a DataFrame, Series, int, float, or NAType.

saspy_df_from_path(path)¶

Use df_from_sasfile instead, this is the old (bad) name for the function.

Parameters:: path (str) – The full path to the sasfile you want to open with sas.
Returns:: The raw content of the sasfile straight from saspy
Return type:: pandas.DataFrame

saspy_session()¶

Get an initialized saspy.SASsession object.

Use the default config, getting your password if you’ve set one.

Returns:: An initialized saspy-session
Return type:: saspy.SASsession

view_dataframe(dataframe, column, operator='==', unique_limit=100)¶

Display an interactive widget for filtering and viewing data in a DataFrame based on selection of values in one column.

Parameters:

dataframe (DataFrame) – The DataFrame containing the data to be filtered.
column (str) – The column in the DataFrame to be filtered.
operator (str) – The comparison operator for filtering (may be altered during the display). Options: ‘==’, ‘!=’, ‘>=’, ‘>’, ‘<’, ‘<=’. Default: ‘==’.
unique_limit (int) – The maximum number of unique values in the column for using ‘==’ or ‘!=’ operators. Default: 100.

Returns:

An interactive widget for filtering and viewing data based on the specified criteria.: The ‘==’ and ‘!=’ operators use a dropdown list for multiple selection The other (interval) parameters us a slider

Return type:

widgets.interactive