Reference

fagfunksjoner package

Subpackages

Submodules

fagfunksjoner.fagfunksjoner_logger module

class ColoredFormatter(*args, colors=None, **kwargs)

Bases: Formatter

Colored log formatter.

Parameters:
  • args (Any)

  • colors (dict[str, str] | None)

  • kwargs (Any)

format(record)

Format the specified record as text.

Return type:

str

Parameters:

record (LogRecord)

silence_logger(func, *args, **kwargs)

Silences INFO and WARNING logs for the duration of the function call.

Return type:

Any

Parameters:
  • func (Callable[[...], Any])

  • args (Any)

  • kwargs (Any)

Module contents

Fagfunksjoner is a place for “loose, small functionality” produced at Statistics Norway in Python.

Often created by “fag”, not IT these are often small “helper-functions” that many might be interested in.

class ProjectRoot

Bases: object

Contextmanager to import locally “with”.

As in:

with ProjectRoot():
    from src.functions.local_functions import local_function

So this class navigates back and forth using a single line/”instruction”

static load_toml(config_file)

Looks for a .toml file to load the contents from.

Looks in the current folder, the specified path, the project root.

Parameters:

config_file (str) – The path or filename of the config-file to load.

Returns:

The contents of the toml-file.

Return type:

dict[Any]

all_combos_agg(df, groupcols, valuecols=None, aggargs=None, fillna_dict=None, keep_empty=False, grand_total='')

Generate all aggregation levels for a set of columns in a dataframe.

Creates aggregations over all combinations of categorical variables specified in groupcols and applies aggregation functions on valuecols. Allows for inclusion of grand totals and customized fill values for missing groups, similar to “proc means” in SAS.

Parameters:
  • df (DataFrame) – DataFrame to aggregate.

  • groupcols (list[str]) – List of columns to group by.

  • valuecols (list[str] | None) – List of columns to apply aggregation functions on. Defaults to None, in which case all numeric columns are used.

  • aggargs (Callable[[Any], Any] | str | ufunc | Mapping[str, Callable[[Any], Any] | str | ufunc] | dict[str, list[str]] | None) – Dictionary or function specifying aggregation for each column in valuecols. If None, defaults to ‘sum’ for each column in valuecols.

  • fillna_dict (dict[str, Any] | None) – Dictionary specifying values to fill NA in each column of groupcols. Useful for indicating totals in the final table.

  • keep_empty (bool) – If True, preserves empty groups in the output.

  • grand_total (dict[str, str] | str) – Dictionary or string to indicate a grand total row. If a dictionary, the values are applied in each corresponding groupcols.

Returns:

  • groupcols: group-by columns with filled total values as needed.

  • level: indicates aggregation level.

  • ways: counts the number of grouping columns used for each aggregation.

Return type:

DataFrame with all aggregation levels, including

Examples

>>> data = pd.DataFrame({
        'age': [20, 60, 33, 33, 20],
        'region': ['0301', '3001', '0301', '5401', '0301'],
        'gender': ['1', '2', '1', '2', '2'],
        'income': [1000000, 120000, 220000, 550000, 50000],
        'wealth': [25000, 50000, 33000, 44000, 90000]
    })
>>> all_combos_agg(data, groupcols=['gender', 'age'], aggargs={'income': ['mean', 'sum']})
auto_dtype(df, cardinality_threshold=0, copy_df=True, show_memory=True)

Clean up a dataframes dtypes.

First lowers all column names. Tries to decodes byte strings to utf8. Runs pandas’ convert_dtypes() Tries to convert object to string, and strips empty spaces Downcasts ints to lower versions of ints If cardinality_threshold is set above 0, will convert object and strings to categoricals, if number of unique values in the columns are below the threshold.

Parameters:
  • df (DataFrame) – The dataframe to manipulate

  • cardinality_threshold (int) – Less unique values in columns than this threshold, means it should be converted to a categorical. Defaults to 0, meaning no conversion to categoricals.

  • copy_df (bool) – The reverse of inplace, make a copy in memory. This may give a memory impact, but be safer. Defaults to True.

  • show_memory (bool) – Show the user how much memory was saved by doing the conversion, does require some processing. Defaults to True.

Returns:

_description_

Return type:

pd.DataFrame

check_env(raise_err=True)

Check if you are on Dapla or in prodsone.

Parameters:

raise_err (bool) – Set to False if you don’t want the code to raise an error on an unrecognized environment.

Returns:

“DAPLA” if on Dapla, “PROD” if in prodsone, otherwise “UNKNOWN”.

Return type:

str

Raises:

OSError – If no environment indications match (Dapla or Prod), and raise_err is set to True.

get_latest_fileversions(glob_list_path)

Receives a list of filenames with multiple versions and returns the latest versions of the files.

Recommend using glob operation to create the input list. See doc for glob operations: - GCS: https://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFileSystem.glob - Locally: https://docs.python.org/3/library/glob.html

Parameters:

glob_list_path (list[str] | str) – List of strings or single string that represents a filepath. Recommend that the list is created with glob operation.

Returns:

List of strings with unique filepaths and their latest versions.

Return type:

list[str]

Raises:

TypeError – If parameter does not fit with type-narrowing to list of strings.

Example:

import dapla as dp
fs = dp.FileClient.get_gcs_file_system()
all_files = fs.glob("gs://dir/statdata_v*.parquet")
latest_files = get_latest_fileversions(all_files)
linux_shortcuts(insert_environ=False)

Manually load the “linux-forkortelser” in as dict.

If the function can find the file they are shared in.

Parameters:

insert_environ (bool) – Set to True if you want the dict to be inserted into the environment variables (os.environ).

Returns:

The “linux-forkortelser” as a dict

Return type:

dict[str, str]

Raises:

ValueError – If the stamme_variabel file is wrongly formatted.

next_version_path(filepath)

Generates a new file path with an incremented version number.

Constructs a filepath for a new version of a file, based on the latest existing version found in a specified folder. Meaning it skips to “one after the highest version it finds”. It increments the version number by one, to ensure the new file path is unique.

Parameters:

filepath (str) – The address for the file.

Returns:

The new file path with an incremented version number and specified suffix.

Return type:

str

Example:

next_version_path('gs://my-bucket/datasets/data_v1.parquet')
'gs://my-bucket/datasets/data_v2.parquet'
open_path_datadok(path, **read_fwf_params)

Get archive data only based on the path of the .dat or .txt file.

This function attempts to correct and test options, to try track down the file and metadata mentioned.

Parameters:
  • path (str | Path) – The path to the archive file in prodsonen to attempt to get metadata for and open.

  • read_fwf_params (Any) – Remaining parameters to pass to pd.read_fwf, dtype, widths, names and na_values is overwritten, so dont use those.

Returns:

An ArchiveData object containing the imported data, metadata, and code lists.

Return type:

ArchiveData

Raises:

ValueError – If no datadok-api endpoint is found for the path given.

open_path_metapath_datadok(path, metapath, **read_fwf_params)

If open_path_datadok doesnt work, specify the path on linux AND the path in Datadok.

Parameters:
  • path (str) – Path to the archive file on linux.

  • metapath (str) – Path described in datadok.

  • read_fwf_params (Any) – Remaining parameters to pass to pd.read_fwf, dtype, widths, names and na_values is overwritten, so dont use those.

Returns:

An ArchiveData object containing the imported data, metadata, and code lists.

Return type:

ArchiveData

saspy_df_from_path(path)

Use df_from_sasfile instead, this is the old (bad) name for the function.

Parameters:

path (str) – The full path to the sasfile you want to open with sas.

Returns:

The raw content of the sasfile straight from saspy

Return type:

pandas.DataFrame

saspy_session()

Get an initialized saspy.SASsession object.

Use the default config, getting your password if you’ve set one.

Returns:

An initialized saspy-session

Return type:

saspy.SASsession

view_dataframe(dataframe, column, operator='==', unique_limit=100)

Display an interactive widget for filtering and viewing data in a DataFrame based on selection of values in one column.

Parameters:
  • dataframe (DataFrame) – The DataFrame containing the data to be filtered.

  • column (str) – The column in the DataFrame to be filtered.

  • operator (str) – The comparison operator for filtering (may be altered during the display). Options: ‘==’, ‘!=’, ‘>=’, ‘>’, ‘<’, ‘<=’. Default: ‘==’.

  • unique_limit (int) – The maximum number of unique values in the column for using ‘==’ or ‘!=’ operators. Default: 100.

Returns:

An interactive widget for filtering and viewing data based on the specified criteria.

The ‘==’ and ‘!=’ operators use a dropdown list for multiple selection The other (interval) parameters us a slider

Return type:

widgets.interactive