Reference

fagfunksjoner package

Subpackages

Submodules

fagfunksjoner.fagfunksjoner_logger module

class ColoredFormatter(*args, colors=None, **kwargs)

Bases: Formatter

Colored log formatter.

Parameters:
  • args (Any)

  • colors (dict[str, str] | None)

  • kwargs (Any)

format(record)

Format the specified record as text.

Return type:

str

Parameters:

record (LogRecord)

silence_logger(func, *args, **kwargs)

Silences INFO and WARNING logs for the duration of the function call.

Return type:

Any

Parameters:
  • func (Callable[[...], Any])

  • args (Any)

  • kwargs (Any)

Module contents

Fagfunksjoner is a place for “loose, small functionality” produced at Statistics Norway in Python.

Often created by “fag”, not IT these are often small “helper-functions” that many might be interested in.

class ProjectRoot

Bases: object

Contextmanager to import locally “with”.

As in:

with ProjectRoot():
    from src.functions.local_functions import local_function

So this class navigates back and forth using a single line/”instruction”

static load_toml(config_file)

Looks for a .toml file to load the contents from.

Looks in the current folder, the specified path, the project root.

Parameters:

config_file (str) – The path or filename of the config-file to load.

Returns:

The contents of the toml-file.

Return type:

dict[Any]

all_combos_agg(df, groupcols, aggargs, fillna_dict=None, keep_empty=False, grand_total='')

Generate all aggregation levels for a set of columns in a dataframe.

Parameters:
  • df (DataFrame) – dataframe to aggregate.

  • groupcols (list[str]) – List of columns to group by.

  • aggargs (Callable[[Any], Any] | str | ufunc | Mapping[str, Callable[[Any], Any] | str | ufunc] | dict[str, list[str]]) – how to aggregate, is sent to the agg function in pandas, look at its documentation.

  • fillna_dict (dict[str, Any] | None) – Fills “totals” in the groupcols, by filling their NA values. Send a dict with col names as keys, and string-values to put in cells as values.

  • keep_empty (bool) – Keep groups without observations through the process. Removing them is default behaviour of Pandas

  • grand_total (dict[str, str] | str) – Fill this value, if you want a grand total in your aggregations. If you use a string, this will be input in the fields in the groupcol columns. If you send a dict, like to the fillna_dict parameter, the values in the cells in the grand_total will reflect the values in the dict.

Returns:

with all the group-by columns, all the aggregation columns combined

with the aggregation functions, a column called aggregation_level which separates the different aggregation levels, and a column called aggregation_ways which counts the number of group columns used for the aggregation..

Return type:

pd.DataFrame

Known problems:

You should not use dataframes with multi-index columns as they cause trouble.

Examples:

import pandas as pd
from fagfunksjoner.data.pandas_combinations import all_combos_agg

data = {'alder': [20, 60, 33, 33, 20],
        'kommune': ['0301', '3001', '0301', '5401', '0301'],
        'kjonn': ['1', '2', '1', '2', '2'],
        'inntekt': [1000000, 120000, 220000, 550000, 50000],
        'formue': [25000, 50000, 33000, 44000, 90000]
        }

pers = pd.DataFrame(data)

agg1 = all_combos_agg(pers, groupcols=['kjonn'], keep_empty=True, aggargs={'inntekt':['mean', 'sum']})
display(agg1)

agg2 = all_combos_agg(pers, groupcols=['kjonn', 'alder'], aggargs={'inntekt':['mean', 'sum']})
display(agg2)

agg3 = all_combos_agg(pers, groupcols=['kjonn', 'alder'], grand_total=True, grand_total='Grand total', aggargs={'inntekt':['mean', 'sum']})
display(agg3)
agg4 = all_combos_agg(pers, groupcols=['kjonn', 'alder'], fillna_dict={'kjonn': 'Total kjønn', 'alder': 'Total alder'}, aggargs={'inntekt':['mean', 'sum'], 'formue': ['count', 'min', 'max']}, grand_total="Total")
display(agg4)
pers['antall'] = 1
groupcols = pers.columns[0:3].tolist()
func_dict = {'inntekt':['mean', 'sum'], 'formue': ['sum', 'std', 'count']}
fillna_dict = {'kjonn': 'Total kjønn', 'alder': 'Total alder', 'kommune': 'Total kommune'}
agg5 = all_combos_agg(pers, groupcols=groupcols, aggargs=func_dict, fillna_dict=fillna_dict, grand_total=fillna_dict )
display(agg5)
auto_dtype(df, cardinality_threshold=0, copy_df=True, show_memory=True)

Clean up a dataframes dtypes.

First lowers all column names. Tries to decodes byte strings to utf8. Runs pandas’ convert_dtypes() Tries to convert object to string, and strips empty spaces Downcasts ints to lower versions of ints If cardinality_threshold is set above 0, will convert object and strings to categoricals, if number of unique values in the columns are below the threshold.

Parameters:
  • df (DataFrame) – The dataframe to manipulate

  • cardinality_threshold (int) – Less unique values in columns than this threshold, means it should be converted to a categorical. Defaults to 0, meaning no conversion to categoricals.

  • copy_df (bool) – The reverse of inplace, make a copy in memory. This may give a memory impact, but be safer. Defaults to True.

  • show_memory (bool) – Show the user how much memory was saved by doing the conversion, does require some processing. Defaults to True.

Returns:

_description_

Return type:

pd.DataFrame

check_env(raise_err=True)

Check if you are on Dapla or in prodsone.

Parameters:

raise_err (bool) – Set to False if you don’t want the code to raise an error on an unrecognized environment.

Returns:

“DAPLA” if on Dapla, “PROD” if in prodsone, otherwise “UNKNOWN”.

Return type:

str

Raises:

OSError – If no environment indications match (Dapla or Prod), and raise_err is set to True.

get_latest_fileversions(glob_list_path)

Receives a list of filenames with multiple versions and returns the latest versions of the files.

Recommend using glob operation to create the input list. See doc for glob operations: - GCS: https://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFileSystem.glob - Locally: https://docs.python.org/3/library/glob.html

Parameters:

glob_list_path (list[str] | str) – List of strings or single string that represents a filepath. Recommend that the list is created with glob operation.

Returns:

List of strings with unique filepaths and their latest versions.

Return type:

list[str]

Raises:

TypeError – If parameter does not fit with type-narrowing to list of strings.

Example:

import dapla as dp
fs = dp.FileClient.get_gcs_file_system()
all_files = fs.glob("gs://dir/statdata_v*.parquet")
latest_files = get_latest_fileversions(all_files)
linux_shortcuts(insert_environ=False)

Manually load the “linux-forkortelser” in as dict.

If the function can find the file they are shared in.

Parameters:

insert_environ (bool) – Set to True if you want the dict to be inserted into the environment variables (os.environ).

Returns:

The “linux-forkortelser” as a dict

Return type:

dict[str, str]

Raises:

ValueError – If the stamme_variabel file is wrongly formatted.

next_version_path(filepath)

Generates a new file path with an incremented version number.

Constructs a filepath for a new version of a file, based on the latest existing version found in a specified folder. Meaning it skips to “one after the highest version it finds”. It increments the version number by one, to ensure the new file path is unique.

Parameters:

filepath (str) – The address for the file.

Returns:

The new file path with an incremented version number and specified suffix.

Return type:

str

Example:

next_version_path('gs://my-bucket/datasets/data_v1.parquet')
'gs://my-bucket/datasets/data_v2.parquet'
open_path_datadok(path, **read_fwf_params)

Get archive data only based on the path of the .dat or .txt file.

This function attempts to correct and test options, to try track down the file and metadata mentioned.

Parameters:
  • path (str | Path) – The path to the archive file in prodsonen to attempt to get metadata for and open.

  • read_fwf_params (Any) – Remaining parameters to pass to pd.read_fwf, dtype, widths, names and na_values is overwritten, so dont use those.

Returns:

An ArchiveData object containing the imported data, metadata, and code lists.

Return type:

ArchiveData

Raises:

ValueError – If no datadok-api endpoint is found for the path given.

open_path_metapath_datadok(path, metapath, **read_fwf_params)

If open_path_datadok doesnt work, specify the path on linux AND the path in Datadok.

Parameters:
  • path (str) – Path to the archive file on linux.

  • metapath (str) – Path described in datadok.

  • read_fwf_params (Any) – Remaining parameters to pass to pd.read_fwf, dtype, widths, names and na_values is overwritten, so dont use those.

Returns:

An ArchiveData object containing the imported data, metadata, and code lists.

Return type:

ArchiveData

saspy_df_from_path(path)

Use df_from_sasfile instead, this is the old (bad) name for the function.

Parameters:

path (str) – The full path to the sasfile you want to open with sas.

Returns:

The raw content of the sasfile straight from saspy

Return type:

pandas.DataFrame

saspy_session()

Get an initialized saspy.SASsession object.

Use the default config, getting your password if you’ve set one.

Returns:

An initialized saspy-session

Return type:

saspy.SASsession

view_dataframe(dataframe, column, operator='==', unique_limit=100)

Display an interactive widget for filtering and viewing data in a DataFrame based on selection of values in one column.

Parameters:
  • dataframe (DataFrame) – The DataFrame containing the data to be filtered.

  • column (str) – The column in the DataFrame to be filtered.

  • operator (str) – The comparison operator for filtering (may be altered during the display). Options: ‘==’, ‘!=’, ‘>=’, ‘>’, ‘<’, ‘<=’. Default: ‘==’.

  • unique_limit (int) – The maximum number of unique values in the column for using ‘==’ or ‘!=’ operators. Default: 100.

Returns:

An interactive widget for filtering and viewing data based on the specified criteria.

The ‘==’ and ‘!=’ operators use a dropdown list for multiple selection The other (interval) parameters us a slider

Return type:

widgets.interactive