Reference¶
fagfunksjoner package¶
Subpackages¶
- fagfunksjoner.api package
- Submodules
- fagfunksjoner.api.statistikkregisteret module
Contact
FuturePublishingError
LangText
MultiplePublishings
Name
Owningsection
PublishingSpecifics
PublishingSpecifics.desk_flow
PublishingSpecifics.has_changed
PublishingSpecifics.is_cancelled
PublishingSpecifics.is_period
PublishingSpecifics.name
PublishingSpecifics.period_from
PublishingSpecifics.period_until
PublishingSpecifics.precision
PublishingSpecifics.publish_id
PublishingSpecifics.revision
PublishingSpecifics.statistic
PublishingSpecifics.status
PublishingSpecifics.time
PublishingSpecifics.time_changed
PublishingSpecifics.title
PublishingSpecifics.variant
SinglePublishing
SinglePublishing.name
SinglePublishing.short_name
SinglePublishing.old_subjectcodes
SinglePublishing.firstpublishing
SinglePublishing.status
SinglePublishing.owner_name
SinglePublishing.owner_code
SinglePublishing.contacts
SinglePublishing.triggerwords
SinglePublishing.variants
SinglePublishing.regional_levels
SinglePublishing.continuation
SinglePublishing.publish_id
SinglePublishing.default_lang
SinglePublishing.approved
SinglePublishing.changed
SinglePublishing.desk_flow
SinglePublishing.dir_flow
SinglePublishing.annual_reporting
SinglePublishing.approved
SinglePublishing.changed
SinglePublishing.changes
SinglePublishing.contacts
SinglePublishing.continuation
SinglePublishing.created_date
SinglePublishing.default_lang
SinglePublishing.desk_flow
SinglePublishing.dir_flow
SinglePublishing.firstpublishing
SinglePublishing.name
SinglePublishing.old_subjectcodes
SinglePublishing.owningsection
SinglePublishing.publish_id
SinglePublishing.publishings
SinglePublishing.regional_levels
SinglePublishing.short_name
SinglePublishing.start_year
SinglePublishing.status
SinglePublishing.triggerwords
SinglePublishing.variants
StatisticPublishingShort
Variant
etree_to_dict()
find_latest_publishing()
find_publishings()
find_stat_shortcode()
get_contacts()
get_singles_publishings()
get_statistics_register()
handle_children()
kwargs_specifics()
parse_contact_single()
parse_contacts()
parse_data_single()
parse_eierseksjon_single()
parse_lang_text_single()
parse_name_single()
parse_single_stat_from_englishjson()
parse_triggerord_single()
parse_variant_single()
raise_on_missing_future_publish()
sections_publishings()
single_stat()
specific_publishing()
time_until_publishing()
- fagfunksjoner.api.valuta module
- Module contents
- fagfunksjoner.dapla package
- fagfunksjoner.data package
- Submodules
- fagfunksjoner.data.datadok_extract module
ArchiveData
CodeList
ContextVariable
Metadata
add_dollar_or_nondollar_path()
add_pii_paths()
bumpcheck_file_years_back()
codelist_to_df()
codelist_to_dict()
convert_dates()
convert_to_pathlib()
date_formats()
date_parser()
downcast_ints()
extract_codelist()
extract_context_variables()
extract_parameters()
get_path_combinations()
get_yr_char_ranges()
go_back_in_time()
handle_decimals()
import_archive_data()
look_for_filepath()
metadata_to_df()
open_path_datadok()
open_path_metapath_datadok()
replace_dollar_stamme()
test_url()
test_url_combos()
url_from_path()
- fagfunksjoner.data.dicts module
- fagfunksjoner.data.pandas_combinations module
- fagfunksjoner.data.pandas_dtypes module
- fagfunksjoner.data.pyarrow module
- fagfunksjoner.data.view_dataframe module
- Module contents
- fagfunksjoner.log package
- fagfunksjoner.paths package
- fagfunksjoner.prodsone package
Submodules¶
fagfunksjoner.fagfunksjoner_logger module¶
- class ColoredFormatter(*args, colors=None, **kwargs)¶
Bases:
Formatter
Colored log formatter.
Initialize the formatter with specified format strings.
- Parameters:
args (Any)
colors (dict[str, str] | None)
kwargs (Any)
- format(record)¶
Format the specified record as text.
- Return type:
str
- Parameters:
record (LogRecord)
- silence_logger(func, *args, **kwargs)¶
Silences INFO and WARNING logs for the duration of the function call.
- Return type:
Any
- Parameters:
func (Callable[[...], Any])
args (Any)
kwargs (Any)
Module contents¶
Fagfunksjoner is a place for “loose, small functionality” produced at Statistics Norway in Python.
Often created by “fag”, not IT these are often small “helper-functions” that many might be interested in.
- class ProjectRoot¶
Bases:
object
Contextmanager to import locally “with”.
As in:
with ProjectRoot(): from src.functions.local_functions import local_function
So this class navigates back and forth using a single line/”instruction”
Initialize the projectroot by finding the correct folder.
And navigating back to, and storing the starting folder.
- static load_toml(config_file)¶
Looks for a .toml file to load the contents from.
Looks in the current folder, the specified path, the project root.
- Parameters:
config_file (
str
) – The path or filename of the config-file to load.- Returns:
The contents of the toml-file.
- Return type:
dict[Any]
- class SsbFormat(start_dict=None)¶
Bases:
dict
[Any
,Any
]Custom dictionary class designed to handle specific formatting conventions, including mapping intervals (defined as range strings) even when they map to the same value.
Initializes the SsbFormat instance.
- Parameters:
start_dict (dict, optional) – Initial dictionary to populate SsbFormat.
- static check_if_na(key)¶
Checks if the specified key represents a NA (Not Available) value.
- Parameters:
key (
Any
) – Key to be checked for NA value.- Returns:
True if the key represents NA, False otherwise.
- Return type:
bool
- int_str_confuse(key)¶
Handles conversion between integer and string keys.
- Parameters:
key (
str
|int
|float
|NAType
|None
) – Key to be converted or checked for existence in the dictionary.- Return type:
None
|Any
- Returns:
The value associated with the key (if found) or None.
- look_in_ranges(key)¶
Returns the mapping value for the key if it falls within any defined range.
The method attempts to convert the key to a float and then checks if it lies within any of the stored range intervals. If the key is None, NA, or not of a convertible type, the method returns None.
- Return type:
None
|Any
- Parameters:
key (str | int | float | NAType | None)
- set_na_value()¶
Sets the value for NA (Not Available) keys in the SsbFormat.
- Returns:
True if NA value is successfully set, False otherwise.
- Return type:
bool
- set_other_as_lowercase()¶
Ensures that the ‘other’ key is stored in lowercase.
If a key matching ‘other’ in any other case is found, its value is reassigned to ‘other’.
- Return type:
None
- store(output_path, force=False)¶
Stores the SsbFormat instance in a specified output path.
- Parameters:
output_path (str) – Path where the format will be stored.
force (bool) – Flag to force storing even for cached instances.
- Raises:
ValueError – If storing a cached SsbFormat might lead to an unexpectedly large number of keys.
- Return type:
None
- store_ranges()¶
Stores ranges by converting range-string keys into tuple keys.
For example, a key “0-18” with value “A” will be stored as {(0.0, 18.0): “A”}.
- Return type:
None
- update_format()¶
Update method to set special instance attributes.
- Return type:
None
- class StatLogger(log_level=10, log_file='app.log', loggers=(LoggerType.CONSOLE, LoggerType.FILE))¶
Bases:
object
A root logger class that facilitates logging to console and files.
This class is meant to be the root-level logger in an application, that receives log messages from all other modules. It formats the messages in a uniform way and directs the messages to the specified outputs (console, file, etc.)
There is only one instance of this class, ensured by a singelton pattern implementation.
Initialize the StatLogger class.
- Parameters:
log_level (
int
) – The logging level. Defaults to logging.DEBUG.log_file (
str
|Path
) – The file where logs will be written. Defaults to ‘app.log’.loggers (
Iterable
[LoggerType
]) – Optional list of LoggerTypes that should be added. Defaults to LoggerType.CONSOLE and LoggerType.FILE.args (Any)
kwargs (Any)
- Raises:
TypeError – If not all loggers have type LoggerType.
- Return type:
Any
- getLogger()¶
Returns the configured logger instance.
- Return type:
Logger
- all_combos_agg(df, groupcols, valuecols=None, aggargs=None, fillna_dict=None, keep_empty=False, grand_total='')¶
Generate all aggregation levels for a set of columns in a dataframe.
Creates aggregations over all combinations of categorical variables specified in groupcols and applies aggregation functions on valuecols. Allows for inclusion of grand totals and customized fill values for missing groups, similar to “proc means” in SAS.
- Parameters:
df (
DataFrame
) – DataFrame to aggregate.groupcols (
list
[str
]) – List of columns to group by.valuecols (
list
[str
] |None
) – List of columns to apply aggregation functions on. Defaults to None, in which case all numeric columns are used.aggargs (
Callable
[[Any
],Any
] |str
|ufunc
|Mapping
[str
,Callable
[[Any
],Any
] |str
|ufunc
] |dict
[str
,list
[str
]] |None
) – Dictionary or function specifying aggregation for each column in valuecols. If None, defaults to ‘sum’ for each column in valuecols.fillna_dict (
dict
[str
,Any
] |None
) – Dictionary specifying values to fill NA in each column of groupcols. Useful for indicating totals in the final table.keep_empty (
bool
) – If True, preserves empty groups in the output.grand_total (
dict
[str
,str
] |str
) – Dictionary or string to indicate a grand total row. If a dictionary, the values are applied in each corresponding groupcols.
- Returns:
groupcols: group-by columns with filled total values as needed.
level: indicates aggregation level.
ways: counts the number of grouping columns used for each aggregation.
- Return type:
DataFrame with all aggregation levels, including
Examples
>>> data = pd.DataFrame({ 'age': [20, 60, 33, 33, 20], 'region': ['0301', '3001', '0301', '5401', '0301'], 'gender': ['1', '2', '1', '2', '2'], 'income': [1000000, 120000, 220000, 550000, 50000], 'wealth': [25000, 50000, 33000, 44000, 90000] }) >>> all_combos_agg(data, groupcols=['gender', 'age'], aggargs={'income': ['mean', 'sum']})
- all_combos_agg_inclusive(df, groupcols=None, category_mappings=None, valuecols=None, aggargs=None, totalcodes=None, keep_empty=False, grand_total=True)¶
Generate all aggregation levels for a set of columns in a dataframe, for non-exclusive categories.
Creates aggregations over all combinations of categorical variables specified in groupcols and applies aggregation functions on valuecols. Allows for inclusion of grand totals and customized fill values for missing groups. It is basically a more general version of the all_combos_agg function, allowing for inclusive (non-exclusive) categories. Inclusive categories are defined by a dictionary of mappings in category_mappings. Variables in groupcols are assumed to be categorical, and their categories are treated as mutually exclusive.
- Parameters:
df (
DataFrame
) – DataFrame to aggregate.groupcols (
None
|list
[str
]) – List of columns to group by.category_mappings (
None
|dict
[str
,dict
[str
,list
[Any
] |str
] |Any
]) – Dictionary of dictionaries, where each key is a column name and each value is a dictionary of mappings. ‘__ALL__’ can be used to indicate ‘all values’ in a column, and is used for totals.valuecols (
None
|list
[str
]) – List of columns to apply aggregation functions on. Defaults to None, in which case all numeric columns are used.aggargs (
None
|dict
[str
,Any
] |Callable
[...
,Any
] |str
|list
[Any
]) – Dictionary or function specifying aggregation for each column in valuecols. If None, defaults to ‘sum’ for each column in valuecols.totalcodes (
None
|dict
[str
,str
]) – Dictionary specifying values to use as labels representing totals in each column.keep_empty (
bool
) – If True, preserves empty groups in the output.grand_total (
bool
) – Dictionary or string to indicate a grand total row. If a dictionary, the values are applied in each corresponding groupcols.
- Raises:
ValueError – If a column in groupcols is not found in the DataFrame.
- Returns:
DataFrame with all aggregation levels for the specified columns.
- Return type:
pd.DataFrame
Examples
>>> # Define the categorical bins based on the metadata >>> gender_bins = {"1": "Menn", "2": "Kvinner"}
>>> # Generate synthetic data >>> np.random.seed(42) >>> num_samples = 100
>>> synthetic_data = pd.DataFrame({ "Tid": np.random.choice(["2021", "2022", "2023"], num_samples), "UtdanningOppl": np.random.choice(list(range(1,19)), num_samples), "Kjonn": np.random.choice(list(gender_bins.keys()), num_samples), "Alder": np.random.randint(15, 67, num_samples), # Ages between 15 and 66 "syss_student": np.random.choice(["01", "02", "03", "04"], num_samples), "n": 1 })
>>> category_mappings = { "Alder": { "15-24": range(15, 25), "25-34": range(25, 35), "35-44": range(35, 45), "45-54": range(45, 55), "55-66": range(55, 67), "15-21": range(15, 22), "22-30": range(22, 31), "31-40": range(31, 41), "41-50": range(41, 51), "51-66": range(51, 67), "15-30": range(15, 31), "31-45": range(31, 46), "46-66": range(46, 67), }, "syss_student": { "01": ["01", "02"], "02": ["03", "04"], "03": ["02"], "04": ["04"], }, "Kjonn": { "Menn": ["1"], "Kvinner": ["2"], } }
>>> totalcodes = { "Alder": "Total", "syss_student": "Total", "Kjonn": "Begge" }
>>> all_combos_agg_inclusive(synthetic_data, groupcols = [], category_mappings=category_mappings, totalcodes=totalcodes, valuecols = ["n"], aggargs={"n": "sum"}, grand_total=True)
- auto_dtype(df, cardinality_threshold=0, copy_df=True, show_memory=True)¶
Clean up a dataframes dtypes.
First lowers all column names. Tries to decodes byte strings to utf8. Runs pandas’ convert_dtypes() Tries to convert object to string, and strips empty spaces Downcasts ints to lower versions of ints If cardinality_threshold is set above 0, will convert object and strings to categoricals, if number of unique values in the columns are below the threshold.
- Parameters:
df (
DataFrame
) – The dataframe to manipulatecardinality_threshold (
int
) – Less unique values in columns than this threshold, means it should be converted to a categorical. Defaults to 0, meaning no conversion to categoricals.copy_df (
bool
) – The reverse of inplace, make a copy in memory. This may give a memory impact, but be safer. Defaults to True.show_memory (
bool
) – Show the user how much memory was saved by doing the conversion, does require some processing. Defaults to True.
- Returns:
_description_
- Return type:
pd.DataFrame
- check_env(raise_err=True)¶
Check if you are on Dapla or in prodsone.
- Parameters:
raise_err (bool) – Set to False if you don’t want the code to raise an error on an unrecognized environment.
- Returns:
“DAPLA” if on Dapla, “PROD” if in prodsone, otherwise “UNKNOWN”.
- Return type:
str
- Raises:
OSError – If no environment indications match (Dapla or Prod), and raise_err is set to True.
- get_latest_fileversions(glob_list_path)¶
Receives a list of filenames with multiple versions and returns the latest versions of the files.
Recommend using glob operation to create the input list. See doc for glob operations: - GCS: https://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFileSystem.glob - Locally: https://docs.python.org/3/library/glob.html
- Parameters:
glob_list_path (
list
[str
] |list
[Path
] |str
|Path
) – List of strings/Paths or single string/Path that represents a filepath. Recommend that the list is created with glob operation.- Returns:
List of strings, or Paths (if path was submitted) with unique filepaths and their latest versions.
- Return type:
list[str | Path]
- Raises:
TypeError – If parameter does not fit with type-narrowing to list of strings.
Example:
import dapla as dp fs = dp.FileClient.get_gcs_file_system() all_files = fs.glob("gs://dir/statdata_v*.parquet") latest_files = get_latest_fileversions(all_files)
- latest_version_path(filepath)¶
Finds the path to the latest version of a specified file.
This function retrieves all versioned files matching the provided file path pattern and identifies the latest version. It supports both Google Cloud Storage (GCS) paths and local file paths, provided they follow the required naming convention with version numbers (e.g., ‘_v1’). If no versions are found, it defaults to returning a pattern representing version 1.
- Parameters:
filepath (
str
|Path
) – The full path of the file, either a GCS path or a local path. It should follow the naming standard, including the version indicator.- Returns:
- The path to the latest version of the file. If no versions are found, returns
a pattern for version 1 of the file.
- Return type:
str | Path
- Raises:
ValueError – If get_latest_fileversions returns a list of more than one file.
ValueError – If the filepath does not follow the naming convention with ‘_v’ followed by digits to denote version, when a versioned file is required.
Examples
‘ssb-prod-ofi-skatteregn-data-produkt/skatteregn/inndata/skd_data/2023/skd_p2023-01_v1.parquet’
‘/ssb/stammeXX/kortkode/inndata/skd_data/2023/skd_p2023-01_v1.parquet’
- linux_shortcuts(insert_environ=False)¶
Manually load the “linux-forkortelser” in as dict.
If the function can find the file they are shared in.
- Parameters:
insert_environ (
bool
) – Set to True if you want the dict to be inserted into the environment variables (os.environ).- Returns:
The “linux-forkortelser” as a dict
- Return type:
dict[str, str]
- Raises:
ValueError – If the stamme_variabel file is wrongly formatted.
- make_klass_xml_codelist(path, codes, names_bokmaal=None, names_nynorsk=None, names_engelsk=None)¶
Make a klass xml file and pandas Dataframe from a list of codes and names.
This XML can be loaded into the old KLASS UI under version -> import to the top right.
- Parameters:
path (str) – Path to save the xml file.
codes (list[str|int]) – List of codes.
names_bokmaal (list[str] | None) – List of names in Bokmål.
names_nynorsk (list[str] | None) – List of names in Nynorsk.
names_engelsk (list[str] | None) – List of names in English.
- Returns:
Dataframe with columns for codes and names.
- Return type:
pd.DataFrame
- next_version_path(filepath)¶
Generates a new file path with an incremented version number.
Constructs a filepath for a new version of a file, based on the latest existing version found in a specified folder. Meaning it skips to “one after the highest version it finds”. It increments the version number by one, to ensure the new file path is unique.
- Parameters:
filepath (
str
|Path
) – The path for the file.- Returns:
The new file path with an incremented version number and specified suffix.
- Return type:
str | Path
Example:
next_version_path('gs://my-bucket/datasets/data_v1.parquet') 'gs://my-bucket/datasets/data_v2.parquet'
- open_path_datadok(path, **read_fwf_params)¶
Get archive data only based on the path of the .dat or .txt file.
This function attempts to correct and test options, to try track down the file and metadata mentioned.
- Parameters:
path (
str
|Path
) – The path to the archive file in prodsonen to attempt to get metadata for and open.read_fwf_params (
Any
) – Remaining parameters to pass to pd.read_fwf, dtype, widths, names and na_values is overwritten, so dont use those.
- Returns:
An ArchiveData object containing the imported data, metadata, and code lists.
- Return type:
- Raises:
ValueError – If no datadok-api endpoint is found for the path given.
- open_path_metapath_datadok(path, metapath, **read_fwf_params)¶
If open_path_datadok doesnt work, specify the path on linux AND the path in Datadok.
- Parameters:
path (
str
) – Path to the archive file on linux.metapath (
str
) – Path described in datadok.read_fwf_params (
Any
) – Remaining parameters to pass to pd.read_fwf, dtype, widths, names and na_values is overwritten, so dont use those.
- Returns:
An ArchiveData object containing the imported data, metadata, and code lists.
- Return type:
- round_up(data, decimal_places=0, col_names='')¶
Round up a number, to a given number of decimal places. Avoids Pythons default of rounding to even.
- Parameters:
data (
DataFrame
|object
|float
|NAType
) – The data to round up, can be a float, Series, or DataFrame.decimal_places (
int
) – The number of decimal places to round up to. Ignored if you send a dictionary into col_names with column names and decimal places.col_names (
str
|list
[str
] |dict
[str
,int
]) – The column names to round up. If a dictionary is provided, it should map column names to the number of decimal places for each column. If a list is provided, it should contain the names of the columns to round up. If a string is provided, it should be the name of a single column to round up.
- Returns:
The rounded up number as an int, float, Series, or DataFrame.
- Return type:
pd.DataFrame | pd.Series | int | float
- Raises:
TypeError – If data is not a DataFrame, Series, int, float, or NAType.
- saspy_df_from_path(path)¶
Use df_from_sasfile instead, this is the old (bad) name for the function.
- Parameters:
path (
str
) – The full path to the sasfile you want to open with sas.- Returns:
The raw content of the sasfile straight from saspy
- Return type:
pandas.DataFrame
- saspy_session()¶
Get an initialized saspy.SASsession object.
Use the default config, getting your password if you’ve set one.
- Returns:
An initialized saspy-session
- Return type:
saspy.SASsession
- view_dataframe(dataframe, column, operator='==', unique_limit=100)¶
Display an interactive widget for filtering and viewing data in a DataFrame based on selection of values in one column.
- Parameters:
dataframe (
DataFrame
) – The DataFrame containing the data to be filtered.column (
str
) – The column in the DataFrame to be filtered.operator (
str
) – The comparison operator for filtering (may be altered during the display). Options: ‘==’, ‘!=’, ‘>=’, ‘>’, ‘<’, ‘<=’. Default: ‘==’.unique_limit (
int
) – The maximum number of unique values in the column for using ‘==’ or ‘!=’ operators. Default: 100.
- Returns:
- An interactive widget for filtering and viewing data based on the specified criteria.
The ‘==’ and ‘!=’ operators use a dropdown list for multiple selection The other (interval) parameters us a slider
- Return type:
widgets.interactive