fagfunksjoner.data package

Submodules

fagfunksjoner.data.datadok_extract module

class ArchiveData(df, metadata_df, codelist_df, codelist_dict, names, widths, datatypes)

Bases: object

Class representing the archive data along with its metadata and code lists.

Parameters:
  • df (DataFrame)

  • metadata_df (DataFrame)

  • codelist_df (DataFrame)

  • codelist_dict (dict[str, dict[str, str]])

  • names (list[str])

  • widths (list[int])

  • datatypes (dict[str, str])

codelist_df: DataFrame
codelist_dict: dict[str, dict[str, str]]
datatypes: dict[str, str]
df: DataFrame
metadata_df: DataFrame
names: list[str]
widths: list[int]
class CodeList(context_id, codelist_title, codelist_description, code_value, code_text)

Bases: object

Class representing a code list.

Parameters:
  • context_id (str)

  • codelist_title (str)

  • codelist_description (str)

  • code_value (str)

  • code_text (str)

code_text: str
code_value: str
codelist_description: str
codelist_title: str
context_id: str
class ContextVariable(context_id, title, description, datatype, length, start_position, precision, division)

Bases: object

Class representing a context variable.

Parameters:
  • context_id (str)

  • title (str)

  • description (str)

  • datatype (str)

  • length (int)

  • start_position (int)

  • precision (int | None)

  • division (str)

context_id: str
datatype: str
description: str
division: str
length: int
precision: int | None
start_position: int
title: str
class Metadata(context_variables, codelists)

Bases: object

Class representing metadata which includes context variables and code lists.

Parameters:
codelists: list[CodeList]
context_variables: list[ContextVariable]
add_dollar_or_nondollar_path(path, add_dollar=True)

Add a $-path or non-$-path to an existing path. Output should be a list of length 2.

Parameters:
  • path (str | Path) – The path to expand on.

  • add_dollar (bool) – If we should add dollar-paths (not needed for opening file).

Raises:

TypeError – If what we get for the dollar-paths is not a single string.

Returns:

List containing one more path now.

Return type:

list[str]

add_pii_paths(paths)

Add PII-paths to a list of paths, to look in more places.

Parameters:

paths (list[str]) – List containing paths that we will add PII or non-PII paths for.

Returns:

List containing more paths now, we should have added to it.

Return type:

list[str]

bumpcheck_file_years_back(curr_path, yr_char_ranges, exts)

Modify the path to point at older versions of file, to look for valid datadok-api paths.

Parameters:
  • curr_path (Path) – The path given by user to look for.

  • yr_char_ranges (list[tuple[int, int]]) – The placement of the year ranges in the paths.

  • exts (list[str]) – The base extensions to explore.

Return type:

Path | None

codelist_to_df(codelist)

Converts a list of CodeList objects to a DataFrame.

Parameters:

codelist (list[CodeList]) – A list of CodeList objects.

Returns:

A DataFrame containing the code list information.

Return type:

pd.DataFrame

codelist_to_dict(codelist_df)

Converts a DataFrame containing code lists to a dictionary.

Parameters:

codelist_df (DataFrame) – DataFrame containing the code list information.

Returns:

A dictionary mapping code list titles to dictionaries of code values and texts.

Return type:

dict[str, CodeList]

convert_dates(df, metadata_df)

Faster to convert columns vectorized after importing as string, instead of running every row through a lambda.

Parameters:
  • df (DataFrame) – The DataFrame containing archive data.

  • metadata_df (DataFrame) – The DataFrame containing metadata.

Returns:

The modified archive DataFrame with converted datetimecolumns.

Return type:

pd.DataFrame

convert_to_pathlib(path)

Make sure the path is converted to pathlib.Path.

Parameters:

path (str | Path) – The path to possibly convert.

Returns:

The converted path.

Return type:

Path

date_formats(metadata_df)

Creates a dictionary of date conversion functions based on the metadata DataFrame.

Parameters:

metadata_df (DataFrame) – DataFrame containing metadata.

Returns:

A dictionary mapping column titles to date conversion formats.

Return type:

dict[str, str]

Raises:

ValueError – On unrecognized dateformats.

date_parser(date_str, date_format)

Parses a date string into a datetime object based on the provided format.

Parameters:
  • date_str (str) – The date string to be parsed.

  • date_format (str) – The format in which the date string is.

Returns:

The parsed datetime object, or pd.NaT if parsing fails.

Return type:

datetime

downcast_ints(df, metadata_df)

Store ints as the lowest possible datatype that can contain the values.

Parameters:
  • df (DataFrame) – The DataFrame containing archive data.

  • metadata_df (DataFrame) – The DataFrame containing metadata.

Returns:

The modified archive DataFrame with downcast ints.

Return type:

pd.DataFrame

extract_codelist(root)

Extracts code lists from the XML root element and returns a list of CodeList objects.

Parameters:

root (Element) – The root element of the XML tree to parse.

Returns:

A list of CodeList objects.

Return type:

list[Codelist]

extract_context_variables(root)

Extracts context variables from the XML root element and returns a list of ContextVariable objects.

Parameters:

root (Element) – The root element of the XML tree to parse.

Returns:

A list of ContextVariable objects.

Return type:

list

Raises:

ValueError – Missing information in the XML.

extract_parameters(df)

Extracts parameters from the metadata DataFrame for importing archive data.

Parameters:

df (DataFrame) – A DataFrame containing metadata.

Returns:

Extracted parameters to input into archive import.

Return type:

tuple[list[str], list[int], dict[str, str], str]

get_path_combinations(path, file_exts=None, add_dollar=True)

Generate a list of combinations of possible paths and file extensions for a given path.

Parameters:
  • path (str | Path) – The given path, will be modified to include both $UTD, $UTD_PII, utd and utd_pii

  • file_exts (list[str] | str | None) – Possible file extensions for the files. Defaults to [“”, “.dat”, “.txt”].

  • add_dollar (bool) – If we should add dollar paths (not needed for file opening.)

Returns:

The generated combinations for possible locations of the files.

Return type:

list[tuple[str, str]]

get_yr_char_ranges(path)

Find the character ranges containing years in the path. Usually 1-4 ranges.

Parameters:

path (str | Path) – The filename to look at for character ranges.

Returns:

A list of tuples, tuples have length 2,

one int for the starting position of a year range, and one int for the last position.

Return type:

list[tuple[int, int]]

go_back_in_time(path, file_exts=None)

Look for datadok-api URLs back in time. Sometimes new ones are not added, if the previous still works.

Only modifies yearly publishings for now…

Parameters:
  • path (str | Path) – The path to modify and test for previous years.

  • file_exts (list[str] | None) – The different file extensions to try.

Returns:

The path that was found, with a corresponding URL with content in the Datadok-API.

If nothing is found returns None.

Return type:

str | None

handle_decimals(df, metadata_df)

Adjusts the decimal values in the archive DataFrame based on the metadata or contained decimal sign.

Parameters:
  • df (DataFrame) – The DataFrame containing archive data.

  • metadata_df (DataFrame) – The DataFrame containing metadata.

Returns:

The modified archive DataFrame with adjusted decimal values.

Return type:

pd.DataFrame

import_archive_data(archive_desc_xml, archive_file, **read_fwf_params)

Imports archive data based on the given XML description and archive file.

Parameters:
  • archive_desc_xml (str) – Path or URL to the XML file describing the archive.

  • archive_file (str | Path) – Path to the archive file.

  • read_fwf_params (Any) – Remaining parameters to pass to pd.read_fwf, dtype, widths, names and na_values is overwritten, so dont use those.

Returns:

An ArchiveData object containing the imported data, metadata, and code lists.

Return type:

ArchiveData

Raises:
  • ParseError – If we cant parse the content on the datadok-api endpoint as XML.

  • ValueError – If params are passed through read_fwf_params that we will overwrite with the import function.

Example usage:

archive_data = import_archive_data('path_to_xml.xml', 'path_to_archive_file.txt')
print(archive_data.df)
look_for_filepath(path_lib)

Look for possible placements of the physical “flatfile” on disk.

Parameters:

path_lib (Path) – The given path from the user as a pathlib.Path

Raises:
  • FileNotFoundError – If we find more than one matching file, we do not know which to pick.

  • FileNotFoundError – If we find zero matching files, we also do not know which to pick.

Returns:

The found path of an actual physical file.

Return type:

Path

metadata_to_df(context_variables)

Converts a list of ContextVariable objects to a DataFrame.

Parameters:

context_variables (list[ContextVariable]) – A list of ContextVariable objects.

Returns:

A DataFrame containing the context variable information.

Return type:

pd.DataFrame

open_path_datadok(path, **read_fwf_params)

Get archive data only based on the path of the .dat or .txt file.

This function attempts to correct and test options, to try track down the file and metadata mentioned.

Parameters:
  • path (str | Path) – The path to the archive file in prodsonen to attempt to get metadata for and open.

  • read_fwf_params (Any) – Remaining parameters to pass to pd.read_fwf, dtype, widths, names and na_values is overwritten, so dont use those.

Returns:

An ArchiveData object containing the imported data, metadata, and code lists.

Return type:

ArchiveData

Raises:

ValueError – If no datadok-api endpoint is found for the path given.

open_path_metapath_datadok(path, metapath, **read_fwf_params)

If open_path_datadok doesnt work, specify the path on linux AND the path in Datadok.

Parameters:
  • path (str) – Path to the archive file on linux.

  • metapath (str) – Path described in datadok.

  • read_fwf_params (Any) – Remaining parameters to pass to pd.read_fwf, dtype, widths, names and na_values is overwritten, so dont use those.

Returns:

An ArchiveData object containing the imported data, metadata, and code lists.

Return type:

ArchiveData

replace_dollar_stamme(path_lib)

Replace the dollar in a path with the full path using the linux-stammer.

Parameters:

path_lib (Path) – Th inpath, suspected to have a dollar.

Returns:

Corrected path returned.

Return type:

Path | None

test_url(url)

Test if there is content at the given endpoint in the Datadok-API.

Parameters:

url (str) – The URL we should test.

Returns:

True if there is content at the URL. False otherwise.

Return type:

bool

test_url_combos(combinations)

Tests a set of path combinations for valid responses from the Datadok-API.

Parameters:

combinations (list[tuple[Path, str]]) – A list of tuples, each tuple containing two elements. First element is most of the file path, second part is the file extensions, including “.”

Returns:

Returns the tested path, if one test passes, if nothing is found, return None.

Return type:

None | str

url_from_path(path)

Append sent path to the endpoint URL that datadok uses.

Parameters:

path (str | Path) – The path to append to the endpoint.

Returns:

The URL for the given path

Return type:

str

fagfunksjoner.data.dicts module

Extra functionality operating on baseline dicts added to this module.

get_key_by_value(data, value)

Searches through the values in a dict for a match, returns the key.

Parameters:
  • data (dict[TypeVar(KeyType, bound= Hashable), Any]) – The data to search through, will only look in top most level…

  • value (Any) – The value to look for in the dict

Returns:

Returns a single key if a single match on value,

otherwise returns a list of keys with the same value.

Return type:

KeyType | list[KeyType]

Raises:

ValueError – If no matches are found on values

fagfunksjoner.data.pandas_combinations module

The background for these functions is a common operation before publishing to the “statbank” at statistics Norway.

All combinations (including total-groups), over all categorical codes, in a set of columns, need to have their numbers aggregated. This has some similar functionality to “proc means” in SAS.

all_combos_agg(df, groupcols, valuecols=None, aggargs=None, fillna_dict=None, keep_empty=False, grand_total='')

Generate all aggregation levels for a set of columns in a dataframe.

Creates aggregations over all combinations of categorical variables specified in groupcols and applies aggregation functions on valuecols. Allows for inclusion of grand totals and customized fill values for missing groups, similar to “proc means” in SAS.

Parameters:
  • df (DataFrame) – DataFrame to aggregate.

  • groupcols (list[str]) – List of columns to group by.

  • valuecols (list[str] | None) – List of columns to apply aggregation functions on. Defaults to None, in which case all numeric columns are used.

  • aggargs (Callable[[Any], Any] | str | ufunc | Mapping[str, Callable[[Any], Any] | str | ufunc] | dict[str, list[str]] | None) – Dictionary or function specifying aggregation for each column in valuecols. If None, defaults to ‘sum’ for each column in valuecols.

  • fillna_dict (dict[str, Any] | None) – Dictionary specifying values to fill NA in each column of groupcols. Useful for indicating totals in the final table.

  • keep_empty (bool) – If True, preserves empty groups in the output.

  • grand_total (dict[str, str] | str) – Dictionary or string to indicate a grand total row. If a dictionary, the values are applied in each corresponding groupcols.

Returns:

  • groupcols: group-by columns with filled total values as needed.

  • level: indicates aggregation level.

  • ways: counts the number of grouping columns used for each aggregation.

Return type:

DataFrame with all aggregation levels, including

Examples

>>> data = pd.DataFrame({
        'age': [20, 60, 33, 33, 20],
        'region': ['0301', '3001', '0301', '5401', '0301'],
        'gender': ['1', '2', '1', '2', '2'],
        'income': [1000000, 120000, 220000, 550000, 50000],
        'wealth': [25000, 50000, 33000, 44000, 90000]
    })
>>> all_combos_agg(data, groupcols=['gender', 'age'], aggargs={'income': ['mean', 'sum']})
calculate_aggregates(df, combos, aggargs, keep_empty)

Calculate aggregates for each combination of group columns.

Parameters:
  • df (DataFrame) – The dataframe to aggregate.

  • combos (list[tuple[str, ...]]) – List of group column combinations.

  • aggargs (Callable[[Any], Any] | str | ufunc | Mapping[str, Callable[[Any], Any] | str | ufunc] | dict[str, list[str]]) – Aggregation functions to apply.

  • keep_empty (bool) – Whether to keep groups without observations.

Returns:

The dataframe with calculated aggregates for each combination.

Return type:

pd.DataFrame

check_column_arguments(df, groupcols, valuecols=None, aggargs=None)

Validate and set defaults for grouping and aggregation arguments.

Confirms that columns in groupcols and valuecols exist in df, assigns default aggregations if none are provided, and ensures all columns are numeric if aggregations are unspecified.

Parameters:
  • df (DataFrame) – The input DataFrame to check.

  • groupcols (list[str]) – List of column names to group by.

  • valuecols (list[str] | None) – List of columns to aggregate. Defaults to None, in which case all numeric columns are used or the keys of aggargs if provided.

  • aggargs (Callable[[Any], Any] | str | ufunc | Mapping[str, Callable[[Any], Any] | str | ufunc] | dict[str, list[str]] | None) – Aggregation functions for valuecols. Defaults to ‘sum’ for all numeric columns.

Returns:

List of columns needed for grouping and aggregation. - aggargs: Updated aggregation functions for each column in valuecols.

Return type:

  • required_columns

Raises:
  • ValueError – If a column in groupcols or valuecols is not in df.

  • ValueError – If any column in valuecols is non-numeric and lacks an aggregation function.

Example

>>> data = pd.DataFrame({
        'A': [1, 2, 3],
        'B': [4, 5, 6],
        'C': [7, 8, 9]
    })
>>> check_column_arguments(data, groupcols=['A'], valuecols=['B', 'C'])
(['A', 'B', 'C'], {'B': 'sum', 'C': 'sum'})
fill_na_dict(df, mapping)

Fills NAs in the passed dataframe with a dict.

Keys in dict should be column names, the values what should be inputed in the cells. Also handles categorical columns if they exist in the dataframe.

Parameters:
  • df (DataFrame) – The DataFrame to fill NAs on.

  • mapping (dict[str, Any]) – What each of the columns should have their NAs filled with.

Returns:

The DataFrame with filled NAs.

Return type:

pd.DataFrame

finalize_dataframe(all_levels, df, groupcols, aggargs, grand_total, fillna_dict, keep_empty)

Finalize the dataframe by calculating the grand total and filling missing values.

Parameters:
  • all_levels (DataFrame) – The dataframe with calculated aggregates.

  • df (DataFrame) – The original dataframe.

  • groupcols (list[str]) – List of columns to group by.

  • aggargs (Callable[[Any], Any] | str | ufunc | Mapping[str, Callable[[Any], Any] | str | ufunc] | dict[str, list[str]]) – Aggregation functions to apply.

  • grand_total (dict[str, str] | str) – Value(s) to use for the grand total row.

  • fillna_dict (dict[str, Any] | None) – Values to fill in missing data.

  • keep_empty (bool) – Whether to keep groups without observations.

Returns:

Final DataFrame with all aggregations and filled values.

Return type:

pd.DataFrame

flatten_col_multiindex(df, sep='_')

If the dataframe has a multiindex as a column.

Flattens it by combining the names of the multiindex, using the seperator (sep).

Parameters:
  • df (DataFrame) – The DataFrame with multiindexed columns.

  • sep (str) – What should seperate the names of the levels in the multiindex. Defaults to “_”.

Returns:

The DataFrame with the flattened column headers.

Return type:

pd.DataFrame

handle_grand_total(all_levels, df, groupcols, grand_total, aggargs)

Handle the totals of groupcols, in addition to a grand total for the whole dataset?

Parameters:
  • all_levels (DataFrame) – The inherited dataset from the previous step.

  • df (DataFrame) – The original dataframe.

  • groupcols (list[str]) – List of columns to group by.

  • grand_total (dict[str, str] | str) – Value(s) to use for the grand total row.

  • aggargs (Callable[[Any], Any] | str | ufunc | Mapping[str, Callable[[Any], Any] | str | ufunc] | dict[str, list[str]]) – Aggregation functions to apply.

Returns:

The modified original dataset that now should contain the grand totals.

Return type:

pd.DataFrame

Raises:

ValueError – If ‘grand_total’ is not a string or a dictionary.

prepare_combinations(groupcols)

Generate all possible combinations of group columns.

Parameters:

groupcols (list[str]) – List of columns to group by.

Return type:

list[tuple[str, ...]]

Returns:

List of tuples representing all group column combinations.

prepare_dataframe(df, groupcols, collist)

Prepare DataFrame by selecting necessary columns and setting empty groups.

Parameters:
  • df (DataFrame) – The dataframe to process.

  • groupcols (list[str]) – List of columns to group by.

  • collist (list[str]) – List of all required columns for aggregation.

Return type:

DataFrame

Returns:

The DataFrame with required columns, optionally converted to category dtype.

fagfunksjoner.data.pandas_dtypes module

Automatically changes dtypes on pandas dataframes using logic.

Tries to keep objects as strings if numeric, but with leading zeros. Downcasts ints to smalles size. Changes possible columns to categoricals. The function you most likely want is “auto_dype”.

auto_dtype(df, cardinality_threshold=0, copy_df=True, show_memory=True)

Clean up a dataframes dtypes.

First lowers all column names. Tries to decodes byte strings to utf8. Runs pandas’ convert_dtypes() Tries to convert object to string, and strips empty spaces Downcasts ints to lower versions of ints If cardinality_threshold is set above 0, will convert object and strings to categoricals, if number of unique values in the columns are below the threshold.

Parameters:
  • df (DataFrame) – The dataframe to manipulate

  • cardinality_threshold (int) – Less unique values in columns than this threshold, means it should be converted to a categorical. Defaults to 0, meaning no conversion to categoricals.

  • copy_df (bool) – The reverse of inplace, make a copy in memory. This may give a memory impact, but be safer. Defaults to True.

  • show_memory (bool) – Show the user how much memory was saved by doing the conversion, does require some processing. Defaults to True.

Returns:

_description_

Return type:

pd.DataFrame

categories_threshold(df, cardinality_threshold=0, copy_df=True)

Convert to categoricals using a threshold of unique values.

Parameters:
  • df (DataFrame) – The dataframe to convert to categoricals on.

  • cardinality_threshold (int) – Less unique values in columns than this threshold, means it should be converted to a categorical. Defaults to 0, meaning no conversion to categoricals.

  • copy_df (bool) – The reverse of inplace, make a copy in memory. This may give a memory impact, but be safer. Defaults to True.

Returns:

The dataframe with converted columns to categoricals.

Return type:

pd.DataFrame

decode_bytes(df, copy_df=True, check_row_len=50)

Check object columns if they contain bytes and should be attempted to convert to real utf8 strings.

Parameters:
  • df (DataFrame) – The dataframe to check.

  • copy_df (bool) – The reverse of inplace, make a copy in memory. This may give a memory impact, but be safer. Defaults to True.

  • check_row_len (int) – How many rows to look for byte-content in, conserves processing, but might miss columns if set too low. Defaults to 50.

Returns:

The dataframe with converted byte-columns to string-columns.

Return type:

pd.DataFrame

dtype_set_from_json(df, json_path)

Use a stored json to change the dtypes of a dataframe to match what was stored.

Parameters:
  • df (DataFrame) – The Dataframe to manipulate towards the stored dtypes.

  • json_path (str) – The jsonfile containing the dtypes on the columns.

Returns:

The manipulated dataframe, with newly set dtypes.

Return type:

pd.DataFrame

dtype_store_json(df, json_path)

Store the dtypes of a dataframes columns as a json for later reference.

Parameters:
  • df (DataFrame) – The dataframe to look at for column names and dtypes.

  • json_path (str) – The path to the jsonfile, to store dtypes in.

Return type:

None

object_to_strings(df, copy_df=True)

Convert columns that are still “object”, to pyarrow strings.

Parameters:
  • df (DataFrame) – The dataframe to manipulate.

  • copy_df (bool) – The reverse of inplace, make a copy in memory. This may give a memory impact, but be safer. Defaults to True.

Returns:

The modified dataframe.

Return type:

pd.DataFrame

smaller_ints(df, copy_df=True)

Downcasts ints to smaller int-dtypes to conserve space.

Parameters:
  • df (DataFrame) – The dataframe to manipulate.

  • copy_df (bool) – The reverse of inplace, make a copy in memory. This may give a memory impact, but be safer. Defaults to True.

Returns:

The manipulated dataframe.

Return type:

pd.DataFrame

strings_to_int(df, copy_df=True)

Checks string columns to see if their content can be converted safely to ints.

This conserves A LOT of storage and memory.

Parameters:
  • df (DataFrame) – The dataframe to manipulate.

  • copy_df (bool) – The reverse of inplace, make a copy in memory. This may give a memory impact, but be safer. Defaults to True.

Returns:

The manipulated dataframe.

Return type:

pd.DataFrame

fagfunksjoner.data.pyarrow module

cast_pyarrow_table_schema(data, schema)

Set correct schema on Pyarrow Table, especially when dictionary datatype is wanted.

Parameters:
  • data (Table) – The pyarrow table data

  • schema (Schema) – The wanted schema to cast to the table data. All columns in pyarrow table must be present in the schema. The order of the columns in the schema will be used.

Returns:

A new pyarrow table with correct schema.

Return type:

pa.Table

restructur_pyarrow_schema(inuse_schema, wanted_schema)

Reorder and set the schema you want to fit the in-use schema.

The column names in the in use schema must be present in the wanted schema. They should preferably have the same datatype, but not necessarily the same datatype settings, especially when it comes to DictionaryType. If datatypes are different, the wanted schema is used. And if DictionaryType is present in that case, you must then change your datatypes before casting this new schema.

Parameters:
  • inuse_schema (Schema) – The schema that is in use of your pyarrow dataset or table.

  • wanted_schema (Schema) – The schema that you want, but it is not in the same order of the schema that is in use.

Returns:

A new pyarrow schema that has the same order as the in use schema,

but with the correct datatypes from the schema that we want.

Return type:

pa.Schema

fagfunksjoner.data.view_dataframe module

filter_display(dataframe, column, value, operator)

Filter data based on args, and display the result.

Parameters:
  • dataframe (DataFrame) – The DataFrame to filter.

  • column (str) – Column to base filter on.

  • value (str | int | float | tuple[str | int | float, ...]) – The value to compare filter against.

  • operator (str) – How to compare column against value.

Returns:

only has visual side-effects

Return type:

None

Raises:

TypeError – On combinations of value and operator we can’t handle.

view_dataframe(dataframe, column, operator='==', unique_limit=100)

Display an interactive widget for filtering and viewing data in a DataFrame based on selection of values in one column.

Parameters:
  • dataframe (DataFrame) – The DataFrame containing the data to be filtered.

  • column (str) – The column in the DataFrame to be filtered.

  • operator (str) – The comparison operator for filtering (may be altered during the display). Options: ‘==’, ‘!=’, ‘>=’, ‘>’, ‘<’, ‘<=’. Default: ‘==’.

  • unique_limit (int) – The maximum number of unique values in the column for using ‘==’ or ‘!=’ operators. Default: 100.

Returns:

An interactive widget for filtering and viewing data based on the specified criteria.

The ‘==’ and ‘!=’ operators use a dropdown list for multiple selection The other (interval) parameters us a slider

Return type:

widgets.interactive

Module contents

This module contains functionality around manipulating standalone data, often as pandas dataframes.