nudb_use.variables package

Subpackages

nudb_use.variables.checks module

Validation utilities for ensuring variable schemas match expectations.

check_cols_against_klass_codelists(df, col_codelist=None)

Validate DataFrame values against KLASS codelists.

Return type:

None

Parameters:
  • df (DataFrame)

  • col_codelist (dict[str, list[str] | dict[str, str]] | None)

check_column_presence(df, dataset_name=None, check_for=None, raise_errors=True)

Validate columns against config or a supplied list.

Return type:

list[Exception]

Parameters:
  • df (DataFrame)

  • dataset_name (str | None)

  • check_for (None | list[str])

  • raise_errors (bool)

identify_cols_not_in_keep_drop_in_paths(paths, cols_keep, cols_drop, raise_error_found=False)

Identify columns present in data files that are missing from keep/drop lists.

Return type:

set[str]

Parameters:
  • paths (list[Path])

  • cols_keep (list[str])

  • cols_drop (list[str])

  • raise_error_found (bool)

pyarrow_columns_from_metadata(path)

Read column names from a Parquet file via metadata only.

Return type:

list[str]

Parameters:

path (str | Path)

nudb_use.variables.cleanup module

Utilities for reorganizing and trimming NUDB datasets.

move_col_after_col(df, col_anchor, col_move_after)

Move a specified column in a DataFrame to immediately follow another specified column.

Parameters:
  • df (DataFrame) – Input pandas DataFrame.

  • col_anchor (str) – Name of the column after which the specified column will be moved.

  • col_move_after (str) – Name of the column to move.

Returns:

New DataFrame with the specified column moved to follow the anchor column.

Return type:

DataFrame

move_content_from_col_to(df, from_col, to_col)

Fill empty values (NA) in one column with values from another column.

Parameters:
  • df (DataFrame) – DataFrame

  • from_col (str) – Column where information is taken.

  • to_col (str) – Column where information is moved to.

Returns:

DataFrame with values filled out.

Return type:

DataFrame

sort_all_values(df, priority_cols=None)

Sort the dataset semi-deterministically, by using all columns, but prioritizing some.

Notes

  • If the people analyzing our data do not sort on all the columns…

  • And they use a semi-random “keep first of duplicates strategy”…

  • They will be dependant on the order of values we are sending them to reproduce the same aggregations across versions…

  • So providing a semi-deterministic sort order of values might be a good thing?

  • Still, people removing duplicates dependent on a semi-random strategy, should strive to make it less random…

Parameters:
  • df (DataFrame) – The dataframe to be sorted.

  • priority_cols (list[str] | None) – A list of columns to weigh first in sorting. Default is (if None): [“utd_skoleaar_start”, “nus2000”, “utd_skolekom”,]

Returns:

A pandas dataframe sorted.

Return type:

DataFrame

Raises:

KeyError – If some of the priority_cols are not in the dataframe sent in.