nudb_use.variables package¶

Subpackages¶

nudb_use.variables.checks module¶

Validation utilities for ensuring variable schemas match expectations.

check_cols_against_klass_codelists(df, col_codelist=None)¶

Validate DataFrame values against KLASS codelists.

Return type:

None

Parameters:

df (DataFrame)
col_codelist (dict[str, list[str] | dict[str, str]] | None)

check_column_presence(df, dataset_name=None, check_for=None, raise_errors=True)¶

Validate columns against config or a supplied list.

Return type:

list[Exception]

Parameters:

df (DataFrame)
dataset_name (str | None)
check_for (None | list[str])
raise_errors (bool)

identify_cols_not_in_keep_drop_in_paths(paths, cols_keep, cols_drop, raise_error_found=False)¶

Identify columns present in data files that are missing from keep/drop lists.

Return type:

set[str]

Parameters:

paths (list[Path])
cols_keep (list[str])
cols_drop (list[str])
raise_error_found (bool)

pyarrow_columns_from_metadata(path)¶

Read column names from a Parquet file via metadata only.

Return type:: list[str]
Parameters:: path (str | Path)

nudb_use.variables.cleanup module¶

Utilities for reorganizing and trimming NUDB datasets.

move_col_after_col(df, col_anchor, col_move_after)¶

Move a specified column in a DataFrame to immediately follow another specified column.

Parameters:

df (DataFrame) – Input pandas DataFrame.
col_anchor (str) – Name of the column after which the specified column will be moved.
col_move_after (str) – Name of the column to move.

Returns:

New DataFrame with the specified column moved to follow the anchor column.

Return type:

DataFrame

move_content_from_col_to(df, from_col, to_col)¶

Fill empty values (NA) in one column with values from another column.

Parameters:

df (DataFrame) – DataFrame
from_col (str) – Column where information is taken.
to_col (str) – Column where information is moved to.

Returns:

DataFrame with values filled out.

Return type:

DataFrame

sort_all_values(df, priority_cols=None)¶

Sort the dataset semi-deterministically, by using all columns, but prioritizing some.

Notes

If the people analyzing our data do not sort on all the columns…
And they use a semi-random “keep first of duplicates strategy”…
They will be dependant on the order of values we are sending them to reproduce the same aggregations across versions…
So providing a semi-deterministic sort order of values might be a good thing?
Still, people removing duplicates dependent on a semi-random strategy, should strive to make it less random…

Parameters:

df (DataFrame) – The dataframe to be sorted.
priority_cols (list[str] | None) – A list of columns to weigh first in sorting. Default is (if None): [“utd_skoleaar_start”, “nus2000”, “utd_skolekom”,]

Returns:

A pandas dataframe sorted.

Return type:

DataFrame

Raises:

KeyError – If some of the priority_cols are not in the dataframe sent in.