nudb_use.quality package

Subpackages

nudb_use.quality.check_bool_string_columns module

Checks for string columns containing boolean-like literal values.

check_bool_string_columns(df, raise_errors=True)

Detect string columns that contain literal boolean values.

Parameters:
  • df (DataFrame) – DataFrame to inspect.

  • raise_errors (bool) – When True, raise grouped errors if violations are found.

Returns:

Errors describing columns with boolean-like string literals, or an empty list when none are found.

Return type:

list[NudbQualityError]

nudb_use.quality.check_drop_cols module

Ensure requested drop columns do not collide with configured variables.

check_drop_cols_for_valid_cols(drop_cols, ignores=None, raise_errors=False)

Warn when requested drop columns overlap with defined valid columns.

Parameters:
  • drop_cols (list[str]) – Column names that are about to be dropped.

  • ignores (list[str] | str | None) – Optional iterable or single column name that should be ignored when computing the overlap.

  • raise_errors (bool) – When True, raise a NudbQualityError instead of returning it. Defaults to False.

Returns:

Error describing the overlapping columns, or None when no problematic columns are found.

Return type:

NudbQualityError | None

Raises:

NudbQualityError – Raised when overlaps exist and raise_errors is True.

nudb_use.quality.colored_views module

Style helpers that colorize quality summary tables.

empty_cols_in_time_colored(df, time_col)

Highlight columns that are empty for entire periods.

Parameters:
  • df (DataFrame) – DataFrame to summarize.

  • time_col (str) – Column to group by when calculating emptiness.

Returns:

Styled output with red/green highlights.

Return type:

Styler

grade_cell_by_time_col(df, time_col)

Colorize per-period completeness percentages.

Parameters:
  • df (DataFrame) – DataFrame to summarize.

  • time_col (str) – Column representing the grouping period.

Returns:

Styled completeness summary for display.

Return type:

Styler

nudb_use.quality.duplicated_columns module

Checks for duplicated DataFrame columns and reports them as errors.

check_duplicated_columns(df)

Return quality errors for duplicated columns in a DataFrame.

Parameters:

df (DataFrame) – DataFrame to inspect.

Returns:

Errors summarizing each duplicated column.

Return type:

list[NudbQualityError]

nudb_use.quality.missing module

Checks ensuring NUDB datasets respect missing-value thresholds.

check_columns_only_missing(df, raise_errors=True)

Identify columns that consist entirely of missing values.

Parameters:
  • df (DataFrame) – DataFrame to inspect.

  • raise_errors (bool) – When True, raise grouped errors if violations are found.

Returns:

Errors describing columns that contain only missing values, or an empty list when every column has data.

Return type:

list[NudbQualityError]

check_missing_thresholds_dataset_name(df, dataset_name, raise_errors=True)

Validate a dataset against the configured missing-value thresholds.

Parameters:
  • df (DataFrame) – DataFrame to validate.

  • dataset_name (str) – Name of the dataset whose threshold config should be used.

  • raise_errors (bool) – When True, raise grouped errors if violations are found.

Returns:

Errors describing columns that exceed their thresholds, or an empty list when all limits are met.

Return type:

list[NudbQualityError]

check_non_missing(df, cols_not_empty, raise_errors=True)

Ensure the provided columns never contain missing values.

Parameters:
  • df (DataFrame) – DataFrame to inspect.

  • cols_not_empty (list[str]) – Column names that must be fully populated.

  • raise_errors (bool) – When True, raise grouped errors if violations are found.

Returns:

Errors describing columns that contain missing values, or an empty list if all columns are complete.

Return type:

list[NudbQualityError]

df_within_missing_thresholds(df, thresholds=None, raise_errors=True)

Check whether each column respects its configured missing-value threshold.

Parameters:
  • df (DataFrame) – DataFrame providing the values to inspect.

  • thresholds (dict[str, float] | None) – Mapping of column names to allowed missing-value percentages.

  • raise_errors (bool) – When True, raise grouped errors if violations are found.

Returns:

Errors describing columns that exceed their thresholds, or an empty list when all limits are met.

Return type:

list[NudbQualityError]

empty_percents_over_columns(df, group_cols=None)

Check the percentage of empty values in specified columns in a DataFrame.

Parameters:
  • df (DataFrame) – DataFrame to check columns in.

  • group_cols (str | list[str] | None) – List of columns to check for percentage of empty values.

Returns:

DataFrame with percentage values for empty values for each column.

Return type:

pd.DataFrame

get_thresholds_from_config(dataset_name)

Retrieve percentage completion threshold values for a given dataset from config.

Parameters:

dataset_name (str) – Name of the dataset to retrieve threshold values for.

Returns:

Dictionary mapping variable names to specified percentage completion threshold values.

Return type:

dict[str, float]

last_period_within_thresholds(df, period_col, thresholds=None, raise_errors=True)

Validate that the latest period satisfies missing-value thresholds.

Parameters:
  • df (DataFrame) – DataFrame containing the data to evaluate.

  • period_col (str) – Column identifying the period dimension.

  • thresholds (dict[str, float] | None) – Mapping of column names to allowed missing-value percentages.

  • raise_errors (bool) – When True, raise grouped errors if violations are found.

Returns:

Errors describing columns that exceed thresholds in the most recent period, or an empty list if all pass.

Return type:

list[NudbQualityError]

nudb_use.quality.outdated_variables module

Checks that flag columns marked as outdated are absent from datasets.

check_outdated_variables(df)

Return errors for columns marked as outdated in config.

Parameters:

df (DataFrame) – DataFrame to inspect.

Returns:

Errors describing each outdated column present.

Return type:

list[NudbQualityError]

find_outdated_variables_in_df(df)

Return metadata for outdated variables present in a DataFrame.

Parameters:

df (DataFrame) – DataFrame to inspect.

Returns:

Mapping from variable name to metadata entries.

Return type:

dict[str, Variable]

nudb_use.quality.suite module

High-level orchestration for NUDB quality checks.

run_quality_suite(df, dataset_name, data_time_start=None, data_time_end=None, raise_errors=True, **kwargs)

Run the full NUDB quality suite over a dataset.

Parameters:
  • df (DataFrame) – DataFrame to validate.

  • dataset_name (str) – Name of the dataset in config; controls which part of the config to choose for values used in the valiadations.

  • data_time_start (str | None) – Optional start date used by codelist validations.

  • data_time_end (str | None) – Optional end date used by codelist validations.

  • raise_errors (bool) – When True, raise grouped exceptions if any check fails.

  • **kwargs (object) – Additional keyword arguments forwarded to specific checks.

Returns:

All collected quality errors, or an empty sequence when every check passes.

Return type:

Sequence[Exception]

Raises:

TypeError – If the first parameter df is not a pandas dataframe.

nudb_use.quality.thresholds module

Threshold helpers for completeness and fill-rate validations.

filled_value_to_threshold(col, value, threshold_lower, raise_error=True)

Ensure the proportion of specific values stays above a threshold.

Parameters:
  • col (Series) – Series to inspect.

  • value (Iterable[object] | object) – Single value or iterable of values that must meet the threshold.

  • threshold_lower (float) – Minimum allowed percentage of matching values.

  • raise_error (bool) – When True, raise NudbQualityError if the threshold is not met.

Returns:

Error describing the shortage when the threshold is not met (if raise_error is False), otherwise None.

Return type:

NudbQualityError | None

Raises:

NudbQualityError – If the percentage of matching values is below the threshold while raise_error is True.

non_empty_to_threshold(col, threshold_lower, raise_error=True)

Ensure the proportion of non-empty values stays above a threshold.

Parameters:
  • col (Series) – Series to inspect.

  • threshold_lower (float) – Minimum allowed percentage of non-empty values.

  • raise_error (bool) – When True, raise NudbQualityError if the threshold is not met.

Returns:

Error describing the shortage when the threshold is not met (if raise_error is False), otherwise None.

Return type:

NudbQualityError | None

Raises:

NudbQualityError – If the percentage of non-empty values is below the threshold while raise_error is True.

nudb_use.quality.values module

Utilities for inspecting column fill rates and unexpected values.

get_fill_amount_per_column(df)

Calculate the percentage of filled (non-null) values per column.

Parameters:

df (DataFrame) – DataFrame whose columns should be summarized.

Returns:

Mapping of column name to percentage of filled cells.

Return type:

dict[str, float]

values_not_in_column(col, values, raise_error=False)

Check whether certain values appear inside a column.

Parameters:
  • col (Series) – Series to inspect.

  • values (Sequence[object] | object) – Allowed values; may be a single value or a list.

  • raise_error (bool) – When True, raise ValueError immediately when matches occur.

Returns:

None when the column is clean, otherwise a ValueError describing the unexpected values.

Return type:

None | ValueError

Raises:

ValueError – If forbidden values are found and raise_error is True.

nudb_use.quality.widths module

Validation helpers that ensure column values follow expected widths.

check_column_widths(df, widths=None, raise_errors=True)

Validate that string lengths in each column match expected widths.

Note: ignore_na is currently unused.

Parameters:
  • df (DataFrame) – DataFrame to inspect.

  • widths (dict[str, list[int]] | None) – Optional mapping of column names to allowed string lengths. When omitted or malformed, definitions are loaded from config.

  • raise_errors (bool) – When True, raise grouped errors if mismatches are found.

Returns:

Errors describing columns whose values are outside the allowed width definitions, or an empty list when all pass.

Return type:

list[NudbQualityError]