nudb_use.quality package¶

Subpackages¶

nudb_use.quality.specific_variables package

nudb_use.quality.check_bool_string_columns module¶

Checks for string columns containing boolean-like literal values.

check_bool_string_columns(df, raise_errors=True)¶

Detect string columns that contain literal boolean values.

Parameters:

df (DataFrame) – DataFrame to inspect.
raise_errors (bool) – When True, raise grouped errors if violations are found.

Returns:

Errors describing columns with boolean-like string literals, or an empty list when none are found.

Return type:

list[NudbQualityError]

nudb_use.quality.check_drop_cols module¶

Ensure requested drop columns do not collide with configured variables.

check_drop_cols_for_valid_cols(drop_cols, ignores=None, raise_errors=False)¶

Warn when requested drop columns overlap with defined valid columns.

Parameters:

drop_cols (list[str]) – Column names that are about to be dropped.
ignores (list[str] | str | None) – Optional iterable or single column name that should be ignored when computing the overlap.
raise_errors (bool) – When True, raise a NudbQualityError instead of returning it. Defaults to False.

Returns:

Error describing the overlapping columns, or None when no problematic columns are found.

Return type:

NudbQualityError | None

Raises:

NudbQualityError – Raised when overlaps exist and raise_errors is True.

nudb_use.quality.colored_views module¶

Style helpers that colorize quality summary tables.

empty_cols_in_time_colored(df, time_col)¶

Highlight columns that are empty for entire periods.

Parameters:

df (DataFrame) – DataFrame to summarize.
time_col (str) – Column to group by when calculating emptiness.

Returns:

Styled output with red/green highlights.

Return type:

Styler

grade_cell_by_time_col(df, time_col)¶

Colorize per-period completeness percentages.

Parameters:

df (DataFrame) – DataFrame to summarize.
time_col (str) – Column representing the grouping period.

Returns:

Styled completeness summary for display.

Return type:

Styler

nudb_use.quality.duplicated_columns module¶

Checks for duplicated DataFrame columns and reports them as errors.

check_duplicated_columns(df)¶

Return quality errors for duplicated columns in a DataFrame.

Parameters:: df (DataFrame) – DataFrame to inspect.
Returns:: Errors summarizing each duplicated column.
Return type:: list[NudbQualityError]

nudb_use.quality.missing module¶

Checks ensuring NUDB datasets respect missing-value thresholds.

check_columns_only_missing(df, raise_errors=True)¶

Identify columns that consist entirely of missing values.

Parameters:

df (DataFrame) – DataFrame to inspect.
raise_errors (bool) – When True, raise grouped errors if violations are found.

Returns:

Errors describing columns that contain only missing values, or an empty list when every column has data.

Return type:

list[NudbQualityError]

check_missing_thresholds_dataset_name(df, dataset_name, raise_errors=True)¶

Validate a dataset against the configured missing-value thresholds.

Parameters:

df (DataFrame) – DataFrame to validate.
dataset_name (str) – Name of the dataset whose threshold config should be used.
raise_errors (bool) – When True, raise grouped errors if violations are found.

Returns:

Errors describing columns that exceed their thresholds, or an empty list when all limits are met.

Return type:

list[NudbQualityError]

check_non_missing(df, cols_not_empty, raise_errors=True)¶

Ensure the provided columns never contain missing values.

Parameters:

df (DataFrame) – DataFrame to inspect.
cols_not_empty (list[str]) – Column names that must be fully populated.
raise_errors (bool) – When True, raise grouped errors if violations are found.

Returns:

Errors describing columns that contain missing values, or an empty list if all columns are complete.

Return type:

list[NudbQualityError]

df_within_missing_thresholds(df, thresholds=None, raise_errors=True)¶

Check whether each column respects its configured missing-value threshold.

Parameters:

df (DataFrame) – DataFrame providing the values to inspect.
thresholds (dict[str, float] | None) – Mapping of column names to allowed missing-value percentages.
raise_errors (bool) – When True, raise grouped errors if violations are found.

Returns:

Errors describing columns that exceed their thresholds, or an empty list when all limits are met.

Return type:

list[NudbQualityError]

empty_percents_over_columns(df, group_cols=None)¶

Check the percentage of empty values in specified columns in a DataFrame.

Parameters:

df (DataFrame) – DataFrame to check columns in.
group_cols (str | list[str] | None) – List of columns to check for percentage of empty values.

Returns:

DataFrame with percentage values for empty values for each column.

Return type:

pd.DataFrame

get_thresholds_from_config(dataset_name)¶

Retrieve percentage completion threshold values for a given dataset from config.

Parameters:: dataset_name (str) – Name of the dataset to retrieve threshold values for.
Returns:: Dictionary mapping variable names to specified percentage completion threshold values.
Return type:: dict[str, float]

last_period_within_thresholds(df, period_col, thresholds=None, raise_errors=True)¶

Validate that the latest period satisfies missing-value thresholds.

Parameters:

df (DataFrame) – DataFrame containing the data to evaluate.
period_col (str) – Column identifying the period dimension.
thresholds (dict[str, float] | None) – Mapping of column names to allowed missing-value percentages.
raise_errors (bool) – When True, raise grouped errors if violations are found.

Returns:

Errors describing columns that exceed thresholds in the most recent period, or an empty list if all pass.

Return type:

list[NudbQualityError]

nudb_use.quality.outdated_variables module¶

Checks that flag columns marked as outdated are absent from datasets.

check_outdated_variables(df)¶

Return errors for columns marked as outdated in config.

Parameters:: df (DataFrame) – DataFrame to inspect.
Returns:: Errors describing each outdated column present.
Return type:: list[NudbQualityError]

find_outdated_variables_in_df(df)¶

Return metadata for outdated variables present in a DataFrame.

Parameters:: df (DataFrame) – DataFrame to inspect.
Returns:: Mapping from variable name to metadata entries.
Return type:: dict[str, Variable]

nudb_use.quality.suite module¶

High-level orchestration for NUDB quality checks.

run_quality_suite(df, dataset_name, data_time_start=None, data_time_end=None, raise_errors=True, **kwargs)¶

Run the full NUDB quality suite over a dataset.

Parameters:

df (DataFrame) – DataFrame to validate.
dataset_name (str) – Name of the dataset in config; controls which part of the config to choose for values used in the valiadations.
data_time_start (str | None) – Optional start date used by codelist validations.
data_time_end (str | None) – Optional end date used by codelist validations.
raise_errors (bool) – When True, raise grouped exceptions if any check fails.
**kwargs (object) – Additional keyword arguments forwarded to specific checks.

Returns:

All collected quality errors, or an empty sequence when every check passes.

Return type:

Sequence[Exception]

Raises:

TypeError – If the first parameter df is not a pandas dataframe.

nudb_use.quality.thresholds module¶

Threshold helpers for completeness and fill-rate validations.

filled_value_to_threshold(col, value, threshold_lower, raise_error=True)¶

Ensure the proportion of specific values stays above a threshold.

Parameters:

col (Series) – Series to inspect.
value (Iterable[object] | object) – Single value or iterable of values that must meet the threshold.
threshold_lower (float) – Minimum allowed percentage of matching values.
raise_error (bool) – When True, raise NudbQualityError if the threshold is not met.

Returns:

Error describing the shortage when the threshold is not met (if raise_error is False), otherwise None.

Return type:

NudbQualityError | None

Raises:

NudbQualityError – If the percentage of matching values is below the threshold while raise_error is True.

non_empty_to_threshold(col, threshold_lower, raise_error=True)¶

Ensure the proportion of non-empty values stays above a threshold.

Parameters:

col (Series) – Series to inspect.
threshold_lower (float) – Minimum allowed percentage of non-empty values.
raise_error (bool) – When True, raise NudbQualityError if the threshold is not met.

Returns:

Error describing the shortage when the threshold is not met (if raise_error is False), otherwise None.

Return type:

NudbQualityError | None

Raises:

NudbQualityError – If the percentage of non-empty values is below the threshold while raise_error is True.

nudb_use.quality.values module¶

Utilities for inspecting column fill rates and unexpected values.

get_fill_amount_per_column(df)¶

Calculate the percentage of filled (non-null) values per column.

Parameters:: df (DataFrame) – DataFrame whose columns should be summarized.
Returns:: Mapping of column name to percentage of filled cells.
Return type:: dict[str, float]

values_not_in_column(col, values, raise_error=False)¶

Check whether certain values appear inside a column.

Parameters:

col (Series) – Series to inspect.
values (Sequence[object] | object) – Allowed values; may be a single value or a list.
raise_error (bool) – When True, raise ValueError immediately when matches occur.

Returns:

None when the column is clean, otherwise a ValueError describing the unexpected values.

Return type:

None | ValueError

Raises:

ValueError – If forbidden values are found and raise_error is True.

nudb_use.quality.widths module¶

Validation helpers that ensure column values follow expected widths.

check_column_widths(df, widths=None, raise_errors=True)¶

Validate that string lengths in each column match expected widths.

Note: ignore_na is currently unused.

Parameters:

df (DataFrame) – DataFrame to inspect.
widths (dict[str, list[int]] | None) – Optional mapping of column names to allowed string lengths. When omitted or malformed, definitions are loaded from config.
raise_errors (bool) – When True, raise grouped errors if mismatches are found.

Returns:

Errors describing columns whose values are outside the allowed width definitions, or an empty list when all pass.

Return type:

list[NudbQualityError]