nudb_use.quality package¶
Subpackages¶
- nudb_use.quality.specific_variables package
- nudb_use.quality.specific_variables.gro_elevstatus module
- nudb_use.quality.specific_variables.grunnskolepoeng module
- nudb_use.quality.specific_variables.kommune module
- nudb_use.quality.specific_variables.land module
- nudb_use.quality.specific_variables.nus2000 module
- nudb_use.quality.specific_variables.run_all module
- nudb_use.quality.specific_variables.skoleaar module
- nudb_use.quality.specific_variables.sn07 module
- nudb_use.quality.specific_variables.snr_fnr module
- nudb_use.quality.specific_variables.unique_per_person module
- nudb_use.quality.specific_variables.utils module
- nudb_use.quality.specific_variables.vg_fullfoertkode_detaljert module
nudb_use.quality.check_bool_string_columns module¶
Checks for string columns containing boolean-like literal values.
- check_bool_string_columns(df, raise_errors=True)¶
Detect string columns that contain literal boolean values.
- Parameters:
df (
DataFrame) – DataFrame to inspect.raise_errors (
bool) – When True, raise grouped errors if violations are found.
- Returns:
Errors describing columns with boolean-like string literals, or an empty list when none are found.
- Return type:
list[NudbQualityError]
nudb_use.quality.check_drop_cols module¶
Ensure requested drop columns do not collide with configured variables.
- check_drop_cols_for_valid_cols(drop_cols, ignores=None, raise_errors=False)¶
Warn when requested drop columns overlap with defined valid columns.
- Parameters:
drop_cols (
list[str]) – Column names that are about to be dropped.ignores (
list[str] |str|None) – Optional iterable or single column name that should be ignored when computing the overlap.raise_errors (
bool) – When True, raise a NudbQualityError instead of returning it. Defaults to False.
- Returns:
Error describing the overlapping columns, or None when no problematic columns are found.
- Return type:
NudbQualityError | None
- Raises:
NudbQualityError – Raised when overlaps exist and raise_errors is True.
nudb_use.quality.colored_views module¶
Style helpers that colorize quality summary tables.
- empty_cols_in_time_colored(df, time_col)¶
Highlight columns that are empty for entire periods.
- Parameters:
df (
DataFrame) – DataFrame to summarize.time_col (
str) – Column to group by when calculating emptiness.
- Returns:
Styled output with red/green highlights.
- Return type:
Styler
- grade_cell_by_time_col(df, time_col)¶
Colorize per-period completeness percentages.
- Parameters:
df (
DataFrame) – DataFrame to summarize.time_col (
str) – Column representing the grouping period.
- Returns:
Styled completeness summary for display.
- Return type:
Styler
nudb_use.quality.duplicated_columns module¶
Checks for duplicated DataFrame columns and reports them as errors.
- check_duplicated_columns(df)¶
Return quality errors for duplicated columns in a DataFrame.
- Parameters:
df (
DataFrame) – DataFrame to inspect.- Returns:
Errors summarizing each duplicated column.
- Return type:
list[NudbQualityError]
nudb_use.quality.missing module¶
Checks ensuring NUDB datasets respect missing-value thresholds.
- check_columns_only_missing(df, raise_errors=True)¶
Identify columns that consist entirely of missing values.
- Parameters:
df (
DataFrame) – DataFrame to inspect.raise_errors (
bool) – When True, raise grouped errors if violations are found.
- Returns:
Errors describing columns that contain only missing values, or an empty list when every column has data.
- Return type:
list[NudbQualityError]
- check_missing_thresholds_dataset_name(df, dataset_name, raise_errors=True)¶
Validate a dataset against the configured missing-value thresholds.
- Parameters:
df (
DataFrame) – DataFrame to validate.dataset_name (
str) – Name of the dataset whose threshold config should be used.raise_errors (
bool) – When True, raise grouped errors if violations are found.
- Returns:
Errors describing columns that exceed their thresholds, or an empty list when all limits are met.
- Return type:
list[NudbQualityError]
- check_non_missing(df, cols_not_empty, raise_errors=True)¶
Ensure the provided columns never contain missing values.
- Parameters:
df (
DataFrame) – DataFrame to inspect.cols_not_empty (
list[str]) – Column names that must be fully populated.raise_errors (
bool) – When True, raise grouped errors if violations are found.
- Returns:
Errors describing columns that contain missing values, or an empty list if all columns are complete.
- Return type:
list[NudbQualityError]
- df_within_missing_thresholds(df, thresholds=None, raise_errors=True)¶
Check whether each column respects its configured missing-value threshold.
- Parameters:
df (
DataFrame) – DataFrame providing the values to inspect.thresholds (
dict[str,float] |None) – Mapping of column names to allowed missing-value percentages.raise_errors (
bool) – When True, raise grouped errors if violations are found.
- Returns:
Errors describing columns that exceed their thresholds, or an empty list when all limits are met.
- Return type:
list[NudbQualityError]
- empty_percents_over_columns(df, group_cols=None)¶
Check the percentage of empty values in specified columns in a DataFrame.
- Parameters:
df (
DataFrame) – DataFrame to check columns in.group_cols (
str|list[str] |None) – List of columns to check for percentage of empty values.
- Returns:
DataFrame with percentage values for empty values for each column.
- Return type:
pd.DataFrame
- get_thresholds_from_config(dataset_name)¶
Retrieve percentage completion threshold values for a given dataset from config.
- Parameters:
dataset_name (
str) – Name of the dataset to retrieve threshold values for.- Returns:
Dictionary mapping variable names to specified percentage completion threshold values.
- Return type:
dict[str, float]
- last_period_within_thresholds(df, period_col, thresholds=None, raise_errors=True)¶
Validate that the latest period satisfies missing-value thresholds.
- Parameters:
df (
DataFrame) – DataFrame containing the data to evaluate.period_col (
str) – Column identifying the period dimension.thresholds (
dict[str,float] |None) – Mapping of column names to allowed missing-value percentages.raise_errors (
bool) – When True, raise grouped errors if violations are found.
- Returns:
Errors describing columns that exceed thresholds in the most recent period, or an empty list if all pass.
- Return type:
list[NudbQualityError]
nudb_use.quality.outdated_variables module¶
Checks that flag columns marked as outdated are absent from datasets.
- check_outdated_variables(df)¶
Return errors for columns marked as outdated in config.
- Parameters:
df (
DataFrame) – DataFrame to inspect.- Returns:
Errors describing each outdated column present.
- Return type:
list[NudbQualityError]
- find_outdated_variables_in_df(df)¶
Return metadata for outdated variables present in a DataFrame.
- Parameters:
df (
DataFrame) – DataFrame to inspect.- Returns:
Mapping from variable name to metadata entries.
- Return type:
dict[str, Variable]
nudb_use.quality.suite module¶
High-level orchestration for NUDB quality checks.
- run_quality_suite(df, dataset_name, data_time_start=None, data_time_end=None, raise_errors=True, **kwargs)¶
Run the full NUDB quality suite over a dataset.
- Parameters:
df (
DataFrame) – DataFrame to validate.dataset_name (
str) – Name of the dataset in config; controls which part of the config to choose for values used in the valiadations.data_time_start (
str|None) – Optional start date used by codelist validations.data_time_end (
str|None) – Optional end date used by codelist validations.raise_errors (
bool) – When True, raise grouped exceptions if any check fails.**kwargs (
object) – Additional keyword arguments forwarded to specific checks.
- Returns:
All collected quality errors, or an empty sequence when every check passes.
- Return type:
Sequence[Exception]
- Raises:
TypeError – If the first parameter df is not a pandas dataframe.
nudb_use.quality.thresholds module¶
Threshold helpers for completeness and fill-rate validations.
- filled_value_to_threshold(col, value, threshold_lower, raise_error=True)¶
Ensure the proportion of specific values stays above a threshold.
- Parameters:
col (
Series) – Series to inspect.value (
Iterable[object] |object) – Single value or iterable of values that must meet the threshold.threshold_lower (
float) – Minimum allowed percentage of matching values.raise_error (
bool) – When True, raise NudbQualityError if the threshold is not met.
- Returns:
Error describing the shortage when the threshold is not met (if raise_error is False), otherwise None.
- Return type:
NudbQualityError | None
- Raises:
NudbQualityError – If the percentage of matching values is below the threshold while raise_error is True.
- non_empty_to_threshold(col, threshold_lower, raise_error=True)¶
Ensure the proportion of non-empty values stays above a threshold.
- Parameters:
col (
Series) – Series to inspect.threshold_lower (
float) – Minimum allowed percentage of non-empty values.raise_error (
bool) – When True, raise NudbQualityError if the threshold is not met.
- Returns:
Error describing the shortage when the threshold is not met (if raise_error is False), otherwise None.
- Return type:
NudbQualityError | None
- Raises:
NudbQualityError – If the percentage of non-empty values is below the threshold while raise_error is True.
nudb_use.quality.values module¶
Utilities for inspecting column fill rates and unexpected values.
- get_fill_amount_per_column(df)¶
Calculate the percentage of filled (non-null) values per column.
- Parameters:
df (
DataFrame) – DataFrame whose columns should be summarized.- Returns:
Mapping of column name to percentage of filled cells.
- Return type:
dict[str, float]
- values_not_in_column(col, values, raise_error=False)¶
Check whether certain values appear inside a column.
- Parameters:
col (
Series) – Series to inspect.values (
Sequence[object] |object) – Allowed values; may be a single value or a list.raise_error (
bool) – When True, raise ValueError immediately when matches occur.
- Returns:
None when the column is clean, otherwise a ValueError describing the unexpected values.
- Return type:
None | ValueError
- Raises:
ValueError – If forbidden values are found and raise_error is True.
nudb_use.quality.widths module¶
Validation helpers that ensure column values follow expected widths.
- check_column_widths(df, widths=None, raise_errors=True)¶
Validate that string lengths in each column match expected widths.
Note: ignore_na is currently unused.
- Parameters:
df (
DataFrame) – DataFrame to inspect.widths (
dict[str,list[int]] |None) – Optional mapping of column names to allowed string lengths. When omitted or malformed, definitions are loaded from config.raise_errors (
bool) – When True, raise grouped errors if mismatches are found.
- Returns:
Errors describing columns whose values are outside the allowed width definitions, or an empty list when all pass.
- Return type:
list[NudbQualityError]