Reference¶
vaskify package¶
vaskify.createdata module¶
- create_test_data(n=5, n_periods=5, freq='monthly', seed=None, wide=False)¶
Generate test data with columns: NACE, number of employees, turnover, time period.
- Parameters:
n (
int) – Number of unique companies to create.n_periods (
int) – Number of time periods to create.freq (
str) – Frequency of the time periods: ‘monthly’, ‘quarterly’ or ‘yearly’.seed (
int|None) – Random seed for reproducibility.wide (
bool) – If True, return data in wide format with time periods as columns.
- Returns:
Test data in long format.
- Return type:
pd.DataFrame
- Raises:
ValueError – If freq is not one of “monthly”, “quarterly”, or “yearly”.
vaskify.detect module¶
- class Detect(data, id_nr, logger_level='warning')¶
Bases:
objectClass for data editing.
- Parameters:
data (DataFrame)
id_nr (str)
logger_level (str)
- accumulation_error(y_var, time_var, error=0.5, flag='flag_accumulation', impute=False, impute_var='', output_format='data')¶
Detect accumulation errors based on a previous periods (unstable beta method).
- Parameters:
y_var (
str) – The variable of insterest to check.time_var (
str) – String variable for indicating the time period. This should be in a ISO 8601 standard format for example: ‘YYYY’, ‘YYYY-MM’, ‘YYYY-MM-DD’ or a SSB standard like ‘YYYY-Qq’.error (
float) – Float for the allowed error factor.flag (
str) – String for the name of the flag variable to add to the data. Default is ‘flag_thousand’.impute (
bool) – Boolean for whether to impute the flagged observations. Default is False. (NOT IMPLEMENTED)impute_var (
str) – String for the name of the imputed variable.output_format (
str) – String for whether to return a data frame ‘data’, or just the identified outlier units ‘outliers’.
- Return type:
DataFrame- Returns:
Data frame containing a flag variable for identified outliers or a dataframe containing only the outliers.
- change_logging_level(logger_level)¶
Change the logging print level.
- Parameters:
logger_level (
str) – Detail level for information output. Choose between ‘debug’,’info’,’warning’,’error’ and ‘critical’.- Return type:
None
- hb(y_var, time_var=None, time_periods=None, strata_var='', pu=0.5, pa=0.05, pc=20, percentiles=(0.25, 0.75), flag='flag_hb', output_format='wide', output_scope='all')¶
Outlier detection using the Hidiroglou-Berthelot (HB) method.
Detects possible outliers of a variable in period t by comparing it with values from period t-1.
- Parameters:
y_var (
str|list[str]) – String for the name of the variable of interest to check.time_var (
str|None) – String variable for indicating the time period. This should be in a ISO 8601 standard format for example: ‘YYYY’, ‘YYYY-MM’, ‘YYYY-MM-DD’ or a SSB standard like ‘YYYY-Qq’. Set to None for wide-format data.time_periods (
list[str] |None) – List of strings for the two time periods to compare. Default None, in which case it is assumed that the time variable contains exactly two time preiods.strata_var (
str) – String variable for stratification. Default is blank (“”).pu (
float) – Parameter that adjusts for different level of the variables. Default value 0.5.pa (
float) – Parameter that adjusts for small differences between the median and the 1st or 3rd quartile. Default value 0.05.pc (
float) – Parameter that controls the width of the confidence interval. Default value 20.percentiles (
tuple[float,float]) – Tuple for percentile values to use. Default (0.25, 0.75)flag (
str) – String variable name to use to indicate outliers.output_format (
str) – String for data format to return. Can be ‘wide’ (default), ‘long’ or ‘infer’. For ‘infer’, the data format returned (wide or long) will be that of the input data.output_scope (
str) – String for which units to return, either all (‘all’) or just the outliers (‘outliers’).
- Return type:
DataFrame- Returns:
Dataframe with flags or with identified units depending on output_scope
- quartile_error(x_var, y_var=None, time_var=None, time_periods=None, strata_var='', pkl=3, pku=3, percentiles=(0.25, 0.75), flag='flag_quartile', output_format='infer', output_scope='all')¶
Detect and flag potential errors based on quartile ranges.
This method identifies outliers using a quartile-based approach applied to ratios between variables. The method currently supports only wide-format data. Observations with missing values or non-positive values in the required variables are excluded before calculations.
The method supports both single ratios (one x- and one y-variable) and multiple ratios (lists of variables), where lower and upper bounds are computed using specified percentiles and scaling parameters.
- Parameters:
x_var (
str|list[str]) – str or list of str for the name(s) of numerator variable(s). If a list is provided, y_var must also be a list of the same length.y_var (
str|list[str] |None) – str or list of str or None for the name(s) of denominator variable(s). If None, a temporary constantdenominator is used. Default is None.time_var (
str|None) – str or None for the name of a time variable. Currently not supported; only wide-format input is implemented. If provided, an error is logged. Default is None.time_periods (
list[str] |None) – list of str or None. Reserved for future use. Currently not applied.strata_var (
str) – Optional str variable defining strata within which quartiles are calculated. Default is an empty string (no stratification).pkl (
float) – Float scaling factor applied to the lower quartile limit. Default is 3.pku (
float) – Float scaling factor applied to the upper quartile limit. Default is 3.percentiles (
tuple[float,float]) – tuple of floats for the lower and upper percentiles used to compute quartiles. Default is (0.25, 0.75).flag (
str) – Str name of the output flag variable indicating detected outliers. Default is “flag_quartile”.output_format (
str) – str reserved for future use. Currently not applied. Only wide format returned.output_scope (
str) – {“all”, “outliers”} to determine whether all observations are returned or only those flagged as outliers. Default is “all”.
- Return type:
DataFrame- Returns:
A pandas DataFrame containing the original data along with calculated quartile limits, ratios, and an indicator flag for quartile-based outliers. If output_scope=”outliers”, only flagged observations are returned.
- thousand_error(y_var, time_var=None, lower_bound=-2.5, upper_bound=2.5, flag='flag_thousand', impute=False, impute_var='', output_format='wide', output_scope='all')¶
Detect thousand errors based on a previous period.
- Parameters:
y_var (
str|list[str]) – The variable(s) of interest to check. In long format, a single variable name or list of variable names. In wide format, a prefix or list of prefixes (e.g. ‘employees’) matching columnstime_var (
str|None) – String variable for indicating the time period. This should be in a ISO 8601 standard format for example: ‘YYYY’, ‘YYYY-MM’, ‘YYYY-MM-DD’ or a SSB standard like ‘YYYY-Qq’. Set to None for wide-format data.lower_bound (
float) – Float variable for the lower bound log factor for defining an outlier.upper_bound (
float) – Float variable for the upper bound log factor for defining an outlier.flag (
str) – String for the name of the flag variable to add to the data. Default is ‘flag_thousand’.impute (
bool) – Boolean for whether to impute the flagged observations. Default is False.impute_var (
str) – String for the name of the imputed variable.output_format (
str) – String for whether to return a data frame in ‘wide’ (default) or ‘long’ format. For ‘infer’, the function will return the same format as the input data.output_scope (
str) – String for whether to return all the data (‘all’) or just the identified outlier units (‘outliers’).
- Return type:
DataFrame- Returns:
Data frame containing a flag variable for identified outliers or a dataframe containing only the outliers (depending on output_scope).