Reference¶

vaskify package¶

vaskify.createdata module¶

create_test_data(n=5, n_periods=5, freq='monthly', seed=None)¶

Generate test data with columns: NACE, number of employees, turnover, time period.

Parameters:

n (int) – Number of unique companies to create.
n_periods (int) – Number of time periods to create.
freq (str) – Frequency of the time periods: ‘monthly’, ‘quarterly’ or ‘yearly’.
seed (int) – Random seed for reproducibility.

Returns:

Test data in long format.

Return type:

pd.DataFrame

vaskify.detect module¶

class Detect(data, id_nr, logger_level='warning')¶

Bases: object

Class for data editing.

Parameters:

data (DataFrame)
id_nr (str)
logger_level (str)

accumulation_error(y_var, time_var, error=0.5, flag='flag_accumulation', impute=False, impute_var='', output_format='data')¶

Detect accumulation errors based on a previous periods.

Parameters:

y_var (str) – The variable of insterest to check.
time_var (str) – String variable for indicating the time period. This should be in a ISO 8601 standard format for example: ‘YYYY’, ‘YYYY-MM’, ‘YYYY-MM-DD’ or a SSB standard like ‘YYYY-Qq’.
error (float) – Float for the allowed error factor.
flag (str) – String for the name of the flag variable to add to the data. Default is ‘flag_thousand’.
impute (bool) – Boolean for whether to impute the flagged observations. Default is False. (NOT IMPLEMENTED)
impute_var (str) – String for the name of the imputed variable.
output_format (str) – String for whether to return a data frame ‘data’, or just the identified outlier units ‘outliers’.

Return type:

DataFrame

Returns:

Data frame containing a flag variable for identified outliers or a dataframe containing only the outliers.

change_logging_level(logger_level)¶

Change the logging print level.

Parameters:: logger_level (str) – Detail level for information output. Choose between ‘debug’,’info’,’warning’,’error’ and ‘critical’.
Return type:: None

hb(y_var, time_var, time_periods=None, strata_var='', pu=0.5, pa=0.05, pc=20, percentiles=(0.25, 0.75), flag='flag_hb', output_format='wide')¶

Outlier detection using the Hidiroglou-Berthelot (HB) method.

Detects possible outliers of a variable in period t by comparing it with values from period t-1.

Parameters:

y_var (str) – String for the name of the variable of interest to check.
time_var (str) – String variable for indicating the time period. This should be in a ISO 8601 standard format for example: ‘YYYY’, ‘YYYY-MM’, ‘YYYY-MM-DD’ or a SSB standard like ‘YYYY-Qq’.
time_periods (list[str] | None) – List of strings for the two time periods to compare. Default None, in which case it is assumed that the time variable contains exactly two time preiods.
strata_var (str) – String variable for stratification. Default is blank (“”).
pu (float) – Parameter that adjusts for different level of the variables. Default value 0.5.
pa (float) – Parameter that adjusts for small differences between the median and the 1st or 3rd quartile. Default value 0.05.
pc (float) – Parameter that controls the width of the confidence interval. Default value 20.
percentiles (tuple[float, float]) – Tuple for percentile values to use.
flag (str) – String variable name to use to indicate outliers.
output_format (str) – String for format to return. Can be ‘wide’,’long’,’outliers’.

Return type:

DataFrame

Returns:

Dataframe with flags or with identified units

thousand_error(y_var, time_var, lower_bound=-2.5, upper_bound=2.5, flag='flag_thousand', impute=False, impute_var='', output_format='data')¶

Detect thousand errors based on a previous period.

Parameters:

y_var (str) – The variable of insterest to check.
time_var (str) – String variable for indicating the time period. This should be in a ISO 8601 standard format for example: ‘YYYY’, ‘YYYY-MM’, ‘YYYY-MM-DD’ or a SSB standard like ‘YYYY-Qq’.
lower_bound (float) – Float variable for the lower bound log factor for defining an outlier.
upper_bound (float) – Float variable for the upper bound log factor for defining an outlier.
flag (str) – String for the name of the flag variable to add to the data. Default is ‘flag_thousand’.
impute (bool) – Boolean for whether to impute the flagged observations. Default is False.
impute_var (str) – String for the name of the imputed variable.
output_format (str) – String for whether to return a data frame ‘data’, or just the identified outlier units ‘outliers’.

Return type:

DataFrame

Returns:

Data frame containing a flag variable for identified outliers or a dataframe containing only the outliers.