Reference

vaskify package

vaskify.createdata module

create_test_data(n=5, n_periods=5, freq='monthly', seed=None)

Generate test data with columns: NACE, number of employees, turnover, time period.

Parameters:
  • n (int) – Number of unique companies to create.

  • n_periods (int) – Number of time periods to create.

  • freq (str) – Frequency of the time periods: ‘monthly’, ‘quarterly’ or ‘yearly’.

  • seed (int) – Random seed for reproducibility.

Returns:

Test data in long format.

Return type:

pd.DataFrame

vaskify.detect module

class Detect(data, id_nr, logger_level='warning')

Bases: object

Class for data editing.

Parameters:
  • data (DataFrame)

  • id_nr (str)

  • logger_level (str)

accumulation_error(y_var, time_var, error=0.5, flag='flag_accumulation', impute=False, impute_var='', output_format='data')

Detect accumulation errors based on a previous periods.

Parameters:
  • y_var (str) – The variable of insterest to check.

  • time_var (str) – String variable for indicating the time period. This should be in a ISO 8601 standard format for example: ‘YYYY’, ‘YYYY-MM’, ‘YYYY-MM-DD’ or a SSB standard like ‘YYYY-Qq’.

  • error (float) – Float for the allowed error factor.

  • flag (str) – String for the name of the flag variable to add to the data. Default is ‘flag_thousand’.

  • impute (bool) – Boolean for whether to impute the flagged observations. Default is False. (NOT IMPLEMENTED)

  • impute_var (str) – String for the name of the imputed variable.

  • output_format (str) – String for whether to return a data frame ‘data’, or just the identified outlier units ‘outliers’.

Return type:

DataFrame

Returns:

Data frame containing a flag variable for identified outliers or a dataframe containing only the outliers.

change_logging_level(logger_level)

Change the logging print level.

Parameters:

logger_level (str) – Detail level for information output. Choose between ‘debug’,’info’,’warning’,’error’ and ‘critical’.

Return type:

None

hb(y_var, time_var, time_periods=None, strata_var='', pu=0.5, pa=0.05, pc=20, percentiles=(0.25, 0.75), flag='flag_hb', output_format='wide')

Outlier detection using the Hidiroglou-Berthelot (HB) method.

Detects possible outliers of a variable in period t by comparing it with values from period t-1.

Parameters:
  • y_var (str) – String for the name of the variable of interest to check.

  • time_var (str) – String variable for indicating the time period. This should be in a ISO 8601 standard format for example: ‘YYYY’, ‘YYYY-MM’, ‘YYYY-MM-DD’ or a SSB standard like ‘YYYY-Qq’.

  • time_periods (list[str] | None) – List of strings for the two time periods to compare. Default None, in which case it is assumed that the time variable contains exactly two time preiods.

  • strata_var (str) – String variable for stratification. Default is blank (“”).

  • pu (float) – Parameter that adjusts for different level of the variables. Default value 0.5.

  • pa (float) – Parameter that adjusts for small differences between the median and the 1st or 3rd quartile. Default value 0.05.

  • pc (float) – Parameter that controls the width of the confidence interval. Default value 20.

  • percentiles (tuple[float, float]) – Tuple for percentile values to use.

  • flag (str) – String variable name to use to indicate outliers.

  • output_format (str) – String for format to return. Can be ‘wide’,’long’,’outliers’.

Return type:

DataFrame

Returns:

Dataframe with flags or with identified units

thousand_error(y_var, time_var, lower_bound=-2.5, upper_bound=2.5, flag='flag_thousand', impute=False, impute_var='', output_format='data')

Detect thousand errors based on a previous period.

Parameters:
  • y_var (str) – The variable of insterest to check.

  • time_var (str) – String variable for indicating the time period. This should be in a ISO 8601 standard format for example: ‘YYYY’, ‘YYYY-MM’, ‘YYYY-MM-DD’ or a SSB standard like ‘YYYY-Qq’.

  • lower_bound (float) – Float variable for the lower bound log factor for defining an outlier.

  • upper_bound (float) – Float variable for the upper bound log factor for defining an outlier.

  • flag (str) – String for the name of the flag variable to add to the data. Default is ‘flag_thousand’.

  • impute (bool) – Boolean for whether to impute the flagged observations. Default is False.

  • impute_var (str) – String for the name of the imputed variable.

  • output_format (str) – String for whether to return a data frame ‘data’, or just the identified outlier units ‘outliers’.

Return type:

DataFrame

Returns:

Data frame containing a flag variable for identified outliers or a dataframe containing only the outliers.