Reference¶
vaskify package¶
vaskify.createdata module¶
- create_test_data(n=5, n_periods=5, freq='monthly', seed=None)¶
Generate test data with columns: NACE, number of employees, turnover, time period.
- Parameters:
n (int) – Number of unique companies to create.
n_periods (int) – Number of time periods to create.
freq (str) – Frequency of the time periods: ‘monthly’, ‘quarterly’ or ‘yearly’.
seed (int) – Random seed for reproducibility.
- Returns:
Test data in long format.
- Return type:
pd.DataFrame
vaskify.detect module¶
- class Detect(data, id_nr, logger_level='warning')¶
Bases:
object
Class for data editing.
- Parameters:
data (DataFrame)
id_nr (str)
logger_level (str)
- accumulation_error(y_var, time_var, error=0.5, flag='flag_accumulation', impute=False, impute_var='', output_format='data')¶
Detect accumulation errors based on a previous periods.
- Parameters:
y_var (
str
) – The variable of insterest to check.time_var (
str
) – String variable for indicating the time period. This should be in a ISO 8601 standard format for example: ‘YYYY’, ‘YYYY-MM’, ‘YYYY-MM-DD’ or a SSB standard like ‘YYYY-Qq’.error (
float
) – Float for the allowed error factor.flag (
str
) – String for the name of the flag variable to add to the data. Default is ‘flag_thousand’.impute (
bool
) – Boolean for whether to impute the flagged observations. Default is False. (NOT IMPLEMENTED)impute_var (
str
) – String for the name of the imputed variable.output_format (
str
) – String for whether to return a data frame ‘data’, or just the identified outlier units ‘outliers’.
- Return type:
DataFrame
- Returns:
Data frame containing a flag variable for identified outliers or a dataframe containing only the outliers.
- change_logging_level(logger_level)¶
Change the logging print level.
- Parameters:
logger_level (
str
) – Detail level for information output. Choose between ‘debug’,’info’,’warning’,’error’ and ‘critical’.- Return type:
None
- hb(y_var, time_var, time_periods=None, strata_var='', pu=0.5, pa=0.05, pc=20, percentiles=(0.25, 0.75), flag='flag_hb', output_format='wide')¶
Outlier detection using the Hidiroglou-Berthelot (HB) method.
Detects possible outliers of a variable in period t by comparing it with values from period t-1.
- Parameters:
y_var (
str
) – String for the name of the variable of interest to check.time_var (
str
) – String variable for indicating the time period. This should be in a ISO 8601 standard format for example: ‘YYYY’, ‘YYYY-MM’, ‘YYYY-MM-DD’ or a SSB standard like ‘YYYY-Qq’.time_periods (
list
[str
] |None
) – List of strings for the two time periods to compare. Default None, in which case it is assumed that the time variable contains exactly two time preiods.strata_var (
str
) – String variable for stratification. Default is blank (“”).pu (
float
) – Parameter that adjusts for different level of the variables. Default value 0.5.pa (
float
) – Parameter that adjusts for small differences between the median and the 1st or 3rd quartile. Default value 0.05.pc (
float
) – Parameter that controls the width of the confidence interval. Default value 20.percentiles (
tuple
[float
,float
]) – Tuple for percentile values to use.flag (
str
) – String variable name to use to indicate outliers.output_format (
str
) – String for format to return. Can be ‘wide’,’long’,’outliers’.
- Return type:
DataFrame
- Returns:
Dataframe with flags or with identified units
- thousand_error(y_var, time_var, lower_bound=-2.5, upper_bound=2.5, flag='flag_thousand', impute=False, impute_var='', output_format='data')¶
Detect thousand errors based on a previous period.
- Parameters:
y_var (
str
) – The variable of insterest to check.time_var (
str
) – String variable for indicating the time period. This should be in a ISO 8601 standard format for example: ‘YYYY’, ‘YYYY-MM’, ‘YYYY-MM-DD’ or a SSB standard like ‘YYYY-Qq’.lower_bound (
float
) – Float variable for the lower bound log factor for defining an outlier.upper_bound (
float
) – Float variable for the upper bound log factor for defining an outlier.flag (
str
) – String for the name of the flag variable to add to the data. Default is ‘flag_thousand’.impute (
bool
) – Boolean for whether to impute the flagged observations. Default is False.impute_var (
str
) – String for the name of the imputed variable.output_format (
str
) – String for whether to return a data frame ‘data’, or just the identified outlier units ‘outliers’.
- Return type:
DataFrame
- Returns:
Data frame containing a flag variable for identified outliers or a dataframe containing only the outliers.