Estimation using struktuR

Introduction

The struktuR package is used to estimate totals and their uncertainty from sample survey data, with particular focus on business surveys. It contains functions which replicates and builds on the estimation and functionality used in Statistic Norway’s internally developed SAS-application Struktur (Using SAS-Struktur, Norsk). Documentation on this earlier application will be referenced here to aid those wishing to move their statistical production (or parts of it) to R.

The main purpose of the functions is to:

Estimate totals for statistic variables for the country and within strata and groups.
Provide estimations for variance to indicate the quality of the estimated totals.
Identify outlier/extreme observations that may be having a overly high influence on the statistics.

The estimation method is a model-based estimate using a homogenous, rate or regression model. These are described further in more details later in this document.

Package installation

For internal Statistic Norway users, the package is already installed on many of the production servers and this step may be skipped. For other users the package can be installed from github using the devtools function install_github. This step only needs to be run one time.

devtools::install_github("statisticsnorway/struktuR")

To access and use the functions in the package we need to run library each time we start a new R session.

library(struktuR)

Data requirements

The purpose of the functions used for estimation is to take information from a sample of units (for example a sample of companies) and make inferences on the remaining units in the population to find an estimation for the total. For this we need data on both the sample and population. This can be two separate data sets or one full/population data set with information on the statistics variable for those units in the sample. When there are two data sets, the following variables should be included:

Population data set:

ID variable which is consistent in both the population and sample data sets (id).
An explanatory variable when using regression and rate models (x). This is a variable which the statistic variable is correlated with.
A strata variable which divides the population into groups which are similar to each other. This is the smallest grouping variable and which seperate models will be run on.
Domain variables (or publication groups) for producing statistical totals for. These should be able to be created by joining strata groups together.

Sample data set:

ID variable (id)
The explanatory variable when using regression and rate models, but can be missing (x)
The statistic variable(s) we are interested in estimating (y)

Example data There are two synthetic data sets in the package for examples: pop_data, sample_data. pop_data represents a population data set with 10000 rows, each one representing a company. The variable employees is the number of employees in the company and provides an explanatory variable used in many of Statistic Norway’s current estimation models. Variables emplyees_f and employees_m give the breakdown for number of male and female employees. Turnover provides an alternative explanatory variable. Variables size and industry providing options for stratification.

head(pop_data)

id	employees	employees_f	employees_m	turnover	size	industry
1	0	0	0	15396.54	small	B
2	75	15	60	78814.71	mid	B
3	55	42	13	97128.26	mid	B
4	56	32	24	60414.60	mid	B
5	110	13	97	237306.65	big	B
6	172	50	122	473721.22	big	B

The data set sample_data contains 1000 rows representing a sample of companies. In addition to the variables listed previously it contains a job_vacancies representing the number of job positions advertised for the year as an example statistic variable. Variables sick_days, sick_days_f and sick_days_m give addition statistic variables for the number of days lost to sick leave, broken down into male and female.

head(sample_data)

id	employees	employees_f	employees_m	turnover	size	industry	job_vacancies	sick_days	sick_days_f	sick_days_m
5	132	13	97	237306.653	big	B	1	515	278	237
9	28	11	10	63848.919	mid	B	8	134	127	7
12	35	6	1	11462.578	small	B	3	21	17	4
14	155	1	170	7934.057	big	B	16	632	540	92
25	167	81	61	408964.032	big	B	9	337	200	137
55	124	31	45	118718.363	mid	B	7	525	151	374

Model choice

The main function for estimation in the R package is called struktur_model and uses the parameter method to define the model of choice. The default setting is set to run a rate model but can also be a homogenous or regression model. This is similar to the choice given in the SAS-Struktur application

A homogenous model is useful when there are no available additional variables that are correlated with the statistics variable. A rate model is used when there is one variable available for all units in the population which has a good correlation with the statistics variable. A regression model can also be used when there is a variable available in the population but can also be adapted for cases with multiple explanatory variables. It also has an assumption that the residuals in the model are not related to the explanatory variables.

The struktuR package currently only supports the use of a rate model for estimation. Contact Statistic’s Norway’s Methods department if you wish to test either of the other models.

Rate model

A rate model can be described with the general formula: \[ y_{hi} = \beta_{h}x_{hi} + \epsilon_{hi} \] where \(i = 1, 2, ..., N_h\) and is the units in the sample in the stratum, \(h\), \(y\) is the statistics variable available in the sample and \(x\) is the explanatory variable.

For a rate model, the distribution of the error term, or residuals, is assumed to correlate with the explanatory variable, \(x\). Thereby \[ Var(\epsilon_{hi})= x_{hi}\sigma_h^2 \] An easy way to see if a rate model might be a good choice, is to start with plotting the statistics variable against the explanatory variable. If there appears to be a linear pattern, but the points vary more from the linear correlation when they themselves are larger it will result in a funnel type shape. This can be seen in the example:

Another difference is that rate models do not have an intercept term whereas regression models often do. This means that when the explanatory variable (x) is 0, then it should make sense for the statistic variable (y) to also be 0.

Estimation

The main function, struktur_model is used to perform the modeling and estimation. This function will fit the chosen model within strata groups then predict it for the population as a whole. The population and sample datasets must be provided, along with explanatory variable, statistic variable(s), id and strata variables. Here we use industry groups in the example data as the strata variable.

results <- struktur_model(pop_data, sample_data, 
                          x = "employees", 
                          y = "job_vacancies", 
                          id = "id",
                          strata = "industry")

id	employees	employees_f	employees_m	turnover	size	industry	job_vacancies	s_flag	.strata	employees_pop	employees_utv	job_vacancies_pop	job_vacancies_utv	job_vacancies_beta	job_vacancies_beta_ex	job_vacancies_rstud	job_vacancies_hat	job_vacancies_G	job_vacancies_imp	job_vacancies_flag
1	0	0	0	15396.54	small	B	NA	0	B	165769	15997	2000	193	0.1459024	NA	NA	NA	NA	0.000000	pred
2	75	15	60	78814.71	mid	B	NA	0	B	165769	15997	2000	193	0.1459024	NA	NA	NA	NA	10.942677	pred
3	55	42	13	97128.26	mid	B	NA	0	B	165769	15997	2000	193	0.1459024	NA	NA	NA	NA	8.024630	pred
4	56	32	24	60414.60	mid	B	NA	0	B	165769	15997	2000	193	0.1459024	NA	NA	NA	NA	8.170532	pred
5	110	13	97	237306.65	big	B	1	1	B	165769	15997	2000	193	0.1459024	0.1470533	-2.085257	0.008251547	-0.190207	1.000000	mod
6	172	50	122	473721.22	big	B	NA	0	B	165769	15997	2000	193	0.1459024	NA	NA	NA	NA	25.095205	pred

The result is a population dataset with added columns. It acts as a mass-imputed file where totals and variances can be calculated from.

The general explanatory, x, variables added include:

employees_pop gives the sum of the explanatory (employees) for the stratum in the population.
employees_utv gives the sum of the explanatory (employees) for the stratum in the sample.

The prefix employees will be replaced with the name of the explanatory variable chosen when run using a different data set.

Statistic variables added include:

job_vacancies_pop gives the count/number of observations in the population for the stratum.
job_vacancies_utv gives the count/number of observations in the sample for the stratum.
job_vacancies_beta gives the estimate of the \(\beta\) value from the model for the stratum.
job_vacancies_beta_ex gives the estimate of the \(\beta\) value from the model for the stratum if that specific observation is removed from the model.
job_vacancies_rstud, job_vacancies_hat, job_vacancies_G provide the studentized residuals, hat values and G-value used for identifying outliers and calculating variances
job_vacancies_imp provides the model based imputation for the statistic variable.
job_vacancies_flag provides an indicator on whether the value of job_vacancies_imp is predicted (pred), is observed and contributes to the model mod) or is observed but excluded from the model (ex).

Again, the prefix job_vacancies will be replaced with the name of the statistical variable chosen when run on a different data set.

An additional parameter exclude in the function, allows you to include a list of ID’s to exclude from the modelling. This can be useful is there are some extreme values which will heavily influence the results but are most likely correct values.

View results and uncertainty

To summarize and view the results you can use the function get_results with the first input parameter being the output data set from the main struktur_modelfunction. This will give estimates for the total of the statistical variable in addition to the variance (job_vacanices_var), upper and lower bounds (job_vacancies_UB and job_vacancies_LB). The coefficient og variation is given for three robust variance methods (job_vacancies_CV1, job_vacancies_CV2 and job_vacancies_CV3) with variance 2 being the default used as the varience for the upper and lower bound varaibles. These are all described in the Using SAS-struktur guide (page 25). The results are provided per strata if a grouping/domain variable is not provided.

If the exclude parameter is use in the struktur_model function an additional “suprise strata” is created containing all the observations to exclude from the model. This strata will be shown in the results.

results_table <- get_results(results)
results_table

.strata	employees_pop	employees_utv	job_vacancies_pop	job_vacancies_utv	job_vacancies_est	job_vacancies_var	job_vacancies_UB	job_vacancies_LB	job_vacancies_CV1	job_vacancies_CV2	job_vacancies_CV3
B	165769	15997	2000	193	24186.09	1477434	26568.46	21803.71	5.000508	5.025606	5.050884
C	166222	16484	2000	196	23535.68	1881763	26224.36	20847.00	5.801299	5.828485	5.855847
D	160949	15037	2000	191	25720.59	1604171	28203.04	23238.13	4.897947	4.924300	4.950855
E	164038	19319	2000	219	23520.12	1281820	25739.19	21301.06	4.792041	4.813643	4.835400
F	161212	15138	2000	191	22811.21	2073611	25633.62	19988.80	6.277441	6.312702	6.348270

View domain estimations

We generally don’t publish statistics at the strata-level. This is usually most useful for the modeling/estimation process. To view results for total estimates and uncertainty for publishing domains we can use the get_results function with the group parameter. For example to view results for the entire population/country level and industry levels:

results$country <- 1
get_results(results, group = c("country", "industry"))

group_name	group	job_vacancies_est	job_vacancies_var	job_vacancies_cv
country	1	119773.69	8318798	2.408069
industry	B	24186.09	1477434	5.025606
industry	C	23535.68	1881763	5.828485
industry	D	25720.59	1604171	4.924300
industry	E	23520.12	1281820	4.813643
industry	F	22811.21	2073611	6.312702

In this case any observations that were excluded from the model and are placed in the “surprise strata” will now be included in the total (but do not contribute to the variance).

View outliers

To look for extreme values we can use the get_extremes function to create a table og potential problematic values. We simple run this function on the output data set from the main struktur_model function.

outliers <- get_extremes(data = results)
head(outliers)

id	.strata	employees	job_vacancies	job_vacancies_utv	job_vacancies_est	job_vacancies_est_ex	job_vacancies_G	job_vacancies_rstud	job_vacancies_G_grense	job_vacancies_rstud_grense
8897	F	209	68	191	22811.21	22426.30	0.3732072	3.011318	0.1447149	2
2583	C	245	74	196	23535.68	23139.00	0.3687849	2.977828	0.1428571	2
4874	D	214	74	191	25720.59	25351.70	0.3667711	2.815034	0.1447149	2
9394	F	393	90	191	22811.21	22464.15	0.3321278	1.985633	0.1447149	2
9729	F	239	66	191	22811.21	22478.08	0.3206109	2.479199	0.1447149	2
885	B	301	71	193	24186.09	23875.72	0.3101890	2.303136	0.1439632	2

This produces a table with possible outliers. The ID and strata variables are provided along the the explanatory and statistical varaibles. The estimate for the strata total that the observation is given (job_vacancies_est) and also the strata estimate if the observation was excluded from the model, but the observed value was included in the total (job_vacancies_est_ex).

The Gand rstud values are given for the observations and the threshold values used. These give an indication of how close they are to an average, or expected value given the size of the explanatory variable value. The methods for these are described further in Using SAS-Struktur.

Only observations with values above the threshold for G or rstud are given in the table.

To view outliers in a figure you can use the function plot_extreme, inputting the dataset given from the get_extremefunction. This is a way to get a quick view of the potential outliers. The default plot gives the G-values.

plot_extreme(outliers)

Alternatively, you can plot a comparison of the strata estimates with and without that point.

plot_extreme(outliers, type = "estimate")

Susie Jentoft

27.05.2022