Skip to contents

Introduction

The merge_lonely_strata is one of the R functions in the struktuR R package, which is documented here (English). Lonely strata are defined to be non take-all strata with only one observation in sample data. Variance estimation would not be possible within such strata, which is a common problem, particularly in business surveys, where many strata, some of which may be quite small, are often used.

The main purpose of this function is to:

  • Create estimation strata (or artificial superstrata) by merging lonely strata.

This function uses the collapse.strata function from the ReGenesees package. The idea is to merge lonely strata by using population means of an auxiliary variable, denoted by xx, as a similarity measure. A lonely stratum is merged to another lonely stratum or a larger stratum which is the most similar to that lonely stratum in terms of the population mean of the xx-variable. If strata with only one observation are take-all strata, then these strata are ignored in strata merging since corresponding variances are set to zero for such strata in the struktuR as well as in the Python version Statstruk, and SAS-application Struktur (Using SAS-Struktur, Norsk).

This function should only be used for sample surveys with (stratified) single-stage sampling design where simple random sampling is used within each stratum to select sampling units.

Strata merge may not always be successful due to the reasons described in the following section of this document.

Package installation

For internal Statistic Norway users, the package is already installed on many of the production servers and this step may be skipped. For other users the package can be installed from github using the devtools function install_github. This step only needs to be run one time.

devtools::install_github("statisticsnorway/ssb-struktur")

To access and use the functions in the package we need to run library each time we start a new R session.

library(struktuR)

Data requirements

We need data on both the sample and population. The following variables should be included:

Population data set:

  • ID variable (id) which is consistent in both the population and sample data sets.
  • An explanatory variable when using regression and rate models (x). This is a variable which the statistic variable is correlated with.
  • A strata variable (strata) which divides the population into groups which are similar to each other. This is the smallest grouping variable and which separate models will be run on.
  • Domain variables (or publication groups: group) for producing statistical totals for. These should be able to be created by joining strata groups together.
  • Blocking variables (block) that are used to constrain strata merge, in a way that estimation strata can only be created within blocks. This parameter may contain one or more blocking variables.

Sample data set:

  • ID variable (id)
  • The explanatory variable when using regression and rate models, but can be missing (x)
  • The statistic variable(s) we are interested in estimating (y)

Example data

There are two synthetic data sets in the struktuR package that contain lonely strata: pop_data_1obs, sample_data_1obs. pop_data_1obs represents a population data set with 10005 rows, each one representing a company. We will use the variable employees, which refers to the number of employees in the company, or turnover, as the xx-variable. Variables size and industry will be used to specify the values of the strata, block, and/or group parameters of the function.

head(pop_data_1obs)
id employees employees_f employees_m turnover size industry
1 0 0 0 15396.54 small B
2 75 15 60 78814.71 mid B
3 55 42 13 97128.26 mid B
4 56 32 24 60414.60 mid B
5 110 13 97 237306.65 big B
6 172 50 122 473721.22 big B

The data set sample_data_1obs contains 1001 rows representing a sample of companies. In addition to the variables in the population data set, we do have some variables among which job_vacancies will be used as the yy-variable, representing the number of job positions advertised for the year.

head(sample_data_1obs)
id employees employees_f employees_m turnover size industry job_vacancies sick_days sick_days_f sick_days_m
5 132 13 97 237306.653 big B 1 515 278 237
9 28 11 10 63848.919 mid B 8 134 127 7
12 35 6 1 11462.578 small B 3 21 17 4
14 155 1 170 7934.057 big B 16 632 540 92
25 167 81 61 408964.032 big B 9 337 200 137
55 124 31 45 118718.363 mid B 7 525 151 374

Let’s see which strata are lonely strata by running the following code.

table(sample_data_1obs$industry)
As it can be noticed, the industry group G has only one observation in the sample data set.
Var1 Freq
B 195
C 197
D 193
E 219
F 196
G 1

The same stratum has five observations in the population data set. Hence, it is, in fact, a lonely stratum that we can handle by using the merge_lonely_strata function.

pop_data_1obs[pop_data_1obs$industry=="G", ]
id employees employees_f employees_m turnover size industry
10001 10001 50 30 20 10000 mid G
10002 10002 79 40 39 15000 mid G
10003 10003 63 30 33 9000 mid G
10004 10004 30 10 20 13000 mid G
10005 10005 50 30 20 12000 mid G

Parameter specifications

The default values of the parameters block and group are set to NULL. However, it is highly recommended to provide these variables. When specifying blocking variable(s), one must pay attention to that design strata are the most detailed groups within each block. In other words, each block must be either a design stratum or a group of design strata. Otherwise, the function would provide an error message. It is also recommended that blocks are set in a way that the sampling fractions do not vary a lot within each block. If possible, one may consider grouping design strata with similar sampling fractions to reduce bias that may arise merging strata with different fractions. Bias may occur due to assuming equal weights for all units within each estimation stratum.

Specification of the the publication domains (group) is also important to check the suitability of domain variables given (estimation) strata. Because, one may have issues in the estimation process with struktur_model later on if one or more domain variables cut across strata or estimation strata.

An additional parameter exclude in the function, allows one to include a list of IDs to exclude from the modelling in the estimation stage. This can be useful if there are some extreme values which will heavily influence the results but are most likely correct values. If a singleton stratum includes an ID that is specified to be excluded, then that stratum is not considered as lonely stratum since its corresponding variance is set to zero in the estimation with the struktuR package and the Pyhton version of it.

Usage of the function

At first, we will demonstrate how the function merge_lonely_strata works. Population and sample datasets, one auxiliary variable (xx-variable), one statistic variable (yy-variable), an id variable and strata variable(s) must all be provided. Then, some examples regarding the effect of blocking variable(s) will be given. Finally, we will some examples regarding the use of publication domain variables. Note that blocking and group variables are optional but it would be good to specify them for the reasons explained above.

Example 1

In this example, we shall use industry groups as the stratification variable and the number of employees as the xx-variable. Here, we will not specify any blocking constraint, neither group variable(s).

results <- merge_lonely_strata(pop_data_1obs, sample_data_1obs, 
                          x = "employees", 
                          y = "job_vacancies", 
                          id = "id",
                          strata = "industry")
# 
# # All lonely strata (1) successfully collapsed!
# Warning in merge_lonely_strata(pop_data_1obs, sample_data_1obs, x =
# "employees", : Consider providing non-NULL groups to check the suitability of
# group variables given estimation strata.
industry id employees employees_f employees_m turnover size est_strata merged
B 1 0 0 0 15396.54 small B 0
B 2 75 15 60 78814.71 mid B 0
B 3 55 42 13 97128.26 mid B 0
B 4 56 32 24 60414.60 mid B 0
B 5 110 13 97 237306.65 big B 0
B 6 172 50 122 473721.22 big B 0

The output of the function is a population dataset with additional columns:

  • est_strata refers to the estimation strata which can be design strata or superstrata that are formed newly.
  • merged is an indicator variable taking value 1 if merge took place or value 0 otherwise.

Value of each newly created superstratum is formed by binding values of the two strata that are merged with underscore as shown in the result table below. In this example, the lonely stratum “G” is merged to stratum “D”, so the new estimation strata has a value of “D_G”.

head(results[results$merged==1, ])
# 
# # All lonely strata (1) successfully collapsed!
# Warning in merge_lonely_strata(pop_data_1obs, sample_data_1obs, x =
# "employees", : Consider providing non-NULL groups to check the suitability of
# group variables given estimation strata.
industry id employees employees_f employees_m turnover size est_strata merged
4001 D 4543 129 108 21 93900.10 big D_G 1
4002 D 4535 82 17 65 107087.07 mid D_G 1
4003 D 4536 240 144 96 357626.36 big D_G 1
4004 D 4537 53 13 40 62149.90 mid D_G 1
4005 D 4538 13 11 2 22390.25 mid D_G 1
4006 D 4539 53 34 19 20276.47 mid D_G 1

Example 2

In the examples below, we shall use both size and industry groups as the stratification variables and turnover as the xx-variable. In this example, We will examine strata merge without blocking variable(s).

results <- merge_lonely_strata(pop_data_1obs, sample_data_1obs, 
                          x = "turnover", 
                          y = "job_vacancies", 
                          id = "id",
                          strata = c("size", "industry"))

As it can seen from the Table in the previous section, the company in the lonely stratum “G” is a medium-size company, so its size group is “mid”. With strata merge, stratum “mid_G” is merged to stratum “small_D”. When we do not specify a blocking variable, it is possible to have such merging. If sampling fractions differ quite much among small- and medium size groups, then this may bring about a bias problem as mentioned previously. To reduce bias, one may put a blocking constraint when forming superstrata. This is demonstrated below.

head(results[results$merged==1, ])
# 
# # All lonely strata (1) successfully collapsed!
# Warning in merge_lonely_strata(pop_data_1obs, sample_data_1obs, x = "turnover",
# : Consider providing non-NULL groups to check the suitability of group
# variables given estimation strata.
id employees employees_f employees_m turnover size industry est_strata merged
9176 10001 50 30 20 10000.0000 mid G small_D_mid_G 1
9177 10002 79 40 39 15000.0000 mid G small_D_mid_G 1
9178 10003 63 30 33 9000.0000 mid G small_D_mid_G 1
9179 10004 30 10 20 13000.0000 mid G small_D_mid_G 1
9180 10005 50 30 20 12000.0000 mid G small_D_mid_G 1
9499 4906 8 7 1 301.7372 small D small_D_mid_G 1

Example 3

In this example, we shall put a blocking constraint as such that strata merge will only take place within blocks, which means that strata from different blocks cannot be merged even they are similar to each other in terms of the population mean of the xx-variable (turnover here).

To avoid merging strata from different size groups, we may use size groups as the value of the block parameter as follows:

results <- merge_lonely_strata(pop_data_1obs, sample_data_1obs, 
                          x = "turnover", 
                          y = "job_vacancies", 
                          id = "id",
                          strata = c("size", "industry"),
                          block = "size")

Since we avoid strata merge across blocks, the lonely stratum “mid_G” is now merged to stratum “mid_C” instead of “small_D”. The new super strata is called “mid_C_mid_G” as seen below.

head(results[results$merged==1, ])
# 
# # All lonely strata (1) successfully collapsed!
# Warning in merge_lonely_strata(pop_data_1obs, sample_data_1obs, x = "turnover",
# : Consider providing non-NULL groups to check the suitability of group
# variables given estimation strata.
id employees employees_f employees_m turnover size industry est_strata merged
4425 3465 61 4 57 74122.764 mid C mid_C_mid_G 1
4426 3862 56 41 15 114993.527 mid C mid_C_mid_G 1
4427 2018 18 2 16 5720.938 mid C mid_C_mid_G 1
4428 3903 85 41 44 92782.059 mid C mid_C_mid_G 1
4429 3942 14 1 13 17956.721 mid C mid_C_mid_G 1
4430 2042 85 41 44 26477.678 mid C mid_C_mid_G 1

Example 4

In this example, we demonstrate how to specify the grouping variable. At first, we will use an inappropriate group variable, and then we will fix this by creating a suitable one depending on which design strata are merged.

Suppose that we wish to produce statistics at country level and by industry groups. If so, we need put a variable corresponding to country and industry groups as the value of the group parameter.

pop_data_1obs$country <- 1
results <- merge_lonely_strata(pop_data_1obs, sample_data_1obs, 
                          x = "turnover", 
                          y = "job_vacancies", 
                          id = "id",
                          strata = c("size", "industry"),
                          block = "size",
                          group = c("country", "industry"))

The code above will produce a warning message since one or more group variables(s) cut across strata or super strata (estimation strata). This is due to the fact that two industry groups, that is, “C” and “G” are merged, and so the publication domain needs to be minimum at estimation strata level. In other words, design strata or super strata need to be the most detailed groups within each publication domain.

# There were 10 observations that were missing values for job_vacancies. These were removed from the sample data
# The following strata, being neither surprise nor full count, had only 1 observation in the sample:  mid_G . Strata merge was attempted.
# 
# # All lonely strata (1) successfully collapsed!
# Warning in merge_lonely_strata(pop_data_1obs, sample_data_1obs, x = "turnover",
# : One or more group variables cut across estimation strata. Set group variables
# in a way that estimation strata are the most detailed levels within each group.
# The variable est_strata should be used for the variable job_vacancies in the next steps of the estimation functions as the value of the strata parameter.

The function would still produce an output data. However, this will be an issue when newly created estimation strata are used as the stratification variable with an inappropriate publication domain variable in the struktuR R package. One can use the output of the function merge_lonely_strata as a diagnostic to decide suitable publication domains given strata or estimation strata.

To remedy the issue in this example, we will simply create a new domain variable by merging two design strata as follows.

pop_data_1obs$domain <- pop_data_1obs$industry
pop_data_1obs$domain[pop_data_1obs$industry %in% c("C", "G")] <- "C_G"
results <- merge_lonely_strata(pop_data_1obs, sample_data_1obs, 
                          x = "turnover", 
                          y = "job_vacancies", 
                          id = "id",
                          strata = c("size", "industry"),
                          block = "size",
                          group = c("country", "domain"))

Restrictions

  • The use of this function will require some expert judgement when it comes to the specification of the block and group parameters. There may be some back and forwards. Users are advised to modify their choices depending on whether strata merge was successful and if so, which strata were merged.

  • This function can only be used for a single statistic variable (yy-variable). As a result of this, one may obtain different strata merge for different yy-variables, and so publication domains per variable may differ as well. This is because of the fact that the identification of lonely strata is done based on the number of observations within strata with non-null yy-values. If missing patterns by strata vary a lot from variable to variable, then a lonely stratum for one yy-variable may not be a lonely stratum for another statistic variable.