Merge lonely strata
Melike Oguz Alper
10.09.2025
merge_lonely_strata.Rmd
Introduction
The merge_lonely_strata
is one of the R functions in the
struktuR
R package, which is documented here
(English). Lonely strata are defined to be non take-all strata with only
one observation in sample data. Variance estimation would not be
possible within such strata, which is a common problem, particularly in
business surveys, where many strata, some of which may be quite small,
are often used.
The main purpose of this function is to:
- Create estimation strata (or artificial superstrata) by merging lonely strata.
This function uses the collapse.strata
function from the
ReGenesees
package. The idea is to merge lonely strata by
using population means of an auxiliary variable, denoted by
,
as a similarity measure. A lonely stratum is merged to another lonely
stratum or a larger stratum which is the most similar to that lonely
stratum in terms of the population mean of the
-variable.
If strata with only one observation are take-all strata, then these
strata are ignored in strata merging since corresponding variances are
set to zero for such strata in the struktuR
as well as in
the Python version Statstruk,
and SAS-application Struktur (Using
SAS-Struktur, Norsk).
This function should only be used for sample surveys with (stratified) single-stage sampling design where simple random sampling is used within each stratum to select sampling units.
Strata merge may not always be successful due to the reasons described in the following section of this document.
Package installation
For internal Statistic Norway users, the package is already installed
on many of the production servers and this step may be skipped. For
other users the package can be installed from github using the
devtools
function install_github
. This step
only needs to be run one time.
devtools::install_github("statisticsnorway/ssb-struktur")
To access and use the functions in the package we need to run
library
each time we start a new R session.
library(struktuR)
Data requirements
We need data on both the sample and population. The following variables should be included:
Population data set:
- ID variable (
id
) which is consistent in both the population and sample data sets. - An explanatory variable when using regression and rate models
(
x
). This is a variable which the statistic variable is correlated with. - A strata variable (
strata
) which divides the population into groups which are similar to each other. This is the smallest grouping variable and which separate models will be run on. - Domain variables (or publication groups:
group
) for producing statistical totals for. These should be able to be created by joining strata groups together. - Blocking variables (
block
) that are used to constrain strata merge, in a way that estimation strata can only be created within blocks. This parameter may contain one or more blocking variables.
Sample data set:
- ID variable (
id
) - The explanatory variable when using regression and rate models, but
can be missing (
x
) - The statistic variable(s) we are interested in estimating
(
y
)
Example data
There are two synthetic data sets in the struktuR
package that contain lonely strata: pop_data_1obs, sample_data_1obs.
pop_data_1obs
represents a population data set with 10005
rows, each one representing a company. We will use the variable
employees
, which refers to the number of employees in the
company, or turnover
, as the
-variable.
Variables size
and industry
will be used to
specify the values of the strata
, block
,
and/or group
parameters of the function.
head(pop_data_1obs)
id | employees | employees_f | employees_m | turnover | size | industry |
---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 15396.54 | small | B |
2 | 75 | 15 | 60 | 78814.71 | mid | B |
3 | 55 | 42 | 13 | 97128.26 | mid | B |
4 | 56 | 32 | 24 | 60414.60 | mid | B |
5 | 110 | 13 | 97 | 237306.65 | big | B |
6 | 172 | 50 | 122 | 473721.22 | big | B |
The data set sample_data_1obs
contains 1001 rows
representing a sample of companies. In addition to the variables in the
population data set, we do have some variables among which
job_vacancies
will be used as the
-variable,
representing the number of job positions advertised for the year.
head(sample_data_1obs)
id | employees | employees_f | employees_m | turnover | size | industry | job_vacancies | sick_days | sick_days_f | sick_days_m |
---|---|---|---|---|---|---|---|---|---|---|
5 | 132 | 13 | 97 | 237306.653 | big | B | 1 | 515 | 278 | 237 |
9 | 28 | 11 | 10 | 63848.919 | mid | B | 8 | 134 | 127 | 7 |
12 | 35 | 6 | 1 | 11462.578 | small | B | 3 | 21 | 17 | 4 |
14 | 155 | 1 | 170 | 7934.057 | big | B | 16 | 632 | 540 | 92 |
25 | 167 | 81 | 61 | 408964.032 | big | B | 9 | 337 | 200 | 137 |
55 | 124 | 31 | 45 | 118718.363 | mid | B | 7 | 525 | 151 | 374 |
Let’s see which strata are lonely strata by running the following code.
table(sample_data_1obs$industry)
Var1 | Freq |
---|---|
B | 195 |
C | 197 |
D | 193 |
E | 219 |
F | 196 |
G | 1 |
The same stratum has five observations in the population data set.
Hence, it is, in fact, a lonely stratum that we can handle by using the
merge_lonely_strata
function.
pop_data_1obs[pop_data_1obs$industry=="G", ]
id | employees | employees_f | employees_m | turnover | size | industry | |
---|---|---|---|---|---|---|---|
10001 | 10001 | 50 | 30 | 20 | 10000 | mid | G |
10002 | 10002 | 79 | 40 | 39 | 15000 | mid | G |
10003 | 10003 | 63 | 30 | 33 | 9000 | mid | G |
10004 | 10004 | 30 | 10 | 20 | 13000 | mid | G |
10005 | 10005 | 50 | 30 | 20 | 12000 | mid | G |
Parameter specifications
The default values of the parameters block
and
group
are set to NULL
. However, it is highly
recommended to provide these variables. When specifying blocking
variable(s), one must pay attention to that design strata are the most
detailed groups within each block. In other words, each block must be
either a design stratum or a group of design strata. Otherwise, the
function would provide an error message. It is also recommended that
blocks are set in a way that the sampling fractions do not vary a lot
within each block. If possible, one may consider grouping design strata
with similar sampling fractions to reduce bias that may arise merging
strata with different fractions. Bias may occur due to assuming equal
weights for all units within each estimation stratum.
Specification of the the publication domains (group
) is
also important to check the suitability of domain variables given
(estimation) strata. Because, one may have issues in the estimation
process with struktur_model
later on if one or more domain
variables cut across strata or estimation strata.
An additional parameter exclude
in the function, allows
one to include a list of IDs to exclude from the modelling in the
estimation stage. This can be useful if there are some extreme values
which will heavily influence the results but are most likely correct
values. If a singleton stratum includes an ID that is specified to be
excluded, then that stratum is not considered as lonely stratum since
its corresponding variance is set to zero in the estimation with the
struktuR
package and the Pyhton version of it.
Usage of the function
At first, we will demonstrate how the function
merge_lonely_strata
works. Population and sample datasets,
one auxiliary variable
(-variable),
one statistic variable
(-variable),
an id variable and strata variable(s) must all be provided. Then, some
examples regarding the effect of blocking variable(s) will be given.
Finally, we will some examples regarding the use of publication domain
variables. Note that blocking and group variables are optional but it
would be good to specify them for the reasons explained above.
Example 1
In this example, we shall use industry groups as the stratification variable and the number of employees as the -variable. Here, we will not specify any blocking constraint, neither group variable(s).
results <- merge_lonely_strata(pop_data_1obs, sample_data_1obs,
x = "employees",
y = "job_vacancies",
id = "id",
strata = "industry")
#
# # All lonely strata (1) successfully collapsed!
# Warning in merge_lonely_strata(pop_data_1obs, sample_data_1obs, x =
# "employees", : Consider providing non-NULL groups to check the suitability of
# group variables given estimation strata.
industry | id | employees | employees_f | employees_m | turnover | size | est_strata | merged |
---|---|---|---|---|---|---|---|---|
B | 1 | 0 | 0 | 0 | 15396.54 | small | B | 0 |
B | 2 | 75 | 15 | 60 | 78814.71 | mid | B | 0 |
B | 3 | 55 | 42 | 13 | 97128.26 | mid | B | 0 |
B | 4 | 56 | 32 | 24 | 60414.60 | mid | B | 0 |
B | 5 | 110 | 13 | 97 | 237306.65 | big | B | 0 |
B | 6 | 172 | 50 | 122 | 473721.22 | big | B | 0 |
The output of the function is a population dataset with additional columns:
-
est_strata
refers to the estimation strata which can be design strata or superstrata that are formed newly. -
merged
is an indicator variable taking value 1 if merge took place or value 0 otherwise.
Value of each newly created superstratum is formed by binding values of the two strata that are merged with underscore as shown in the result table below. In this example, the lonely stratum “G” is merged to stratum “D”, so the new estimation strata has a value of “D_G”.
head(results[results$merged==1, ])
#
# # All lonely strata (1) successfully collapsed!
# Warning in merge_lonely_strata(pop_data_1obs, sample_data_1obs, x =
# "employees", : Consider providing non-NULL groups to check the suitability of
# group variables given estimation strata.
industry | id | employees | employees_f | employees_m | turnover | size | est_strata | merged | |
---|---|---|---|---|---|---|---|---|---|
4001 | D | 4543 | 129 | 108 | 21 | 93900.10 | big | D_G | 1 |
4002 | D | 4535 | 82 | 17 | 65 | 107087.07 | mid | D_G | 1 |
4003 | D | 4536 | 240 | 144 | 96 | 357626.36 | big | D_G | 1 |
4004 | D | 4537 | 53 | 13 | 40 | 62149.90 | mid | D_G | 1 |
4005 | D | 4538 | 13 | 11 | 2 | 22390.25 | mid | D_G | 1 |
4006 | D | 4539 | 53 | 34 | 19 | 20276.47 | mid | D_G | 1 |
Example 2
In the examples below, we shall use both size and industry groups as the stratification variables and turnover as the -variable. In this example, We will examine strata merge without blocking variable(s).
results <- merge_lonely_strata(pop_data_1obs, sample_data_1obs,
x = "turnover",
y = "job_vacancies",
id = "id",
strata = c("size", "industry"))
As it can seen from the Table in the previous section, the company in the lonely stratum “G” is a medium-size company, so its size group is “mid”. With strata merge, stratum “mid_G” is merged to stratum “small_D”. When we do not specify a blocking variable, it is possible to have such merging. If sampling fractions differ quite much among small- and medium size groups, then this may bring about a bias problem as mentioned previously. To reduce bias, one may put a blocking constraint when forming superstrata. This is demonstrated below.
head(results[results$merged==1, ])
#
# # All lonely strata (1) successfully collapsed!
# Warning in merge_lonely_strata(pop_data_1obs, sample_data_1obs, x = "turnover",
# : Consider providing non-NULL groups to check the suitability of group
# variables given estimation strata.
id | employees | employees_f | employees_m | turnover | size | industry | est_strata | merged | |
---|---|---|---|---|---|---|---|---|---|
9176 | 10001 | 50 | 30 | 20 | 10000.0000 | mid | G | small_D_mid_G | 1 |
9177 | 10002 | 79 | 40 | 39 | 15000.0000 | mid | G | small_D_mid_G | 1 |
9178 | 10003 | 63 | 30 | 33 | 9000.0000 | mid | G | small_D_mid_G | 1 |
9179 | 10004 | 30 | 10 | 20 | 13000.0000 | mid | G | small_D_mid_G | 1 |
9180 | 10005 | 50 | 30 | 20 | 12000.0000 | mid | G | small_D_mid_G | 1 |
9499 | 4906 | 8 | 7 | 1 | 301.7372 | small | D | small_D_mid_G | 1 |
Example 3
In this example, we shall put a blocking constraint as such that strata merge will only take place within blocks, which means that strata from different blocks cannot be merged even they are similar to each other in terms of the population mean of the -variable (turnover here).
To avoid merging strata from different size groups, we may use size
groups as the value of the block
parameter as follows:
results <- merge_lonely_strata(pop_data_1obs, sample_data_1obs,
x = "turnover",
y = "job_vacancies",
id = "id",
strata = c("size", "industry"),
block = "size")
Since we avoid strata merge across blocks, the lonely stratum “mid_G” is now merged to stratum “mid_C” instead of “small_D”. The new super strata is called “mid_C_mid_G” as seen below.
head(results[results$merged==1, ])
#
# # All lonely strata (1) successfully collapsed!
# Warning in merge_lonely_strata(pop_data_1obs, sample_data_1obs, x = "turnover",
# : Consider providing non-NULL groups to check the suitability of group
# variables given estimation strata.
id | employees | employees_f | employees_m | turnover | size | industry | est_strata | merged | |
---|---|---|---|---|---|---|---|---|---|
4425 | 3465 | 61 | 4 | 57 | 74122.764 | mid | C | mid_C_mid_G | 1 |
4426 | 3862 | 56 | 41 | 15 | 114993.527 | mid | C | mid_C_mid_G | 1 |
4427 | 2018 | 18 | 2 | 16 | 5720.938 | mid | C | mid_C_mid_G | 1 |
4428 | 3903 | 85 | 41 | 44 | 92782.059 | mid | C | mid_C_mid_G | 1 |
4429 | 3942 | 14 | 1 | 13 | 17956.721 | mid | C | mid_C_mid_G | 1 |
4430 | 2042 | 85 | 41 | 44 | 26477.678 | mid | C | mid_C_mid_G | 1 |
Example 4
In this example, we demonstrate how to specify the grouping variable. At first, we will use an inappropriate group variable, and then we will fix this by creating a suitable one depending on which design strata are merged.
Suppose that we wish to produce statistics at country level and by
industry groups. If so, we need put a variable corresponding to country
and industry groups as the value of the group
parameter.
pop_data_1obs$country <- 1
results <- merge_lonely_strata(pop_data_1obs, sample_data_1obs,
x = "turnover",
y = "job_vacancies",
id = "id",
strata = c("size", "industry"),
block = "size",
group = c("country", "industry"))
The code above will produce a warning message since one or more group variables(s) cut across strata or super strata (estimation strata). This is due to the fact that two industry groups, that is, “C” and “G” are merged, and so the publication domain needs to be minimum at estimation strata level. In other words, design strata or super strata need to be the most detailed groups within each publication domain.
# There were 10 observations that were missing values for job_vacancies. These were removed from the sample data
# The following strata, being neither surprise nor full count, had only 1 observation in the sample: mid_G . Strata merge was attempted.
#
# # All lonely strata (1) successfully collapsed!
# Warning in merge_lonely_strata(pop_data_1obs, sample_data_1obs, x = "turnover",
# : One or more group variables cut across estimation strata. Set group variables
# in a way that estimation strata are the most detailed levels within each group.
# The variable est_strata should be used for the variable job_vacancies in the next steps of the estimation functions as the value of the strata parameter.
The function would still produce an output data. However, this will
be an issue when newly created estimation strata are used as the
stratification variable with an inappropriate publication domain
variable in the struktuR
R package. One can use the output
of the function merge_lonely_strata
as a diagnostic to
decide suitable publication domains given strata or estimation
strata.
To remedy the issue in this example, we will simply create a new domain variable by merging two design strata as follows.
pop_data_1obs$domain <- pop_data_1obs$industry
pop_data_1obs$domain[pop_data_1obs$industry %in% c("C", "G")] <- "C_G"
results <- merge_lonely_strata(pop_data_1obs, sample_data_1obs,
x = "turnover",
y = "job_vacancies",
id = "id",
strata = c("size", "industry"),
block = "size",
group = c("country", "domain"))
Restrictions
The use of this function will require some expert judgement when it comes to the specification of the
block
andgroup
parameters. There may be some back and forwards. Users are advised to modify their choices depending on whether strata merge was successful and if so, which strata were merged.This function can only be used for a single statistic variable (-variable). As a result of this, one may obtain different strata merge for different -variables, and so publication domains per variable may differ as well. This is because of the fact that the identification of lonely strata is done based on the number of observations within strata with non-null -values. If missing patterns by strata vary a lot from variable to variable, then a lonely stratum for one -variable may not be a lonely stratum for another statistic variable.