Internally a dummy/model matrix is created according to the model specification.
This model matrix is used in the aggregation process via matrix multiplication and/or the function aggregate_multiple_fun
.
Usage
model_aggregate(
data,
sum_vars = NULL,
fun_vars = NULL,
fun = NULL,
hierarchies = NULL,
formula = NULL,
dim_var = NULL,
total = NULL,
input_in_output = NULL,
remove_empty = NULL,
avoid_hierarchical = NULL,
preagg_var = NULL,
dummy = TRUE,
pre_aggregate = dummy,
aggregate_pkg = "base",
aggregate_na = TRUE,
aggregate_base_order = FALSE,
list_return = FALSE,
pre_return = FALSE,
verbose = TRUE,
mm_args = NULL,
...
)
Arguments
- data
Input data containing data to be aggregated, typically a data frame, tibble, or data.table. If data is not a classic data frame, it will be coerced to one internally.
- sum_vars
Variables to be summed. This will be done via matrix multiplication.
- fun_vars
Variables to be aggregated by supplied functions. This will be done via
aggregate_multiple_fun
anddummy_aggregate
andfun_vars
is specified as the parametervars
.- fun
The
fun
parameter toaggregate_multiple_fun
- hierarchies
The
hierarchies
parameter toModelMatrix
- formula
The
formula
parameter toModelMatrix
- dim_var
The
dimVar
parameter toModelMatrix
- total
When non-NULL, the
total
parameter toModelMatrix
. Thus, the actual default value is"Total"
.- input_in_output
When non-NULL, the
inputInOutput
parameter toModelMatrix
. Thus, the actual default value isTRUE
.- remove_empty
When non-NULL, the
removeEmpty
parameter toModelMatrix
. Thus, the actual default value isTRUE
with formula input without hierarchy and otherwiseFALSE
(seeModelMatrix
).- avoid_hierarchical
When non-NULL, the
avoidHierarchical
parameter toFormula2ModelMatrix
, which is an underlying function ofModelMatrix
.- preagg_var
Extra variables to be used as grouping elements in the pre-aggregate step
- dummy
The
dummy
parameter todummy_aggregate
. WhenTRUE
, only 0s and 1s are assumed in the generated model matrix. WhenFALSE
, non-0s in this matrix are passed as an additional first input parameter to thefun
functions.- pre_aggregate
Whether to pre-aggregate data to reduce the dimension of the model matrix. Note that all original
fun_vars
observations are retained in the aggregated dataset andpre_aggregate
does not affect the final result. However,pre_aggregate
must be set toFALSE
when thedummy_aggregate
parameterdummy
is set toFALSE
since thenunlist
will not be run. An exception to this is if thefun
functions are written to handle list data.- aggregate_pkg
Package used to pre-aggregate. Parameter
pkg
toaggregate_by_pkg
.- aggregate_na
Whether to include NAs in the grouping variables while preAggregate. Parameter
include_na
toaggregate_by_pkg
.- aggregate_base_order
Parameter
base_order
toaggregate_by_pkg
, used when pre-aggregate. The default is set toFALSE
to avoid unnecessary sorting operations. WhenTRUE
, an attempt is made to return the same result withdata.table
as with base R. This cannot be guaranteed due to potential variations in sorting behavior across different systems.- list_return
Whether to return a list of separate components including the model matrix
x
.- pre_return
Whether to return the pre-aggregate data as a two-component list. Can also be combined with
list_return
(see examples).- verbose
Whether to print information during calculations.
- mm_args
List of further arguments passed to
ModelMatrix
.- ...
Further arguments passed to
dummy_aggregate
.
Details
With formula input, limited output can be achieved by formula_selection
(see example).
An attribute called startCol
has been added to the output data frame to make this functionality work.
Examples
z <- SSBtoolsData("sprt_emp_withEU")
z$age[z$age == "Y15-29"] <- "young"
z$age[z$age == "Y30-64"] <- "old"
names(z)[names(z) == "ths_per"] <- "ths"
z$y <- 1:18
my_range <- function(x) c(min = min(x), max = max(x))
out <- model_aggregate(z,
formula = ~age:year + geo,
sum_vars = c("y", "ths"),
fun_vars = c(sum = "ths", mean = "y", med = "y", ra = "ths"),
fun = c(sum = sum, mean = mean, med = median, ra = my_range))
#> [pre_aggregate 18*6->18*7] [ModelMatrix] [crossprod] [dummy_aggregate] [cbind]
out
#> age year geo y ths ths_sum y_mean y_med ths_ra.min ths_ra.max
#> 1 Total Total Total 171 680.8 680.8 9.5 9.5 1.5 122.1
#> 2 Total Total Iceland 57 10.6 10.6 9.5 9.5 1.5 1.9
#> 3 Total Total Portugal 63 108.8 108.8 10.5 10.5 11.6 25.8
#> 4 Total Total Spain 51 561.4 561.4 8.5 8.5 63.4 122.1
#> 5 old 2014 Total 15 142.0 142.0 5.0 5.0 1.5 120.3
#> 6 old 2015 Total 33 145.5 145.5 11.0 11.0 1.6 119.6
#> 7 old 2016 Total 51 149.8 149.8 17.0 17.0 1.9 122.1
#> 8 young 2014 Total 6 80.3 80.3 2.0 2.0 1.8 66.9
#> 9 young 2015 Total 24 79.5 79.5 8.0 8.0 1.9 63.4
#> 10 young 2016 Total 42 83.7 83.7 14.0 14.0 1.9 69.1
# Limited output can be achieved by formula_selection
formula_selection(out, ~geo)
#> age year geo y ths ths_sum y_mean y_med ths_ra.min ths_ra.max
#> 1 Total Total Total 171 680.8 680.8 9.5 9.5 1.5 122.1
#> 2 Total Total Iceland 57 10.6 10.6 9.5 9.5 1.5 1.9
#> 3 Total Total Portugal 63 108.8 108.8 10.5 10.5 11.6 25.8
#> 4 Total Total Spain 51 561.4 561.4 8.5 8.5 63.4 122.1
# Using the single unnamed variable feature.
model_aggregate(z, formula = ~age, fun_vars = "y",
fun = c(sum = sum, mean = mean, med = median, n = length))
#> [pre_aggregate 18*6->2*2] [ModelMatrix] [dummy_aggregate] [cbind]
#> age sum mean med n
#> 1 Total 171 9.5 9.5 18
#> 2 old 99 11.0 11.0 9
#> 3 young 72 8.0 8.0 9
# To illustrate list_return and pre_return
for (pre_return in c(FALSE, TRUE)) for (list_return in c(FALSE, TRUE)) {
cat("\n=======================================\n")
cat("list_return =", list_return, ", pre_return =", pre_return, "\n\n")
out <- model_aggregate(z, formula = ~age:year,
sum_vars = c("ths", "y"),
fun_vars = c(mean = "y", ra = "y"),
fun = c(mean = mean, ra = my_range),
list_return = list_return,
pre_return = pre_return)
cat("\n")
print(out)
}
#>
#> =======================================
#> list_return = FALSE , pre_return = FALSE
#>
#> [pre_aggregate 18*6->6*5] [ModelMatrix] [crossprod] [dummy_aggregate] [cbind]
#>
#> age year ths y y_mean y_ra.min y_ra.max
#> 1 Total Total 680.8 171 9.5 1 18
#> 2 old 2014 142.0 15 5.0 4 6
#> 3 old 2015 145.5 33 11.0 10 12
#> 4 old 2016 149.8 51 17.0 16 18
#> 5 young 2014 80.3 6 2.0 1 3
#> 6 young 2015 79.5 24 8.0 7 9
#> 7 young 2016 83.7 42 14.0 13 15
#>
#> =======================================
#> list_return = TRUE , pre_return = FALSE
#>
#> [pre_aggregate 18*6->6*5] [ModelMatrix] [crossprod] [dummy_aggregate]
#>
#> $cross_table
#> age year
#> 1 Total Total
#> 2 old 2014
#> 3 old 2015
#> 4 old 2016
#> 5 young 2014
#> 6 young 2015
#> 7 young 2016
#>
#> $sum_data
#> ths y
#> Total-Total 680.8 171
#> old-2014 142.0 15
#> old-2015 145.5 33
#> old-2016 149.8 51
#> young-2014 80.3 6
#> young-2015 79.5 24
#> young-2016 83.7 42
#>
#> $fun_data
#> y_mean y_ra.min y_ra.max
#> 1 9.5 1 18
#> 2 5.0 4 6
#> 3 11.0 10 12
#> 4 17.0 16 18
#> 5 2.0 1 3
#> 6 8.0 7 9
#> 7 14.0 13 15
#>
#> $x
#> 6 x 7 sparse Matrix of class "dgCMatrix"
#> Total-Total old-2014 old-2015 old-2016 young-2014 young-2015 young-2016
#> [1,] 1 1 . . . . .
#> [2,] 1 . . . 1 . .
#> [3,] 1 . 1 . . . .
#> [4,] 1 . . . . 1 .
#> [5,] 1 . . 1 . . .
#> [6,] 1 . . . . . 1
#>
#>
#> =======================================
#> list_return = FALSE , pre_return = TRUE
#>
#> [pre_aggregate 18*6->6*5]
#>
#> $pre_data
#> age year y
#> 1 old 2014 4, 5, 6
#> 2 young 2014 1, 2, 3
#> 3 old 2015 10, 11, 12
#> 4 young 2015 7, 8, 9
#> 5 old 2016 16, 17, 18
#> 6 young 2016 13, 14, 15
#>
#> $pre_sum
#> ths y
#> 1 142.0 15
#> 2 80.3 6
#> 3 145.5 33
#> 4 79.5 24
#> 5 149.8 51
#> 6 83.7 42
#>
#>
#> =======================================
#> list_return = TRUE , pre_return = TRUE
#>
#> [pre_aggregate 18*6->6*5] [ModelMatrix] [crossprod] [dummy_aggregate]
#>
#> $pre_data
#> age year y
#> 1 old 2014 4, 5, 6
#> 2 young 2014 1, 2, 3
#> 3 old 2015 10, 11, 12
#> 4 young 2015 7, 8, 9
#> 5 old 2016 16, 17, 18
#> 6 young 2016 13, 14, 15
#>
#> $pre_sum
#> ths y
#> 1 142.0 15
#> 2 80.3 6
#> 3 145.5 33
#> 4 79.5 24
#> 5 149.8 51
#> 6 83.7 42
#>
#> $cross_table
#> age year
#> 1 Total Total
#> 2 old 2014
#> 3 old 2015
#> 4 old 2016
#> 5 young 2014
#> 6 young 2015
#> 7 young 2016
#>
#> $sum_data
#> ths y
#> Total-Total 680.8 171
#> old-2014 142.0 15
#> old-2015 145.5 33
#> old-2016 149.8 51
#> young-2014 80.3 6
#> young-2015 79.5 24
#> young-2016 83.7 42
#>
#> $fun_data
#> y_mean y_ra.min y_ra.max
#> 1 9.5 1 18
#> 2 5.0 4 6
#> 3 11.0 10 12
#> 4 17.0 16 18
#> 5 2.0 1 3
#> 6 8.0 7 9
#> 7 14.0 13 15
#>
#> $x
#> 6 x 7 sparse Matrix of class "dgCMatrix"
#> Total-Total old-2014 old-2015 old-2016 young-2014 young-2015 young-2016
#> [1,] 1 1 . . . . .
#> [2,] 1 . . . 1 . .
#> [3,] 1 . 1 . . . .
#> [4,] 1 . . . . 1 .
#> [5,] 1 . . 1 . . .
#> [6,] 1 . . . . . 1
#>
# To illustrate preagg_var
model_aggregate(z, formula = ~age:year,
sum_vars = c("ths", "y"),
fun_vars = c(mean = "y", ra = "y"),
fun = c(mean = mean, ra = my_range),
preagg_var = "eu",
pre_return = TRUE)[["pre_data"]]
#> [pre_aggregate 18*6->12*6]
#> age year eu y
#> 1 old 2014 EU 4, 6
#> 2 young 2014 EU 1, 3
#> 3 old 2015 EU 10, 12
#> 4 young 2015 EU 7, 9
#> 5 old 2016 EU 16, 18
#> 6 young 2016 EU 13, 15
#> 7 old 2014 nonEU 5
#> 8 young 2014 nonEU 2
#> 9 old 2015 nonEU 11
#> 10 young 2015 nonEU 8
#> 11 old 2016 nonEU 17
#> 12 young 2016 nonEU 14
# To illustrate hierarchies
geo_hier <- SSBtoolsData("sprt_emp_geoHier")
model_aggregate(z, hierarchies = list(age = "All", geo = geo_hier),
sum_vars = "y",
fun_vars = c(sum = "y"))
#> [pre_aggregate 18*6->6*4] [ModelMatrix] [crossprod] [dummy_aggregate] [cbind]
#> age geo y y_sum
#> 1 All EU 114 114
#> 2 All Europe 171 171
#> 3 All nonEU 57 57
#> 4 All Iceland 57 57
#> 5 All Portugal 63 63
#> 6 All Spain 51 51
#> 7 old EU 66 66
#> 8 old Europe 99 99
#> 9 old nonEU 33 33
#> 10 old Iceland 33 33
#> 11 old Portugal 36 36
#> 12 old Spain 30 30
#> 13 young EU 48 48
#> 14 young Europe 72 72
#> 15 young nonEU 24 24
#> 16 young Iceland 24 24
#> 17 young Portugal 27 27
#> 18 young Spain 21 21
#### Special non-dummy cases illustrated below ####
# Extend the hierarchy to make non-dummy model matrix
geo_hier2 <- rbind(data.frame(mapsFrom = c("EU", "Spain"),
mapsTo = "EUandSpain", sign = 1), geo_hier[, -4])
# Warning since non-dummy
# y and y_sum are different
model_aggregate(z, hierarchies = list(age = "All", geo = geo_hier2),
sum_vars = "y",
fun_vars = c(sum = "y"))
#> [pre_aggregate 18*6->6*4] [ModelMatrix] [crossprod] [dummy_aggregate
#> Warning: All non-0s in x are treated as 1s. Use dummy = FALSE?
#> ] [cbind]
#> age geo y y_sum
#> 1 All EUandSpain 165 114
#> 2 All EU 114 114
#> 3 All Europe 171 171
#> 4 All nonEU 57 57
#> 5 All Iceland 57 57
#> 6 All Portugal 63 63
#> 7 All Spain 51 51
#> 8 old EUandSpain 96 66
#> 9 old EU 66 66
#> 10 old Europe 99 99
#> 11 old nonEU 33 33
#> 12 old Iceland 33 33
#> 13 old Portugal 36 36
#> 14 old Spain 30 30
#> 15 young EUandSpain 69 48
#> 16 young EU 48 48
#> 17 young Europe 72 72
#> 18 young nonEU 24 24
#> 19 young Iceland 24 24
#> 20 young Portugal 27 27
#> 21 young Spain 21 21
# No warning since dummy since unionComplement = TRUE (see ?HierarchyCompute)
# y and y_sum are equal
model_aggregate(z, hierarchies = list(age = "All", geo = geo_hier2),
sum_vars = "y",
fun_vars = c(sum = "y"),
mm_args = list(unionComplement = TRUE))
#> [pre_aggregate 18*6->6*4] [ModelMatrix] [crossprod] [dummy_aggregate] [cbind]
#> age geo y y_sum
#> 1 All EUandSpain 114 114
#> 2 All EU 114 114
#> 3 All Europe 171 171
#> 4 All nonEU 57 57
#> 5 All Iceland 57 57
#> 6 All Portugal 63 63
#> 7 All Spain 51 51
#> 8 old EUandSpain 66 66
#> 9 old EU 66 66
#> 10 old Europe 99 99
#> 11 old nonEU 33 33
#> 12 old Iceland 33 33
#> 13 old Portugal 36 36
#> 14 old Spain 30 30
#> 15 young EUandSpain 48 48
#> 16 young EU 48 48
#> 17 young Europe 72 72
#> 18 young nonEU 24 24
#> 19 young Iceland 24 24
#> 20 young Portugal 27 27
#> 21 young Spain 21 21
# Non-dummy again, but no warning since dummy = FALSE
# Then pre_aggregate is by default set to FALSE (error when TRUE)
# fun with extra argument needed (see ?dummy_aggregate)
# y and y_sum2 are equal
model_aggregate(z, hierarchies = list(age = "All", geo = geo_hier2),
sum_vars = "y",
fun_vars = c(sum2 = "y"),
fun = c(sum2 = function(x, y) sum(x * y)),
dummy = FALSE)
#> [ModelMatrix] [crossprod] [dummy_aggregate] [cbind]
#> age geo y y_sum2
#> 1 All EUandSpain 165 165
#> 2 All EU 114 114
#> 3 All Europe 171 171
#> 4 All nonEU 57 57
#> 5 All Iceland 57 57
#> 6 All Portugal 63 63
#> 7 All Spain 51 51
#> 8 old EUandSpain 96 96
#> 9 old EU 66 66
#> 10 old Europe 99 99
#> 11 old nonEU 33 33
#> 12 old Iceland 33 33
#> 13 old Portugal 36 36
#> 14 old Spain 30 30
#> 15 young EUandSpain 69 69
#> 16 young EU 48 48
#> 17 young Europe 72 72
#> 18 young nonEU 24 24
#> 19 young Iceland 24 24
#> 20 young Portugal 27 27
#> 21 young Spain 21 21