Internally a dummy/model matrix is created according to the model specification.
This model matrix is used in the aggregation process via matrix multiplication and/or the function aggregate_multiple_fun.
Usage
model_aggregate(
data,
sum_vars = NULL,
fun_vars = NULL,
fun = NULL,
hierarchies = NULL,
formula = NULL,
dim_var = NULL,
total = NULL,
input_in_output = NULL,
remove_empty = NULL,
avoid_hierarchical = NULL,
preagg_var = NULL,
dummy = TRUE,
pre_aggregate = dummy,
aggregate_pkg = "base",
aggregate_na = TRUE,
aggregate_base_order = FALSE,
list_return = FALSE,
pre_return = FALSE,
verbose = TRUE,
mm_args = NULL,
...
)Arguments
- data
Input data containing data to be aggregated, typically a data frame, tibble, or data.table. If data is not a classic data frame, it will be coerced to one internally.
- sum_vars
Variables to be summed. This will be done via matrix multiplication.
- fun_vars
Variables to be aggregated by supplied functions. This will be done via
aggregate_multiple_funanddummy_aggregateandfun_varsis specified as the parametervars.- fun
The
funparameter toaggregate_multiple_fun- hierarchies
The
hierarchiesparameter toModelMatrix- formula
The
formulaparameter toModelMatrix- dim_var
The
dimVarparameter toModelMatrix- total
When non-NULL, the
totalparameter toModelMatrix. Thus, the actual default value is"Total".- input_in_output
When non-NULL, the
inputInOutputparameter toModelMatrix. Thus, the actual default value isTRUE.- remove_empty
When non-NULL, the
removeEmptyparameter toModelMatrix. Thus, the actual default value isTRUEwith formula input without hierarchy and otherwiseFALSE(seeModelMatrix).- avoid_hierarchical
When non-NULL, the
avoidHierarchicalparameter toFormula2ModelMatrix, which is an underlying function ofModelMatrix.- preagg_var
Extra variables to be used as grouping elements in the pre-aggregate step
- dummy
The
dummyparameter todummy_aggregate. WhenTRUE, only 0s and 1s are assumed in the generated model matrix. WhenFALSE, non-0s in this matrix are passed as an additional first input parameter to thefunfunctions.- pre_aggregate
Whether to pre-aggregate data to reduce the dimension of the model matrix. Note that all original
fun_varsobservations are retained in the aggregated dataset andpre_aggregatedoes not affect the final result. However,pre_aggregatemust be set toFALSEwhen thedummy_aggregateparameterdummyis set toFALSEsince thenunlistwill not be run. An exception to this is if thefunfunctions are written to handle list data.- aggregate_pkg
Package used to pre-aggregate. Parameter
pkgtoaggregate_by_pkg.- aggregate_na
Whether to include NAs in the grouping variables while preAggregate. Parameter
include_natoaggregate_by_pkg.- aggregate_base_order
Parameter
base_ordertoaggregate_by_pkg, used when pre-aggregate. The default is set toFALSEto avoid unnecessary sorting operations. WhenTRUE, an attempt is made to return the same result withdata.tableas with base R. This cannot be guaranteed due to potential variations in sorting behavior across different systems.- list_return
Whether to return a list of separate components including the model matrix
x.- pre_return
Whether to return the pre-aggregate data as a two-component list. Can also be combined with
list_return(see examples).- verbose
Whether to print information during calculations.
- mm_args
List of further arguments passed to
ModelMatrix.- ...
Further arguments passed to
dummy_aggregate.
Details
With formula input, limited output can be achieved by formula_selection (see example).
An attribute called startCol has been added to the output data frame to make this functionality work.
Examples
z <- SSBtoolsData("sprt_emp_withEU")
z$age[z$age == "Y15-29"] <- "young"
z$age[z$age == "Y30-64"] <- "old"
names(z)[names(z) == "ths_per"] <- "ths"
z$y <- 1:18
my_range <- function(x) c(min = min(x), max = max(x))
out <- model_aggregate(z,
formula = ~age:year + geo,
sum_vars = c("y", "ths"),
fun_vars = c(sum = "ths", mean = "y", med = "y", ra = "ths"),
fun = c(sum = sum, mean = mean, med = median, ra = my_range))
#> [pre_aggregate 18*6->18*7] [ModelMatrix] [crossprod] [dummy_aggregate] [cbind]
out
#> age year geo y ths ths_sum y_mean y_med ths_ra.min ths_ra.max
#> 1 Total Total Total 171 680.8 680.8 9.5 9.5 1.5 122.1
#> 2 Total Total Iceland 57 10.6 10.6 9.5 9.5 1.5 1.9
#> 3 Total Total Portugal 63 108.8 108.8 10.5 10.5 11.6 25.8
#> 4 Total Total Spain 51 561.4 561.4 8.5 8.5 63.4 122.1
#> 5 old 2014 Total 15 142.0 142.0 5.0 5.0 1.5 120.3
#> 6 old 2015 Total 33 145.5 145.5 11.0 11.0 1.6 119.6
#> 7 old 2016 Total 51 149.8 149.8 17.0 17.0 1.9 122.1
#> 8 young 2014 Total 6 80.3 80.3 2.0 2.0 1.8 66.9
#> 9 young 2015 Total 24 79.5 79.5 8.0 8.0 1.9 63.4
#> 10 young 2016 Total 42 83.7 83.7 14.0 14.0 1.9 69.1
# Limited output can be achieved by formula_selection
formula_selection(out, ~geo)
#> age year geo y ths ths_sum y_mean y_med ths_ra.min ths_ra.max
#> 1 Total Total Total 171 680.8 680.8 9.5 9.5 1.5 122.1
#> 2 Total Total Iceland 57 10.6 10.6 9.5 9.5 1.5 1.9
#> 3 Total Total Portugal 63 108.8 108.8 10.5 10.5 11.6 25.8
#> 4 Total Total Spain 51 561.4 561.4 8.5 8.5 63.4 122.1
# Using the single unnamed variable feature.
model_aggregate(z, formula = ~age, fun_vars = "y",
fun = c(sum = sum, mean = mean, med = median, n = length))
#> [pre_aggregate 18*6->2*2] [ModelMatrix] [dummy_aggregate] [cbind]
#> age sum mean med n
#> 1 Total 171 9.5 9.5 18
#> 2 old 99 11.0 11.0 9
#> 3 young 72 8.0 8.0 9
# To illustrate list_return and pre_return
for (pre_return in c(FALSE, TRUE)) for (list_return in c(FALSE, TRUE)) {
cat("\n=======================================\n")
cat("list_return =", list_return, ", pre_return =", pre_return, "\n\n")
out <- model_aggregate(z, formula = ~age:year,
sum_vars = c("ths", "y"),
fun_vars = c(mean = "y", ra = "y"),
fun = c(mean = mean, ra = my_range),
list_return = list_return,
pre_return = pre_return)
cat("\n")
print(out)
}
#>
#> =======================================
#> list_return = FALSE , pre_return = FALSE
#>
#> [pre_aggregate 18*6->6*5] [ModelMatrix] [crossprod] [dummy_aggregate] [cbind]
#>
#> age year ths y y_mean y_ra.min y_ra.max
#> 1 Total Total 680.8 171 9.5 1 18
#> 2 old 2014 142.0 15 5.0 4 6
#> 3 old 2015 145.5 33 11.0 10 12
#> 4 old 2016 149.8 51 17.0 16 18
#> 5 young 2014 80.3 6 2.0 1 3
#> 6 young 2015 79.5 24 8.0 7 9
#> 7 young 2016 83.7 42 14.0 13 15
#>
#> =======================================
#> list_return = TRUE , pre_return = FALSE
#>
#> [pre_aggregate 18*6->6*5] [ModelMatrix] [crossprod] [dummy_aggregate]
#>
#> $cross_table
#> age year
#> 1 Total Total
#> 2 old 2014
#> 3 old 2015
#> 4 old 2016
#> 5 young 2014
#> 6 young 2015
#> 7 young 2016
#>
#> $sum_data
#> ths y
#> Total-Total 680.8 171
#> old-2014 142.0 15
#> old-2015 145.5 33
#> old-2016 149.8 51
#> young-2014 80.3 6
#> young-2015 79.5 24
#> young-2016 83.7 42
#>
#> $fun_data
#> y_mean y_ra.min y_ra.max
#> 1 9.5 1 18
#> 2 5.0 4 6
#> 3 11.0 10 12
#> 4 17.0 16 18
#> 5 2.0 1 3
#> 6 8.0 7 9
#> 7 14.0 13 15
#>
#> $x
#> 6 x 7 sparse Matrix of class "dgCMatrix"
#> Total-Total old-2014 old-2015 old-2016 young-2014 young-2015 young-2016
#> [1,] 1 1 . . . . .
#> [2,] 1 . . . 1 . .
#> [3,] 1 . 1 . . . .
#> [4,] 1 . . . . 1 .
#> [5,] 1 . . 1 . . .
#> [6,] 1 . . . . . 1
#>
#>
#> =======================================
#> list_return = FALSE , pre_return = TRUE
#>
#> [pre_aggregate 18*6->6*5]
#>
#> $pre_data
#> age year y
#> 1 old 2014 4, 5, 6
#> 2 young 2014 1, 2, 3
#> 3 old 2015 10, 11, 12
#> 4 young 2015 7, 8, 9
#> 5 old 2016 16, 17, 18
#> 6 young 2016 13, 14, 15
#>
#> $pre_sum
#> ths y
#> 1 142.0 15
#> 2 80.3 6
#> 3 145.5 33
#> 4 79.5 24
#> 5 149.8 51
#> 6 83.7 42
#>
#>
#> =======================================
#> list_return = TRUE , pre_return = TRUE
#>
#> [pre_aggregate 18*6->6*5] [ModelMatrix] [crossprod] [dummy_aggregate]
#>
#> $pre_data
#> age year y
#> 1 old 2014 4, 5, 6
#> 2 young 2014 1, 2, 3
#> 3 old 2015 10, 11, 12
#> 4 young 2015 7, 8, 9
#> 5 old 2016 16, 17, 18
#> 6 young 2016 13, 14, 15
#>
#> $pre_sum
#> ths y
#> 1 142.0 15
#> 2 80.3 6
#> 3 145.5 33
#> 4 79.5 24
#> 5 149.8 51
#> 6 83.7 42
#>
#> $cross_table
#> age year
#> 1 Total Total
#> 2 old 2014
#> 3 old 2015
#> 4 old 2016
#> 5 young 2014
#> 6 young 2015
#> 7 young 2016
#>
#> $sum_data
#> ths y
#> Total-Total 680.8 171
#> old-2014 142.0 15
#> old-2015 145.5 33
#> old-2016 149.8 51
#> young-2014 80.3 6
#> young-2015 79.5 24
#> young-2016 83.7 42
#>
#> $fun_data
#> y_mean y_ra.min y_ra.max
#> 1 9.5 1 18
#> 2 5.0 4 6
#> 3 11.0 10 12
#> 4 17.0 16 18
#> 5 2.0 1 3
#> 6 8.0 7 9
#> 7 14.0 13 15
#>
#> $x
#> 6 x 7 sparse Matrix of class "dgCMatrix"
#> Total-Total old-2014 old-2015 old-2016 young-2014 young-2015 young-2016
#> [1,] 1 1 . . . . .
#> [2,] 1 . . . 1 . .
#> [3,] 1 . 1 . . . .
#> [4,] 1 . . . . 1 .
#> [5,] 1 . . 1 . . .
#> [6,] 1 . . . . . 1
#>
# To illustrate preagg_var
model_aggregate(z, formula = ~age:year,
sum_vars = c("ths", "y"),
fun_vars = c(mean = "y", ra = "y"),
fun = c(mean = mean, ra = my_range),
preagg_var = "eu",
pre_return = TRUE)[["pre_data"]]
#> [pre_aggregate 18*6->12*6]
#> age year eu y
#> 1 old 2014 EU 4, 6
#> 2 young 2014 EU 1, 3
#> 3 old 2015 EU 10, 12
#> 4 young 2015 EU 7, 9
#> 5 old 2016 EU 16, 18
#> 6 young 2016 EU 13, 15
#> 7 old 2014 nonEU 5
#> 8 young 2014 nonEU 2
#> 9 old 2015 nonEU 11
#> 10 young 2015 nonEU 8
#> 11 old 2016 nonEU 17
#> 12 young 2016 nonEU 14
# To illustrate hierarchies
geo_hier <- SSBtoolsData("sprt_emp_geoHier")
model_aggregate(z, hierarchies = list(age = "All", geo = geo_hier),
sum_vars = "y",
fun_vars = c(sum = "y"))
#> [pre_aggregate 18*6->6*4] [ModelMatrix] [crossprod] [dummy_aggregate] [cbind]
#> age geo y y_sum
#> 1 All EU 114 114
#> 2 All Europe 171 171
#> 3 All nonEU 57 57
#> 4 All Iceland 57 57
#> 5 All Portugal 63 63
#> 6 All Spain 51 51
#> 7 old EU 66 66
#> 8 old Europe 99 99
#> 9 old nonEU 33 33
#> 10 old Iceland 33 33
#> 11 old Portugal 36 36
#> 12 old Spain 30 30
#> 13 young EU 48 48
#> 14 young Europe 72 72
#> 15 young nonEU 24 24
#> 16 young Iceland 24 24
#> 17 young Portugal 27 27
#> 18 young Spain 21 21
#### Special non-dummy cases illustrated below ####
# Extend the hierarchy to make non-dummy model matrix
geo_hier2 <- rbind(data.frame(mapsFrom = c("EU", "Spain"),
mapsTo = "EUandSpain", sign = 1), geo_hier[, -4])
# Warning since non-dummy
# y and y_sum are different
model_aggregate(z, hierarchies = list(age = "All", geo = geo_hier2),
sum_vars = "y",
fun_vars = c(sum = "y"))
#> [pre_aggregate 18*6->6*4] [ModelMatrix] [crossprod] [dummy_aggregate
#> Warning: All non-0s in x are treated as 1s. Use dummy = FALSE?
#> ] [cbind]
#> age geo y y_sum
#> 1 All EUandSpain 165 114
#> 2 All EU 114 114
#> 3 All Europe 171 171
#> 4 All nonEU 57 57
#> 5 All Iceland 57 57
#> 6 All Portugal 63 63
#> 7 All Spain 51 51
#> 8 old EUandSpain 96 66
#> 9 old EU 66 66
#> 10 old Europe 99 99
#> 11 old nonEU 33 33
#> 12 old Iceland 33 33
#> 13 old Portugal 36 36
#> 14 old Spain 30 30
#> 15 young EUandSpain 69 48
#> 16 young EU 48 48
#> 17 young Europe 72 72
#> 18 young nonEU 24 24
#> 19 young Iceland 24 24
#> 20 young Portugal 27 27
#> 21 young Spain 21 21
# No warning since dummy since unionComplement = TRUE (see ?HierarchyCompute)
# y and y_sum are equal
model_aggregate(z, hierarchies = list(age = "All", geo = geo_hier2),
sum_vars = "y",
fun_vars = c(sum = "y"),
mm_args = list(unionComplement = TRUE))
#> [pre_aggregate 18*6->6*4] [ModelMatrix] [crossprod] [dummy_aggregate] [cbind]
#> age geo y y_sum
#> 1 All EUandSpain 114 114
#> 2 All EU 114 114
#> 3 All Europe 171 171
#> 4 All nonEU 57 57
#> 5 All Iceland 57 57
#> 6 All Portugal 63 63
#> 7 All Spain 51 51
#> 8 old EUandSpain 66 66
#> 9 old EU 66 66
#> 10 old Europe 99 99
#> 11 old nonEU 33 33
#> 12 old Iceland 33 33
#> 13 old Portugal 36 36
#> 14 old Spain 30 30
#> 15 young EUandSpain 48 48
#> 16 young EU 48 48
#> 17 young Europe 72 72
#> 18 young nonEU 24 24
#> 19 young Iceland 24 24
#> 20 young Portugal 27 27
#> 21 young Spain 21 21
# Non-dummy again, but no warning since dummy = FALSE
# Then pre_aggregate is by default set to FALSE (error when TRUE)
# fun with extra argument needed (see ?dummy_aggregate)
# y and y_sum2 are equal
model_aggregate(z, hierarchies = list(age = "All", geo = geo_hier2),
sum_vars = "y",
fun_vars = c(sum2 = "y"),
fun = c(sum2 = function(x, y) sum(x * y)),
dummy = FALSE)
#> [ModelMatrix] [crossprod] [dummy_aggregate] [cbind]
#> age geo y y_sum2
#> 1 All EUandSpain 165 165
#> 2 All EU 114 114
#> 3 All Europe 171 171
#> 4 All nonEU 57 57
#> 5 All Iceland 57 57
#> 6 All Portugal 63 63
#> 7 All Spain 51 51
#> 8 old EUandSpain 96 96
#> 9 old EU 66 66
#> 10 old Europe 99 99
#> 11 old nonEU 33 33
#> 12 old Iceland 33 33
#> 13 old Portugal 36 36
#> 14 old Spain 30 30
#> 15 young EUandSpain 69 69
#> 16 young EU 48 48
#> 17 young Europe 72 72
#> 18 young nonEU 24 24
#> 19 young Iceland 24 24
#> 20 young Portugal 27 27
#> 21 young Spain 21 21
