Difference and Sum Groups

This function is a wrapper around RowGroups() for the specific case where the input contains two columns. It calls RowGroups() with returnGroups = TRUE, and extends the resulting data frame of unique code combinations with additional information about common groups, difference groups, and sum groups.

Usage

diff_groups(
  x,
  ...,
  hiddenNA = TRUE,
  sep_common = "_=_",
  sep_diff = "_-_",
  sep_sum = c("_=_", "_+_"),
  outputNA = "NA",
  diff_extra = FALSE
)

Arguments

x: A data frame with exactly two columns.
...: Additional arguments passed to RowGroups().
hiddenNA: Logical. When TRUE (default), missing codes (NA) are treated as hidden categories — they are not available for computing difference and sum groups. See Note for details on how this differs from the NAomit parameter in RowGroups().
sep_common: A character string used in the common column to separate codes that are identical across the two input columns.
sep_diff: A character string used in the diff_1_2 and diff_2_1 columns to indicate difference groups. The first column contains the parent code, and one or more child codes from the other column are subtracted.
sep_sum: A character vector of one or two elements used in the sum_1_2 and sum_2_1 columns to describe relationships where a code in one column represents the sum of several codes in the other. The first element (sep_sum[1]) acts as an equality sign, and the second element (sep_sum[2]) acts as a plus sign. If sep_sum has length 1, the same value is used for both positions.
outputNA: Character string used to represent NA values within the newly constructed text strings in the additional output columns. Only relevant when hiddenNA = FALSE.
diff_extra: Logical. When TRUE, additional difference-group variables are returned when found.

Value

A list (as returned by RowGroups()), where the groups data frame is extended with additional descriptive columns indicating common, difference, and sum relationships between the two code columns.

Details

The returned list contains the same elements as from RowGroups(), but with an extended groups data frame. The columns describe relationships between the two input columns as follows:

is_common — TRUE when the two codes on the row are identical.
is_child_1, is_child_2 — TRUE when the code in the column is a subset or subgroup of a code in the other column.
common — identical code pairs, formatted using sep_common.
diff_1_2, diff_2_1 — difference groups. The first element is the parent from the source column, followed by one or more child codes from the opposite column, joined using sep_diff.
sum_1_2, sum_2_1 — sum groups where a parent code in one column equals the sum of several codes in the other.

Note

The parameter NAomit from RowGroups() can still be set via ..., but using it will remove rows containing NA before processing. The relationships found will then reflect the reduced data, which is usually not the intended behaviour when identifying relationships between code sets.

Examples