Skip to contents

Calculates the standardized mean difference for a variable directly from a data frame, automatically detecting whether the variable is continuous or categorical.

Usage

calculate_smd_from_data(
  data,
  var,
  trt_var,
  ref_group = NULL,
  method = c("auto", "cohens_d", "hedges_g", "arcsine", "logit", "raw"),
  conf_level = 0.95,
  continuous_threshold = 10
)

Arguments

data

A data frame containing the analysis data.

var

Character. Name of the variable to calculate SMD for.

trt_var

Character. Name of the treatment group variable.

ref_group

Character or NULL. Value of the reference (control) group. If NULL, uses the first level of the treatment variable.

method

Character. Method for SMD calculation:

  • "cohens_d": Cohen's d for continuous variables

  • "hedges_g": Hedges' g (bias-corrected) for continuous variables

  • "arcsine": Arcsine transformation for binary/categorical

  • "logit": Logit transformation for binary/categorical variables

  • "raw": Raw proportions/means without transformation

  • "auto" (default): Automatically selects based on variable type

conf_level

Numeric. Confidence level for CI (default: 0.95)

continuous_threshold

Integer. Minimum number of unique values to treat numeric variables as continuous (default: 10). Used only when method = "auto".

Value

A named list with components:

  • smd: The standardized mean difference. For multi-level categorical variables, returns the SMD with the maximum absolute value, preserving sign.

  • ci_lower: Lower bound of confidence interval

  • ci_upper: Upper bound of confidence interval

  • method: Method used

  • var_type: Detected variable type ("continuous" or "categorical")

  • se: Standard error of the SMD

Details

When method = "auto":

  • Numeric variables with > continuous_threshold unique values are treated as continuous (using Cohen's d)

  • Numeric variables with <= continuous_threshold unique values are treated as categorical

  • Character/factor variables are treated as categorical (using arcsine)

For categorical variables with more than 2 levels, the function calculates the maximum absolute SMD across all pairwise level comparisons.

Method-specific considerations:

  • "logit": Useful for binary variables but requires boundary handling for proportions at 0 or 1 (adds 0.5/N continuity correction). Results are on the logit scale; back-transformation is not straightforward.

  • "raw": Appropriate when no transformation is desired. Calculates SMD directly from raw proportions for binary variables, standardized by the pooled standard deviation of the binary variable.

Examples

if (FALSE) { # \dontrun{
# Create example data
adsl <- data.frame(
  AGE = c(rnorm(100, 55, 12), rnorm(100, 54, 11)),
  SEX = c(sample(c("M", "F"), 100, replace = TRUE, prob = c(0.4, 0.6)),
          sample(c("M", "F"), 100, replace = TRUE, prob = c(0.45, 0.55))),
  TRT01P = rep(c("Treatment", "Placebo"), each = 100)
)

# Continuous variable
calculate_smd_from_data(adsl, "AGE", "TRT01P", ref_group = "Placebo")

# Categorical variable
calculate_smd_from_data(adsl, "SEX", "TRT01P", ref_group = "Placebo")
} # }