The hypothesize package provides a consistent API for
hypothesis testing in R. It is designed around three principles from
Structure and Interpretation of Computer Programs (SICP):
This vignette introduces the package through examples, building from simple primitives to more sophisticated compositions.
Every hypothesis test in this package implements the same interface:
pval(x) — extract the p-valuetest_stat(x) — extract the test statisticdof(x) — extract degrees of freedomis_significant_at(x, alpha) — check significance at
level αThis abstraction means you can work with any test the same way, regardless of its internal implementation.
The z-test is the most basic hypothesis test. It tests whether a population mean equals a hypothesized value when the population standard deviation is known:
# A manufacturer claims widgets weigh 100g on average
# We test 30 widgets (population SD is known to be 5g from historical data)
set.seed(42)
weights <- rnorm(30, mean = 102, sd = 5)
# Test H0: mu = 100 vs H1: mu != 100
z <- z_test(weights, mu0 = 100, sigma = 5)
z
#> Hypothesis test ( z_test )
#> -----------------------------
#> Test statistic: 2.566556
#> P-value: 0.01027141
#> Degrees of freedom: Inf
#> Significant at 5% level: TRUEWe can extract components:
The Wald test generalizes the z-test to any asymptotically normal estimator. If you have an estimate and its standard error, you can test whether the true parameter equals a hypothesized value:
# Suppose we fit a regression and got coefficient = 2.3 with SE = 0.8
# Test H0: beta = 0 (no effect) vs H1: beta != 0
w <- wald_test(estimate = 2.3, se = 0.8, null_value = 0)
w
#> Hypothesis test ( wald_test )
#> -----------------------------
#> Test statistic: 8.265625
#> P-value: 0.004040275
#> Degrees of freedom: 1
#> Significant at 5% level: TRUE
is_significant_at(w, 0.05)
#> [1] TRUEThe Wald statistic is z², which follows a chi-squared distribution with 1 df:
w$z # The z-score
#> [1] 2.875
test_stat(w) # z² (the Wald statistic)
#> [1] 8.265625The LRT compares nested models by examining the ratio of their likelihoods:
# Suppose we fit two models:
# Null model (simpler): log-likelihood = -250
# Alt model (complex): log-likelihood = -240
# The alt model has 3 extra parameters
test <- lrt(null_loglik = -250, alt_loglik = -240, dof = 3)
test
#> Hypothesis test ( likelihood_ratio_test )
#> -----------------------------
#> Test statistic: 20
#> P-value: 0.0001697424
#> Degrees of freedom: 3
#> Significant at 5% level: TRUE
# Is the more complex model justified?
is_significant_at(test, 0.05)
#> [1] TRUEHypothesis tests and confidence intervals are two sides of the same coin. A 95% confidence interval contains exactly those parameter values that would not be rejected at the 5% level.
For tests that store their estimates, we can extract confidence intervals:
w <- wald_test(estimate = 5.2, se = 1.1)
confint(w) # 95% CI
#> lower upper
#> 3.04404 7.35596
confint(w, level = 0.99) # 99% CI
#> lower upper
#> 2.366588 8.033412Notice the duality: the CI contains 0 if and only if the test is not significant:
# Is 0 in the 95% CI?
ci <- confint(w)
ci["lower"] < 0 && 0 < ci["upper"]
#> [1] FALSE
# Is the test significant at 5%?
is_significant_at(w, 0.05)
#> [1] TRUEA key principle from SICP is the closure property: when you combine things, you get back the same kind of thing. In our case, combining hypothesis tests yields a hypothesis test.
Fisher’s method combines p-values from independent tests:
# Three independent studies test the same hypothesis
# None is individually significant, but together...
p1 <- 0.08
p2 <- 0.12
p3 <- 0.06
combined <- fisher_combine(p1, p2, p3)
combined
#> Hypothesis test ( fisher_combined_test )
#> -----------------------------
#> Test statistic: 14.91881
#> P-value: 0.02089771
#> Degrees of freedom: 6
#> Significant at 5% level: TRUE
is_significant_at(combined, 0.05)
#> [1] TRUEYou can also combine actual test objects:
t1 <- wald_test(estimate = 1.2, se = 0.8)
t2 <- wald_test(estimate = 0.9, se = 0.6)
t3 <- wald_test(estimate = 1.5, se = 1.0)
fisher_combine(t1, t2, t3)
#> Hypothesis test ( fisher_combined_test )
#> -----------------------------
#> Test statistic: 12.07678
#> P-value: 0.06027762
#> Degrees of freedom: 6
#> Significant at 5% level: FALSEBecause the result is itself a hypothesis_test, you can
use all the same methods on it:
combined <- fisher_combine(0.05, 0.10, 0.15)
pval(combined)
#> [1] 0.02556195
test_stat(combined)
#> [1] 14.39087
dof(combined)
#> [1] 6When performing multiple tests, we need to adjust p-values to control
error rates. The adjust_pval() function transforms tests by
adjusting their p-values:
# Perform 5 tests
tests <- list(
wald_test(estimate = 2.5, se = 1.0),
wald_test(estimate = 1.8, se = 0.9),
wald_test(estimate = 1.2, se = 0.7),
wald_test(estimate = 0.9, se = 0.8),
wald_test(estimate = 2.1, se = 1.1)
)
# Original p-values
sapply(tests, pval)
#> [1] 0.01241933 0.04550026 0.08647627 0.26058903 0.05625037
# Bonferroni-adjusted p-values
adjusted <- adjust_pval(tests, method = "bonferroni")
sapply(adjusted, pval)
#> [1] 0.06209665 0.22750132 0.43238133 1.00000000 0.28125183
# Less conservative: Benjamini-Hochberg (FDR control)
adjusted_bh <- adjust_pval(tests, method = "BH")
sapply(adjusted_bh, pval)
#> [1] 0.06209665 0.09375061 0.10809533 0.26058903 0.09375061The adjusted tests retain all original properties:
adj <- adjusted[[1]]
test_stat(adj) # Same as original
#> [1] 6.25
adj$original_pval # Can access unadjusted p-value
#> [1] 0.01241933
adj$adjustment_method # Method used
#> [1] "bonferroni"Because adjusted tests are still hypothesis tests, they compose naturally:
# First adjust, then combine
adjusted <- adjust_pval(tests[1:3], method = "bonferroni")
fisher_combine(adjusted[[1]], adjusted[[2]], adjusted[[3]])
#> Hypothesis test ( fisher_combined_test )
#> -----------------------------
#> Test statistic: 13.26117
#> P-value: 0.03907096
#> Degrees of freedom: 6
#> Significant at 5% level: TRUEThe hypothesis_test() constructor makes it easy to add
new test types:
# Example: Create a simple chi-squared goodness-of-fit test wrapper
chisq_gof <- function(observed, expected) {
stat <- sum((observed - expected)^2 / expected)
df <- length(observed) - 1
p.value <- pchisq(stat, df = df, lower.tail = FALSE)
hypothesis_test(
stat = stat,
p.value = p.value,
dof = df,
superclasses = "chisq_gof_test",
observed = observed,
expected = expected
)
}
# Use it
result <- chisq_gof(
observed = c(45, 35, 20),
expected = c(40, 40, 20)
)
result
#> Hypothesis test ( chisq_gof_test )
#> -----------------------------
#> Test statistic: 1.25
#> P-value: 0.5352614
#> Degrees of freedom: 2
#> Significant at 5% level: FALSE
# It works with all the standard functions
pval(result)
#> [1] 0.5352614
is_significant_at(result, 0.05)
#> [1] FALSEThe hypothesize package demonstrates that a small set of
well-designed primitives and combinators can provide a powerful,
extensible framework for hypothesis testing:
| Function | Purpose | SICP Principle |
|---|---|---|
hypothesis_test() |
Base constructor | Data abstraction |
z_test(), wald_test(),
lrt()
|
Primitive tests | Building blocks |
pval(), test_stat(),
dof()
|
Accessors | Abstraction barrier |
fisher_combine() |
Combine tests | Closure property |
adjust_pval() |
Transform tests | Higher-order functions |
confint() |
Extract CI | Duality |
This design makes hypothesis testing composable, extensible, and easy to reason about.