Distribution Algebra with algebraic.dist • algebraic.dist

library(algebraic.dist)

Introduction

The algebraic.dist package provides an algebra over probability distributions. You can combine distribution objects with standard arithmetic operators (+, -, *, /, ^) and mathematical functions (exp, log, sqrt, …), and the package will either simplify the result to a known closed-form distribution or fall back to Monte Carlo sampling when no closed form exists.

This means you can write expressions like normal(0, 1) + normal(2, 3) and get back a normal object, without manually deriving the result. When the package cannot simplify, the result is an edist (expression distribution) that supports all the usual methods – mean, vcov, sampler, cdf, and more – via MC approximation.

Basic Arithmetic

Arithmetic operators on dist objects create edist (expression distribution) objects, which are then automatically simplified when a closed-form rule applies. When no rule matches, the edist is kept as-is.

x <- normal(0, 1)
y <- exponential(1)
z <- x + y

Since the sum of a normal and an exponential has no simple closed form, z remains an edist:

is_edist(z)
#> [1] TRUE
class(z)
#> [1] "+_normal_exponential" "x_normal_exponential" "y_normal_exponential"
#> [4] "edist"                "dist"

The edist still supports all distribution methods. Moments are computed via Monte Carlo:

set.seed(1)
mean(z)
#> [1] 0.9944453
vcov(z)
#> [1] 2.023444

The theoretical mean is 0 + 1 = 1 and variance is 1 + 1 = 2, so the MC estimates should be close.

Automatic Simplification

The real power of the package is automatic simplification. When you compose distributions using operations that have known closed-form results, the package returns the simplified distribution directly – no sampling needed.

Normal algebra

The sum and difference of independent normals is normal. Scalar multiplication and location shifts are also handled.

Sum of normals:

z <- normal(1, 4) + normal(2, 5)
is_normal(z)
#> [1] TRUE
z
#> Normal distribution (mu = 3, var = 9)
mean(z)
#> [1] 3
vcov(z)
#> [1] 9

The result is Normal(mu = 3, var = 9), since means add and variances add for independent normals.

Difference of normals:

z <- normal(1, 4) - normal(2, 5)
z
#> Normal distribution (mu = -1, var = 9)
mean(z)
#> [1] -1
vcov(z)
#> [1] 9

Normal(mu = -1, var = 9) – means subtract, but variances still add.

Scalar multiplication:

z <- 3 * normal(2, 1)
z
#> Normal distribution (mu = 6, var = 9)
mean(z)
#> [1] 6
vcov(z)
#> [1] 9

If X ~ Normal(mu, v), then c * X ~ Normal(c * mu, c^2 * v).

Location shift:

z <- normal(5, 2) + 10
z
#> Normal distribution (mu = 15, var = 2)
mean(z)
#> [1] 15
vcov(z)
#> [1] 2

Adding a constant shifts the mean but preserves the variance.

Exponential and Gamma algebra

Sums of independent exponentials (with the same rate) produce gamma distributions. Gamma distributions with the same rate also combine.

Sum of exponentials:

z <- exponential(2) + exponential(2)
is_gamma_dist(z)
#> [1] TRUE
z
#> Gamma distribution (shape = 2, rate = 2)
params(z)
#> shape  rate 
#>     2     2

Two independent Exp(2) variables sum to Gamma(shape = 2, rate = 2).

Gamma plus exponential:

z <- gamma_dist(3, 2) + exponential(2)
z
#> Gamma distribution (shape = 4, rate = 2)
params(z)
#> shape  rate 
#>     4     2

Gamma(3, 2) + Exp(2) = Gamma(4, 2).

Sum of gammas (same rate):

z <- gamma_dist(3, 2) + gamma_dist(4, 2)
z
#> Gamma distribution (shape = 7, rate = 2)
params(z)
#> shape  rate 
#>     7     2

Gamma(3, 2) + Gamma(4, 2) = Gamma(7, 2) – shapes add when rates match.

Transform rules

Certain mathematical transformations of distributions have known closed-form results.

Exponential of a normal is log-normal:

z <- exp(normal(0, 1))
is_lognormal(z)
#> [1] TRUE
z
#> Log-normal distribution (meanlog = 0, sdlog = 1)

Logarithm of a log-normal is normal:

z <- log(lognormal(0, 1))
is_normal(z)
#> [1] TRUE
z
#> Normal distribution (mu = 0, var = 1)

Standard normal squared is chi-squared:

z <- normal(0, 1)^2
is_chi_squared(z)
#> [1] TRUE
z
#> Chi-squared distribution (df = 1)

This rule specifically applies when the normal has mean 0 and variance 1.

Other simplification rules

Chi-squared addition:

z <- chi_squared(3) + chi_squared(5)
is_chi_squared(z)
#> [1] TRUE
z
#> Chi-squared distribution (df = 8)
params(z)
#> df 
#>  8

Degrees of freedom add: Chi-squared(3) + Chi-squared(5) = Chi-squared(8).

Poisson addition:

z <- poisson_dist(2) + poisson_dist(3)
is_poisson_dist(z)
#> [1] TRUE
z
#> Poisson distribution (lambda = 5)
params(z)
#> lambda 
#>      5

Rates add: Poisson(2) + Poisson(3) = Poisson(5).

Product of log-normals:

z <- lognormal(0, 1) * lognormal(1, 2)
is_lognormal(z)
#> [1] TRUE
z
#> Log-normal distribution (meanlog = 1, sdlog = 2.23607)
params(z)
#>  meanlog    sdlog 
#> 1.000000 2.236068

The product of independent log-normals is log-normal because log(X * Y) = log(X) + log(Y), and the sum of independent normals is normal. So the meanlog values add and the sdlog values combine as sqrt(sdlog1^2 + sdlog2^2).

Uniform scaling and shifting:

z <- 2 * uniform_dist(0, 1)
is_uniform_dist(z)
#> [1] TRUE
z
#> Uniform distribution (min = 0, max = 2)
params(z)
#> min max 
#>   0   2

Scaling a Uniform(0, 1) by 2 gives Uniform(0, 2).

z <- uniform_dist(0, 1) + 5
is_uniform_dist(z)
#> [1] TRUE
z
#> Uniform distribution (min = 5, max = 6)
params(z)
#> min max 
#>   5   6

Shifting Uniform(0, 1) by 5 gives Uniform(5, 6).

When Simplification Cannot Help

When no algebraic rule matches, the result remains an unsimplified edist. All distribution methods still work – they just use Monte Carlo under the hood.

z <- normal(0, 1) * exponential(1)
is_edist(z)
#> [1] TRUE

set.seed(2)
mean(z)
#> [1] 0.02069384

The true mean of X * Y where X ~ Normal(0, 1) and Y ~ Exp(1) is E[X] * E[Y] = 0 * 1 = 0 (by independence), so the MC estimate should be near zero.

Realization and MC Fallback

You can explicitly convert any distribution to an empirical distribution via realize(). This draws n samples and wraps them in an empirical_dist that supports exact (discrete) methods for CDF, density, conditional, and more.

set.seed(3)
z <- normal(0, 1) * exponential(1)
rd <- realize(z, n = 10000)
rd
#> Realized distribution (10000 samples from: Expression distribution: x * y) 
#>   source:
#> Expression distribution: x * y 
#>    x  ~  Normal distribution (mu = 0, var = 1) 
#>    y  ~  Exponential distribution (rate = 1)
mean(rd)
#> [1] -0.003321398
vcov(rd)
#>          [,1]
#> [1,] 2.048537

For edist objects, methods like cdf(), density(), and inv_cdf() automatically materialize samples behind the scenes (with caching, so repeated calls share the same sample). You do not need to call realize() manually for these:

set.seed(4)
z <- normal(0, 1) + exponential(1)
Fz <- cdf(z)
Fz(1)
#> [1] 0.537
Fz(2)
#> [1] 0.7914

Min, Max, and Summary Operations

The Summary group generic is implemented for distributions. The sum() and prod() generics reduce via + and *, so they benefit from all the simplification rules. The min() and max() generics create elementwise edist expressions.

Min of exponentials has a special rule – the minimum of independent exponentials is exponential with the sum of rates:

z <- min(exponential(1), exponential(2))
is_exponential(z)
#> [1] TRUE
z
#> Exponential distribution (rate = 3)
params(z)
#> rate 
#>    3

This is the well-known result: if X ~ Exp(r1) and Y ~ Exp(r2), then min(X, Y) ~ Exp(r1 + r2).

Sum reduces via the + operator, so simplification rules apply:

z <- sum(normal(0, 1), normal(2, 3), normal(-1, 2))
is_normal(z)
#> [1] TRUE
z
#> Normal distribution (mu = 1, var = 6)
mean(z)
#> [1] 1
vcov(z)
#> [1] 6

Three normals sum to a single normal: mean = 0 + 2 - 1 = 1, var = 1 + 3 + 2 = 6.

Max and other combinations that lack closed forms remain as edist:

set.seed(5)
z <- max(exponential(1), exponential(2))
is_edist(z)
#> [1] TRUE
mean(z)
#> [1] 1.170383

Limiting Distributions

The package provides functions for common asymptotic results from probability theory.

Central Limit Theorem

clt(base_dist) returns the limiting distribution of the standardized sample mean, sqrt(n) * (Xbar_n - mu). For a univariate distribution with variance sigma^2, this is Normal(0, sigma^2).

z <- clt(exponential(1))
z
#> Normal distribution (mu = 0, var = 1)
mean(z)
#> [1] 0
vcov(z)
#> [1] 1

For Exp(1), the mean is 1 and the variance is 1, so the CLT limit is Normal(0, 1).

Law of Large Numbers

lln(base_dist) returns the degenerate distribution at the population mean – the limit of Xbar_n as n -> infinity. It is represented as a normal with zero variance.

d <- lln(exponential(1))
d
#> Normal distribution (mu = 1, var = 0)
mean(d)
#> [1] 1
vcov(d)
#> [1] 0

Delta Method

delta_clt(base_dist, g, dg) returns the limiting distribution of sqrt(n) * (g(Xbar_n) - g(mu)) under the delta method. You supply the function g and its derivative dg.

z <- delta_clt(exponential(1), g = exp, dg = exp)
z
#> Normal distribution (mu = 0, var = 7.38906)
mean(z)
#> [1] 0
vcov(z)
#> [1] 7.389056

For g = exp applied to Exp(1) (mean = 1, var = 1), the delta method gives Normal(0, (exp(1))^2 * 1) = Normal(0, e^2).

Moment-Matching Normal Approximation

normal_approx(x) constructs a normal distribution that matches the mean and variance of any input distribution:

g <- gamma_dist(5, 2)
n <- normal_approx(g)
n
#> Normal distribution (mu = 2.5, var = 1.25)
mean(n)
#> [1] 2.5
vcov(n)
#> [1] 1.25

The Gamma(5, 2) has mean 2.5 and variance 1.25, so the normal approximation is Normal(2.5, 1.25).

Empirical CLT Convergence

To see the CLT in action, we can draw replicated standardized sample means at increasing sample sizes and watch them converge to the predicted distribution:

set.seed(42)
base <- exponential(rate = 1)
ns <- c(10, 100, 1000)
for (n in ns) {
  samp_means <- replicate(5000, {
    x <- sampler(base)(n)
    sqrt(n) * (mean(x) - mean(base))
  })
  cat(sprintf("n=%4d: mean=%.3f, var=%.3f\n",
              n, mean(samp_means), var(samp_means)))
}
#> n=  10: mean=-0.006, var=1.002
#> n= 100: mean=-0.003, var=1.022
#> n=1000: mean=0.007, var=0.995

Compare with the CLT prediction:

z <- clt(base)
cat(sprintf("CLT:    mean=%.3f, var=%.3f\n", mean(z), vcov(z)))
#> CLT:    mean=0.000, var=1.000

As n grows, the empirical mean converges to 0 and the empirical variance converges to 1, matching the CLT limit of Normal(0, 1).