Introduction

To demonstrate the R package algebraic.dist, we consider the multivariate normal distribution (MVN) and its empirical approximation.

We start by loading the package:

Defining the data generating process

We define the parameters of the data generating process (DGP) with:

# we define the parameters of the MVN
M <- mvn(mu = 1:5, sigma = diag(1:5))
print(M)
#> Multivariate normal distribution with mean:
#> [1] 1 2 3 4 5
#> and variance-covariance matrix:
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    2    0    0    0
#> [3,]    0    0    3    0    0
#> [4,]    0    0    0    4    0
#> [5,]    0    0    0    0    5

When we print out the parameters, we get a large vector that includes both the elements of the mean vector (\(\mu\)) and the elements of the variance-covariance matrix (\(\Sigma\)).

params(M)
#>     mu1     mu2     mu3     mu4     mu5  sigma1  sigma2  sigma3  sigma4  sigma5 
#>       1       2       3       4       5       1       0       0       0       0 
#>  sigma6  sigma7  sigma8  sigma9 sigma10 sigma11 sigma12 sigma13 sigma14 sigma15 
#>       0       2       0       0       0       0       0       3       0       0 
#> sigma16 sigma17 sigma18 sigma19 sigma20 sigma21 sigma22 sigma23 sigma24 sigma25 
#>       0       0       0       4       0       0       0       0       0       5

Each observation is an i.i.d. vector from the MVN. We sample from the MVN with:

# we generate a sample of size n
n <- 10000
# we sample from the MVN
data <- sampler(M)(n)

We have a sample of size \(n=10000\) from the DGP. We show some observations from this sample with:

head(data, n=6)
#>        [,1]   [,2]  [,3] [,4]  [,5]
#> [1,]  1.018  2.875 1.966 1.84 1.920
#> [2,]  2.754  1.049 2.824 2.71 0.694
#> [3,] -0.816  3.614 3.716 2.69 9.839
#> [4,]  0.837  3.573 0.499 2.09 4.872
#> [5,]  1.881  3.709 4.856 8.04 5.531
#> [6,]  0.976 -0.927 1.493 2.67 5.932

Now, we also construct a empirical distribution from the sample with:

emp <- empirical_dist(data)
print(emp)
#> Empirical Distribution 
#> Mean:
#> [1] 1.01 2.02 2.99 4.00 5.00
#> Covariance:
#>         [,1]     [,2]     [,3]   [,4]    [,5]
#> [1,] 1.00960  0.01305  0.00684 0.0220 0.00859
#> [2,] 0.01305  1.94926 -0.00593 0.0461 0.04928
#> [3,] 0.00684 -0.00593  3.10654 0.0139 0.08274
#> [4,] 0.02201  0.04607  0.01393 4.0401 0.05753
#> [5,] 0.00859  0.04928  0.08274 0.0575 4.94639
#> Number of observations: 10000

Since this is the empirical distribution, it is parameter-free:

params(emp)
#> NULL

Let’s show the supports of the empirical distribution and the MVN:

# generate a data frame with the dimension, supremum,
# and infimum of the MVN and empirical distribution
df <- data.frame(supremum.mvn = supremum(sup(M)),
                 supremum.emp = supremum(sup(emp)),
                 infimum.mvn = infimum(sup(M)),
                 infimum.emp = infimum(sup(emp)))
row.names(df) <- paste0("param",1:dim(sup(M)))
print(df)
#>        supremum.mvn supremum.emp infimum.mvn infimum.emp
#> param1          Inf         4.42        -Inf       -3.26
#> param2          Inf         7.54        -Inf       -2.93
#> param3          Inf         9.56        -Inf       -3.56
#> param4          Inf        11.78        -Inf       -3.95
#> param5          Inf        13.35        -Inf       -3.97

Let’s compare the mean and covariance-variance matrices of both the MVN and the empirical distribution of the MVN. First, let’s look at the means.

mean(M); mean(emp)
#> [1] 1 2 3 4 5
#> [1] 1.01 2.02 2.99 4.00 5.00

As expected, pretty close. Now let’s look at the variance-covariance:

vcov(M); vcov(emp)
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    2    0    0    0
#> [3,]    0    0    3    0    0
#> [4,]    0    0    0    4    0
#> [5,]    0    0    0    0    5
#>         [,1]     [,2]     [,3]   [,4]    [,5]
#> [1,] 1.00960  0.01305  0.00684 0.0220 0.00859
#> [2,] 0.01305  1.94926 -0.00593 0.0461 0.04928
#> [3,] 0.00684 -0.00593  3.10654 0.0139 0.08274
#> [4,] 0.02201  0.04607  0.01393 4.0401 0.05753
#> [5,] 0.00859  0.04928  0.08274 0.0575 4.94639

The true variances of the population defined by the MVN \(M\) is the diagonal of the variance-covariance matrix:

diag(vcov(M)); diag(vcov(emp))
#> [1] 1 2 3 4 5
#> [1] 1.01 1.95 3.11 4.04 4.95

Let’s compute the variances using the general expectation method:

mu.emp <- mean(emp)
expectation(emp, function(x) (x - mu.emp)^2)
#> [1] 1.01 1.95 3.11 4.04 4.95
expectation(M, function(x) (x - mean(M))^2, control = list(n = n))
#> [1] 0.998 1.969 2.997 3.941 4.926

We see that these are pretty good estimates, as the expectation is actually a Monte Carlo approximation. We can see the CI’s with:

expectation(emp, function(x) (x - mu.emp)^2, control = list(compute_stats = TRUE))
#> $value
#> [1] 1.01 1.95 3.11 4.04 4.95
#> 
#> $ci
#>       [,1] [,2]
#> [1,] 0.981 1.04
#> [2,] 1.895 2.00
#> [3,] 3.019 3.19
#> [4,] 3.926 4.15
#> [5,] 4.808 5.08
#> 
#> $n
#> [1] 10000
expectation(M, function(x) (x - mean(M))^2, control = list(n = n, compute_stats = TRUE))
#> $value
#> [1] 1.00 1.94 2.89 3.97 4.95
#> 
#> $ci
#>       [,1] [,2]
#> [1,] 0.975 1.03
#> [2,] 1.891 2.00
#> [3,] 2.813 2.97
#> [4,] 3.859 4.08
#> [5,] 4.811 5.09
#> 
#> $n
#> [1] 10000

Next, we use the rmap function on the MVN and the empirical distribution to compute the distribution of \((X - E(X))^2\). If we take the mean of this, we should get the variance as shown above:

mu.emp <- mean(emp)
mean(rmap(emp, function(x) (x - mu.emp)^2))
#> [1] 1.01 1.95 3.11 4.04 4.95
mean(rmap(M, function(x) (x - mean(M))^2, n = n))
#> [1] 0.978 2.009 3.052 4.069 4.895

These are both reasonable estimates of the variance.

The PDF of the empirical is not very useful – it gets \(1/n\) for each observation:

x <- sampler(emp)(1)
y <- sampler(M)(1)
data.frame(
  ob = c("empirical(x)", "MVN(y)"),
  pdf.emp = c(density(emp)(x), density(emp)(y)),
  pdf.mvn = c(density(M)(x), density(M)(y)))
#>             ob pdf.emp   pdf.mvn
#> 1 empirical(x)  0.0001 0.0000579
#> 2       MVN(y)  0.0000 0.0001495

Let’s plot the CDF of marginal distribution of \(X_1\):

X1.emp <- marginal(emp, 1)
F1.emp <- cdf(X1.emp)
curve(F1.emp(x), from = infimum(sup(X1.emp)), to = supremum(sup(X1.emp)), col = "blue", lty = 1)
X1 <- marginal(M, 1)
F1 <- cdf(X1)
curve(F1(x), from = infimum(sup(X1.emp)), to = supremum(sup(X1.emp)), add = TRUE, col = "red", lty = 2)
legend("topleft", legend = c("empirical", "MVN"), col = c("blue", "red"), lty = c(1, 2))

Given the sample size, the empirical distribution’s CDF is essentially the same as the MVN’s CDF (they’re right on top of each other). Let’s zoom in much closer so we can ee that the empirical CDF is a step function:

curve(F1.emp(x), from = 1, to = 1.0005, col = "blue", lty = 1, type = "s")

Let’s compute some expectations of \(X_1\):

cat("E(X1) = ", expectation(X1, function(x) x), "( expected ", mean(X1), ")\n",
    "Var(X1) = ", expectation(X1,
                              function(x) {
                                  (x - expectation(X1,
                                                   function(x) x)
                                  )^2 
                              }),
    "( expected ", vcov(X1), ")\n",
    "Skewness(X1) = ", expectation(X1,
                                   function(x) {
                                       (x - expectation(X1, function(x) x))^3 / 
                                       expectation(X1, function(x) x)^3 
                                   }),
    "( expected 0 )\n",
    "E(X^2) = ", expectation(X1, function(x) x^2),
    "( expected ", vcov(X1) + mean(X1)^2, ")")
#> E(X1) =  1 ( expected  1 )
#>  Var(X1) =  1 ( expected  1 )
#>  Skewness(X1) =  -0.000000000000341 ( expected 0 )
#>  E(X^2) =  2 ( expected  2 )

Let’s show a scatter plot of the joint distribution of \(X_2\) and \(X_4\):

dataX2X4.emp <- sampler(marginal(emp, c(2, 4)))(10000)
dataX2X4.mvn <- sampler(marginal(M, c(2, 4)))(10000)

# scatter plot a 2d sample. use xlab and ylab to label the axes
plot(dataX2X4.emp[,1], dataX2X4.emp[,2], pch = 20, col = "blue", xlab = "X2", ylab = "X4")
# overlay in green the MVN data
points(dataX2X4.mvn[,1], dataX2X4.mvn[,2], pch = 20, col = "green")
legend("topright", legend = c("empirical", "MVN"), col = c("blue", "green"), pch = c(20, 20))
title("Scatter plot of X2 and X4")

Let’s look at the MVN’s multivariate CDF:

cdf(M)(c(1,2,3,4,0))
#> [1] 0.000792

Now we show the mean of the conditional distribution, \(X | X_1 + X_2 < 0\):

mean(conditional(emp, function(x) x[1] + x[2] < 0))
#> [1] -0.294 -0.329  3.095  3.996  4.871
mean(conditional(M, function(x) x[1] + x[2] < 0))
#> [1] -0.245 -0.467  2.910  4.133  4.972

I didn’t do the analysis of what this distribution’s mean should theoretically be, if it’s even practical to derive, but it doesn’t look unreasonable. We see that the mean of the first two components are negative, which makes sense: the sum of the first two components is negative, so the mean of the first two components should be negative.

Working with edist Objects

The edist object is a key component of the algebraic.dist package. It represents a distribution that is an algebraic expression of other distributions.

Creating edist Objects

You can create an edist object using the edist function. Here’s an example:

e <- edist(expression(x + y),
           list("x" = normal(mu = 0, var = 1),
                "y" = exponential(rate = 1)))

This creates an edist object representing the distribution of the sum of a normal random variable and an exponential random variable.

Printing edist Objects

You can print an edist object to see its expression and the distributions of its variables:

print(e)
#> Distribution expression(x + y) 
#>    x  ~ Normal distribution with mean 0 and variance 1 
#> 
#>    y  ~ Exponential distribution with failure rate 1

Mean and Variance of edist Objects

You can compute the mean and variance of an edist object using the mean and vcov functions:

mean(e)
#> [1] 0.985
vcov(e)
#> [1] 2.06

Let’s have a look at the parameters of the edist object:

params(e)
#>   x.mu  x.var y.rate 
#>      0      1      1

Sampling from edist Objects

You can generate a sample from an edist object using the sampler method:

s <- sampler(e)
samp <- s(10)  # Generate a sample of size 10
print(samp)
#>  [1] -0.458  2.000 -0.240  1.905  1.736  2.379  3.208  1.627  1.571  0.106

This concludes our overview of the edist object. It’s a powerful tool for working with algebraic expressions of distributions, and we hope you find it useful in your statistical analyses.