active library

md.tools

Masked data tools

Started 2022 R

Resources & Distribution

Source Code

Package Registries

R package: `md.tools`

A miscellaneous set of tools for working with masked data and common features of masked data. The tool set takes inspiration from functional programming, with inputs and outputs defined over masked data frames of type tbl_md (or just data frames), making it consistent with the tidyverse way of doing things.

We provide a set of simple functions on masked data frames, which may be used to compose more complicated functions, particularly when using the pipe operator %>%.

Installation

You can install the development version of md.tools from GitHub with:

# install.packages("devtools")
#devtools::install_github("queelius/md.tools")

We load the libraries md.tools and tidyverse with:

library(tidyverse)
library(md.tools)

Matrices

A lot of space in md.tools is devoted to working with matrices encoded in the columns of data frames. We could directly store matrices in a column, but we prefer to work with columns defined over primitive types like boolean.

Consider the boolean matrix C of size 10-by-3:

n <- 5
m <- 3
C <- matrix(sample(c(T,F), size=m*n, replace=TRUE), nrow=n)

We may represent this in a data frame of 5 rows with the columns c1, c2, and c3 with:

md <- md_encode_matrix(C,"c")
print(md)
#> # A tibble: 5 × 3
#>   c1    c2    c3   
#>   <lgl> <lgl> <lgl>
#> 1 FALSE TRUE  TRUE 
#> 2 TRUE  TRUE  TRUE 
#> 3 TRUE  TRUE  TRUE 
#> 4 FALSE FALSE FALSE
#> 5 FALSE FALSE FALSE

We may also decode a matrix stored in a data frame with:

print(all(C == md_decode_matrix(md,"c")))
#> [1] TRUE

We may want to work with a Boolean matrix as a list. The function md_boolean_matrix_to_list uses the following transformation:

If we have a $n$-by-$m$ Boolean matrix, then if the $(j,k)$-element is TRUE, the $j$-th vector in the list contains the integer $k$.

We can show the md with the candidate set as a set of integers with:

md %>% md_boolean_matrix_to_charsets("c", "candidate set")
#> # A tibble: 5 × 4
#>   c1    c2    c3    `candidate set`
#>   <lgl> <lgl> <lgl> <chr>          
#> 1 FALSE TRUE  TRUE  {2, 3}         
#> 2 TRUE  TRUE  TRUE  {1, 2, 3}      
#> 3 TRUE  TRUE  TRUE  {1, 2, 3}      
#> 4 FALSE FALSE FALSE {}             
#> 5 FALSE FALSE FALSE {}

We allow converting between three representations: lists of integer vectors, Boolean vectors, and lists of charsets (sets represented with strings, e.g., “{ 1, 2 }”. Thus, the inverse of md_boolean_matrix_to_list is just md_list_to_boolean_matrix and so on.

Decorators

We now consider some data frame transformations that adds additional columns with information that may be inferred from what is already in the data frame. For this reason, we have chosen to call them decorators.

In a masked data frame, we may have a column k that stores the failed component. We simulate failed components and mark them as latent with:

md <- md %>%
  mutate(k=sample(1:m,n,replace=TRUE)) %>%
  md_mark_latent("k")
print(md)
#> Latent variables:  k 
#> # A tibble: 5 × 4
#>   c1    c2    c3        k
#>   <lgl> <lgl> <lgl> <int>
#> 1 FALSE TRUE  TRUE      1
#> 2 TRUE  TRUE  TRUE      1
#> 3 TRUE  TRUE  TRUE      2
#> 4 FALSE FALSE FALSE     3
#> 5 FALSE FALSE FALSE     3

We may additionally have a candidate set encoded by the Boolean columns c1,…,cm, in which case we may infer whether the candidate set contains the failed component k with:

md <- md %>% md_set_contains("c", "k", "contains")
print(md)
#> Latent variables:  k 
#> # A tibble: 5 × 5
#>   c1    c2    c3        k contains
#>   <lgl> <lgl> <lgl> <int> <lgl>   
#> 1 FALSE TRUE  TRUE      1 FALSE   
#> 2 TRUE  TRUE  TRUE      1 TRUE    
#> 3 TRUE  TRUE  TRUE      2 TRUE    
#> 4 FALSE FALSE FALSE     3 FALSE   
#> 5 FALSE FALSE FALSE     3 FALSE

We see that there is a new column, contains, that tells us whether the candidate set actually contains the failed component. No new information is given by this column, it only presents what information that is already there in a potentially more conventient format.

Given the same data frame and candidate set, we may determine the cardinality of the candidate sets with:

md <- md %>% md_set_size("c")
print(md)
#> Latent variables:  k 
#> # A tibble: 5 × 6
#>   c1    c2    c3        k contains size_c
#>   <lgl> <lgl> <lgl> <int> <lgl>     <int>
#> 1 FALSE TRUE  TRUE      1 FALSE         2
#> 2 TRUE  TRUE  TRUE      1 TRUE          3
#> 3 TRUE  TRUE  TRUE      2 TRUE          3
#> 4 FALSE FALSE FALSE     3 FALSE         0
#> 5 FALSE FALSE FALSE     3 FALSE         0

We may unmark a column variable as latent with:

md <- md %>% md_unmark_latent("k")
print(md)
#> # A tibble: 5 × 6
#>   c1    c2    c3        k contains size_c
#>   <lgl> <lgl> <lgl> <int> <lgl>     <int>
#> 1 FALSE TRUE  TRUE      1 FALSE         2
#> 2 TRUE  TRUE  TRUE      1 TRUE          3
#> 3 TRUE  TRUE  TRUE      2 TRUE          3
#> 4 FALSE FALSE FALSE     3 FALSE         0
#> 5 FALSE FALSE FALSE     3 FALSE         0

The latent variable specification is metadata about the masked data frame, but it does not necessarily impose any requirements on algorithms applied to it.

More generally, a masked data frame may have a lot more metadata about it, and we provide some tools for working with them. However, for the most part, you are expected to handle the metadata yourself. The metadata is stored in the attributes, and so underneath the hood, a masked data frame is just a data frame and may be treated as one.

Metadata

To read and write data frames for sharing with others, including yourself, we prefer to work with plaintext files like CSV files, where each row corresponds to some set of measurements of some experimental unit.

However, we may also want to store metadata about the experiment that generated the data, or we may wish to store more information about the experimental units that does not naturally fit into the data frame model.

To store metadata, we take the general approach of storing JSON (Javscript Object Notation) in the comments of the tabular data file (like CSV), where a comment by default is anything after the # character on a line.

data <- md_read_csv_with_meta("./data-raw/exp_series_md_1.csv")
#> Rows: 1000 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (5): t, k, t1, t2, t3
#> lgl (3): x1, x2, x3
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(data)
#> Latent variables:  k t1 t2 t3 
#> # A tibble: 1,000 × 8
#>          t     k      t1      t2     t3 x1    x2    x3   
#>      <dbl> <dbl>   <dbl>   <dbl>  <dbl> <lgl> <lgl> <lgl>
#>  1 0.144       2 0.281   0.144   0.266  TRUE  TRUE  FALSE
#>  2 0.0105      1 0.0105  0.0141  0.0633 TRUE  FALSE TRUE 
#>  3 0.0363      2 0.105   0.0363  0.545  TRUE  TRUE  FALSE
#>  4 0.00972     1 0.00972 0.251   0.0960 TRUE  FALSE TRUE 
#>  5 0.0377      3 0.0937  0.0943  0.0377 TRUE  FALSE TRUE 
#>  6 0.0958      3 0.283   0.391   0.0958 FALSE TRUE  TRUE 
#>  7 0.169       3 0.197   1.01    0.169  FALSE TRUE  TRUE 
#>  8 0.270       3 0.322   0.371   0.270  FALSE TRUE  TRUE 
#>  9 0.299       3 0.390   0.401   0.299  TRUE  FALSE TRUE 
#> 10 0.00794     2 0.524   0.00794 0.120  FALSE TRUE  TRUE 
#> # ℹ 990 more rows

We may view all of the metadata stored in data with:

meta <- attributes(data)
meta["row.names"] <- NULL
print(meta)
#> $class
#> [1] "tbl_md"     "tbl_df"     "tbl"        "data.frame"
#> 
#> $names
#> [1] "t"  "k"  "t1" "t2" "t3" "x1" "x2" "x3"
#> 
#> $comment
#> [1] "this is a simulation test."
#> 
#> $param
#> [1] 3 4 5
#> 
#> $components
#>        family param
#> 1 exponential     3
#> 2 exponential     4
#> 3 exponential     5
#> 
#> $candidate_conditions
#> [1] "C1" "C2" "C3"
#> 
#> $latent
#> [1] "k"  "t1" "t2" "t3"
#> 
#> $observable
#> [1] "t"  "x1" "x2" "x3"
#> 
#> $num_comp
#> [1] 3
#> 
#> $sample_size
#> [1] 1000

Note that we removed the row.names attribute, since it’s quite long and uninformative.

A lot of the metadata for data has to do with how the data was generated. In particular, we see this data is the result of a simulation for a series system with $3$ exponentially distributed component lifetimes parameterized by $\lambda = (3,4,5)’$ and a candidate model consistent with conditions $C_1$, $C_2$, and $C_3$ for series systems with a masked component cause of failure in the form of candidate sets.