Through a single-user case study of 449 ChatGPT conversations, we introduce a cognitive MRI applying network analysis to reveal thought topology hidden in linear conversation logs. We construct semantic similarity networks with user-weighted embeddings to identify knowledge communities and bridge conversations that enable cross-domain flow. Our analysis reveals heterogeneous topology: theoretical domains exhibit hub-and-spoke structures while practical domains show tree-like hierarchies. We identify three distinct bridge types that facilitate knowledge integration across communities.
Raw Papers
Auto-generated list of all papers from /static/latex/
We present Infinigram, a corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching. Unlike neural language models that require expensive training and fine-tuning, Infinigram provides instant training—the corpus is the model. Given a context, Infinigram finds the longest matching suffix in the training corpus and estimates next-token probabilities from observed continuations. This approach offers O(m log n) query time, complete explainability since every prediction traces to specific corpus evidence, and the ability to ground LLM outputs through probability mixing without retraining. We introduce a theoretical framework that views inductive biases as projections—transformations applied to queries or training data that enable generalization.
This technical report introduces the likelihood.model.series.md R package for maximum likelihood estimation in series systems with masked component cause of failure data. The package provides a unified framework for exponential and Weibull series systems, implementing log-likelihood functions, score vectors, and Hessian matrices under specific masking conditions (C1, C2, C3). We describe the mathematical foundation, software architecture, and integration with the broader likelihood.model ecosystem. The package enables practitioners to perform parameter estimation, construct confidence intervals, and conduct hypothesis tests for series system reliability problems where component failure causes are only partially observable.
When can reliability engineers safely use a simpler model for series system analysis? This paper provides a definitive answer: for well-designed systems with similar component failure characteristics, a reduced homogeneous-shape Weibull model is statistically indistinguishable from the full heteroge...
We develop likelihood-based inference methods for series systems with masked failure data when the traditional conditions governing candidate set formation are relaxed. While existing methods require that masking be non-informative (Condition C2) and parameter-independent (Condition C3), practical d...
We consider the problem of estimating component failure rates in series systems when observations consist of system failure times paired with partial information about the failed component. For the case where component lifetimes follow exponential distributions, we derive closed-form expressions for the maximum likelihood estimator, the Fisher information matrix, and establish sufficient statistics. The asymptotic sampling distribution of the estimator is characterized and confidence intervals are provided. A detailed analysis of a three-component system demonstrates the theoretical results.
This technical report provides a rigorous specification of Monte Carlo Tree Search (MCTS) applied to Large Language Model (LLM) reasoning. We present formal definitions of the search tree structure, the four phases of MCTS (Selection, Expansion, Simulation, Backpropagation), and their adaptation for step-by-step reasoning with language models. Key design choices—including tree-building rollouts and terminal-only evaluation—are explicitly documented and justified. The goal is to establish a canonical reference specification that authentically captures the MCTS algorithm while adapting it for the unique characteristics of LLM-based reasoning tasks.
We present a novel privacy framework based on rank-deficient observation functions that create computational indistinguishability between distinct inputs. Unlike differential privacy, which adds calibrated noise, our approach leverages lossy compression and many-to-one mappings to achieve privacy. W...
We present Bernoulli sets, a comprehensive statistical framework for probabilistic set data structures that provides rigorous foundations for approximating set membership queries under uncertainty. Our framework unifies the treatment of fundamental data structures like Bloom filters, Count-Min sketc...
We present cipher maps, a comprehensive theoretical framework unifying oblivious function approximation, algebraic cipher types, and Bernoulli data models. Building on the mathematical foundations of cipher functors that lift monoids into cipher monoids, we develop oblivious Bernoulli maps that prov...
Problem: Traditional encrypted search systems face a fundamental tension: deterministic schemes leak access patterns enabling inference attacks, while probabilistic structures like Bloom filters provide space efficiency but fail to hide what is being queried.
We present a universal construction for space-optimal probabilistic data structures based on hash functions and Bernoulli types. Our framework unifies Bloom filters, Count-Min sketches, HyperLogLog, and other probabilistic structures under a common mathematical foundation, proving that they all aris...
We present Hash-Based Oblivious Sets (HBOS), a practical framework for privacy-preserving set operations that combines cryptographic hash functions with probabilistic data structures. Unlike traditional approaches using fully homomorphic encryption or secure multi-party computation, HBOS achieves mi...
We present a unified type-theoretic framework for approximate computing based on the fundamental distinction between latent (true) and observed (approximate) values. This duality naturally arises not only in probabilistic data structures like Bloom filters, but equally in algorithms themselves—from ...
We present DagShell, a virtual POSIX-compliant filesystem implemented as a content-addressable directed acyclic graph (DAG). Unlike traditional filesystems that mutate data in place, DagShell treats all filesystem objects as immutable nodes identified by SHA256 content hashes, similar to Git’s objec...
We present JSONL Algebra (ja), a command-line tool and interactive REPL that applies relational algebra operations to semi-structured JSONL (JSON Lines) data. Unlike traditional database systems that require rigid schemas, ja embraces the flexibility of JSON while providing the expressive power of r...
We present alea, a C++20 header-only library that implements probability distributions and Monte Carlo methods using generic programming and type erasure. The library provides: (1) a type-erased interface for probability distributions that enables runtime polymorphism while maintaining type safety; ...
We present Alga, a header-only C++20 template library that models parsers as composable algebraic structures. By treating parsers as elements of monoids and leveraging monadic composition patterns, Alga provides a mathematically rigorous yet practically efficient framework for text processing. The l...
We present algebraic cipher types, a functorial framework that lifts monoids and other algebraic structures into cryptographically secure representations while preserving their computational properties. Our cipher functor cA provides a systematic construction that maps a monoid (S,∗,e) to a cipher m...
We present a comprehensive algebraic framework for language model composition that transforms how we build and reason about language systems. Our framework introduces a rich set of operators—mixture (+), scalar (*), maximum (—), minimum (&), exclusive-or (⊕), temperature (**), threshold (¿¿), tr...
Fuzzy logic systems have traditionally relied on domain experts to define membership functions and inference rules, creating a significant barrier to deployment in domains where expert knowledge is limited or expensive to obtain. We present a novel approach to fuzzy system design through fuzzy soft ...
We present DreamLog, a neural-symbolic system that integrates logic programming with large language models (LLMs) through a biologically-inspired architecture featuring wake-sleep cycles, compression-based learning, and recursive knowledge generation. DreamLog addresses the fundamental challenge of ...
We present a novel framework that addresses the fundamental challenge of optimizing prompts for large language models (LLMs) whose internal dynamics are unknown and unobservable. Unlike approaches that assume knowledge of the LLM’s behavior, we formulate prompting as a Partially Observable Markov De...
Streaming data processing requires algorithms that compute statistical aggregates in a single pass with constant memory—a challenge complicated by floating-point precision loss and the need to compute multiple statistics simultaneously. We present accumux, a C++ library that solves these challenges ...
We present Algebraic Hashing, a header-only C++20 library that enables systematic composition of hash functions through XOR operations. The library explores algebraic properties of hash function composition, providing practical implementations with compile-time composition via template metaprogrammi...
We present apertures, a coordination mechanism for distributed computation based on partial evaluation with explicit holes. Apertures (denoted ?variable) represent unknown expressions that enable pausable and resumable evaluation across multiple parties. Unlike security-focused approaches, we make n...
We present maph (Map based on Perfect Hash), a high-performance key-value storage system that achieves sub-microsecond latency through a novel combination of memory-mapped I/O, approximate perfect hashing, and lock-free atomic operations. Unlike traditional key-value stores that suffer from kernel/u...
We present PFC (Prefix-Free Codecs), a header-only C++20 library that achieves full-featured data compression without marshaling overhead through a novel combination of prefix-free codes and generic programming principles. The library implements a zero-copy invariant: in-memory representation equals...
We present Xtk (Expression Toolkit), a powerful and extensible system for symbolic expression manipulation through rule-based term rewriting. Xtk provides a simple yet expressive framework for pattern matching, expression transformation, and symbolic computation. The system employs an Abstract Syntax Tree (AST) representation using nested Python lists, enabling intuitive expression construction while maintaining formal rigor. We demonstrate that Xtk’s rule-based approach is Turing-complete and show its applicability to diverse domains including symbolic differentiation, algebraic simplification, theorem proving via tree search algorithms, and expression optimization. The toolkit includes an extensive library of predefined mathematical rules spanning calculus, algebra, trigonometry, and logic, along with an interactive REPL for exploratory computation. We present the theoretical foundations of the system, describe its implementation architecture, analyze its computational complexity, and provide comprehensive examples demonstrating its practical applications.
This paper presents a novel approach to preventing ransomware damages through in-operation off-site backup systems designed to achieve an exceptionally low false-negative miss-detection rate of 10⁻⁸.
We present Fisher Flow (FF), a framework for sequential statistical inference that propagates Fisher information rather than probability distributions. Fisher Flow provides a computationally efficient alternative to Bayesian updating while maintaining rigorous uncertainty quantification. The key ins...
We present the Stepanov library, a header-only C++20/23 library that demonstrates the power of generic programming through mathematical abstractions and efficient implementations. Building on Alex Stepanov’s foundational principles, this library advances the state of generic programming by introduci...
How do you store infinity in 256 bits? This paper explores the fundamental deception at the heart of computational cryptography: using finite information to simulate infinite randomness. We prove why true random oracles are impossible, then show how lazy evaluation creates a beautiful lie—a finite a...
This paper explores model selection for reliability estimation of components in multi-component series systems using Weibull distributions. We assess the sensitivity of a likelihood model based on right-censored and masked failure data to deviations from well-designed series systems where components...
This paper investigates maximum likelihood techniques to estimate component reliability from masked failure data in series systems. A likelihood model accounts for right-censoring and candidate sets indicative of masked failure causes. Extensive simulation studies assess the accuracy and precision o...
We present a time series analysis of confidentiality degradation in encrypted search systems subject to known-plaintext attacks. By modeling the adversary's accuracy as a time-dependent confidentiality measure, we develop forecasting models based on ARIMA and dynamic regression techniques. Our analy...
hypothesize is an R package that provides a simple API for hypothesis testing. It is designed to be easy to use, and to provide a consistent interface for hypothesis testing. It is built on top of the likelihood.model package, which provides a consistent interface for likelihood models.
Facilitates building likelihood models. It defines a number of generic methods for working with likelihoods, and a few functions that are strictly functions of these generic methods. It is designed to interoperate well with the algebraic.mle R package. It also comes with a likelihood contributions model, a likelihood model that is composed of multiple likelihood contributions assuming i.i.d. data.
We define a set of functions for solving common MLE problems.
Set of functions/methods for operating on, sampling from, etc. distributions.
We define an algebra over maximum likelihood estimators by defining a set of operators that are closed over maximum likelhood estimators in addition to various other informative functions.
Let's you construct differentiable expressions using auto differentiation.
Set of functions/methods for operating on, sampling from, estimating, etc. distributions paramaterized by dynamic failure rates that may be correlated with any set of predictors or covariates.
This paper applies an approach of resilience engineering in studying how effective encrypted searches will be. One of the concerns on encrypted searches is frequency attacks. In frequency attacks, adversaries guess the meaning of the encrypted words by observing a large number of encrypted words in search queries and mapping the encrypted words to guessed plain text words using their known histogram. Thus, it is important for defenders to know how many encrypted words adversaries need to observe before they correctly guess the encrypted words. However, doing so takes long time for defenders because of the large volume of the encrypted words involved. We developed and evaluated Moving Average Bootstrap (MAB) method for estimating the number of encrypted words (N*) an adversary needs to observe before an adversary correctly guesses a certain percentage of the observed words with a certain confidence. Our experiments indicate that MAB method lets defenders to estimate N* using only 5% of the time, compared to the cases without MAB. Because of the significant reduction in the required time for estimating N*, MAB will contribute to the safety in encrypted searches.
Encrypted Search, a technique for securely searching documents on untrusted systems. The author designs and compares several secure index types to determine their performance in terms of time complexity, space complexity, and retrieval accuracy. The study also examines the risk to confidentiality posed by storing frequency and proximity information and suggests ways to mitigate the risk. Additionally, the author investigates the impact of false positives and secure index poisoning on performance and confidentiality. Finally, the thesis explores techniques to protect against adversaries with access to encrypted query histories, such as query obfuscation.