Fuzzy Soft Circuits: Learning Fuzzy Rules from Data

October 1, 2024 8 min read Updated: March 16, 2026

Traditional fuzzy logic systems are powerful. They encode expert knowledge as interpretable rules like “IF temperature IS HIGH AND humidity IS LOW THEN fan speed IS FAST.” The problem is someone has to write those rules.

What if the rules could discover themselves?

The Expert Knowledge Bottleneck

Every classical fuzzy system needs three things from a human expert:

Membership functions:Where does “HIGH” start? Where does “LOW” end?
Rule structure:Which combinations of inputs matter?
Rule existence:How many rules are there? Which ones are relevant?

This is expensive. Experts are hard to find, struggle to articulate their reasoning precisely, and can’t easily update systems as conditions change. In emerging domains, relevant expertise might not even exist.

Previous approaches have chipped away at parts of this problem. ANFIS ¹ learns membership function parameters but needs a predefined rule structure. Genetic fuzzy systems ² can evolve rule bases but lose gradient information. The Wang-Mendel method ³ generates rules from data but still needs hand-designed membership functions.

None of them make the entire system learnable end-to-end.

The Key Insight: Make “IF” Differentiable

The core idea is simple: treat a fuzzy rule’s existence as a continuous parameter.

In a traditional system, a rule either exists or it doesn’t:it’s a binary choice. We replace this with a soft switch: a sigmoid gate \(\gamma_r = \sigma(s_r)\) that smoothly interpolates between “this rule exists” (\(\gamma_r \to 1\)) and “this rule doesn’t exist” (\(\gamma_r \to 0\)).

This transforms rule discovery from a discrete search problem into a differentiable optimization problem. Gradient descent can now tell the system not just how to tune a rule, but whether the rule should exist at all.

Architecture

A fuzzy soft circuit has three differentiable stages:

1. Fuzzification

Each input \(x_i\) is mapped through \(k\) learnable Gaussian membership functions:

\[ \mu_{i,j}(x_i) = \exp\!\left(-\frac{(x_i - c_{i,j})^2}{w_{i,j}^2}\right) \]

The centers \(c_{i,j}\) and widths \(w_{i,j}\) are learnable. We parameterize widths as \(w = e^{\hat{w}}\) to ensure positivity. No one decides where “HIGH” starts:the system figures it out.

2. Soft Rule Evaluation

For each potential rule \(r\), two things are learned:

Antecedent relevance:a weight vector determines which fuzzy features matter for this rule. We use a gated product that smoothly interpolates between “this feature participates” and “this feature is ignored”:

\[ a_r = \prod_{i} \bigl(\mu_i^{\rho_{r,i}} + (1 - \rho_{r,i})\bigr) \]

where \(\rho_{r,i} = \sigma(w_{r,i})\). When \(\rho \to 1\), the term becomes \(\mu_i\) (feature matters). When \(\rho \to 0\), the term becomes 1 (neutral element of multiplication:feature ignored).

Rule switch:the soft gate:

\[ R_r = a_r \cdot \sigma(s_r) \]

During training, unnecessary rules naturally have their switches driven toward zero.

3. Defuzzification

Outputs are a normalized weighted combination of rule consequents:

\[ y_j = \frac{\sum_r R_r \cdot \sigma(v_{j,r}) \cdot \sigma(q_{r,j})}{\sum_r R_r \cdot \sigma(v_{j,r}) + \epsilon} \]

The entire pipeline is differentiable. We minimize MSE over training data:

\[ \mathcal{L} = \frac{1}{N}\sum_{i=1}^{N} \|\mathbf{y}^{(i)} - \hat{\mathbf{y}}^{(i)}\|^2 \]

Semantic-Free Representation

A philosophical point that matters in practice: during training, nothing has a name. Variables are \(x_0, x_1, \ldots\), membership functions are \(\mu_{i,0}, \mu_{i,1}, \ldots\), rules are \(R_0, R_1, \ldots\).

The system discovers pure numerical patterns. Semantic interpretation:labeling a membership function as “HIGH” because its center is at 0.85:is a post-hoc human activity, not a training requirement.

This is important because it means the system isn’t constrained by human preconceptions about what the categories should be. It can discover non-obvious partitions of the input space.

Interactive Demo

Select a target function, configure the circuit, and watch it discover rules from data in real time.

Dataset

Memberships / input 3

Potential rules 6

Learning rate 0.30

Steps / frame 3

Epoch: 0

Loss: —

Inputs: 2

Params: 0

Discovered Rules

Configure and press Train

Reading the panels

The demo has six panels. Here’s what each one shows.

Learned Surface (top right). This is the function the circuit has learned so far. The two axes are input_0 (horizontal) and input_1 (vertical), both ranging from 0 to 1. Color encodes the output value: dark purple is near 0, bright yellow is near 1. The small circles are training data points, colored by their target value. Before training the surface is a flat wash (the circuit outputs roughly 0.5 everywhere). As training progresses, the colors should shift to match the dots.

Membership functions (middle row, two panels). Each panel shows the Gaussian curves for one input variable (always inputs 0 and 1). These are the “categories” the system is learning. At initialization, the curves are evenly spaced across [0, 1]. During training they slide, stretch, and compress as the system discovers which partitions of the input space are useful. The key thing to notice: nobody told the system where to put these curves. It figured out the boundaries on its own.

Rule Switches (bottom left). Each horizontal bar represents one potential rule. The fill shows the switch value, from 0 (rule is off) to 1 (rule is fully active). Before training, switches start at random values near 0.5. During training, useful rules get pushed toward 1.0, and unnecessary rules get pushed toward 0.0. This is the automatic pruning in action: the system is deciding which rules should exist.

Training Loss (bottom right). The loss curve on a log scale. This shows how well the circuit’s output matches the training data over time. A steep initial drop followed by a plateau is normal. If the loss stalls at a high value, try increasing the learning rate or adding more rules.

Discovered Rules (below the panels). A text readout of the active rules (those with switch > 0.3). Each rule shows which input membership functions it uses and what output it produces. Early in training these will be messy (many antecedents per rule). As training converges, rules simplify: irrelevant antecedents get pruned and consequents sharpen toward 0 or 1.

What to try

Start with the XOR dataset. It needs at least 4 rules to capture (output is high when exactly one input is high). Press Train and watch:

The surface starts as a uniform wash
Within ~100 epochs, color contrast appears in the corners
Rule switches diverge: some go to ~1.0, others drop toward 0
By ~500 epochs, the surface matches the XOR pattern and 2 rules have been pruned

Then try Step (a sharp diagonal boundary, tests whether the system can learn a discontinuity) or Peaks (two localized bumps, tests whether the membership functions move to the right places).

For higher-dimensional problems, try Sparse (5D). The target function only depends on inputs 0 and 3, but the circuit doesn’t know that. It has to discover which of the 5 inputs matter and which to ignore. Watch the discovered rules: the surviving ones should only reference inputs 0 and 3. The surface plot shows a 2D slice (inputs 0 vs 1, others fixed at 0.5), so it won’t look like the full picture, but the loss will still drop.

Random (5D) is the hardest. All 5 inputs contribute to a complex nonlinear function. The parameter count jumps significantly (the stats readout updates when you switch datasets), and training is slower per step because each gradient computation requires more forward passes. Crank the steps/frame slider up and give it time.

What the System Discovers

On the XOR dataset (\(f(x_0, x_1) = x_0(1-x_1) + (1-x_0)x_1\)), the system typically converges to 4 active rules with 2 pruned:

Rule	Switch	Pattern	Output
R₀	0.97	\(x_0\) low, \(x_1\) high	≈ 1.00
R₃	1.00	\(x_0\) low, \(x_1\) low	≈ 0.00
R₄	0.91	Both high	≈ 0.00
R₅	0.85	\(x_0\) high, \(x_1\) low	≈ 1.00

That’s the XOR truth table, discovered from 25 data points with zero prior knowledge.

Rules R₁ and R₂ have their switches driven to near zero:the system decided they weren’t needed and effectively deleted them. This is automatic model selection happening inside gradient descent.

Implementation Notes

The implementation is pure TypeScript with zero runtime dependencies. A few design choices worth noting:

Adam optimizer instead of vanilla SGD. The gated product in the antecedent evaluation creates vastly different gradient magnitudes across parameter groups:membership centers get strong gradients while antecedent weights (buried in a product chain) get tiny ones. Adam’s per-parameter adaptive learning rates handle this naturally.

Numerical gradients via central finite differences. For \(\sim\!100\) parameters and \(\sim\!25\) data points, this is fast enough for real-time browser training. No autodiff library needed:

for (let i = 0; i < params.length; i++) {
  params[i] += epsilon;
  const lossPlus = computeLoss(params);
  params[i] -= 2 * epsilon;
  const lossMinus = computeLoss(params);
  params[i] += epsilon; // restore
  grad[i] = (lossPlus - lossMinus) / (2 * epsilon);
}

Log-parameterized widths. Membership function widths must be positive, but gradient descent can push them negative. Storing \(\hat{w}\) and computing \(w = e^{\hat{w}}\) ensures positivity without clamping or constraints.

Approach	Learns MFs	Learns Rules	Learns Rule Existence	Differentiable
Classical Fuzzy	No	No	No	No
ANFIS	Yes	No	No	Yes
Genetic Fuzzy	Yes	Yes	Yes	No
Wang-Mendel	No	Yes	Yes	No
Fuzzy Soft Circuits	Yes	Yes	Yes	Yes

The key differentiator is the last column. Making everything differentiable means we can use the same optimizer for all three learning tasks simultaneously, and the components can co-adapt during training.

Limitations

This is a research demo, not a production system:

Scalability:the gated product over all features creates vanishing gradients for high-dimensional inputs (>10 variables). A sum-based aggregation or attention mechanism would scale better.
Rule count:the number of potential rules is fixed at initialization. L1 regularization on the switch parameters could automate this.
Local optima:different random seeds find different (sometimes worse) rule sets. Ensemble methods or better initialization could help.
No theoretical guarantees:convergence depends on initialization, learning rate, and problem structure.

Jang, J.-S. R. (1993). ANFIS: Adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man, and Cybernetics, 23(3), 665–685. ↩︎
Cordón, O., Herrera, F., Hoffmann, F., & Magdalena, L. (2001). Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases. World Scientific. ↩︎
Wang, L.-X. & Mendel, J. M. (1992). Generating fuzzy rules by learning from examples. IEEE Transactions on Systems, Man, and Cybernetics, 22(6), 1414–1427. ↩︎

The Expert Knowledge Bottleneck

The Key Insight: Make “IF” Differentiable

Architecture

1. Fuzzification

2. Soft Rule Evaluation

3. Defuzzification

Semantic-Free Representation

Interactive Demo

Discovered Rules

Reading the panels

What to try

What the System Discovers

Implementation Notes

Comparison with Related Approaches

Limitations

Related Posts

Learning Fuzzy Logic: Automatic Rule Discovery Through Differentiable Circuits

Fuzzy Inference: Teaching Machines to Think in Shades of Grey

Facts and Degrees

Science as Verifiable Search

MCTS-Reasoning: Tree Search for LLM Reasoning

Cluster-Aware Retrieval for RAG Systems

Discussion