A Network Computes Numbers. The Loss Decides What They Mean.
A neural network is a parameterized function. It takes a vector of numbers and returns a vector of numbers, and that is the whole structural commitment. What the output numbers mean, a probability, a class score, a count, a rate, is a separate decision. The network does not make it. The loss does.
I wrote a small library, scratchnn, to make this concrete. Pure Python, standard library only, no NumPy in the core, because the point is to read it rather than run it fast. This post is the first stop in a tour through it. The organizing question for the whole series is simple: every architecture and every loss is a bet about the data. The bets have a name. They are inductive biases, and the right one is what lets a model learn more from less.
The cheapest model, and where it fails
Start with the smallest network: one linear layer, $\mathbf{z} = W\mathbf{x} + \mathbf{b}$. No hidden units, no nonlinearity. This is a single linear unit, and the function it can represent is exactly a line (a hyperplane, in higher dimensions). It fits anything linearly separable, and nothing else.
The standard way to see “nothing else” is XOR: the four points of the unit square, with the two diagonal corners labeled 1 and the other two labeled 0. Try to separate them with a sigmoid on top of a single linear unit and you can write down what success would require:
$$b < 0, \quad w_1 + b > 0, \quad w_2 + b > 0, \quad w_1 + w_2 + b < 0.$$Add the middle two and you get $w_1 + w_2 + b > -b > 0$. The last one says it is below zero. There is no assignment of weights that satisfies all four. The model cannot fit XOR, and if you train it, every input drifts to probability one half and stays there. Geometrically, one line cannot separate two corners that sit on a diagonal. This was Minsky and Papert’s argument in 1969, and it stalled the field for over a decade.
What a hidden layer buys
The fix is one hidden layer with a nonlinearity between the linear maps. Without the nonlinearity the layers collapse: two affine maps composed are just one affine map, and stacking buys nothing. With it, the hidden units can build intermediate features that are linearly separable, and the output layer reads them off. For XOR, one hidden unit can detect OR and another AND, and the difference is XOR. A 2-8-1 network with a tanh in the middle learns it cleanly.
...Read more →