Suppose you want to predict the political party affiliation (democrat, republican other) of a person based on age, income, and education. A training data set for this problem might look like:

32 48 14 0 1 0 24 28 12 1 0 0 . . .

The first line represents a 32-year old person, who makes $48,000 a year, has 14 years of education, and is a republican (0 1 0).

A neural network would have 3 input nodes (age, income, education), some number of hidden nodes (perhaps 10) determined by trial and error, and 3 output nodes where (1 0 0) is democrat, (0 1 0) is republican, and (0 0 1) is other.

The neural network will generate three output values that could be anything, such as (3.0, -1.0, 4.0). Some neural libraries have two error functions for training, cross-entropy and cross-entropy-with-softmax. You can apply softmax to the raw output nodes and then apply regular cross-entropy, and then use that error to adjust the networks weights and biases. Or you can not apply softmax and then apply cross-entropy-with-softmax. The result will be the same.

But a potentially minor mistake (that I made recently) is to apply softmax to the raw output nodes, and the use cross-entropy-with-softmax during training. The result is to apply softmax twice, which sometimes isn’t a good idea. I’ll try to explain with an example.

Suppose raw output node values are (3.0, -1.0, 4.0) and the target values in the training data are (0, 0, 1). If you apply softmax, the output node values become (0.268, 0.005, 0.727) and the with regular cross-entropy you’d be comparing 0.727 and 1 and you have nice separation between the three probabilities.

But if you apply cross-entropy-with-softmax, the output node values are re-softmaxed and become (0.298, 0.229, 0.472). The probabilities are now much closer together, and so training will likely be slower.

Very interesting stuff! (Well for geeks anyway).

*“The Sequence” – Rick Eskridge*