after days of searching, I'm stuck. Everybody talks about activation functions in the forward pass but nothing in the back-prop.
I coded a fully connected network, 2 hidden layers (sigmoid) for MNIST and got about 90% accuracy. Now I'm coding a front end Conv layer (plus a 2x2 maxpool layer). I apply the ReLU activation to the output of the Conv layer. I've coded all the back-prop except that I have no idea where to apply the derivative of the activation function. In the fully connected network, it was applied to the output and then multiplied by the error. So gradient = f'(O) * E * lr.
If I remove the activation function altogether, I get about 50% accuracy. With the activation function on the forward pass only, I get about 10%.
I'm missing something. It's likely going to be embarrassing :). Thanks for your consideration.