For the image recognition, there is a thing about the filter vector vs its layer function I didn't get. Many articles mentioned the similar idea: "... to detect edges from raw pixels at the first layer, then use the edges to detect simple shapes at the second layer ... ", and some articles wrote: "the filters are initialized randomly and automatically learned from the data during training."
My question is if the filter values are not arranged in some order in a CNN (i.e., values from randomly learned), how could we know a CNN (always?) learns edges at first, and it detects shapes at the second layer, etc.? Thank you very much!
If the filter vectors are learned from the arbitrary values, which I know they are, how a CNN seems always to learn an image from edges, shapes, and so on? It looks like a CNN could find its own way (or say pattern?) to put the filter vectors down in an order. My guess is that the 'filtering-pooling' process resizes the original image, so the CNN would learn the image features in the hierarchical nature of it.