I was trying to understand how K.layers.Dropout might be implemented, since in the literature it's always referred as a random independent sampling of 0/1 masks for each element.
Given that the literature it's pretty clear to me, I switched to coding it, and I stumbled upon an issue: since TF uses Graphs, we don't know the sample size, in particular:
def class CustomLayer(K.keras.Layer)
def call(inputs):
tf.print(inputs.shape)
will indeed print (supposing the eager evaluation is turned off) None as first dimension
Having said that, how is TF able to sample an independent mask for each sample in each minibatch?
At the moment my best guess is that they are using something like tf.vectorized_map to get the performance they are getting with a random mask for each element in the minibatch
I traced the code for
tf.keras.layers.Dropout.callin an effort to answer the following question (tensorflow 2.9):In summary, a random uniform distribution is sampled from [0, 1) with the same shape as the input (including batch dimension). This allows the method to use an independent mask for each sample. The noise array is then made into a boolean mask based on the dropout rate. This is all assuming that one keeps
noise_shape=Nonewhen instantiating theDropoutlayer.I have copied the relevant lines below.
In the case that
noise_shape=Nonein theDropoutlayer,_get_noise_shapewill return the shape of the inputx. This is done with the graph-compatible methodtf.shape, which evaluates the shape of the tensor at runtime.Here is an overview of the process for the TensorFlow / Keras v2 API.
tf.keras.layers.Dropoutlayer (withnoise_shape=None).Dropout.callinstance method on an inputx.self._random_generator.dropout, which callsBaseRandomLayer._random_generator.dropout, which callstf.nn.experimental.stateless_dropoutBaseRandomLayer._random_generator.dropout: v2 api will usestateless_dropoutand v1 api will usetf.nn.dropout._dropout, which then constructs the noise array to be the same shape as the input tensorx.