I'm facing some issues regarding the behavior of the training loop of Tensorflow's Estimator and Dataset APIs. The code is as follows (tf2.3):
NUM_EXAMPLES = X_train.shape[0] # dataset has 8000 elements
BATCH_SIZE = NUM_EXAMPLES
STEPS = NONE
N_EPOCHS = 100
def make_input_fn(X, y, n_epochs=N_EPOCHS, shuffle=True):
dataset = tf.data.Dataset.from_tensor_slices((X.to_dict(orient='list'), y))
if shuffle:
dataset = dataset.shuffle(NUM_EXAMPLES)
return dataset.repeat(n_epochs).batch(BATCH_SIZE)
estimator = tf.estimator.BoostedTreesClassifier(feature_cols, {
'config': tf.estimator.RunConfig(
model_dir=model_dir,
save_checkpoints_steps=100
),
'n_trees': 50,
'max_depth': 6,
'n_batches_per_layer': 1,
'l2_regularization': 0.1
})
estimator.train(input_fn=lambda: make_input_fn(X_train, y_train), steps=STEPS)
I just don't understand the behavior I'm seeing though. Number of steps the TF estimator trains for seems to be capped at 300 steps, regardless of whatever I set for batch_size, training steps or number of epochs.
My dataset has 8K training elements, when I'm choosing n_epochs=100 with a batch_size=1000 and steps=None, I'm expecting tensorflow will run 100 (n_epochs) * 8 (steps required for 1 epoch) steps, but no, it runs 300.
Below is actually a summary of multiple experiments with different N_EPOCHS, BATCH_SIZE and STEPS, the first 3 to me are fine, but not the rest.
| - | steps | N_EPOCHS | BATCH_SIZE | TF training steps (est.train) |
My expected # steps |
|---|---|---|---|---|---|
| 1 | None | 100 | 8000 | 100 | 100 |
| 2 | None | 200 | 8000 | 200 | 200 |
| 3 | None | 300 | 8000 | 300 | 300 |
| 4 | None | 400 | 8000 | 300 | 400 |
| 5 | None | 100 | 1000 | 300 | 800 |
| 6 | None | 600 | 10 | 300 | 600 * 800 |
| 7 | 400 | 400 | 8000 | 300 | 400 |
As can be seen, from the 4th row onwards, my expectation is not equal to the actual steps the tensorflow training runs. This basically means when I lower batch_size to say 10 it only runs 300 epochs on the size 10 batch of the data which is wrong, but I'm failing to understand what is incorrect with my implementation, looking at the docs so any help is super appreciated!
Also it doesn't matter if I use train_and_evaluate with Specs or directly train, it's train here for simplicity. The log for the train function is as follows (for the 7th experiment):
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /tmp/estimator-run-1609770778/model.ckpt.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 0.69314593, step = 0
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 100...
INFO:tensorflow:Saving checkpoints for 100 into /tmp/estimator-run-1609770778/model.ckpt.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 100...
INFO:tensorflow:global_step/sec: 2.88621
INFO:tensorflow:loss = 0.64561623, step = 99 (34.648 sec)
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 200...
INFO:tensorflow:Saving checkpoints for 200 into /tmp/estimator-run-1609770778/model.ckpt.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 200...
INFO:tensorflow:global_step/sec: 2.9199
INFO:tensorflow:loss = 0.6292723, step = 199 (34.248 sec)
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 300...
INFO:tensorflow:Saving checkpoints for 300 into /tmp/estimator-run-1609770778/model.ckpt.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 300...
INFO:tensorflow:global_step/sec: 2.83013
INFO:tensorflow:loss = 0.6164282, step = 299 (35.334 sec)
INFO:tensorflow:Loss for final step: 0.6164282.
I think a clue here is
n_trees * max_depth = 300:Also, look at the line in this test:
I don't know the exact logic, I guess it just adds one layer to one tree with each batch. You set the number of trees and max depth, and it can't keep training once all those trees are built.