Using Tensorflow Estimators with Dataset API results in strange steps behavior

Question

Using Tensorflow Estimators with Dataset API results in strange steps behavior

172 Views Asked by SpiXel At 04 January 2021 at 14:49

I'm facing some issues regarding the behavior of the training loop of Tensorflow's Estimator and Dataset APIs. The code is as follows (tf2.3):


NUM_EXAMPLES = X_train.shape[0] # dataset has 8000 elements
BATCH_SIZE = NUM_EXAMPLES
STEPS = NONE
N_EPOCHS = 100

def make_input_fn(X, y, n_epochs=N_EPOCHS, shuffle=True):
    dataset = tf.data.Dataset.from_tensor_slices((X.to_dict(orient='list'), y))
    if shuffle:
      dataset = dataset.shuffle(NUM_EXAMPLES)
    return dataset.repeat(n_epochs).batch(BATCH_SIZE)


estimator = tf.estimator.BoostedTreesClassifier(feature_cols, {
    'config': tf.estimator.RunConfig(
        model_dir=model_dir,
        save_checkpoints_steps=100
    ),
    'n_trees': 50,
    'max_depth': 6,
    'n_batches_per_layer': 1,
    'l2_regularization': 0.1
})

estimator.train(input_fn=lambda: make_input_fn(X_train, y_train), steps=STEPS)

I just don't understand the behavior I'm seeing though. Number of steps the TF estimator trains for seems to be capped at 300 steps, regardless of whatever I set for batch_size, training steps or number of epochs.

My dataset has 8K training elements, when I'm choosing n_epochs=100 with a batch_size=1000 and steps=None, I'm expecting tensorflow will run 100 (n_epochs) * 8 (steps required for 1 epoch) steps, but no, it runs 300.

Below is actually a summary of multiple experiments with different N_EPOCHS, BATCH_SIZE and STEPS, the first 3 to me are fine, but not the rest.

-	steps	N_EPOCHS	BATCH_SIZE	TF training steps (`est.train`)	My expected # steps
1	None	100	8000	100	100
2	None	200	8000	200	200
3	None	300	8000	300	300
4	None	400	8000	300	400
5	None	100	1000	300	800
6	None	600	10	300	600 * 800
7	400	400	8000	300	400

As can be seen, from the 4th row onwards, my expectation is not equal to the actual steps the tensorflow training runs. This basically means when I lower batch_size to say 10 it only runs 300 epochs on the size 10 batch of the data which is wrong, but I'm failing to understand what is incorrect with my implementation, looking at the docs so any help is super appreciated!

Also it doesn't matter if I use train_and_evaluate with Specs or directly train, it's train here for simplicity. The log for the train function is as follows (for the 7th experiment):

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /tmp/estimator-run-1609770778/model.ckpt.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 0.69314593, step = 0
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 100...
INFO:tensorflow:Saving checkpoints for 100 into /tmp/estimator-run-1609770778/model.ckpt.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 100...
INFO:tensorflow:global_step/sec: 2.88621
INFO:tensorflow:loss = 0.64561623, step = 99 (34.648 sec)
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 200...
INFO:tensorflow:Saving checkpoints for 200 into /tmp/estimator-run-1609770778/model.ckpt.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 200...
INFO:tensorflow:global_step/sec: 2.9199
INFO:tensorflow:loss = 0.6292723, step = 199 (34.248 sec)
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 300...
INFO:tensorflow:Saving checkpoints for 300 into /tmp/estimator-run-1609770778/model.ckpt.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 300...
INFO:tensorflow:global_step/sec: 2.83013
INFO:tensorflow:loss = 0.6164282, step = 299 (35.334 sec)
INFO:tensorflow:Loss for final step: 0.6164282.

Original Q&A

There are 1 best solutions below

**mdaoust** · Answer 1 · 2021-01-07T13:47:01.823000

I think a clue here is n_trees * max_depth = 300:

    'n_trees': 50,
    'max_depth': 6,

Also, look at the line in this test:

      # It will stop after 5 steps because of the max depth and num trees.
      num_steps = 100

I don't know the exact logic, I guess it just adds one layer to one tree with each batch. You set the number of trees and max depth, and it can't keep training once all those trees are built.

Using Tensorflow Estimators with Dataset API results in strange steps behavior

There are 1 best solutions below

Related Questions in TENSORFLOW

Related Questions in MACHINE-LEARNING

Related Questions in TENSORFLOW2.0

Related Questions in TENSORFLOW-DATASETS

Related Questions in TENSORFLOW-ESTIMATOR

Trending Questions

Popular # Hahtags

Popular Questions