Continuous Training on Gpu With Tensorflow Slows Down Over Time
I'm running into an issue where my model training slows down dramatically
Here is what happens:
Epoch 00001: val_loss did not improve from 0.03340 Run 27 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 2s 156us/step - loss: 0.0420 - binary_accuracy: 0.9459 - accuracy: 0.9848 - val_loss: 0.0362 - val_binary_accuracy: 0.9501 - val_accuracy: 0.9876 Epoch 00001: val_loss did not improve from 0.03340 Run 28 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 2s 150us/step - loss: 0.0422 - binary_accuracy: 0.9431 - accuracy: 0.9851 - val_loss: 0.0395 - val_binary_accuracy: 0.9418 - val_accuracy: 0.9863 Epoch 00001: val_loss did not improve from 0.03340 Run 29 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 6s 474us/step - loss: 0.0454 - binary_accuracy: 0.9479 - accuracy: 0.9833 - val_loss: 0.0395 - val_binary_accuracy: 0.9475 - val_accuracy: 0.9856 Epoch 00001: val_loss did not improve from 0.03340 Run 30 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 8s 701us/step - loss: 0.0462 - binary_accuracy: 0.9406 - accuracy: 0.9830 - val_loss: 0.0339 - val_binary_accuracy: 0.9502 - val_accuracy: 0.9882 Epoch 00001: val_loss did not improve from 0.03340 Run 31 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 8s 646us/step - loss: 0.0457 - binary_accuracy: 0.9462 - accuracy: 0.9836 - val_loss: 0.0375 - val_binary_accuracy: 0.9417 - val_accuracy: 0.9861 Epoch 00001: val_loss did not improve from 0.03340 Run 32 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 8s 640us/step - loss: 0.0471 - binary_accuracy: 0.9313 - accuracy: 0.9827 - val_loss: 0.0373 - val_binary_accuracy: 0.9446 - val_accuracy: 0.9868 Epoch 00001: val_loss did not improve from 0.03340 Run 33 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 8s 669us/step - loss: 0.0423 - binary_accuracy: 0.9458 - accuracy: 0.9852 - val_loss: 0.0356 - val_binary_accuracy: 0.9510 - val_accuracy: 0.9873 Epoch 00001: val_loss did not improve from 0.03340 Run 34 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 8s 648us/step - loss: 0.0441 - binary_accuracy: 0.9419 - accuracy: 0.9841 - val_loss: 0.0407 - val_binary_accuracy: 0.9357 - val_accuracy: 0.9849 Epoch 00001: val_loss did not improve from 0.03340 Run 35 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 9s 713us/step - loss: 0.0460 - binary_accuracy: 0.9473 - accuracy: 0.9829 - val_loss: 0.0423 - val_binary_accuracy: 0.9604 - val_accuracy: 0.9840 Epoch 00001: val_loss did not improve from 0.03340 Run 36 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 7s 557us/step - loss: 0.0508 - binary_accuracy: 0.9530 - accuracy: 0.9810 - val_loss: 0.0470 - val_binary_accuracy: 0.9323 - val_accuracy: 0.9820
My GPU usage doesn't decrease (it actually increases):
My CPU usage,clocks and GPU clocks (core and memory) all remain about the same. My RAM usage also remains roughly the same.
The only strange part is my overall power drops (image in percent):
I've read somewhere that it was due to the beta_1 parameter for the ADAM optimizer, and that setting it to 0.99 should fix the issue, yet the issue persists.
Is there any other reason why this would be happening? It looks like something on the computation side, as there are no indicators of hardware/driver issues.
andersonagook1988.blogspot.com
Source: https://stackoverflow.com/questions/66157475/why-does-keras-training-slow-down-after-a-while
0 Response to "Continuous Training on Gpu With Tensorflow Slows Down Over Time"
Post a Comment