Continuous Training on Gpu With Tensorflow Slows Down Over Time

I'm running into an issue where my model training slows down dramatically

Here is what happens:

                      Epoch 00001: val_loss did not improve from 0.03340 Run 27 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 2s 156us/step - loss: 0.0420 - binary_accuracy: 0.9459 - accuracy: 0.9848 - val_loss: 0.0362 - val_binary_accuracy: 0.9501 - val_accuracy: 0.9876  Epoch 00001: val_loss did not improve from 0.03340 Run 28 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 2s 150us/step - loss: 0.0422 - binary_accuracy: 0.9431 - accuracy: 0.9851 - val_loss: 0.0395 - val_binary_accuracy: 0.9418 - val_accuracy: 0.9863  Epoch 00001: val_loss did not improve from 0.03340 Run 29 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 6s 474us/step - loss: 0.0454 - binary_accuracy: 0.9479 - accuracy: 0.9833 - val_loss: 0.0395 - val_binary_accuracy: 0.9475 - val_accuracy: 0.9856  Epoch 00001: val_loss did not improve from 0.03340 Run 30 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 8s 701us/step - loss: 0.0462 - binary_accuracy: 0.9406 - accuracy: 0.9830 - val_loss: 0.0339 - val_binary_accuracy: 0.9502 - val_accuracy: 0.9882  Epoch 00001: val_loss did not improve from 0.03340 Run 31 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 8s 646us/step - loss: 0.0457 - binary_accuracy: 0.9462 - accuracy: 0.9836 - val_loss: 0.0375 - val_binary_accuracy: 0.9417 - val_accuracy: 0.9861  Epoch 00001: val_loss did not improve from 0.03340 Run 32 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 8s 640us/step - loss: 0.0471 - binary_accuracy: 0.9313 - accuracy: 0.9827 - val_loss: 0.0373 - val_binary_accuracy: 0.9446 - val_accuracy: 0.9868  Epoch 00001: val_loss did not improve from 0.03340 Run 33 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 8s 669us/step - loss: 0.0423 - binary_accuracy: 0.9458 - accuracy: 0.9852 - val_loss: 0.0356 - val_binary_accuracy: 0.9510 - val_accuracy: 0.9873  Epoch 00001: val_loss did not improve from 0.03340 Run 34 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 8s 648us/step - loss: 0.0441 - binary_accuracy: 0.9419 - accuracy: 0.9841 - val_loss: 0.0407 - val_binary_accuracy: 0.9357 - val_accuracy: 0.9849  Epoch 00001: val_loss did not improve from 0.03340 Run 35 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 9s 713us/step - loss: 0.0460 - binary_accuracy: 0.9473 - accuracy: 0.9829 - val_loss: 0.0423 - val_binary_accuracy: 0.9604 - val_accuracy: 0.9840  Epoch 00001: val_loss did not improve from 0.03340 Run 36 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 7s 557us/step - loss: 0.0508 - binary_accuracy: 0.9530 - accuracy: 0.9810 - val_loss: 0.0470 - val_binary_accuracy: 0.9323 - val_accuracy: 0.9820                  

My GPU usage doesn't decrease (it actually increases):

enter image description here

My CPU usage,clocks and GPU clocks (core and memory) all remain about the same. My RAM usage also remains roughly the same.

The only strange part is my overall power drops (image in percent):

enter image description here

I've read somewhere that it was due to the beta_1 parameter for the ADAM optimizer, and that setting it to 0.99 should fix the issue, yet the issue persists.

Is there any other reason why this would be happening? It looks like something on the computation side, as there are no indicators of hardware/driver issues.

andersonagook1988.blogspot.com

Source: https://stackoverflow.com/questions/66157475/why-does-keras-training-slow-down-after-a-while

0 Response to "Continuous Training on Gpu With Tensorflow Slows Down Over Time"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel