August 2018
Intermediate to advanced
272 pages
7h 2m
English
As mentioned before, in data parallelism, each model will grab some data from the training set and calculate their own gradient, but somehow we need to synchronize a way before updating the model, given that each worker will have the same model.
In synchronous SGD, all workers calculate a gradient and wait to have all gradients calculated, then the model is updated and distributed to all the workers again:
