Another solution is to use the stale parameters. As long as we use a proper staleness parameter, it's able to converge faster thanks to the reduced synchronization cost.
The impact of convergence of SGD seems like race conditions, but maybe those are not so important because machine learning allows certain errors?
This approach seems like it will tolerate a large amount of errors due to each cluster computing on very different set of data. Is that something that we allow for?
@418_touhenying @chuangxuean, this is allowed, it will still most likely to converge even the parameter server applies subgradients on a state model to a fresher model. Actually there are three typical kinds of synchronization protocol, namely bulk synchronization, asynchronize and bounded delay synchronization. You should refer to this paper for more information on a detailed discussion of communications of parameter servers.