March 2018
Intermediate to advanced
484 pages
10h 31m
English
In this section, we will explore the mechanisms through which computation in TensorFlow can be distributed. The first step in running distributed TensorFlow is to specify the architecture of the cluster using tf.train.ClusterSpec:
import tensorflow as tf
cluster = tf.train.ClusterSpec({"ps": ["localhost:2222"],\
"worker": ["localhost:2223",\
"localhost:2224"]})Nodes are typically divided into two jobs: parameter servers (ps), which host variables, and workers, which perform heavy computation. In the preceding code, we have one parameter server and two workers, as well as the IP address and port of each node.
Then we have to build a tf.train.Server for each parameter server and worker, previously defined:
ps = tf.train.Server(cluster, ...