LinkedIn has open-sourced a project for scaling and managing deep learning jobs in TensorFlow, using the YARN (Yet Another Resource Negotiator) job scheduling system in Hadoop.
The Tony project came about after LinkedIn tried to use two existing open source solutions for running scheduled TensorFlow jobs on Hadoop and found them both wanting. A few projects to run TensorFlow on Hadoop already exist, but LinkedIn was unsatisfied with them. One, TensorFlow on Spark, runs TensorFlow via Apache Spark’s job engine, but it couples too tightly with Spark. Another, TensorFlowOnYARN, provided the same basic functionality as Tony, but is unmaintained and didn’t provide fault tolerance.
Deep learning models in TensorFlow need some form of job management. Training models can take hours or days, and the training process needs some guarantee it can complete correctly.
Tony uses YARN’s resource and task scheduling system to set up TensorFlow jobs across a Hadoop cluster, according to LinkedIn’s press notes. Tony can also schedule GPU-based TensorFlow jobs through Hadoop, request different kinds of resources (GPUs vs. CPUs), or allocate memory differently for TensorFlow nodes and ensure that job outputs are saved periodically to HDFS and resumed from where they left off if they crash or are interrupted.
Tony splits its work among three internal components: a client, an application master, and a task executor. The client accepts incoming TensorFlow jobs; the application master negotiates with YARN’s resource manager to provision the job on YARN; and the task executor is what’s actually launched on the YARN cluster to run the TensorFlow job.
LinkedIn claims there is no discernible overhead for TensorFlow jobs when using Tony, because Tony “is in the layer [that] orchestrates distributed TensorFlow and does not interfere with the actual execution of the TensorFlow job.”
Tony also works with the TensorBoard application for visualizing, optimizing, and debugging TensorFlow apps.