Improve Machine Learning Performance by Dropping the Zeros

Update: August 6, 2023
Improve Machine Learning Performance by Dropping the Zeros

KAUST researchers have found a way to significantly increase the speed of training. Large machine learning models can be trained significantly faster by observing how frequently zero results are produced in distributed machine learning that uses large training datasets.

AI models develop their “intelligence” by being trained on datasets that have been labelled to tell the model how to differentiate between different inputs and then respond accordingly. The more labelled data that goes in, the better the model becomes at performing whatever task it has been assigned to do. For complex deep learning applications, such as self-driving vehicles, this requires enormous input datasets and very long training times, even when using powerful and expensive highly parallel supercomputing platforms.

During training, small learning tasks are assigned to tens or hundreds of computing nodes, which then share their results over a communications network before running the next task. One of the biggest sources of computing overhead in such parallel computing tasks is actually this communication among computing nodes at each model step.

“Communication is a major performance bottleneck in distributed deep learning,” explains the KAUST team. “Along with the fast-paced increase in model size, we also see an increase in the proportion of zero values that are produced during the learning process, which we call sparsity. Our idea was to exploit this sparsity to maximize effective bandwidth usage by sending only non-zero data blocks.”

Building on an earlier KAUST development called SwitchML, which optimized internode communications by running efficient aggregation code on the network switches that process data transfer, Fei, Marco Canini and their colleagues went a step further by identifying zero results and developing a way to drop transmission without interrupting the synchronization of the parallel computing process.

“Exactly how to exploit sparsity to accelerate distributed training is a challenging problem, says the team. “All nodes need to process data blocks at the same location in a time slot, so we have to coordinate the nodes to ensure that only data blocks in the same location are aggregated. To overcome this, we created an aggregator process to coordinate the workers, instructing them on which block to send next.”

The team demonstrated their OmniReduce scheme on a testbed consisting of an array of graphics processing units (GPU) and achieved an eight-fold speed-up for typical deep learning tasks.

ELE Times
+ posts
  • BD Soft Ties up with Data Resolve, Strengthens its Offerings in Cyber Security & Enterprise Intelligence
  • Combined Approach Finds Best Direct Trajectory for Robot Path Generation
  • One Material with Two Functions Could Lead to Faster Memory
  • New Technology Could Bring the Fastest Version of 5G to your Home and Workplace