Sponsor's Event

Sponsor’s Event: Workshop on Network Optimization for Large AI Models

Monday, May 20, 2024 ● 14:00 – 18:00 pm ● Room: Georgia B

14:00–16:00

Invited Talks/Keynotes

Talk 1: Network Bursts and Bottlenecks: Challenges of Distributed DNN Training Traffic

Speaker: Jorg Liebeherr, Professor, Department of Electrical and Computer Engineering, University of Toronto

Abstract:

Network traffic from distributed training of deep neural network (DNN) models has unique and peculiar properties. On the one hand, as the overall traffic pattern is repeated in each round of training, traffic is highly predictive. On the other hand, the transmissions of large tensors create a high degree of traffic burstiness, which may result in microbursts and congestion events. This presentation uses traffic measurements of distributed training of a DNN to characterize the network traffic and analyze its properties. We show that synchronization barriers and application orchestration can be effective with preventing network congestion even with a large number of worker nodes.

Talk 2: Communication-Efficient Online Distributed Optimization for Federated Learning

Speaker: Ben Liang, Professor, Department of Electrical and Computer Engineering, University of Toronto

Abstract:

We consider federated learning where multiple devices collaboratively train a global model with the assistance of a coordinating server. This is naturally captured by an online distributed optimization problem, where the sequence of objective functions varies unpredictably over time. Since practical machine learning often involves large datasets and high-dimensional parameters, the required information exchange between the devices and the server, repeated over many steps of the online optimization algorithm, can impose overwhelming strain on the communication capacity. In this talk, we present several recent case studies on communication-efficient online distributed optimization for federated learning. We first discuss a joint computation-communication optimization framework that encourages temporal similarity in a device’s local-model sequences to reduce the communication overhead. We then consider dynamic resource allocation among all devices to alleviate the impact of stragglers on learning latency. We finally study over-the-air computation where the devices adaptively update their local-model sequences considering the learning accuracy, the time-varying channel states, and the transmit power budget.

Talk 3: Optimizing Network Communications of Large-Scale AI Workloads Using Datacenter Multicast

Speaker: Mohamed Hefeeda, Professor, School of Computing Science, Simon Fraser University

Abstract:

Machine learning models are continually getting bigger. For example, recent large language models (LLMs), e.g., GPT-4 and Gemini, were reported to have more than a trillion parameters each. Training such models involves iteratively exchanging massive amounts of data among thousands of computing nodes. In many cases, especially for large clusters, the network becomes the bottleneck, and it could slow down the entire training process and waste precious GPU resources. In this talk, we will discuss the communication requirements of training large-scale machine learning models and how we can potentially realize substantial savings in network and processing resources using multicast. Current datacenter multicast systems, however, do not scale as they impose considerable state and communication overheads. We will, then, present a new multicast system, called Orca, that addresses the challenges of multicast in datacenter networks. Orca divides the state and tasks of the data plane among switches and servers, and it partially offloads the management of multicast sessions to servers. It significantly reduces the state at switches, minimizes the bandwidth overhead, incurs small and constant processing overhead, and does not limit the size of multicast sessions. Through implementation in a testbed and large-scale simulations, we show that Orca substantially outperforms the closest work in the literature. For example, Orca reduces the switch state by up to two orders of magnitude and the communication overhead by up to 19X compared to the state-of-art. We also show that Orca can considerably reduce the communication time of training tasks by optimizing the data transfer among computing nodes.

Talk 4: Computer to Data Centers: Data and Computation Placement for Learning-Centric Applications

Speaker: Jianping Pan, Professor, Computer Science, University of Victoria

Abstract:

Learning-centric applications (LCAs), fueled by data mining, machine learning and artificial intelligence (AI) and their applications in various domains including large language models, have imposed new challenges in data centers where massive training and large-scale inference are done nowadays. With the emerging AI clusters, traditional network protocols adopted in most current data centers cannot fully leverage the greatly improved and increased computation, storage and acceleration capability and capacity to fulfill the need of many demanding LCAs. In this talk, we will examine the problem and offer some insights from the viewpoint of network topology and data and computation placement, and hope to attract the attention from and collaboration among researchers and practitioners in computer architecture, operating systems, computer networks, high-performance computing and LCA operation.

16:00–16:20

Coffee Break

16:20–18:00

Panel: Network Optimization for Large-Scale AI Clusters

With the rapid growth in sizes of modern AI models, large-scale distributed clusters, with orders of 1K, 10K, or even 100K of cards, have been widely deployed to meet memory and computation requirements. Communication time increases as cluster becomes larger, and network can become a bottleneck, resulting in sub-linear scaling in distributed training. Designing high-performance network systems to optimize communication in AI clusters is very critical and challenging. Such optimization includes but is not limited to efficient network topologies, routing algorithms, traffic engineering, communication protocols, collective scheduling, and fast fault discovery/recovery. This panel will discuss the insights, challenges, and opportunities in network optimization of large-scale clusters for distributed AI training and inference.

18:00–18:30

Networking Break

Sponsor's Event

Gold Patrons

Student Travel Grant Sponsors