Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training (2025)

11footnotetext: Indicates equal contribution.22footnotetext: {\dagger} Indicates corresponding author.

Ao Sun1,∗ Weilin Zhao2,∗Xu Han2,†Cheng Yang1,†Xinrong Zhang 2Zhiyuan Liu2 Chuan Shi1 Maosong Sun2
1 Beijing University of Posts and Telecommunications, Beijing, China. 
2 NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. 
{maydomine, yangcheng}@bupt.edu.cn
{zwl23,zxr19}@mails.tsinghua.edu.cn
hanxu2022@tsinghua.edu.cn

Abstract

The emergence of large language models (LLMs) relies heavily on distributed training strategies, among which pipeline parallelism plays a crucial role. As LLMs’ training sequence length extends to 32k or even 128k, the current pipeline parallel methods face severe bottlenecks, including high memory footprints and substantial pipeline bubbles, greatly hindering model scalability and training throughput.To enhance memory efficiency and training throughput, in this work, we introduce an efficient sequence-level one-forward-one-backward (1F1B) pipeline scheduling method tailored for training LLMs on long sequences named Seq1F1B.Seq1F1B decomposes batch-level schedulable units into finer sequence-level units, reducing bubble size and memory footprint. Considering that Seq1F1B may produce slight extra bubbles if sequences are split evenly, we design a computation-wise strategy to partition input sequences and mitigate this side effect.Compared to competitive pipeline baseline methods such as Megatron 1F1B pipeline parallelism, our method achieves higher training throughput with less memory footprint.Notably, Seq1F1B efficiently trains a LLM with 30B parameters on sequences up to 64k using 64 NVIDIA A100 GPUs without recomputation strategies, a feat unachievable with existing methods. Our source code is based on Megatron-LM, and now is avaiable at: https://github.com/MayDomine/Seq1F1B.git.

1 Introduction

In recent years, there has been a growing interest in large language models (LLMs), which have revolutionized various tasks in natural language processingTouvron etal. [2023], Reid etal. [2024], Jiang etal. [2024], Anil etal. [2023]. Efficient distributed training strategies Korthikanti etal. [2023], Rasley etal. [2020], Rajbhandari etal. [2020], Shoeybi etal. [2019], Narayanan etal. [2021a] play a crucial role in training these large models. Among these strategies, pipeline parallelismShoeybi etal. [2019], Huang etal. [2019], Yang etal. [2021], Qi etal. [2024], Li etal. [2021] stands out due to its low communication bandwidth requirements and high scalability when integrated with other distributed training strategies.

Fundamentally, pipeline parallelism involves partitioning the model into multiple stages, with each computing device responsible for processing a stage consisting of consecutive layers. This setup inherently leads to “bubbles”—the idle time caused by the dependencies between the computation of sharded layers.Several scheduling strategies, such as GPipeHuang etal. [2019], are proposed to address this problem, significantly reducing pipeline bubbles by splitting each mini-batch of training examples into several micro-batches. These strategies come at the expense of increased memory usage, as each stage must store all hidden states of the micro-batches generated during the forward passes until the backward passes are completed. Based on GPipe, TeraPipeLi etal. [2021] further splits each micro-batch along the sequence dimension into micro-sequences to further reduce pipeline bubbles, but still suffer from the high memory demand for storing the hidden states of micro-batches.

To address the issue of high memory demand, one-forward-one-backward (1F1B) scheduling strategies are proposedHarlap etal. [2018], Fan etal. [2021], Narayanan etal. [2021a] to rewrite GPipe to schedule backward passes in advance and make backward passes have higher execution priority than forward passes, without affecting final results. By adopting 1F1B parallel strategies, the memory demand for storing hidden states can be significantly reduced without adding extra pipeline bubbles. Other methods such as zero-bubble-pipelineQi etal. [2024] and 1F1B-I (1F1B with interleaved stages)Narayanan etal. [2021a] seek to further reduce the bubbles of 1F1B strategies but at the cost of more memory overhead and communication cost. Generally, optimizing pipeline parallelism continues to handle trade-offs between bubble ratio and memory overhead.

On the other hand, some recent efforts Buckman and Gelada , Reid etal. [2024] have noticed that long-sequence training benefits LLMs in many aspects, leading to increasingly longer training contexts for LLMs.However, LLMs can not simply support training longersequences due to the quadratic time and memory complexities of Transformer attention modules in terms of sequence lengthVaswani etal. [2017].Several effortsDao etal. [2022], Ding etal. [2023], Jacobs etal. [2023] to build efficient attention modules have been proposed to address this issue. Even so, the challenge caused by long sequences extends beyond attention modules.In distributed training scenarios, long sequences may cause various parallel methods to fail. For pipeline parallelism, such as GPipeHuang etal. [2019] and 1F1BHarlap etal. [2018], Fan etal. [2021], Narayanan etal. [2021a], Qi etal. [2024], these methods can only use micro-batches as the minimal scheduling units. In extreme cases, a single micro-batch consisting of long sequences can lead to memory overflow.Long sequences make the issue of high memory demand in pipeline parallelism more serious.

A straightforward approach to solving such a problem is to split the micro-batches along the sequence dimension. However, such simple modification is challenging for existing 1F1B scheduling strategies such as Qi etal. [2024] and 1F1B-I Narayanan etal. [2021a] because there are computation dependencies between forward and backward passes across micro-sequences, making a direct split along the sequence dimension unfeasible.

To solve such a challenge, we introduce the Seq1F1B, an efficient sequence-level 1F1B schedule, which has higher efficiency and lower memory demands than the traditional 1F1B methods. In detail, we introduce a partially ordered scheduling queue in Seq1F1B to replace the first-in-first-out (FIFO) scheduling queue in 1F1B, rewriting the scheduling strategy to preserve the exact forward and backward semantics while providing synchronous pipeline parallelism.To further improve Seq1F1B, we propose a strategy for balancing the workload across sub-sequence computations.More specifically, we balance the sub-sequence computation by designing a solution based on floating-point operations on each sub-sequence. In this design, we addressed the imbalance of computational workload caused by the attention mechanism by splitting sequences based on computational workloads rather than simply dividing them evenly along the sequence dimension.

Sufficient experiments demonstrate that Seq1F1B significantly outperforms both the 1F1B and 1F1B-I scheduling strategies in terms of memory efficiency and training throughput for training LLMs, with the sequence length ranging from 16k to 128k and the model size ranging from 2.7B to 32B.From the experimental results, the efficiency of Seq1F1B becomes more pronounced as the sequence length increases and Seq1F1B supports efficiently training a GPT with 30B parameters on sequences up to 64k tokens using 64 NVIDIA A100 GPUs without any recomputation strategies, which is unachievable with existing pipeline parallel methods.

2 Related Work

Training large language models requires using a mixture of parallel strategies,the most important of which are data parallelism, tensor parallelism, andpipeline parallelism. Data parallelism scales training models by distributingdata across multipledevicesGoyal etal. [2017], Zinkevich etal. [2010], Li etal. [2020], Xing etal. [2015],each device hosting a model replica and synchronizing gradients. Zero redundancyoptimizer (ZeRO)Rajbhandari etal. [2020], Rasley etal. [2020], Ren etal. [2021] enhances dataparallelism’s memory efficiency by partitioning model parameters across devicesat the cost of significant communication. TensorparallelismShoeybi etal. [2019], Korthikanti etal. [2023] parallelize computation bypartitioning matrix multiplication. In such way, Tensor parallelism effectivelyenhances computation efficiency but introduceshigh communication costs of aggregating the results of matrix multiplication,making it commonly used within the multiple workers of a single node. Since thispaper focuses on improving pipeline parallelism, we will show more details forpipeline parallelism next.

For pipeline parallelism, schedules can be broadly categorized into two maintypes: synchronous and asynchronous. Asynchronous schedules such as asynchronousPipeDreamHarlap etal. [2018] and PipeMareYang etal. [2021] can achievebubble-free but suffer from the performance degradation of final trainedmodels because they use outdated parameters to compute gradient updates. In viewof this, our work focuses on synchronous pipeline schedules, as they ensureconsistent semantics across different model parallel strategies.GPipeHuang etal. [2019], Li etal. [2021] and 1F1BFan etal. [2021], Narayanan etal. [2021a, b] are the most commonly used pipeline schedules followingsynchronous settings. Many other works are built upon these two foundationschedules.

The original GPipeHuang etal. [2019] simply divides a mini-batch into severalmicro-batches. The scheduling process of GPipe has only two phases: the forwardand the backward phases. Only after all forward passes for themicro-batches within a batch are complete will the backward passes be executed.During the forward phase, the hidden states of each micro-batch are enqueuedinto a first-in-first-out (FIFO) queue Q𝑄Qitalic_Q. During the backward phase, thesehidden states are dequeued for their corresponding backward passes. Since thebackward phase happens after all hidden states are queued, GPipe exhibits anO(M)𝑂𝑀O(M)italic_O ( italic_M ) memory consumption, where M𝑀Mitalic_M represents the number of micro-batches.

Based on GPipe, other methods such as TeraPipeLi etal. [2021] and ChimeraLi and Hoefler [2021] further optimize the bubble ratio of GPipe through different techniques. TeraPipe relies on the observation of causal language modeling — the computation of a given input token only depends on its previous tokens. Specifically, TeraPipe divides GPipe’s micro-batch into multiple token spans and replaces the FIFO queue with a last-in-first-out (LIFO) queue to ensure the correct computation of gradients during attention backward passes. By leveraging finer scheduling units, TeraPipe effectively reduces the bubble ratio while being more memory-efficient than GPipe. Chimera adopts bidirectional pipeline parallelism, where each computing device is responsible for the workload of multiple stages. While this approach reduces the bubble ratio, each device has to store redundant parameters (as stages are not evenly distributed across devices), leading to increased memory usage.

Different from GPipe, which performs backward passes after completing all forward passes, 1F1BNarayanan etal. [2021a], Fan etal. [2021] alternates between forward and backward passes (adopting a one-forward-one-backward pattern) to keep the number of hidden states in the FIFO queue Q𝑄Qitalic_Q constant. Regardless of the number of micro-batches, 1F1B mitigates excessive memory usage. Based on 1F1B, 1F1B-INarayanan etal. [2021a] schedule enlarges the number of pipeline stages, and each device is assigned multiple stages. By interleaving stages among devices, 1F1B-I reduces the bubble ratio at the cost of adding more communication operators and slightly increasing memory consumption. Zero-bubble-pipelineQi etal. [2024] divides the backward passes into obtaining weight and input gradients separately. Zero-bubble-pipelineQi etal. [2024] achieves higher pipeline efficiency by delaying weight gradient computation and using dynamic programming to optimize the schedule. Zero-bubble-pipeline approach nearly achieves zero-bubble pipeline efficiency but brings more memory footprint caused by such delayment.

3 Methodology

In this section, we detail how our method works, beginning with a preliminary overview to introducecharacteristics of the 1F1B schedule and language modeling. Then, we can prove why it’s feasible to schedule at the sequence dimension for micro-batches in 1F1B. Following that, We will then explain how Seq1F1B works in detail and how to meet the exact semantics of original language modeling.

Building on this, we will discuss how different sequence-splitting strategies impact the scheduling order in pipeline parallelism, and we will construct an optimal solution based on the theoretical, computational load to address the load-balancing issues associated with sequence-splitting strategies, thereby enhancing the efficiency of our method.

3.1 Preliminary

1F1B includes three phases during one iteration: warm-up, steady, and cooling-down phase. Assume a 1F1B scheduling scenario where we have P𝑃Pitalic_P workers, each responsible for one pipeline stage, such that the size of pipeline parallelism is P𝑃Pitalic_P and the number of micro-batches is M𝑀Mitalic_M. Each worker denoted as i𝑖iitalic_i, executes a forward pass during the warm-up phase. The number of warm-up micro-batches for each worker is determined by the Eq.1.

wi={Pi1ifM>PMifMPsubscriptw𝑖cases𝑃𝑖1if𝑀𝑃𝑀if𝑀𝑃\centering\text{w}_{i}=\begin{cases}P-i-1&\text{if }M>P\\M&\text{if }M\leq P\end{cases}\@add@centeringw start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL italic_P - italic_i - 1 end_CELL start_CELL if italic_M > italic_P end_CELL end_ROW start_ROW start_CELL italic_M end_CELL start_CELL if italic_M ≤ italic_P end_CELL end_ROW(1)

When wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT equals M𝑀Mitalic_M, 1F1B degrades to the behavior of GPipe. Otherwise,during the warm-up phase, a worker responsible for an earlier stage performs onemore forward pass than a worker for a subsequent stage. Each forward passresults in a hidden state that is enqueued in a FIFO queue Q𝑄Qitalic_Q to be used laterfor gradient computation during the backward pass. In the steady phase, eachworker performs one forward pass and enqueues the resulting hidden state intoQ𝑄Qitalic_Q. Following each forward pass, a hidden state is dequeued from Q𝑄Qitalic_Q and immediately to perform a backward pass for gradient computation, which is where the "one-forward-one-backward" (1F1B) name comes from. It is noted that the bubble ratio is minimal during thesteady phase, and the number of one-forward-one-backward passes in this phase is given by: Mwi𝑀subscriptw𝑖M-\text{w}_{i}italic_M - w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.Thus, as M𝑀Mitalic_M increases, the proportion of the steady phase increases, whichreduces the bubble ratio. After the steady phase, the 1F1B scheduling enters thecooling-down phase, which is symmetric to the warm-up phase and involves performingthe same number of backward passes as in the warm-up.

The primary optimization of 1F1B is to ensure that the memory consumption of thehidden states is independent of M𝑀Mitalic_M. The peak memory consumption for the hiddenstates is determined by the number of items in the queue Q𝑄Qitalic_Q at the end of thewarm-up phase, where each worker holds wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT hidden states. Assuming the totalmemory consumption of all hidden states is A𝐴Aitalic_A, the peak memory consumption of worker i𝑖iitalic_i is wiAPsubscript𝑤𝑖𝐴𝑃w_{i}\frac{A}{P}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_A end_ARG start_ARG italic_P end_ARG. During the steady and cooling-down phases,this consumption does not increase.

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training (1)
Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training (2)

Language modeling is the most common unsupervised objective in training language models.In Language modeling’s objective, each token is predicted sequentially while conditioned on the preceding tokens, embodying the principles of sequential generation, as formulated in Eq.2.

P(𝐱)=t=1TP(xtx1,x2,,xt1)𝑃𝐱superscriptsubscriptproduct𝑡1𝑇𝑃conditionalsubscript𝑥𝑡subscript𝑥1subscript𝑥2subscript𝑥𝑡1P(\mathbf{x})=\prod_{t=1}^{T}P(x_{t}\mid x_{1},x_{2},\ldots,x_{t-1})italic_P ( bold_x ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )(2)

In the context of language modeling using Transformers, the unidirectional attention mechanism ensures that each token in a sequence can only see its predecessors, including itself.

Given a sequence of tokens x0,x1,,xnsubscript𝑥0subscript𝑥1subscript𝑥𝑛x_{0},x_{1},\ldots,x_{n}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the output of the attention mechanism for each token can be computed as follows. Each token tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is associated with a query vector qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, a key vector kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and a value vector visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which will be used for attention computation. The output for each token tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, denoted as Oisubscript𝑂𝑖O_{i}italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, is computed by attending over all previous tokens up to tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as formulated in Eq.3.

Oi=softmax(qi[k0,,ki]Tdk)[v0,,vi]subscript𝑂𝑖softmaxsubscript𝑞𝑖superscriptsubscript𝑘0subscript𝑘𝑖𝑇subscript𝑑𝑘subscript𝑣0subscript𝑣𝑖O_{i}=\text{softmax}\left(\frac{q_{i}\cdot[k_{0},\ldots,k_{i}]^{T}}{\sqrt{d_{k%}}}\right)[v_{0},\ldots,v_{i}]italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = softmax ( divide start_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ [ italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) [ italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ](3)

Based on these characteristics, it becomes clear that to partition Transformer computation across the sequence dimension, the attention mechanism must retain the key and value vectors of all preceding tokens. The forward and backward passes also need to maintain a specific order. The forward computation of each token must follow the completion of its predecessor’s computation, while the backward pass requires the subsequent token’s gradients to complete its computation.

3.2 Seq1F1B

From the illustration 1, we observe that the original 1F1B schedule cannot accommodate the splitting of micro-batches along the sequence dimension because the last stage needs to immediately execute a backward pass after forwarding a micro-batch. A straightforward adaptation method is to divide each original 1F1B micro-batch into k𝑘kitalic_k segments and then execute a k𝑘kitalic_kFkB pipeline Li etal. [2021]. Although this schedule can reduce some bubbles in 1F1B, it does not save memory usage.

To achieve a more efficient sequence-level 1F1B pipeline schedule, we propose Seq1F1B, which is a handcrafted 1F1B schedule for sequence-level input. Specifically, Seq1F1B partitions the model into consecutive sets of layers and assigns each worker with the corresponding set (a.k.a pipeline stages). Then, Seq1F1B initializes the schedule part.Similar to 1F1B, the schedule is divided into three phases: warm-up, steady, and cooling-down.

wi={Pi2+kifM>PMifMPsubscriptw𝑖cases𝑃𝑖2𝑘if𝑀𝑃𝑀if𝑀𝑃\centering\text{w}_{i}=\begin{cases}P-i-2+k&\text{if }M>P\\M&\text{if }M\leq P\end{cases}\@add@centeringw start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL italic_P - italic_i - 2 + italic_k end_CELL start_CELL if italic_M > italic_P end_CELL end_ROW start_ROW start_CELL italic_M end_CELL start_CELL if italic_M ≤ italic_P end_CELL end_ROW(4)

During the warm-up phase, the number of warm-up micro-batches of each worker i𝑖iitalic_i is calculated according to Eq.4, in which k𝑘kitalic_k represents the number of splits in the sequence. This equation ensures that the last stage can perform a backward pass on the last sequence segment of the first batch when entering the steady phase, and the worker responsible for each stage performs one more forward pass than the worker responsible for the subsequent stage. Here, we construct a partially ordered queue Qssubscript𝑄𝑠Q_{s}italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, where each pop returns the tail sequence from the earliest batch that has enqueued. This satisfies the first-in-first-out principle in the batch dimension and the first-in-last-out principle in the sequence dimension. In each iteration of the warm-up phase, workers execute one forward pass and enqueue the corresponding hidden states onto Qssubscript𝑄𝑠Q_{s}italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

In the steady phase, after each worker completes a forward pass, it dequeues from Qssubscript𝑄𝑠Q_{s}italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and performs a backward pass on the dequeued hidden states, following the standard 1F1B process, except that the units for forward and backward passes become a sequence segment.

In the cooling-down phase, workers dequeue the remaining warm-up hidden states from Qssubscript𝑄𝑠Q_{s}italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and perform backward passes sequentially.

From the timeline shown in Figure 1, it is evident that theSeq1F1B schedule offers shorter execution time and significantly fewer bubblescompared to the original 1F1B schedule. Meanwhile, it can be clearly seen that each worker now has less memory consumption since the micro-sequence is smaller than the micro-batch. Another observation is that optimizations similar to the zero-bubble pipeline can also be applied to Seq1F1B by delaying the gradient computation associated with weights in the backward pass. We will discuss this in the following section.

3.3 Seq1F1B-I

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training (3)

1F1B-I Narayanan etal. [2021a] achieves better efficiency by modifying the 1F1B schedule to support interleaved stages among workers. In 1F1B-I, each worker is assigned multiple stages. Suppose we have P𝑃Pitalic_P workers and V𝑉Vitalic_V stages {s1,s2,,sV}subscript𝑠1subscript𝑠2subscript𝑠𝑉\{s_{1},s_{2},\ldots,s_{V}\}{ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT } in our pipeline, where V𝑉Vitalic_V is a multiple of P𝑃Pitalic_P. Each worker i𝑖iitalic_i will handle n𝑛nitalic_n stages {si,si+P,si+2P,,si+(n1)P}subscript𝑠𝑖subscript𝑠𝑖𝑃subscript𝑠𝑖2𝑃subscript𝑠𝑖𝑛1𝑃\{s_{i},s_{i+P},s_{i+2P},\ldots,s_{i+(n-1)P}\}{ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i + italic_P end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i + 2 italic_P end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_i + ( italic_n - 1 ) italic_P end_POSTSUBSCRIPT }, where n=VP𝑛𝑉𝑃n=\frac{V}{P}italic_n = divide start_ARG italic_V end_ARG start_ARG italic_P end_ARG. The number of warm-up micro-batches of each worker i𝑖iitalic_i in 1F1B-I is given in Eq.5.

wi=(Pi1)×2+(n1)×Psubscript𝑤𝑖𝑃𝑖12𝑛1𝑃w_{i}=(P-i-1)\times 2+(n-1)\times Pitalic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_P - italic_i - 1 ) × 2 + ( italic_n - 1 ) × italic_P(5)

After completing P𝑃Pitalic_P iterations of forward/backward passes, each worker switches its context to the next stage it is responsible for. From the Figure2, the above part shows a 1F1B-I pipeline with P𝑃Pitalic_P as 4 and V𝑉Vitalic_V as 8, in which each worker handles 2 stages. 1F1B-I’s schedule reduces the bubble ratio by interleaving stages among workers. However, this interleaving slightly increases memory consumption, as the number of warm-up micro-batches wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is greater compared to 1F1B.

Similar to 1F1B-I, Seq1F1B-I further modifies 1F1B-I to achieve sequence-level scheduling, as shown in bottom part of Figure2. From the Figure2, Seq1F1B-I effectively reduces pipeline bubbles and maintains less memory footprint of hidden state compared with 1F1B-I. Seq1F1B-I defines the number of warm-up micro-batches as in Eq.6.

wi=(Pi1)×2+(n1)×P+k1subscript𝑤𝑖𝑃𝑖12𝑛1𝑃𝑘1w_{i}=(P-i-1)\times 2+(n-1)\times P+k-1italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_P - italic_i - 1 ) × 2 + ( italic_n - 1 ) × italic_P + italic_k - 1(6)

in which k𝑘kitalic_k represents the number of splits in the sequence. By using the partially ordered queue, Seq1F1B-I maintains strict order of forward/backward pass and ensures the consistent semantics of gradient updates.From the perspective of pipeline bubbles, Seq1F1B-I outperforms both Seq1F1B and 1F1B-I. In terms of memory demands, Seq1F1B-I requires slightly more memory than Seq1F1B but significantly less than 1F1B-I.

3.4 Integration with Zero-bubble-pipeline

From the illustration3, we can see Seq1F1B can integrate with ZB1P method and further reduce bubbles while reducing memory demands by splitting sequence. Such integration outperforms simple ZB1P in both memory demands and pipeline bubbles since sequence-level pipelines naturally have fewer bubbles. Furthermore, Seq1F1B can integrate with ZB2P and ZBV methods too. Theoretically, introducing a zero-bubble-pipeline to Seq1F1B should be more efficient. Even though, such a fine-grained handcraft schedule may have performance degradation under some settings. We hope our work inspires future work to solve this problem.

3.5 Workload Balance

In this section, we detail the strategy of sequence partition and workload balance consideration. Previous works, such as Li etal. [2021], have discussed strategies for sequence partitioning. To achieve efficient pipeline scheduling, it is crucial that the processing times for each subsequence are approximately equal to avoid pipeline bubbles. Based on this premise, we design a computation-wise partition strategy by estimating the FLOPs of sequences and constructing a theoretical solution aiming to make the FLOPs of all subsequences as closely as possible.

For a input sequence S=(x1,x2,,xn)𝑆subscript𝑥1subscript𝑥2subscript𝑥𝑛S=(x_{1},x_{2},\cdots,x_{n})italic_S = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), we devide it into k𝑘kitalic_k segments S=[S1,,Sk]𝑆subscript𝑆1subscript𝑆𝑘S=[S_{1},\cdots,S_{k}]italic_S = [ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]. Each segment having a length of nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where i=1kni=nsuperscriptsubscript𝑖1𝑘subscript𝑛𝑖𝑛\sum_{i=1}^{k}n_{i}=n∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_n.We expect the computational amount of each segment to be roughly the same, that is

FLOPs(S1)=FLOPs(S2)==FLOPs(Sk)=FLOPs(S)k.FLOPssubscript𝑆1FLOPssubscript𝑆2FLOPssubscript𝑆𝑘FLOPs𝑆𝑘\text{FLOPs}(S_{1})=\text{FLOPs}(S_{2})=\cdots=\text{FLOPs}(S_{k})={\frac{%\text{FLOPs}(S)}{k}}.FLOPs ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = FLOPs ( italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ⋯ = FLOPs ( italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = divide start_ARG FLOPs ( italic_S ) end_ARG start_ARG italic_k end_ARG .(7)

Specifically, we use the method proposed in Hoffmann etal. [2022] to estimate the FLOPs for each subsequence, as formulated in Eq.8,

FLOPs(Si)=2niP+2Lni(j=0inj)d,i=1k;FLOPs(S)=2nP+2Ln2d,formulae-sequenceFLOPssubscript𝑆𝑖2subscript𝑛𝑖𝑃2𝐿subscript𝑛𝑖superscriptsubscript𝑗0𝑖subscript𝑛𝑗𝑑formulae-sequencefor-all𝑖1𝑘FLOPs𝑆2𝑛𝑃2𝐿superscript𝑛2𝑑\begin{aligned} \text{FLOPs}(S_{i})=2~{}n_{i}~{}P+2~{}L~{}n_{i}\left(\sum_{j=0%}^{i}n_{j}\right)d,\forall i=1\dots k;\ \ \text{FLOPs}(S)=2~{}n~{}P+2~{}L~{}n^%{2}d\end{aligned},start_ROW start_CELL FLOPs ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 2 italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P + 2 italic_L italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_d , ∀ italic_i = 1 … italic_k ; FLOPs ( italic_S ) = 2 italic_n italic_P + 2 italic_L italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_CELL end_ROW ,(8)

in which, L𝐿Litalic_L is a number of layers, d𝑑ditalic_d is dimension of the model, and P𝑃Pitalic_P is the total number of parameters in the model.We have k𝑘kitalic_k variables in Eq.8 and k𝑘kitalic_k equations in Eq.7. Therefore, we can set up the equation to get the optimial segmentation.

4 Experiments

4.1 Experimental Settings

ModelNumber ofAttentionHiddenSequencePPTPNumber of
SizeLayersHeadsSizeLengthSizeSizeMicro-batches
2.7B3232256016k / 24k / 32k8132 / 64
7B3232409632k / 64k / 128k4816 / 32
13B4040512032k / 64k / 128k4816 / 32
30B6464614432k / 48k / 64k8832 / 64
Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training (4)

In experiments, we measure our methods and 1F1B and 1F1B-I under variable sequence lengths, different numbers of micro-batches, different numbers of GPUs, different pipeline parallel sizes and tensor parallel sizes. Compared methods are as follows:

  • Seq1F1B: Seq1F1B with computation-wise sequence partition strategy.

  • Seq1F1B-I: Seq1F1B with interleaved stages and computation-wise sequence partition strategy.

  • 1F1B/1F1B-I: 1F1B and 1F1B with interleaved stages in Megatron implementation.

  • Seq1F1B w/o cwp: Seq1F1B without computation-wise sequence partition strategy.

  • Seq1F1B-I w/o cwp: Seq1F1B-I without computation-wise sequence partition strategy.

All assessments are based on GPT model and model configuration are listed in Table1.All experiments focus on long-sequence training since a lot of work has metioned the importance.For hyperparameter configurations, we set the number of sequence splits to four and each worker managing two stages in interleaved settings. Our implementation is based on the open-source Megatron-LM project Narayanan etal. [2021a] and ensures reproducibility. We adopts Megatron-V3Korthikanti etal. [2023]’s tensor parallelism in all experiments since it is necessary for long sequence training.

Our experiments include three cluster settings:1) 1 node with 8 NVIDIA A100 SXM 80G GPUs interconnected by NvLink.2) 4 nodes interconnected by a RoCE RDMA network and each node has 8 NVIDIA A100 SXM 80G GPUs interconnected by NvLink.3) 8 nodes interconnected by a RoCE RDMA network and each node has 8 NVIDIA A100 SXM 80G GPUs interconnected by NvLink.Each measurement in the experiment is repeated 100 times, and the standard deviation is recorded.

4.2 Main Results

Model Size2.7b
Sequence Length163842457632768
Micro-batch163216321632
Throughput1F1B32.0±0.037.1±0.027.0±0.031.4±0.0OOMOOM
(Thousands1F1B-I36.4±0.039.7±0.0OOMOOMOOMOOM
Tokens/s)Seq1F1B37.3±0.038.9±0.332.6±0.034.2±0.028.8±0.030.1±0.2
Seq1F1B-I38.0±0.038.9±0.033.3±0.034.3±0.029.5±0.030.3±0.0
TFLOPS1F1B96.9±0.0112.3±0.095.5±0.1111.1±0.1OOMOOM
per device1F1B-I110.3±0.1120.2±0.1OOMOOMOOMOOM
Seq1F1B113.1±0.0117.8±0.8115.2±0.1120.9±0.1116.5±0.1122.0±1.0
Seq1F1B-I115.2±0.0118.0±0.0118.0±0.1121.3±0.1119.4±0.0122.7±0.0

Model Size7b
Sequence Length3276865536131072
Micro-batch816816816
Throughput1F1B48.2±0.155.3±0.237.3±0.043.1±0.0OOMOOM
(Thousands1F1B-I53.0±0.356.3±0.441.7±0.144.7±0.0OOMOOM
Tokens/s)Seq1F1B53.5±0.355.8±0.143.3±0.045.0±0.130.4±0.031.6±0.0
Seq1F1B-I47.2±0.946.2±0.840.9±0.441.0±0.330.0±0.030.4±0.0
TFLOPS1F1B99.7±0.2114.5±0.4107.5±0.0124.0±0.1OOMOOM
per device1F1B-I109.5±0.7116.5±0.8120.0±0.2128.7±0.1OOMOOM
Seq1F1B110.6±0.5115.3±0.2124.6±0.1129.7±0.5136.7±0.1142.1±0.0
Seq1F1B-I97.7±1.895.5±1.6117.8±1.3118.0±0.8135.1±0.2136.6±0.2

Model Size13b
Sequence Length327684915265536
Micro-batch816816816
Throughput1F1B28.9±0.133.4±0.125.3±0.129.3±0.122.6±0.130.0±0.0
(Thousands1F1B-I32.2±0.234.4±0.128.2±0.230.6±0.1OOMOOM
Tokens/s)Seq1F1B32.9±0.134.3±0.129.5±0.130.8±0.026.7±0.027.8±0.0
Seq1F1B-I29.7±0.429.8±0.328.0±0.228.3±0.126.4±0.126.8±0.1
TFLOPS1F1B106.7±0.2123.0±0.5109.5±0.5126.2±0.6111.9±0.5135.1±0.2
per device1F1B-I118.6±0.6126.9±0.4121.9±0.7132.2±0.4OOMOOM
Seq1F1B121.2±0.2126.6±0.3127.3±0.4133.1±0.2132.5±0.0137.9±0.0
Seq1F1B-I109.7±1.4110.0±1.1121.0±1.1122.1±0.4130.6±0.3132.8±0.3

Model Size30b
Sequence Length327684915265536
Micro-batch816816816
Throughput1F1B26.4±0.131.2±0.2OOMOOMOOMOOM
(Thousands1F1B-IOOMOOMOOMOOMOOMOOM
Tokens/s)Seq1F1B31.3±0.133.1±0.228.2±0.129.6±0.125.5±0.026.8±0.0
Seq1F1B-I28.0±0.428.4±0.226.5±0.227.1±0.224.8±0.125.2±0.1
TFLOPS1F1B104.8±0.3123.9±0.7OOMOOMOOMOOM
per device1F1B-IOOMOOMOOMOOMOOMOOM
Seq1F1B124.5±0.2131.5±0.6129.4±0.3135.6±0.3132.6±0.0139.2±0.0
Seq1F1B-I111.1±1.6113.0±1.0121.5±1.1124.2±0.8128.6±0.3130.9±0.6

In Figure 4, we compared the memory consumption of our method with that of 1F1B and 1F1B-I. As can be seen, our method consistently requires less memory across all settings, and notably, it can support training a 30B model on a 64xA100 cluster, which is impossible for the traditional combination of pipeline and tensor parallelism. Additionally, we recorded TFLOPS(teraFLOPS) per GPU in our experiments to measure the hardware utilization of different methods. From the Table2, 3, 4 and 5, our method Seq1F1B outperforms 1F1B and 1F1B-I under almost all settings in both training throughput and teraFLOPS.

However, as observed in Table3,4,5, the Seq1F1B-I may have a performance degradation under multi-node settings. This could be due to the overly fine-grained interleaving of stage partitioning and input sequence partitioning, which also implies more communication calls in Tensor Parallelism (although the total communication volume remains unchanged), potentially leading to a decrease in performance.Another observation is that the efficiency of Seq1F1B becomes more pronounced as the sequence length increases. This is because the computation time for each micro-sequence extends with longer sequences, thereby enhancing the benefits derived from sequence partitioning.

4.3 Ablation Results

We also conducted all experiments using Seq1F1B without computation-wise partitioning (Seq1F1B w/o cwp) and Seq1F1B-I without computation-wise partitioning (Seq1F1B-I w/o cwp) to evaluate the effectiveness of our computation-wise partition strategy. Under identical settings, employing the computation-wise partition strategy leads to performance enhancements ranging from approximately 10-30% for Seq1F1B compared to simply splitting the sequence.

Across all experimental scales, Seq1F1B consistently surpassed Seq1F1B w/o cwp in performance. Table6 highlights the ablation performance for a 2.7B model with a sequence length of 32k, demonstrating a performance boost of approximately 28% due to the computation-wise partitioning.

5 Conclusion

MethodTFLOPS/deviceSpeedUp
Seq1F1B w/o cwp94.8±0.1-
Seq1F1B122.0±1.01.28x
Seq1F1B-I w/o cwp103.5±0.1-
Seq1F1B-I122.7±0.01.18x

In this paper, we present Seq1F1B, an efficient 1F1B pipeline parallel scheduling method orienting to training Transformer-based LLMs on long sequences by decomposing the batch-level schedulable units used by typical 1F1B methods into more fine-grained sequence-level units.To achieve a better workload balance of the sequence-level pipeline, we design a computation-wise sequence partition strategy to partition the sequences well. Meanwhile Seq1F1B can integrate with other pipeline parallel methods such as 1F1B with interleaved stage or zero-bubble-pipeline. Our evaluations demonstrate that Seq1F1B outperforms the 1F1B and 1F1B-I scheduling strategies regarding memory efficiency and training throughput under variable sequence lengths and model sizes. Moreover, Seq1F1B can support the efficient training of a 30B GPT model on sequences up to 64k in length using 64xA100 GPUs, without using recomputation strategies, which is unachievable with existing pipeline parallel methods.In the future, we will thoroughly combine our method with other distributed methods to achieve better LLM training acceleration. In addition, we will systematically release our code to support the community in training LLMs to process longer sequences more efficiently.

References

  • Touvron etal. [2023]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,etal.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
  • Reid etal. [2024]Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, TimothyLillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, OrhanFirat, Julian Schrittwieser, etal.Gemini 1.5: Unlocking multimodal understanding across millions oftokens of context.arXiv preprint arXiv:2403.05530, 2024.
  • Jiang etal. [2024]AlbertQ Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, BlancheSavary, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, EmmaBouHanna, Florian Bressand, etal.Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024.
  • Anil etal. [2023]Rohan Anil, AndrewM Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin,Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen,etal.Palm 2 technical report.arXiv preprint arXiv:2305.10403, 2023.
  • Korthikanti etal. [2023]VijayAnand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, MichaelAndersch, Mohammad Shoeybi, and Bryan Catanzaro.Reducing activation recomputation in large transformer models.Proceedings of Machine Learning and Systems, 5, 2023.
  • Rasley etal. [2020]Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He.Deepspeed: System optimizations enable training deep learning modelswith over 100 billion parameters.In Proceedings of KDD, pages 3505–3506, 2020.
  • Rajbhandari etal. [2020]Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He.Zero: Memory optimizations toward training trillion parameter models.In Proceedings of SC20, 2020.
  • Shoeybi etal. [2019]Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper,and Bryan Catanzaro.Megatron-lm: Training multi-billion parameter language models usingmodel parallelism.arXiv preprint arXiv:1909.08053, 2019.
  • Narayanan etal. [2021a]Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, MostofaPatwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, JulieBernauer, Bryan Catanzaro, etal.Efficient large-scale language model training on gpu clusters usingmegatron-lm.In Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis, pages 1–15,2021a.
  • Huang etal. [2019]Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen,HyoukJoong Lee, Jiquan Ngiam, QuocV Le, Yonghui Wu, etal.Gpipe: Efficient training of giant neural networks using pipelineparallelism.Advances in neural information processing systems, 32, 2019.
  • Yang etal. [2021]Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher Aberger,and Christopher DeSa.Pipemare: Asynchronous pipeline parallel dnn training.Proceedings of Machine Learning and Systems, 3:269–296, 2021.
  • Qi etal. [2024]Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin.Zero bubble (almost) pipeline parallelism.In The Twelfth International Conference on LearningRepresentations, 2024.URL https://openreview.net/forum?id=tuzTN0eIO5.
  • Li etal. [2021]Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, andIon Stoica.Terapipe: Token-level pipeline parallelism for training large-scalelanguage models.In International Conference on Machine Learning, pages6543–6552. PMLR, 2021.
  • Harlap etal. [2018]Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, NikhilDevanur, Greg Ganger, and Phil Gibbons.Pipedream: Fast and efficient pipeline parallel dnn training.arXiv preprint arXiv:1806.03377, 2018.
  • Fan etal. [2021]Shiqing Fan, YiRong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu,Guoping Long, Jun Yang, Lixue Xia, etal.Dapple: A pipelined data parallel approach for training large models.In Proceedings of the 26th ACM SIGPLAN Symposium on Principlesand Practice of Parallel Programming, pages 431–445, 2021.
  • [16]Jacob Buckman and Carles Gelada.Compute-optimal Context Size.
  • Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.In Proceedings of NeurIPS, 2017.
  • Dao etal. [2022]Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré.FlashAttention: Fast and memory-efficient exact attention withio-awareness.In Proceedings of NeurIPS, pages 16344–16359, 2022.
  • Ding etal. [2023]Jiayu Ding, Shuming Ma, LiDong, Xingxing Zhang, Shaohan Huang, Wenhui Wang,and Furu Wei.LongNet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486, 2023.
  • Jacobs etal. [2023]SamAde Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song,Samyam Rajbhandari, and Yuxiong He.Deepspeed ulysses: System optimizations for enabling training ofextreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023.
  • Goyal etal. [2017]Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, LukaszWesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He.Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677, 2017.
  • Zinkevich etal. [2010]Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola.Parallelized stochastic gradient descent.Advances in neural information processing systems, 23, 2010.
  • Li etal. [2020]Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li,Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, etal.Pytorch distributed: experiences on accelerating data paralleltraining.Proceedings of the VLDB Endowment, 13(12):3005–3018, 2020.
  • Xing etal. [2015]EricP Xing, Qirong Ho, Wei Dai, JinKyu Kim, Jinliang Wei, Seunghak Lee, XunZheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu.Petuum: A new platform for distributed machine learning on big data.IEEE Transactions on Big Data, 1(02):49–67, 2015.
  • Ren etal. [2021]Jie Ren, Samyam Rajbhandari, RezaYazdani Aminabadi, Olatunji Ruwase, ShuangyanYang, Minjia Zhang, Dong Li, and Yuxiong He.ZeRO-Offload: Democratizing billion-scale model training.In Proceedings of ATC, pages 551–564, 2021.
  • Narayanan etal. [2021b]Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia.Memory-efficient pipeline-parallel dnn training.In International Conference on Machine Learning, pages7937–7947. PMLR, 2021b.
  • Li and Hoefler [2021]Shigang Li and Torsten Hoefler.Chimera: efficiently training large-scale neural networks withbidirectional pipelines.In Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis, pages 1–14, 2021.
  • Hoffmann etal. [2022]Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, TrevorCai, Eliza Rutherford, Diego deLas Casas, LisaAnne Hendricks, JohannesWelbl, Aidan Clark, etal.Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022.
Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training (2025)

FAQs

Which parallelism is best for scaling across multiple processors when the model is too large to fit in memory? ›

Model parallelism refers to the technique of partitioning a machine learning model among multiple devices, where each device stores a portion of the model and processes data locally. This approach is used when the model is too large to fit on a single device, allowing for reduced memory usage on each device.

What is pipeline parallelism in deep learning? ›

Pipeline parallelism is a method used in artificial intelligence, particularly in deep learning, where a model is divided into stages or layers. Each stage is executed on a different device, passing activations between neighboring stages when a boundary is reached during execution.

What is the difference between tensor parallelism and pipeline parallelism? ›

Unlike pipeline parallelism, which partitions the set of weights, tensor parallelism splits individual weights. This technique is particularly useful for large models where a single parameter consumes most of the GPU memory.

What is the difference between pipeline parallelism and data parallelism? ›

Data parallel training typically has each worker perform IO independently at the start of each batch. However, in a pipeline parallel environment, only the first stage uses the input data, and only the last stage uses labels for loss calculation.

Which parallel programming model is suitable for big data processing? ›

MapReduce [17] is one of the most popular programming models for big data processing using large-scale commodity clusters. MapReduce is proposed by Google and developed by Yahoo. Map and Reduce functions are programmed by users to process the big data distributed across multiple heterogeneous nodes.

Which parallel processing architecture is the most widely used in big data systems? ›

A more popular architecture of massively parallel processing systems is the “shared nothing” architecture. The processing nodes have independent random-access memory and disk that stores the necessary files and databases. The data that needs to be processed is shared among the nodes using various techniques.

Which type of parallel system uses thousands of smaller processors? ›

Massively parallel processing.

MPP uses many processors to perform coordinated tasks in parallel. These systems use up to hundreds or thousands of processors that are all independently handling a portion of an overall task.

What is a parallelism based on increasing processor word size called? ›

Bit-level parallelism is a form of parallel computing based on increasing processor word size. Increasing the word size reduces the number of instructions the processor must execute in order to perform an operation on variables whose sizes are greater than the length of the word.

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Foster Heidenreich CPA

Last Updated:

Views: 5257

Rating: 4.6 / 5 (56 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Foster Heidenreich CPA

Birthday: 1995-01-14

Address: 55021 Usha Garden, North Larisa, DE 19209

Phone: +6812240846623

Job: Corporate Healthcare Strategist

Hobby: Singing, Listening to music, Rafting, LARPing, Gardening, Quilting, Rappelling

Introduction: My name is Foster Heidenreich CPA, I am a delightful, quaint, glorious, quaint, faithful, enchanting, fine person who loves writing and wants to share my knowledge and understanding with you.