Skip to content

Performance issue when batch_size is 32 #3

@letheantest

Description

@letheantest

Hello, we attempt to utilize SpecInfer to accelerate model inference. However, we encounter several performance issues. Specifically, as the batch size increases from 1 to 16, the system throughput gradually improves. But when the batch size reaches 32, there is a significant decline in throughput, which is confusing. Our execution configurations is as follows:

Environment Setup

We use the provided docker image(ghcr.io/flexflow/flexflow-cuda-11.8:latest) and build from source following the docs(https://flexflow.readthedocs.io/en/latest/).
We test two supported models: Llama2-70B and OPT-13B on this dataset(https://huggingface.co/datasets/gbharti/finance-alpaca).

Test Script

We run the model inference following the Quickstart guidance in the repo.

import flexflow.serve as ff
import argparse
import json
import os

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--num_gpus', type=int)
    parser.add_argument('--memory_per_gpu', type=int)
    parser.add_argument('--zero_copy_memory_per_node', type=int)
    parser.add_argument('--tensor_parallelism_degree', type=int)
    parser.add_argument('--pipeline_parallelism_degree', type=int)
    parser.add_argument('--llm', type=str)
    parser.add_argument('--ssm', type=str)
    parser.add_argument('--prompts_file', type=str)
    parser.add_argument('--max_requests_per_batch', type=int)
    parser.add_argument('--max_seq_length', type=int)
    parser.add_argument('--max_tokens_per_batch', type=int)
    args = parser.parse_args()

    ff.init(num_gpus=args.num_gpus,
            memory_per_gpu=args.memory_per_gpu,
            zero_copy_memory_per_node=args.zero_copy_memory_per_node,
            tensor_parallelism_degree=args.tensor_parallelism_degree,
            pipeline_parallelism_degree=args.pipeline_parallelism_degree
        )
    # Specify the LLM
    llm = ff.LLM(args.llm)

    # Specify a list of SSMs (just one in this case)
    ssms=[]
    if args.ssm != '':
        ssm_names = args.ssm.split(',')
        for ssm_name in ssm_names:
            ssm = ff.SSM(ssm_name)
            ssms.append(ssm)

    # Create the sampling configs
    generation_config = ff.GenerationConfig(
        do_sample=False, temperature=0, topp=1, topk=1
    )

    # Compile the SSMs for inference and load the weights into memory
    for ssm in ssms:
        ssm.compile(generation_config,
                    max_requests_per_batch=args.max_requests_per_batch,
                    max_seq_length=args.max_seq_length,
                    max_tokens_per_batch=args.max_tokens_per_batch)

    # Compile the LLM for inference and load the weights into memory
    llm.compile(generation_config, 
                ssms=ssms,
                max_requests_per_batch=args.max_requests_per_batch,
                max_seq_length=args.max_seq_length,
                max_tokens_per_batch=args.max_tokens_per_batch
               )

    # load prompts
    with open(args.prompts_file, 'r') as f:
        prompts = json.load(f)

    llm.start_server()
    result = llm.generate(prompts=prompts)

Test Results

We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the throughput when batch size increases from 1 to 32. The results are as follows:

throughput(tokens/s) Llama2-70B OPT-13B
BS=1 28.709671931 97.12122162
BS=2 52.22124339 189.1327599
BS=4 106.9214668 362.0640686
BS=8 182.9473744 680.4388029
BS=16 322.7966769 1188.828348
BS=32 298.8251763 437.7545888

Any help to solve this issue is appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions