Performance issue when batch_size is 32

Hello, we attempt to utilize SpecInfer to accelerate model inference. However, we encounter several performance issues. Specifically, as the batch size increases from 1 to 16, the system throughput gradually improves. But when the batch size reaches 32, there is a significant decline in throughput, which is confusing. Our execution configurations is as follows:

### Environment Setup
We use the provided docker image(ghcr.io/flexflow/flexflow-cuda-11.8:latest) and build from source following the docs(https://flexflow.readthedocs.io/en/latest/).
We test two supported models: Llama2-70B and OPT-13B on this dataset(https://huggingface.co/datasets/gbharti/finance-alpaca).

### Test Script
We run the model inference following the Quickstart guidance in the repo.

````python
import flexflow.serve as ff
import argparse
import json
import os

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--num_gpus', type=int)
    parser.add_argument('--memory_per_gpu', type=int)
    parser.add_argument('--zero_copy_memory_per_node', type=int)
    parser.add_argument('--tensor_parallelism_degree', type=int)
    parser.add_argument('--pipeline_parallelism_degree', type=int)
    parser.add_argument('--llm', type=str)
    parser.add_argument('--ssm', type=str)
    parser.add_argument('--prompts_file', type=str)
    parser.add_argument('--max_requests_per_batch', type=int)
    parser.add_argument('--max_seq_length', type=int)
    parser.add_argument('--max_tokens_per_batch', type=int)
    args = parser.parse_args()

    ff.init(num_gpus=args.num_gpus,
            memory_per_gpu=args.memory_per_gpu,
            zero_copy_memory_per_node=args.zero_copy_memory_per_node,
            tensor_parallelism_degree=args.tensor_parallelism_degree,
            pipeline_parallelism_degree=args.pipeline_parallelism_degree
        )
    # Specify the LLM
    llm = ff.LLM(args.llm)

    # Specify a list of SSMs (just one in this case)
    ssms=[]
    if args.ssm != '':
        ssm_names = args.ssm.split(',')
        for ssm_name in ssm_names:
            ssm = ff.SSM(ssm_name)
            ssms.append(ssm)

    # Create the sampling configs
    generation_config = ff.GenerationConfig(
        do_sample=False, temperature=0, topp=1, topk=1
    )

    # Compile the SSMs for inference and load the weights into memory
    for ssm in ssms:
        ssm.compile(generation_config,
                    max_requests_per_batch=args.max_requests_per_batch,
                    max_seq_length=args.max_seq_length,
                    max_tokens_per_batch=args.max_tokens_per_batch)

    # Compile the LLM for inference and load the weights into memory
    llm.compile(generation_config, 
                ssms=ssms,
                max_requests_per_batch=args.max_requests_per_batch,
                max_seq_length=args.max_seq_length,
                max_tokens_per_batch=args.max_tokens_per_batch
               )

    # load prompts
    with open(args.prompts_file, 'r') as f:
        prompts = json.load(f)

    llm.start_server()
    result = llm.generate(prompts=prompts)
````
### Test Results

We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the throughput when batch size increases from 1 to 32. The results are as follows:

| throughput(tokens/s) | Llama2-70B   | OPT-13B     |
| -------------------- | ------------ | ----------- |
| BS=1                 | 28.709671931 | 97.12122162 |
| BS=2                 | 52.22124339  | 189.1327599 |
| BS=4                 | 106.9214668  | 362.0640686 |
| BS=8                 | 182.9473744  | 680.4388029 |
| BS=16                | 322.7966769  | 1188.828348 |
| BS=32                | 298.8251763  | 437.7545888 |


Any help to solve this issue is appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue when batch_size is 32 #3

Environment Setup

Test Script

Test Results

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

throughput(tokens/s)	Llama2-70B	OPT-13B
BS=1	28.709671931	97.12122162
BS=2	52.22124339	189.1327599
BS=4	106.9214668	362.0640686
BS=8	182.9473744	680.4388029
BS=16	322.7966769	1188.828348
BS=32	298.8251763	437.7545888

Performance issue when batch_size is 32 #3

Description

Environment Setup

Test Script

Test Results

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions