Qualcomm AI Engine Direct - [Multimodal] Muti-turn VLM conversation #17308

DannyYuyang-quic · 2026-02-09T15:56:24Z

Summary:

enable multi‑turn conversation and support processing multiple images both within a single turn and across multiple turns.
refactor Chat template and Runner

Test plan

# Turn 1
IMAGE1_URL="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
IMAGE2_URL="http://images.cocodataset.org/val2017/000000039769.jpg"
PROMPT1="<image><image>Compare these images above and list the differences."

# Turn 2

PROMPT2="Answer the question: What's the main object in first image?"

# Turn 3
IMAGE3_URL="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
PROMPT3="<image>Caption this image."

# Execute the multi-turn conversation
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 2048 --prompt "$PROMPT1" "$PROMPT2" "$PROMPT3" --image_path "$IMAGE1_URL" "$IMAGE2_URL" "$IMAGE3_URL"

pytorch-bot · 2026-02-09T15:56:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17308

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures

As of commit f96ab48 with merge base ba89c69 ():

NEW FAILURES - The following jobs have failed:

Build Presets / apple (ios-simulator) / build (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
Build Presets / apple (macos) / build (gh)
pull / test-moshi-linux / linux-job (gh)
RuntimeError: Command docker exec -t 30371a209d022d8e329548fa4b872e91acd700af339996bda3f8c6cc825a99c8 /exec failed with exit code 1
pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 6c83bf97bd8a0c71b2cfc8b07e256b0398111a901f4e198c88fb3fa4bd4793a8 /exec failed with exit code 1
pull / test-samsung-quantmodels-linux / linux-job (gh)
RuntimeError: Command docker exec -t 46c788404f228f66c76400124365ea8fb458d044d7fb5931c09df7e54f8969d5 /exec failed with exit code 1
Test CUDA Builds / test-models-cuda (linear) / linux-job (gh)
RuntimeError: Command docker exec -t e9581360c21d3aef7eb1bdd362248023251622bb1d93e8684cff4d9eb8006ffd /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DannyYuyang-quic · 2026-02-09T16:30:23Z

Hi @cccclai,
This PR extend the multi-turn conversation for VLM.

And encoder quantization is only enabled when each turn contains one single image.
For multi‑image conversation, we skip encoder quantization to maintain visual embedding quality, regular quantization does not preserve enough precision, causing the decoder to misinterpret image features.
Until we have a reliable tuning method for encoder quantization in multi‑image scenarios, we recommend keep the vision encoder in floating point.

Below are some HTP runtime results for SmolVLM‑500M in 3 turns conversation:
cc: @haowhsu-quic

Simulation turns

Turn 1:
Query:"Compare these images above and list the differences."
Image1: "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
Image2: "http://images.cocodataset.org/val2017/000000039769.jpg"

Turn 2:
Query: "Answer the question: What's the main object in first image?"

Turn 3:
Image:"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
Query="Caption this image."

Answer in 3 turns

PyTorchObserver {"prompt_tokens":151,"generated_tokens":30,"model_load_start_ms":1753724157466,"model_load_end_ms":1753724158128,"inference_start_ms":1753724158141,"inference_end_ms":1753724164592,"prompt_eval_end_ms":1753724163488,"first_token_ms":1753724163488,"aggregate_sampling_time_ms":49,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

PyTorchObserver {"prompt_tokens":21,"generated_tokens":13,"model_load_start_ms":1753724157466,"model_load_end_ms":1753724158128,"inference_start_ms":1753724164595,"inference_end_ms":1753724165818,"prompt_eval_end_ms":1753724165339,"first_token_ms":1753724165339,"aggregate_sampling_time_ms":70,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

PyTorchObserver {"prompt_tokens":80,"generated_tokens":10,"model_load_start_ms":1753724157466,"model_load_end_ms":1753724158128,"inference_start_ms":1753724165831,"inference_end_ms":1753724169033,"prompt_eval_end_ms":1753724168665,"first_token_ms":1753724168665,"aggregate_sampling_time_ms":86,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/outputs.txt: 1 file pulled. 0.8 MB/s (779 bytes in 0.001s)
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/inference_speed.txt: 1 file pulled. 0.0 MB/s (7 bytes in 0.001s)
INFO:root:Device Inference Results[0]:
<|im_start|>User:<fake_token_around_image><global-img><image><fake_token_around_image><fake_token_around_image><global-img><image><fake_token_around_image>Compare these images above and list the differences.<end_of_utterance>
Assistant: The first image shows a cityscape with a statue of liberty in the foreground. The second image shows two tabby cats sleeping on a pink blanket.<end_of_utterance><|im_start|>User:Answer the question: What's the main object in first image?<end_of_utterance>
Assistant: The main object in the first image is a statue of liberty.<end_of_utterance><|im_start|>User:<fake_token_around_image><global-img><image><fake_token_around_image>Caption this image.<end_of_utterance>
Assistant: A flower with a yellow center and many petals.<end_of_utterance>

Script

# Turn 1
IMAGE1_URL="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
IMAGE2_URL="http://images.cocodataset.org/val2017/000000039769.jpg"
PROMPT1="<image><image>Compare these images above and list the differences."

# Turn 2

PROMPT2="Answer the question: What's the main object in first image?"

# Turn 3
IMAGE3_URL="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
PROMPT3="<image>Caption this image."

# Execute the multi-turn conversation
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m SM8750 --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 2048 --prompt "$PROMPT1" "$PROMPT2" "$PROMPT3" --image_path "$IMAGE1_URL" "$IMAGE2_URL" "$IMAGE3_URL"

DannyYuyang-quic · 2026-02-09T16:31:24Z

@pytorchbot label "release notes: qualcomm"

Summary: - Multi‑turn conversation: add conversation for VLM scenario - add runtime chat template - Multi-image inputs: Accept multiple images per turn and across turns.

meta-codesync · 2026-02-10T17:43:48Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D92849100.

larryliu0820

Review automatically exported from Phabricator review in Meta.

metascroy · 2026-02-11T23:35:49Z

Can we unify QNN and CoreML runners? The model definitions are similar: #16463

larryliu0820 · 2026-02-11T23:43:08Z

@DannyYuyang-quic and folks,

Thank you for adding this feature! I have a few questions regarding the high level architecture of the runners.

I think you have seen the generic LLM runners and multimodal runners under extension/llm/runner. I'm curious to learn what are the blockers, for qualcomm version of these runners to extend from extension/llm/runner base classes.

The reason for this is that we are maintaining a JNI layer so that it connects to android. I understand right now this is done from irunner interface, I'm thinking if we can reuse more components that will largely reduce our maintenance burden.

One concrete thing is if we can do something like

class QNNMultimodalRunner : public executorch::extension::llm::MultimodalRunner {

And reuse common logics. I think the team is happy to change the generic MultimodalRunner to be able to extend easily.

Tell me what you think about this!

larryliu0820

Requesting changes for the comment

DannyYuyang-quic requested review from cccclai, kirklandsign and larryliu0820 as code owners February 9, 2026 15:56

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2026

pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Feb 9, 2026

Qualcomm AI Engine Direct - [Multimodal] Muti-turn conversation for VLM

f96ab48

Summary: - Multi‑turn conversation: add conversation for VLM scenario - add runtime chat template - Multi-image inputs: Accept multiple images per turn and across turns.

DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch from 88163a8 to f96ab48 Compare February 10, 2026 07:27

larryliu0820 approved these changes Feb 11, 2026

View reviewed changes

larryliu0820 requested changes Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - [Multimodal] Muti-turn VLM conversation #17308

Qualcomm AI Engine Direct - [Multimodal] Muti-turn VLM conversation #17308

DannyYuyang-quic commented Feb 9, 2026

Uh oh!

pytorch-bot bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

DannyYuyang-quic commented Feb 9, 2026

Uh oh!

DannyYuyang-quic commented Feb 9, 2026

Uh oh!

meta-codesync bot commented Feb 10, 2026

Uh oh!

larryliu0820 left a comment

Uh oh!

metascroy commented Feb 11, 2026

Uh oh!

larryliu0820 commented Feb 11, 2026 •

edited

Loading

Uh oh!

larryliu0820 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Qualcomm AI Engine Direct - [Multimodal] Muti-turn VLM conversation #17308

Are you sure you want to change the base?

Qualcomm AI Engine Direct - [Multimodal] Muti-turn VLM conversation #17308

Conversation

DannyYuyang-quic commented Feb 9, 2026

Summary:

Test plan

Uh oh!

pytorch-bot bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17308

❌ 6 New Failures

Uh oh!

DannyYuyang-quic commented Feb 9, 2026

Simulation turns

Answer in 3 turns

Script

Uh oh!

DannyYuyang-quic commented Feb 9, 2026

Uh oh!

meta-codesync bot commented Feb 10, 2026

Uh oh!

larryliu0820 left a comment

Choose a reason for hiding this comment

Uh oh!

metascroy commented Feb 11, 2026

Uh oh!

larryliu0820 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

larryliu0820 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot bot commented Feb 9, 2026 •

edited

Loading

larryliu0820 commented Feb 11, 2026 •

edited

Loading