Skip to content

Conversation

@DannyYuyang-quic
Copy link
Contributor

Summary:

  • enable multi‑turn conversation and support processing multiple images both within a single turn and across multiple turns.
  • refactor Chat template and Runner

Test plan

# Turn 1
IMAGE1_URL="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
IMAGE2_URL="http://images.cocodataset.org/val2017/000000039769.jpg"
PROMPT1="<image><image>Compare these images above and list the differences."

# Turn 2

PROMPT2="Answer the question: What's the main object in first image?"

# Turn 3
IMAGE3_URL="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
PROMPT3="<image>Caption this image."

# Execute the multi-turn conversation
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 2048 --prompt "$PROMPT1" "$PROMPT2" "$PROMPT3" --image_path "$IMAGE1_URL" "$IMAGE2_URL" "$IMAGE3_URL"

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 9, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17308

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures

As of commit f96ab48 with merge base ba89c69 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2026
@DannyYuyang-quic
Copy link
Contributor Author

Hi @cccclai,
This PR extend the multi-turn conversation for VLM.

And encoder quantization is only enabled when each turn contains one single image.
For multi‑image conversation, we skip encoder quantization to maintain visual embedding quality, regular quantization does not preserve enough precision, causing the decoder to misinterpret image features.
Until we have a reliable tuning method for encoder quantization in multi‑image scenarios, we recommend keep the vision encoder in floating point.

Below are some HTP runtime results for SmolVLM‑500M in 3 turns conversation:
cc: @haowhsu-quic

Simulation turns

Turn 1:
Query:"Compare these images above and list the differences."
Image1: "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
Image2: "http://images.cocodataset.org/val2017/000000039769.jpg"

Turn 2:
Query: "Answer the question: What's the main object in first image?"

Turn 3:
Image:"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
Query="Caption this image."

Answer in 3 turns

PyTorchObserver {"prompt_tokens":151,"generated_tokens":30,"model_load_start_ms":1753724157466,"model_load_end_ms":1753724158128,"inference_start_ms":1753724158141,"inference_end_ms":1753724164592,"prompt_eval_end_ms":1753724163488,"first_token_ms":1753724163488,"aggregate_sampling_time_ms":49,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

PyTorchObserver {"prompt_tokens":21,"generated_tokens":13,"model_load_start_ms":1753724157466,"model_load_end_ms":1753724158128,"inference_start_ms":1753724164595,"inference_end_ms":1753724165818,"prompt_eval_end_ms":1753724165339,"first_token_ms":1753724165339,"aggregate_sampling_time_ms":70,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

PyTorchObserver {"prompt_tokens":80,"generated_tokens":10,"model_load_start_ms":1753724157466,"model_load_end_ms":1753724158128,"inference_start_ms":1753724165831,"inference_end_ms":1753724169033,"prompt_eval_end_ms":1753724168665,"first_token_ms":1753724168665,"aggregate_sampling_time_ms":86,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/outputs.txt: 1 file pulled. 0.8 MB/s (779 bytes in 0.001s)
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/inference_speed.txt: 1 file pulled. 0.0 MB/s (7 bytes in 0.001s)
INFO:root:Device Inference Results[0]:
<|im_start|>User:<fake_token_around_image><global-img><image><fake_token_around_image><fake_token_around_image><global-img><image><fake_token_around_image>Compare these images above and list the differences.<end_of_utterance>
Assistant: The first image shows a cityscape with a statue of liberty in the foreground. The second image shows two tabby cats sleeping on a pink blanket.<end_of_utterance><|im_start|>User:Answer the question: What's the main object in first image?<end_of_utterance>
Assistant: The main object in the first image is a statue of liberty.<end_of_utterance><|im_start|>User:<fake_token_around_image><global-img><image><fake_token_around_image>Caption this image.<end_of_utterance>
Assistant: A flower with a yellow center and many petals.<end_of_utterance>

Script

# Turn 1
IMAGE1_URL="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
IMAGE2_URL="http://images.cocodataset.org/val2017/000000039769.jpg"
PROMPT1="<image><image>Compare these images above and list the differences."

# Turn 2

PROMPT2="Answer the question: What's the main object in first image?"

# Turn 3
IMAGE3_URL="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
PROMPT3="<image>Caption this image."

# Execute the multi-turn conversation
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m SM8750 --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 2048 --prompt "$PROMPT1" "$PROMPT2" "$PROMPT3" --image_path "$IMAGE1_URL" "$IMAGE2_URL" "$IMAGE3_URL"

@DannyYuyang-quic
Copy link
Contributor Author

@pytorchbot label "release notes: qualcomm"

@pytorch-bot pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Feb 9, 2026
Summary:
 - Multi‑turn conversation: add conversation for VLM scenario
   - add runtime chat template
 - Multi-image inputs: Accept multiple images per turn and across turns.
@DannyYuyang-quic DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch from 88163a8 to f96ab48 Compare February 10, 2026 07:27
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Feb 10, 2026

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D92849100.

Copy link
Contributor

@larryliu0820 larryliu0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

@metascroy
Copy link
Contributor

Can we unify QNN and CoreML runners? The model definitions are similar: #16463

@larryliu0820
Copy link
Contributor

larryliu0820 commented Feb 11, 2026

@DannyYuyang-quic and folks,

Thank you for adding this feature! I have a few questions regarding the high level architecture of the runners.

I think you have seen the generic LLM runners and multimodal runners under extension/llm/runner. I'm curious to learn what are the blockers, for qualcomm version of these runners to extend from extension/llm/runner base classes.

The reason for this is that we are maintaining a JNI layer so that it connects to android. I understand right now this is done from irunner interface, I'm thinking if we can reuse more components that will largely reduce our maintenance burden.

One concrete thing is if we can do something like

class QNNMultimodalRunner : public executorch::extension::llm::MultimodalRunner {

And reuse common logics. I think the team is happy to change the generic MultimodalRunner to be able to extend easily.

Tell me what you think about this!

Copy link
Contributor

@larryliu0820 larryliu0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for the comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: qualcomm Changes to the Qualcomm backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants