-
Notifications
You must be signed in to change notification settings - Fork 835
Qualcomm AI Engine Direct - [Multimodal] Muti-turn VLM conversation #17308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Qualcomm AI Engine Direct - [Multimodal] Muti-turn VLM conversation #17308
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17308
Note: Links to docs will display an error until the docs builds have been completed. ❌ 6 New FailuresAs of commit f96ab48 with merge base ba89c69 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Hi @cccclai, And encoder quantization is only enabled when each turn contains one single image. Below are some HTP runtime results for SmolVLM‑500M in 3 turns conversation: Simulation turnsTurn 1: Turn 2: Turn 3: Answer in 3 turnsScript# Turn 1
IMAGE1_URL="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
IMAGE2_URL="http://images.cocodataset.org/val2017/000000039769.jpg"
PROMPT1="<image><image>Compare these images above and list the differences."
# Turn 2
PROMPT2="Answer the question: What's the main object in first image?"
# Turn 3
IMAGE3_URL="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
PROMPT3="<image>Caption this image."
# Execute the multi-turn conversation
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m SM8750 --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 2048 --prompt "$PROMPT1" "$PROMPT2" "$PROMPT3" --image_path "$IMAGE1_URL" "$IMAGE2_URL" "$IMAGE3_URL" |
|
@pytorchbot label "release notes: qualcomm" |
Summary: - Multi‑turn conversation: add conversation for VLM scenario - add runtime chat template - Multi-image inputs: Accept multiple images per turn and across turns.
88163a8 to
f96ab48
Compare
larryliu0820
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review automatically exported from Phabricator review in Meta.
|
Can we unify QNN and CoreML runners? The model definitions are similar: #16463 |
|
@DannyYuyang-quic and folks, Thank you for adding this feature! I have a few questions regarding the high level architecture of the runners. I think you have seen the generic LLM runners and multimodal runners under extension/llm/runner. I'm curious to learn what are the blockers, for qualcomm version of these runners to extend from extension/llm/runner base classes. The reason for this is that we are maintaining a JNI layer so that it connects to android. I understand right now this is done from irunner interface, I'm thinking if we can reuse more components that will largely reduce our maintenance burden. One concrete thing is if we can do something like And reuse common logics. I think the team is happy to change the generic MultimodalRunner to be able to extend easily. Tell me what you think about this! |
larryliu0820
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes for the comment
Summary:
Test plan