fix: stream multipart uploads to avoid OOM on large files#477
fix: stream multipart uploads to avoid OOM on large files#477
Conversation
Replace buffered file read + build_multipart_body in build_http_request with streaming build_multipart_stream using tokio_util::io::ReaderStream. Memory usage drops from O(file_size) to O(64 KB) regardless of upload size. Content-Length is pre-computed from file metadata so Google APIs still receive the correct header without buffering. Fixes #244
🦋 Changeset detectedLatest commit: f267bee The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical out-of-memory issue that occurred when uploading large files by switching from a buffered approach to a streaming mechanism. The change ensures that file uploads are processed in chunks, significantly reducing memory footprint and improving the stability of the application when dealing with substantial data volumes. This enhancement makes the upload process more robust and scalable without compromising the accuracy of request headers. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a critical out-of-memory issue by replacing buffered file uploads with a streaming approach. The use of ReaderStream and pre-calculating Content-Length from file metadata is a solid implementation. My feedback focuses on improving error message clarity to aid in future debugging. Overall, this is an excellent and important improvement.
- Metadata error now says 'Failed to get metadata' instead of misleading 'Failed to read upload file' - File::open error in stream now includes the file path for easier debugging
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #477 +/- ##
==========================================
+ Coverage 67.71% 67.81% +0.10%
==========================================
Files 38 38
Lines 17044 17136 +92
==========================================
+ Hits 11541 11621 +80
- Misses 5503 5515 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request effectively resolves a critical out-of-memory issue during large file uploads by replacing the buffered implementation with a streaming multipart body. The use of tokio_util::ReaderStream and pre-calculating Content-Length from file metadata is a solid approach. The changes are well-structured, and the new tests provide good coverage for the streaming logic. I have one suggestion to further improve the robustness of file path handling.
Uploads a small text file, verifies the response has a file ID, then cleans up by deleting it. Validates the streaming multipart upload path end-to-end against real Google APIs.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the out-of-memory issue with large file uploads by switching to a streaming multipart body. The implementation using ReaderStream is solid and the pre-computation of Content-Length is correct. I've identified one high-severity security concern regarding the handling of the upload file path and have provided a detailed suggestion to mitigate it.
The upload is via the +upload helper command, not files create --upload. Also pipe stderr through tee so errors are visible in CI logs.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request successfully refactors multipart uploads to use streaming, which is a great improvement to prevent out-of-memory errors with large files. The use of ReaderStream and pre-calculating Content-Length is well-implemented. However, I've identified a critical security vulnerability related to path validation for the uploaded file, which could allow an attacker to read arbitrary files from the system. My review includes suggestions to address this by leveraging the existing validation utilities in the codebase.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant improvement by replacing buffered multipart uploads with a streaming approach, effectively fixing an out-of-memory issue with large files. The implementation is well-structured and includes new tests. However, I've found a critical issue in how the multipart preamble is constructed, which includes unintended whitespace that will corrupt the request body and cause uploads to fail. The new tests unfortunately replicate this bug and will also need to be corrected.
| let preamble = format!( | ||
| "--{boundary}\r\nContent-Type: application/json; charset=UTF-8\r\n\r\n{metadata_json}\r\n\ | ||
| --{boundary}\r\nContent-Type: {media_mime}\r\n\r\n" | ||
| ); |
There was a problem hiding this comment.
The multi-line format! macro for preamble includes leading whitespace from the source code indentation on the second line. This will create a malformed multipart/related body because the boundary separator will be --{boundary} instead of --{boundary}. This will cause the upload to fail.
| let preamble = format!( | |
| "--{boundary}\r\nContent-Type: application/json; charset=UTF-8\r\n\r\n{metadata_json}\r\n\ | |
| --{boundary}\r\nContent-Type: {media_mime}\r\n\r\n" | |
| ); | |
| let preamble = format!("--{boundary}\r\nContent-Type: application/json; charset=UTF-8\r\n\r\n{metadata_json}\r\n--{boundary}\r\nContent-Type: {media_mime}\r\n\r\n"); |
| let preamble = format!( | ||
| "--{boundary}\r\nContent-Type: application/json; charset=UTF-8\r\n\r\n{metadata_json}\r\n\ | ||
| --{boundary}\r\nContent-Type: text/plain\r\n\r\n" | ||
| ); |
There was a problem hiding this comment.
The test calculates the expected preamble length using the same buggy format string as in build_multipart_stream. The leading whitespace on the second line of the format string should be removed to correctly test against a valid multipart body.
| let preamble = format!( | |
| "--{boundary}\r\nContent-Type: application/json; charset=UTF-8\r\n\r\n{metadata_json}\r\n\ | |
| --{boundary}\r\nContent-Type: text/plain\r\n\r\n" | |
| ); | |
| let preamble = format!("--{boundary}\r\nContent-Type: application/json; charset=UTF-8\r\n\r\n{metadata_json}\r\n--{boundary}\r\nContent-Type: text/plain\r\n\r\n"); |
| let preamble = format!( | ||
| "--{boundary}\r\nContent-Type: application/json; charset=UTF-8\r\n\r\n{{}}\r\n\ | ||
| --{boundary}\r\nContent-Type: application/octet-stream\r\n\r\n" | ||
| ); |
There was a problem hiding this comment.
Similar to the other test, the preamble calculation here includes extra whitespace that will not match a correctly formed multipart body. The leading whitespace on the second line of the format string should be removed.
| let preamble = format!( | |
| "--{boundary}\r\nContent-Type: application/json; charset=UTF-8\r\n\r\n{{}}\r\n\ | |
| --{boundary}\r\nContent-Type: application/octet-stream\r\n\r\n" | |
| ); | |
| let preamble = format!("--{boundary}\r\nContent-Type: application/json; charset=UTF-8\r\n\r\n{{}}\r\n--{boundary}\r\nContent-Type: application/octet-stream\r\n\r\n"); |
Summary
Fixes #244 — uploading large files via
--uploadcauses an out-of-memory crash because the entire file is read into memory (tokio::fs::read), then copied into a secondVecbybuild_multipart_body. A 5 GB file requests ~20 GB of contiguous RAM.This replaces the buffered approach with a streaming
multipart/relatedbody:build_multipart_streamyields the body in three chained streams: preamble (Bytes) → file chunks viaReaderStream→ postamble (Bytes)Content-Lengthis computed fromtokio::fs::metadataso Google APIs still receive the correct header without buffering the filebytes::Bytes)Resulterror propagation for metadata serialization (nounwrap_or)The old
build_multipart_bodyis retained under#[cfg(test)]for the existing unit tests.Supersedes #418 — incorporates all review feedback from that PR (ReaderStream instead of manual unfold, zero-copy Bytes, proper error handling).
Test Plan
cargo clippy -- -D warnings✅cargo test— 610/610 pass (2 new tests added)test_build_multipart_stream_content_length— verifies declaredContent-Lengthmatches expected preamble + file + postamble arithmetictest_build_multipart_stream_large_file— 256 KB file (larger than 64 KB chunk size) verifies multi-chunk content-length accuracyNew Dependencies
tokio-util = { version = "0.7", features = ["io"] }— providesReaderStreambytes = "1"— zero-copy byte buffers (already a transitive dependency via reqwest)