Skip to content

Conversation

@emkornfield
Copy link
Contributor

@emkornfield emkornfield commented Nov 19, 2025

Which issue does this PR close?

Split off from #1851

What changes are included in this PR?

This change honors the compression setting for metadata.json file (write.metadata.compression-codec).

Are these changes tested?

Add unit test to verify files are gzipped when the flag is enabled.

BREAKING CHANGE: Make write_to take MetadataLocation

@emkornfield
Copy link
Contributor Author

CC @kevinjqliu @liurenjie1024

key: &str,
default: T,
) -> Result<T, anyhow::Error>
) -> crate::error::Result<T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> crate::error::Result<T>
) -> Result<T>

Unnecessary fully qualified name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

/// The target file size for files.
pub write_target_file_size_bytes: usize,
/// Compression codec for metadata files (JSON)
pub metadata_compression_codec: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be sth like Option<String>, none should be None in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


// Modify filename to add .gz before .metadata.json
let location = metadata_location.as_ref();
let new_location = if location.ends_with(".metadata.json") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should simply replace the suffix with this approach. Currently all metadata location generated logic are in MetadataLocation, we should add a new method in like following and deprecate old method as following:

impl MetadataLocation {
#[deprecate("Use new_with_table instead.")]
pub fn new_with_table_location(table_location: impl ToString) -> Self {
}

pub fn new_with_table(table: &Table) -> Self {
  ....
}

Copy link
Contributor Author

@emkornfield emkornfield Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table option seemed a little awkward with the control flow on how it is used istead I added a method that takes table properties. Also added a similar method for incrementing the number.

I modified this logic a bit and kept. My concern here is if we only deprecate the new_with_table_location then users can write out metadata.json files in a format that would be unreadable by other readers.

.map_err(from_aws_sdk_error)?;
let warehouse_location = get_resp.warehouse_location().to_string();
MetadataLocation::new_with_table_location(warehouse_location).to_string()
get_resp.warehouse_location().to_string()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a semantic change because up the update to creation.location below, need to investigate further what the correct thing to do here is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, from the comment above (L451-452) it seems like the "s3tables catalog" generates the metadata file location and does not allow any modification

so maybe we should just use

        let metadata_location =
            MetadataLocation::new_with_properties(metadata_location, None)
                .to_string(); 

so as to not do any path modification before writing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original code was wrong. As best I can test warehouseLocation is actually the table location.

I've updated it to assign to table_location and then assign it below (as far as I can tell, the .location field is meant to be table location and not the metadata.json file).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think this fix is correct, I think it's much related to this pr? We should move it to a standalone pr.

Copy link
Contributor Author

@emkornfield emkornfield Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We needed change this anyways do to the reordering of calls, it seems better to fix it. I'll open a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2115 to make the change isolated.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, a few nit comments.
would be good to get a second pair of 👀


/// Creates a completely new metadata location starting at version 0,
/// with compression settings from the table's properties.
/// Only used for creating a new table. For updates, see `with_next_version`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Only used for creating a new table. For updates, see `with_next_version`.
/// Only used for creating a new table. For updates, see `with_next_version_and_properties`.

since with_next_version is also deprecated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait till I understand @liurenjie1024 concerns about with_next_version_and_properties to update this.

.map_err(from_aws_sdk_error)?;
let warehouse_location = get_resp.warehouse_location().to_string();
MetadataLocation::new_with_table_location(warehouse_location).to_string()
get_resp.warehouse_location().to_string()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, from the comment above (L451-452) it seems like the "s3tables catalog" generates the metadata file location and does not allow any modification

so maybe we should just use

        let metadata_location =
            MetadataLocation::new_with_properties(metadata_location, None)
                .to_string(); 

so as to not do any path modification before writing

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emkornfield for this pr, looks much better now, still need some refinements.

/// Creates a new metadata location for an updated metadata file.
/// Takes table properties to determine compression settings, which may have changed
/// from the previous version.
pub fn with_next_version_and_properties(&self, properties: &HashMap<String, String>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this.

Copy link
Contributor Author

@emkornfield emkornfield Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what you mean by we don't need it? I updated crates/iceberg/src/catalog/mod.rs to use it, if new properties change compression settings, I'm not sure how this would be reflected otherwise? I might have missed it but I don't think there is anything in the spec that prevents users from changing compression properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If properties has been changes, we should build a new instance of this MetadataLocation struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If properties has been changes, we should build a new instance of this MetadataLocation struct.

My main concern here, is I don't think this is something that clients of MetadataLocation would actually understand and check for a new version of the metadata (i.e. the path of least resistance is to call with_next_version on whatever metadata they used for creation. As a solution, I've created a static factory next_version which combines the two steps for the only call-site that exists for with_next_version. If you think it is better to not deprecated with_next_version I can also revert this change.


/// Compression codec for metadata files (JSON).
#[derive(Debug, PartialEq)]
pub enum MetadataCompressionCodec {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a CompressionCodec, we should move it to spec module and reuse it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that, that compression codec doesn't actually include GZIP which means extra validation for this code path. I'm not opposed to it but given that it will take extra logic to validate valid options for both this PR and the existing code in puffin to not allow writing GZIP and we also have a different concept for Avro files in #1851 it seems better do the refactoring change in a separate PR once everything is merged (conveniently reaching the rule of 3?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns with this approach. OSS software is somehow different enterprise software development. If everyone just move forward with a lot of diverges and left the cleanup things to others, the codebase will become a mess. I think a better approach would be to have another pr to move the compression codec enum into spec module. And then we can move on with this pr after it got merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thank you for the guidance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #2081

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use that code.

liurenjie1024 pushed a commit that referenced this pull request Jan 29, 2026
)

## Which issue does this PR close?

As discussed in the PR for [allowing compressed
metadata.json](#1876 (comment))
writes we want CompressionCodec in a central place so it can be used in
puffin and for other compression use-cases. Happy to move it to `spec/`
as suggested in the original comment but this seemed more natural.

## What changes are included in this PR?

This moves compression.rs to a top level (seems like the best location
for common code but please let me know if there is a better place.

- It adds Gzip as a compression option and replaces current direct use
the GzEncoder package.
- It adds validation to puffin that Gzip is not currently supported (per
spec)

## Are these changes tested?

Added unit tests to cover additions and existing tests cover the rest.
Copy link
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emkornfield for this pr, we are quite close !

.map_err(from_aws_sdk_error)?;
let warehouse_location = get_resp.warehouse_location().to_string();
MetadataLocation::new_with_table_location(warehouse_location).to_string()
get_resp.warehouse_location().to_string()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think this fix is correct, I think it's much related to this pr? We should move it to a standalone pr.

@emkornfield
Copy link
Contributor Author

emkornfield commented Feb 2, 2026

@blackmwk thanks for the review I think we need to close on a direct for this: https://github.com/apache/iceberg-rust/pull/1876/files#r2752454670

  • I'll separate out the S3 fix into its own PR.
  • Use the new compression enum on MetadataLocation.

If we can make breaking changes then:

  1. Then if write-to takes metadata location then we don't need to store compression codec on MetadataLocation at all (we can make it take a compression codec when generating the final path). (I actually need to double check this, there might be other places that try to reserialize the string).
  2. We can eliminate a lot of the validation logic.

@blackmwk
Copy link
Contributor

blackmwk commented Feb 3, 2026

Hi, @emkornfield

I'll separate out the S3 fix into its own PR.
Use the new compression enum on MetadataLocation.

I think this is the right direction to go.

If we can make breaking changes then:

Then if write-to takes metadata location then we don't need to store compression codec on MetadataLocation at all (we can make it take a compression codec when generating the final path). (I actually need to double check this, there might be other places that try to reserialize the string).
We can eliminate a lot of the validation logic.

This is the place where we have to make a breaking change, so I think it's fine.


#[test]
fn test_parse_metadata_file_compression_valid() {
use super::parse_metadata_file_compression;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to move these imports out and clean up this code a little bit.

@emkornfield
Copy link
Contributor Author

@blackmwk thanks for the review, I believe I addressed comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] iceberg-rust does not respect compression settings for metadata & avro

5 participants