Skip to content

Conversation

@ChaomingZhangCN
Copy link
Contributor

@ChaomingZhangCN ChaomingZhangCN commented Jan 7, 2026

Purpose

#38

This PR only covers basic reading and writing of SST format files, data compression encoding and caching are not considered. Too much code in the PR will make it difficult to review, so I will continue work for this in the future.

DONE:
  • cache interface
  • memory slice & memory slice input & memory slice output
  • sst format file write & read
  • segment based bloomfilter & bitset
TODO:
  • block cache: lru?
  • compression: lz4, zstd, etc.(simd implementaion?)

Tests

./build/debug/paimon-common-sst-file-format-test
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from SstFileIOTest
[ RUN ] SstFileIOTest.TestSimple
[ OK ] SstFileIOTest.TestSimple (5 ms)
[ RUN ] SstFileIOTest.TestJavaCompatitable
[ OK ] SstFileIOTest.TestJavaCompatitable (2 ms)
[----------] 2 tests from SstFileIOTest (7 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (8 ms total)
[ PASSED ] 2 tests.

./build/debug/paimon-common-test
[----------] 1 test from BitSetTest
[ RUN ] BitSetTest.TestBitSet
[ OK ] BitSetTest.TestBitSet (0 ms)
[----------] 1 test from BitSetTest (0 ms total)

[----------] 5 tests from BloomFilterTest
[ RUN ] BloomFilterTest.TestOneSegmentBuilder
[ OK ] BloomFilterTest.TestOneSegmentBuilder (0 ms)
[ RUN ] BloomFilterTest.TestEstimatedHashFunctions
[ OK ] BloomFilterTest.TestEstimatedHashFunctions (0 ms)
[ RUN ] BloomFilterTest.TestBloomNumBits
[ OK ] BloomFilterTest.TestBloomNumBits (0 ms)
[ RUN ] BloomFilterTest.TestBloomNumHashFunctions
[ OK ] BloomFilterTest.TestBloomNumHashFunctions (0 ms)
[ RUN ] BloomFilterTest.TestBloomFilter
[ OK ] BloomFilterTest.TestBloomFilter (2 ms)
[----------] 5 tests from BloomFilterTest (2 ms total)

@ChaomingZhangCN ChaomingZhangCN changed the title feat: introduce sst file format for btree global index feat: sst file format for btree global index Jan 7, 2026
@ChaomingZhangCN ChaomingZhangCN changed the title feat: sst file format for btree global index feat: Introduce sst file format for btree global index Jan 7, 2026
@lucasfang lucasfang requested a review from Copilot January 12, 2026 01:40
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the SST (Sorted String Table) file format infrastructure for the btree global index. It implements core read/write functionality for SST files including memory management utilities (MemorySlice, MemorySliceInput/Output), data structures (BloomFilter, BitSet), block-based storage (BlockWriter/Reader, BlockIterator), and a basic cache interface. The implementation focuses on fundamental functionality, deferring compression encoding and LRU caching to future work.

Changes:

  • Added memory slice abstractions for efficient I/O operations on memory segments
  • Implemented SST file format with block-based storage including writer, reader, and iterator components
  • Introduced BloomFilter and BitSet data structures for efficient key lookups
  • Added cache interface with placeholder NoCache implementation

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
src/paimon/common/utils/bloom_filter.{h,cpp} Segment-based Bloom filter implementation for probabilistic key membership testing
src/paimon/common/utils/bloom_filter_test.cpp Comprehensive tests for Bloom filter including hash functions and multi-segment scenarios
src/paimon/common/utils/bit_set.{h,cpp} Memory segment-backed bit set implementation used by Bloom filter
src/paimon/common/utils/bit_set_test.cpp Tests for BitSet operations including set, get, and clear
src/paimon/common/sst/sst_file_writer.{h,cpp} SST file writer with block and Bloom filter serialization
src/paimon/common/sst/sst_file_reader.{h,cpp} SST file reader with point and range query support using block iterators
src/paimon/common/sst/sst_file_io_test.cpp Integration test for SST file write and read with Bloom filter validation
src/paimon/common/sst/block_writer.{h,cpp} Block writer supporting aligned/unaligned key-value pair storage
src/paimon/common/sst/block_reader.{h,cpp} Block reader with support for both aligned and unaligned block formats
src/paimon/common/sst/block_iterator.{h,cpp} Iterator for sequential and binary search access to block entries
src/paimon/common/sst/block_handle.{h,cpp} Metadata structure for block location and size with serialization
src/paimon/common/sst/block_entry.h Simple key-value pair container for block entries
src/paimon/common/sst/block_trailer.{h,cpp} Block metadata including CRC32 checksum and compression type
src/paimon/common/sst/block_cache.{h,cpp} Block cache for reading SST blocks with position-based caching
src/paimon/common/sst/block_aligned_type.h Enumeration for aligned vs unaligned block storage modes
src/paimon/common/sst/bloom_filter_handle.h Metadata for Bloom filter location in SST files
src/paimon/common/memory/memory_slice.{h,cpp} Slice abstraction over memory segments with comparison operators
src/paimon/common/memory/memory_slice_input.{h,cpp} Input stream over memory slices with variable-length integer support
src/paimon/common/memory/memory_slice_output.{h,cpp} Output stream over memory slices with auto-resize capability
src/paimon/common/io/cache/cache.{h,cpp} Cache interface with NoCache placeholder implementation
src/paimon/common/io/cache/cache_key.{h,cpp} Position-based cache key with hash support
src/paimon/common/io/cache/cache_manager.{h,cpp} Cache manager coordinating data and index caches
src/paimon/CMakeLists.txt Build configuration updates for new SST and memory components

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@lxy-9602 lxy-9602 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on the SST implementation! 👍 This is a solid contribution that significantly benefits B-tree index for point lookups and range queries. Thanks for your effort and the valuable improvement!

ps: could you please make sure all comments have been addressed before we merge?

Copy link
Collaborator

@lxy-9602 lxy-9602 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@lxy-9602 lxy-9602 merged commit 2c4994c into alibaba:main Jan 21, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants