Skip to content

Fix DoFnInvoker cache collision for generic types#37355

Open
Eliaaazzz wants to merge 3 commits intoapache:masterfrom
Eliaaazzz:fix-37351-dofn-cache-collision
Open

Fix DoFnInvoker cache collision for generic types#37355
Eliaaazzz wants to merge 3 commits intoapache:masterfrom
Eliaaazzz:fix-37351-dofn-cache-collision

Conversation

@Eliaaazzz
Copy link
Contributor

@Eliaaazzz Eliaaazzz commented Jan 20, 2026

This fixes a bug where ByteBuddyDoFnInvokerFactory would return the same cached invoker for different generic instantiations of the same DoFn class.

Changes:

  1. Introduced InvokerCacheKey with TypeDescriptors to ensure unique cache entries.
  2. Updated generateInvokerClass to append type-based hash suffix.
  3. Added regression test (testCacheKeyCollisionProof).

Fixes #37351

Description

This PR addresses a critical bug in the ByteBuddyDoFnInvokerFactory where the caching mechanism for generated DoFnInvoker classes failed to distinguish between different generic instantiations of the same DoFn class.

1. The Problem: Cache Collision on Generic Types

The previous implementation utilized a ConcurrentHashMap<Class<?>, Constructor<?>> to cache the generated invoker constructors. The cache key was solely the DoFn class itself.

In scenarios involving generic DoFn classes (e.g., class MyGenericFn<T> extends DoFn<T, T>), the raw class object is identical regardless of the type parameter T.

  • Scenario: If a pipeline instantiates MyGenericFn<String> and later MyGenericFn<Integer>, the factory would generate an invoker for String and cache it.
  • The Bug: When MyGenericFn<Integer> is requested, the factory hits the cache using the same MyGenericFn.class key and returns the String-specialized invoker.
  • Consequence: This leads to runtime ClassCastException or incorrect Coder inference, as the bytecode for the invoker is specialized for the wrong type.

2. The Solution: Type-Aware Caching & Naming

This PR introduces a composite cache key and updates the class naming strategy to ensure full isolation between generic types.

A. Introduced InvokerCacheKey

I replaced the simple Class<?> key with a new static inner class InvokerCacheKey. This key encapsulates three components:

  1. fnClass: The raw class of the DoFn.
  2. inputType: The TypeDescriptor for the input element.
  3. outputType: The TypeDescriptor for the output element.

By implementing equals() and hashCode() based on all three fields, we ensure that MyGenericFn<String> and MyGenericFn<Integer> map to distinct cache entries.

B. Unique Class Naming Strategy

ByteBuddy cannot define two different classes with the exact same name in the same ClassLoader. To support multiple generated invokers for the same raw DoFn class, I modified generateInvokerClass.

  • Old Strategy: <DoFnClassName>$DoFnInvoker
  • New Strategy: <DoFnClassName>$DoFnInvoker$<TypeHash>

The suffix now includes a hexadecimal hash derived from the input and output TypeDescriptors. This guarantees that the generated bytecode class names are unique for each generic instantiation.

Test Plan

I have added a regression test testCacheKeyCollisionProof to DoFnInvokersTest.java.

  • Methodology: The test defines a local generic class DynamicTypeDoFn<T> and manually forces specific TypeDescriptor returns.
  • Verification: It instantiates the DoFn twice (once for String, once for Integer) and asserts that the generated Invoker classes are not the same (assertNotSame).
  • Result: The test passes with the fix (confirming distinct invokers are generated) and fails without it.

Existing tests passed:

  • DoFnInvokersTest

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

This fixes a bug where ByteBuddyDoFnInvokerFactory would return the same cached invoker for different generic instantiations of the same DoFn class.

Changes:
1. Introduced InvokerCacheKey with TypeDescriptors to ensure unique cache entries.
2. Updated generateInvokerClass to append type-based hash suffix.
3. Added regression test (testCacheKeyCollisionProof).

Fixes apache#37351
@github-actions github-actions bot added the java label Jan 20, 2026
@Eliaaazzz Eliaaazzz force-pushed the fix-37351-dofn-cache-collision branch from 6fcdeef to fe957ea Compare January 20, 2026 09:56
@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@Eliaaazzz Eliaaazzz force-pushed the fix-37351-dofn-cache-collision branch from fe957ea to 4d337b6 Compare January 20, 2026 15:08
This PR fixes a critical issue where ByteBuddyDoFnInvokerFactory failed to distinguish between different generic instantiations of the same DoFn class (e.g., MyFn<String> vs MyFn<Integer>).

1. Cache Key Strategy: Introduced InvokerCacheKey to include input/output TypeDescriptors in the cache lookup.
2. Class Naming: Updated generateInvokerClass to append a type-based hash suffix to ensure unique class names.
3. Robustness (The Fix): Added defensive try-catch blocks when accessing TypeDescriptors.
   - Some internal transforms (like MapElements) throw IllegalStateException if getOutputTypeDescriptor() is called after serialization.
   - In these cases, the factory now gracefully falls back to using Object.class (legacy behavior), ensuring backward compatibility for transforms that do not retain type information at runtime.

Fixes apache#37351
@Eliaaazzz Eliaaazzz force-pushed the fix-37351-dofn-cache-collision branch from 4d337b6 to f0271f0 Compare January 21, 2026 03:48
@github-actions
Copy link
Contributor

Assigning reviewers:

R: @chamikaramj for label java.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is great. I just had a few bits to polish it up.


@Override
public int hashCode() {
int result = fnClass.hashCode();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using Objects.hash(fnClass, inputType, outputType) along with Objects.equals which makes it very easy to maintain consistency and clarity.


@Override
public String toString() {
return String.format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here using ToStringHelper keeps a consistent format across all our classes.

try {
inputType = fn.getInputTypeDescriptor();
} catch (Exception e) {
// Some DoFns (like MapElements) throw IllegalStateException if queried after
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in digging in to this. Will we still have cache collision problems in this case?

// different generic instantiations of the same DoFn class.
// Format: <DoFn class name>$<DoFnInvoker>$<type hash>
TypeDescriptor<Void> voidType = new StableNameTestDoFn().getInputTypeDescriptor();
String expectedTypeSuffix =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than repeat the logic of producing this esoteric suffix, go ahead and make it a package-private, or even public method. (the whole module should be @Internal so public methods are not frozen APIs)

Copy link
Contributor Author

@Eliaaazzz Eliaaazzz Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kennknowles Thanks for the review! I have updated the code to use ToStringHelper and Objects.hash as suggested, and extracted the suffix logic to clean up the tests.

Regarding the cache collision in the fallback case:

You are exactly right—if type lookup fails, different generic instantiations like MyDoFn and MyDoFn will map to the same CacheKey.

However, it's important to highlight that type lookup failure is often an objective side effect of deserialization in a distributed environment. When a DoFn is serialized and transmitted across workers, specific generic type information can be lost due to Java's type erasure or classloader limitations on the worker side. This is an inherent constraint we must handle.

This collision is safe and intentional for two reasons:

Class Isolation: The cache key still includes fnClass, so different DoFn implementations will never share an invoker.

Erasure Compatibility: When we fall back to Object, we generate a "Raw Invoker". Due to Java type erasure, the underlying method in the bytecode acts as processElement(Object). Thus, a single shared "Raw Invoker" is perfectly compatible with any generic instantiation of that class.

I have added a comment in the catch block to explicitly mention that this fallback is a resilient design for cases where type information is lost during deserialization, ensuring the system remains functional even when reflection is limited.

@Eliaaazzz Eliaaazzz force-pushed the fix-37351-dofn-cache-collision branch from bd5a070 to f3a1596 Compare January 24, 2026 02:25
- Replace String.format in toString() with MoreObjects.toStringHelper for consistency.
- Update hashCode() and equals() to use java.util.Objects utility methods.
- Extract type suffix generation logic into a reusable static method to avoid duplication in tests.
@Eliaaazzz Eliaaazzz force-pushed the fix-37351-dofn-cache-collision branch from f3a1596 to 648ab3c Compare January 24, 2026 04:45
@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @chamikaramj

@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2026

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @ahmedabu98 for label java.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: ByteBuddyDoFnInvokerFactory cache collision for DoFns with different generic types

2 participants