[WIP] POC for serializer changes #54075

Yicong-Huang · 2026-01-30T21:47:52Z

To be filled

Extract the struct flattening/wrapping logic from ArrowStreamUDFSerializer into reusable transformers in a new transformers.py module.

…use-flatten-struct' into SPARK-55176/refactor/extract-arrow-to-pandas-converter

Resolved conflicts by: - Adopting create_converter architecture while keeping arrow_to_pandas compatibility - Using _load_group_dataframes helper from upstream - Removing GroupArrowUDFSerializer (not in upstream) - Keeping input_type (singular) parameter name

…er/mapper layers This refactoring separates concerns between serialization and data transformation: - Serializers now only handle data serialization/deserialization - Wrappers handle UDF wrapping and data format conversion - Mappers handle UDF result aggregation using transformer utilities Key changes: 1. Simplified serializer hierarchy by removing redundant classes: - Removed ArrowStreamPandasUDFSerializer, ArrowStreamGroupUDFSerializer, ArrowStreamArrowUDFSerializer, ArrowStreamUDTFSerializer - Renamed ArrowStreamMapIterSerializer to ArrowStreamGroupSerializer - Unified grouped/non-grouped UDF handling in ArrowStreamGroupSerializer 2. Introduced transformer utility classes: - ArrowBatchTransformer: Arrow batch operations (wrap_struct, flatten_struct, partial_batch, partial_table, concat_batches, merge_batches, reorder_columns) - PandasBatchTransformer: Pandas/Arrow conversions (to_arrow, concat_series_batches) 3. Moved data transformations from serializers to wrappers/mappers: - Moved to_arrow conversion from mappers to wrappers for Pandas agg UDFs - Wrappers now return RecordBatch directly instead of (result, arrow_type) tuples - Mappers simplified to use transformer methods for common operations Benefits: - Clearer separation of concerns - Reduced code duplication through transformer utilities - Easier to maintain and extend - Consistent data format handling across UDF types

- Use ArrowBatchTransformer.zip_batches for type coercion instead of manual loop - Simplify error handling logic - Handle empty struct case properly - Unwrap wrapped batches from worker before type coercion, then wrap back for JVM - All 46/47 UDTF tests pass (1 known failure unrelated to this change)

github-actions · 2026-01-30T21:48:01Z

⚠️ Pull Request Title Validation

This pull request title does not contain a JIRA issue ID.

Please update the title to either:

Include a JIRA ID: [SPARK-12345] Your description
Mark as minor change: [MINOR] Your description

For minor changes that don't require a JIRA ticket (e.g., typo fixes), please prefix the title with [MINOR].

This comment was automatically generated by GitHub Actions

This commit fixes several issues with pandas UDF handling after the serializer refactoring: 1. Fix parameter initialization order in read_udfs(): - Move pandas_udf_* parameter defaults BEFORE the if-elif chain - Previously they were reset AFTER being set, causing df_for_struct to always be False for scalar pandas UDFs 2. Add struct_in_pandas="dict" to scalar UDF wrappers: - wrap_scalar_pandas_udf: enables DataFrame→struct array conversion - wrap_pandas_batch_iter_udf: same fix for iter variant 3. Fix grouped map UDF column matching: - Use assign_cols_by_name to match DataFrame columns by name when available, otherwise by position - Handle empty DataFrame (0 columns) by creating empty struct array 4. Fix Arrow batch handling: - zip_batches: convert items to list for pa.RecordBatch.from_arrays - mapper: special handling for SQL_ARROW_BATCHED_UDF to return raw result instead of calling zip_batches 5. Fix error handling in create_array: - Only catch ArrowInvalid for arrow_cast retry (not ArrowTypeError) - Add ArrowTypeError to TypeError handler for proper error messages - Update error message format to match expected test output All 264+ pandas UDF tests pass including struct type tests.

Remove timezone, int_to_decimal_coercion_enabled, and assign_cols_by_name parameters that were stored but never used by the serializer or subclasses.

…to-pandas conversion - Extract common verification functions: verify_result_length, verify_result_type, verify_is_iterable, verify_element_type - Simplify wrapper functions using common verification utilities - Centralize Arrow-to-pandas conversion in read_udfs mapper/func - Remove unused pandas_udf_* variables from read_udfs - Fix is_scalar_pandas_iter to not convert Arrow iter UDFs to pandas

- Separate iter UDF branches by type for clarity (scalar pandas/arrow, map pandas, map arrow) - Wrapper functions now return (result, arrow_return_type) tuples, with output conversion centralized in mapper/func - Grouped map pandas UDFs now receive Iterator[DataFrame] directly instead of Iterator[List[Series]] - Inline concat_series_batches and series_batches_to_dataframe methods

…implify-serializers

Yicong-Huang added 30 commits January 23, 2026 17:40

[SPARK-55162][PYTHON] Extract transformers from ArrowStreamUDFSerializer

c358c44

Extract the struct flattening/wrapping logic from ArrowStreamUDFSerializer into reusable transformers in a new transformers.py module.

fix: format

d0c2644

refactor: simplify

fc23683

refactor: use function and apply maps

1464cfb

refactor: move to conversion.py

5f9627b

fix: format

b396e10

fix: keep wrapper

f6dbc95

refactor: use transformer for GroupArrowUDFSerializer

535ad45

refactor: use flatten_struct in ArrowStreamArrowUDTFSerializer

26b0a70

fix: import

8467a59

refactor: extract converter logic out to Conversion

31370e2

fix: format

71c5e49

refactor: simplify and add comments

84bdb21

fix: type annotation

7ce137a

Merge remote-tracking branch 'origin/SPARK-55169/refactor/arrow-udtf-…

aa33b98

…use-flatten-struct' into SPARK-55176/refactor/extract-arrow-to-pandas-converter

refactor: extract to_pandas transformer

6a0e897

refactor: simplify

06aeec7

fix: revert changes

56f6a37

revert: bring back convert_legacy

4bdf46c

fix: comments

eecdd13

fix: type

9efc623

fix: unused import

e5c6ad1

refactor: use spark_type from callsite

1c7c9ed

fix: use classmethod

cbb3a90

fix: simplify

bec3f44

fix: import and doc

cf51876

refactor: extract to_pandas transformer

78f2920

Yicong-Huang marked this pull request as draft January 30, 2026 21:47

github-actions bot added SQL CORE PYTHON CONNECT labels Jan 30, 2026

Yicong-Huang added 7 commits January 31, 2026 00:51

refactor: remove unused ArrowStreamGroupSerializer parameters

bf3ba3f

Remove timezone, int_to_decimal_coercion_enabled, and assign_cols_by_name parameters that were stored but never used by the serializer or subclasses.

fix: remove unused imports to fix linting errors

63b6fbe

fix: correct SQL_BATCHED_UDF and legacy Arrow UDF handling

79d7c41

Merge remote-tracking branch 'upstream/master' into SPARK-55175/poc/s…

ba21e3e

…implify-serializers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] POC for serializer changes #54075

[WIP] POC for serializer changes #54075

Yicong-Huang commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[WIP] POC for serializer changes #54075

Are you sure you want to change the base?

[WIP] POC for serializer changes #54075

Conversation

Yicong-Huang commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

⚠️ Pull Request Title Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant