[SPARK-55224][PYTHON] Use Spark DataType as ground truth in Pandas-Arrow serialization by Yicong-Huang · Pull Request #53992 · apache/spark

Yicong-Huang · 2026-01-27T00:23:54Z

What changes were proposed in this pull request?

Let _create_batch and _create_array in PySpark's Pandas serializers to use Spark's DataType as the single source of truth, deriving Arrow types internally when needed.

Before: Callers in worker.py pre-computed arrow_return_type = to_arrow_type(return_type, ...) and passed both arrow_type and spark_type through the serialization pipeline.

After: Callers pass only spark_type (Spark DataType). The serializers derive arrow_type internally via to_arrow_type().

Key changes:

~15 Pandas-based wrapper functions in worker.py updated to yield return_type instead of arrow_return_type
Arrow UDF functions (which use ArrowStreamArrowUDFSerializer) unchanged - they still pass arrow_type directly

Why are the changes needed?

Single source of truth: spark_type is the canonical type representation defined by users
Simplified API: Callers no longer need to pre-compute arrow_type
Consistency: Both _create_batch and _create_array now follow the same pattern

Does this PR introduce any user-facing change?

No. This is an internal refactoring with no user-facing API changes.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

github-actions · 2026-01-27T00:24:20Z

JIRA Issue Information

=== Improvement SPARK-55224 ===
Summary: Use Spark DataType as ground truth in Pandas-Arrow serialization
Assignee: None
Status: Open
Affected: ["4.2.0"]

This comment was automatically generated by GitHub Actions

…tor/spark-type-ground-truth

…tor/spark-type-ground-truth # Conflicts: # python/pyspark/sql/pandas/serializers.py

python/pyspark/sql/pandas/serializers.py

zhengruifeng · 2026-01-28T02:44:48Z

python/pyspark/sql/pandas/serializers.py

        spark_type : DataType, optional
-            If None, spark type converted from arrow_type will be used
-        arrow_cast: bool, optional
+            The Spark type to use. If None, pyarrow's inferred type will be used.


wait, the spark type here is the return type?
I think it should never be None?

updated. there are some cases where spark type is not available, using from_arrow_type to get it.

what cases?

createDataFrame (conversion.py, connect/session.py): spark_type can be None for non-timestamp columns when user doesn't provide a schema. This is existing behavior on master:

spark_types = [ TimestampType() if is_datetime64_dtype(t) ... else None # Non-timestamp columns get None for t in data.dtypes ]

And later when the type is None (in this case, both spark type and arrow type will be None in master), pyarrow will try to infer.

return pa.Array.from_pandas( series, mask=mask, type=None, safe=self._safecheck )

> type : pyarrow.DataType, optional > If not provided, the Arrow type is inferred from the pandas dtype.

can we factor out the usage of createDataFrame first?

That makes sense, thanks for the suggestion.

I want to make sure I understand what you mean by factoring out the createDataFrame usage.

In this PR I am not changing the createDataFrame behavior. Spark still allows users to pass an optional schema (see doc here), and when it is not provided, we can end up without a Spark type at this stage. That is already part of the current behavior. So even if we refactor and isolate the createDataFrame-related logic, we would still have cases where the Spark type is None before Arrow conversion.

Are you suggesting that instead we should make createDataFrame always let Arrow infer the type first, and then convert that inferred Arrow type back into a Spark type, so that downstream we can assume the Spark type is always defined?

If so, maybe we can do it after this PR, and then change the optional spark type to be required?

one major goal of this refactoring is to eliminate confusion in UDF.

Are you suggesting that instead we should make createDataFrame always let Arrow infer the type first, and then convert that inferred Arrow type back into a Spark type, so that downstream we can assume the Spark type is always defined?

this can be a solution if it works. Otherwise, we should copy the necessary code to createDataFrame

got it. I went ahead implemented the above logic. PR is open here #54092!

python/pyspark/sql/pandas/serializers.py

python/pyspark/worker.py

Yicong-Huang · 2026-01-29T23:45:56Z

@zhengruifeng this is ready as well!

Combine prefers_large_types from HEAD with struct_in_pandas, ndarray_as_list, df_for_struct parameters from upstream/master.

zhengruifeng · 2026-02-03T03:32:02Z

merged to master

zhengruifeng · 2026-02-07T12:57:39Z

python/pyspark/worker.py

+                "row"
+                if (
+                    eval_type == PythonEvalType.SQL_ARROW_BATCHED_UDF
+                    or runner_conf.use_legacy_pandas_udf_conversion


@Yicong-Huang why adding or runner_conf.use_legacy_pandas_udf_conversion here?

the use_legacy_pandas_udf_conversion is supposed to only take effect in SQL_ARROW_BATCHED_UDF.

Suppose the eval type is SQL_SCALAR_PANDAS_UDF and the config use_legacy_pandas_udf_conversion is true,

the struct_in_pandas was changed from dict -> row?

Good catch! You're right.. the or runner_conf.use_legacy_pandas_udf_conversion is redundant here. When use_legacy_pandas_udf_conversion=True and eval_type=SQL_ARROW_BATCHED_UDF, the earlier elif condition (line 2759) doesn't match, so it falls through to the else branch and the or clause is unnecessary for that case.
And as you pointed out, it has an unintended side effect: if use_legacy_pandas_udf_conversion=True while eval_type is something else (e.g., SQL_SCALAR_PANDAS_UDF), it would incorrectly change struct_in_pandas from "dict" to "row".
I've created a follow-up PR to fix this: #54212

zhengruifeng · 2026-02-07T12:57:49Z

python/pyspark/worker.py

+            )
+            ndarray_as_list = (
+                eval_type == PythonEvalType.SQL_ARROW_BATCHED_UDF
+                or runner_conf.use_legacy_pandas_udf_conversion


Yicong-Huang added 2 commits January 26, 2026 16:21

refactor: use spark type as groud truth

fe6efb7

fix: format

3ecb8d4

github-actions bot added SQL CORE PYTHON labels Jan 27, 2026

Yicong-Huang added 24 commits January 26, 2026 16:57

fix: use new code path

61f4a5e

fix: propagate prefers_large_types to serializer

f52ecff

fix: format

7cdac95

fix: let callsite handle format

e3f7472

fix: lint and format

173cf86

fix: wrong input type

3fd7445

fix: make sure all returns are the same in wrappers

43f98ff

doc: update comments

377acc1

Merge branch 'master' into SPARK-55224/refactor/spark-type-ground-truth

5521367

fix: use iteratble instead of list

e5d770c

fix: keyword arguments

67b2ca6

fix: wrapped in list

df8b20f

fix: use wrapper list

a8c5362

fix: revert changes and add normalization in create_batch

5d4d297

fix: add config arrow_cast

87b6ecb

Merge remote-tracking branch 'upstream/master' into SPARK-55224/refac…

8d19dca

…tor/spark-type-ground-truth

Merge remote-tracking branch 'upstream/master' into SPARK-55224/refac…

c9e3116

…tor/spark-type-ground-truth # Conflicts: # python/pyspark/sql/pandas/serializers.py

fix: use spark type

e200d35

fix: format

5651b9b

fix: import

25a5fbb

fix: get rid of wrapper

99eea8e

fix: arrow cast and doc

ea0c74c

fix: struct_in_pandas config

5556847

fix: wrapper and struct_in_pandas

6d7a474

zhengruifeng reviewed Jan 28, 2026

View reviewed changes

python/pyspark/sql/pandas/serializers.py Outdated Show resolved Hide resolved

zhengruifeng reviewed Jan 28, 2026

View reviewed changes

python/pyspark/sql/pandas/serializers.py Outdated Show resolved Hide resolved

zhengruifeng reviewed Jan 28, 2026

View reviewed changes

python/pyspark/worker.py Outdated Show resolved Hide resolved

Yicong-Huang added 6 commits January 27, 2026 21:33

fix: comment

a5233a6

fix: using list

2d8a1d8

fix: add normalization

f27ca70

fix: missing configs

39e32a3

fix: lint

1a8486f

fix: handle struct type serialization correctly in _create_batch

1ee7989

Yicong-Huang requested a review from zhengruifeng January 29, 2026 21:57

merge: resolve conflict with upstream/master

9b77c86

Combine prefers_large_types from HEAD with struct_in_pandas, ndarray_as_list, df_for_struct parameters from upstream/master.

zhengruifeng approved these changes Feb 3, 2026

View reviewed changes

zhengruifeng closed this in 0c041c2 Feb 3, 2026

zhengruifeng reviewed Feb 7, 2026

View reviewed changes

Conversation

Yicong-Huang commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Jan 27, 2026

JIRA Issue Information

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Yicong-Huang commented Jan 29, 2026

Uh oh!

zhengruifeng commented Feb 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicong-Huang commented Jan 27, 2026 •

edited

Loading

Yicong-Huang Feb 1, 2026 •

edited

Loading