fix: HashJoin panic with dictionary-encoded columns in multi-key joins#20441
fix: HashJoin panic with dictionary-encoded columns in multi-key joins#20441alamb merged 4 commits intoapache:mainfrom
Conversation
b1328a2 to
ffc5b55
Compare
jonathanc-n
left a comment
There was a problem hiding this comment.
This looks good, just some small comments
datafusion/core/tests/sql/joins.rs
Outdated
| } | ||
|
|
||
| // Issue #20437: https://github.com/apache/datafusion/issues/20437 | ||
| #[tokio::test] |
There was a problem hiding this comment.
We want to keep unit tests at a minimum when possible. Use sqllogictests instead here
| fn flatten_dictionary_array(array: &ArrayRef) -> ArrayRef { | ||
| downcast_dictionary_array! { | ||
| array => { | ||
| fn flatten_dictionary_array(array: &ArrayRef) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Shall we rename the function?
There was a problem hiding this comment.
It seems to me like it still flattens dictionaries. why would we rename it?
| fn flatten_dictionary_array(array: &ArrayRef) -> ArrayRef { | ||
| downcast_dictionary_array! { | ||
| array => { | ||
| fn flatten_dictionary_array(array: &ArrayRef) -> Result<ArrayRef> { |
There was a problem hiding this comment.
It seems to me like it still flattens dictionaries. why would we rename it?
| } | ||
|
|
||
| #[test] | ||
| fn test_build_multi_column_inlist_with_dictionary() { |
There was a problem hiding this comment.
I am not sure this unit test adds much value -- it just basically reiterates how the current function works (it is testing some intermediate state
I double checked that the .slt test fails without the code in this PR
| fn flatten_dictionary_array(array: &ArrayRef) -> Result<ArrayRef> { | ||
| match array.data_type() { | ||
| DataType::Dictionary(_, value_type) => { | ||
| let casted = cast(array, value_type)?; |
There was a problem hiding this comment.
I messed around with this PR and I don't really understand why the code is flattening arrays at all
I removed the flattening code entirely and all the code seems to pass. I'll make a follow on PR
There was a problem hiding this comment.
Just deleting the code seems to have worked
There was a problem hiding this comment.
I should have followed my instincts a bit more. It turns out we found another issue with this code at InfluxData
Thankfully the following PR fixes it:
|
I also made a backport PR here |
apache#20441) - Closes apache#20437 `flatten_dictionary_array` returned only the unique values rather then the full expanded array when being called on a `DictionaryArray`. When building a `StructArray` this caused a length mismatch panic. Replaced `array.values()` with `arrow::compute::cast(array, value_type)` in `flatten_dictionary_array`, which properly expands the dictionary into a full length array matching the row count. Yes, both a new unit test aswell as a regression test were added. Nope --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
apache#20441) - Closes apache#20437 `flatten_dictionary_array` returned only the unique values rather then the full expanded array when being called on a `DictionaryArray`. When building a `StructArray` this caused a length mismatch panic. Replaced `array.values()` with `arrow::compute::cast(array, value_type)` in `flatten_dictionary_array`, which properly expands the dictionary into a full length array matching the row count. Yes, both a new unit test aswell as a regression test were added. Nope --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
#20505) ## Which issue does this PR close? - Fixes #20696 - Follow on to #20441 ## Rationale for this change #20441 (review) fixes the special case DictionaryArray handling in Joins. However, I don't think we need to special case DictionaryArrays at all ## What changes are included in this PR? 1. Remove the special case dictionary handling ## Are these changes tested? Yes by CI ## Are there any user-facing changes? No (though maybe some queries get faster)
apache#20505) ## Which issue does this PR close? - Fixes apache#20696 - Follow on to apache#20441 ## Rationale for this change apache#20441 (review) fixes the special case DictionaryArray handling in Joins. However, I don't think we need to special case DictionaryArrays at all ## What changes are included in this PR? 1. Remove the special case dictionary handling ## Are these changes tested? Yes by CI ## Are there any user-facing changes? No (though maybe some queries get faster)
apache#20505) - Fixes apache#20696 - Follow on to apache#20441 apache#20441 (review) fixes the special case DictionaryArray handling in Joins. However, I don't think we need to special case DictionaryArrays at all 1. Remove the special case dictionary handling Yes by CI No (though maybe some queries get faster)
apache#20505) ## Which issue does this PR close? - Fixes apache#20696 - Follow on to apache#20441 ## Rationale for this change apache#20441 (review) fixes the special case DictionaryArray handling in Joins. However, I don't think we need to special case DictionaryArrays at all ## What changes are included in this PR? 1. Remove the special case dictionary handling ## Are these changes tested? Yes by CI ## Are there any user-facing changes? No (though maybe some queries get faster)
Which issue does this PR close?
Rationale for this change
flatten_dictionary_arrayreturned only the unique values rather then the full expanded array when being called on aDictionaryArray. When building aStructArraythis caused a length mismatch panic.What changes are included in this PR?
Replaced
array.values()witharrow::compute::cast(array, value_type)inflatten_dictionary_array, which properly expands the dictionary into a full length array matching the row count.Are these changes tested?
Yes, both a new unit test aswell as a regression test were added.
Are there any user-facing changes?
Nope