Skip to content

GH-49410: [C++] Add regression test for if_else with sliced BaseBinary chunks#49443

Open
Ebraam-Ashraf wants to merge 2 commits intoapache:mainfrom
Ebraam-Ashraf:GH-49410-fix-if-else-sliced-string-chunks
Open

GH-49410: [C++] Add regression test for if_else with sliced BaseBinary chunks#49443
Ebraam-Ashraf wants to merge 2 commits intoapache:mainfrom
Ebraam-Ashraf:GH-49410-fix-if-else-sliced-string-chunks

Conversation

@Ebraam-Ashraf
Copy link

Rationale for this change

if_else with a null scalar and a sliced BaseBinary array (offset != 0) produces invalid output. The ASA and AAS shortcut paths in scalar_if_else.cc copy offsets without adjusting for the slice offset, and copy data from byte 0 instead of data + offsets[0].

all existing BaseBinary tests use arrays built directly from ArrayFromJSON where offset is always 0.

A proposed fix (adjusting offsets by offsets[0] and copying data from data + offsets[0] in both the ASA and AAS paths) is ready and will be added in a follow-up commit once this test is reviewed and confirmed correct.

What changes are included in this PR?

A regression test IfElseBaseBinarySlicedChunk that reproduces the bug across utf8, binary, large_utf8, and large_binary types, covering the ASA path, AAS path, and the full round-trip from the original issue report.

The test currently fails:

[ RUN      ] TestIfElseKernel.IfElseBaseBinarySlicedChunk
scalar_if_else_test.cc:3790: Failure 'result_asa.make_array()->ValidateFull()' 
failed with Invalid: Offset 
invariant failure: offset for slot 2 out of bounds: 3 > 2
[  FAILED  ] TestIfElseKernel.IfElseBaseBinarySlicedChunk (4 ms)

Are these changes tested?

Yes. The new test reproduces the bug. Fix will follow in a separate commit.

This PR contains a "Critical Fix". The bug causes incorrect/invalid data to be produced when if_else is called with a null scalar and a sliced BaseBinary array.

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

⚠️ GitHub issue #49410 has been automatically assigned in GitHub to PR creator.

@Ebraam-Ashraf
Copy link
Author

hi @kou
thanks for your guidance
as requested here's the test that reproduces the bug
Looking forward to any feedback also I'll add the fix in a follow up commit once you confirm the test is fine

{ArrayFromJSON(int64(), "[-1]"), ArrayFromJSON(int32(), "[0]")}));
}

TEST_F(TestIfElseKernel, IfElseBaseBinarySlicedChunk) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this TestIfEleseKernel test to the location where other TestIfEleseKernel tests exist instead of adding it after the TestChooseKernel test?

Can we use better name than IfElseBaseBinarySlicedChunk like other existing test names?

}

TEST_F(TestIfElseKernel, IfElseBaseBinarySlicedChunk) {
for (auto type : {utf8(), binary(), large_utf8(), large_binary()}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test all these types?

If so, TYPED_TEST(TestIfElseBaseBinary, ...) may be better than TEST_F(TestifEleseKernel, ...).

Comment on lines +3787 to +3809
auto cond_asa = ArrayFromJSON(boolean(), "[true, false, false]");
ASSERT_OK_AND_ASSIGN(auto result_asa,
CallFunction("if_else", {cond_asa, MakeNullScalar(type), chunk1}));
ASSERT_OK(result_asa.make_array()->ValidateFull());
AssertArraysEqual(*ArrayFromJSON(type, R"([null, "x", "x"])"),
*result_asa.make_array(), true);

auto cond_aas = ArrayFromJSON(boolean(), "[false, true, true]");
ASSERT_OK_AND_ASSIGN(auto result_aas,
CallFunction("if_else", {cond_aas, chunk1, MakeNullScalar(type)}));
ASSERT_OK(result_aas.make_array()->ValidateFull());
AssertArraysEqual(*ArrayFromJSON(type, R"([null, "x", "x"])"),
*result_aas.make_array(), true);

auto arr1 = std::make_shared<ChunkedArray>(ArrayVector{chunk0, chunk1});
auto mask = *CallFunction("is_null", {arr1});
ASSERT_OK_AND_ASSIGN(auto arr2_datum,
CallFunction("if_else", {Datum(true), *Concatenate(arr1->chunks()), arr1}));
ASSERT_OK(arr2_datum.chunked_array()->ValidateFull());
ASSERT_OK_AND_ASSIGN(auto arr3_datum,
CallFunction("if_else", {mask, MakeNullScalar(type), arr2_datum}));
ASSERT_OK(arr3_datum.chunked_array()->ValidateFull());
AssertDatumsEqual(Datum(arr1), arr3_datum);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reproduce this problem without chunked array, right?

If so, we don't need to test chunked array.

@Ebraam-Ashraf
Copy link
Author

thanks for the feedback @kou

I switched from TEST_F(TestIfElseKernel, ...) to TYPED_TEST(TestIfElseBaseBinary, IfElseBaseBinarySliced)
and moved the test to its right place
also removed the chunked array portion

[----------] Global test environment tear-down
[==========] 4 tests from 4 test suites ran. (125 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] TestIfElseBaseBinary/0.IfElseBaseBinarySliced, where TypeParam = arrow::BinaryType
[  FAILED  ] TestIfElseBaseBinary/1.IfElseBaseBinarySliced, where TypeParam = arrow::LargeBinaryType
[  FAILED  ] TestIfElseBaseBinary/2.IfElseBaseBinarySliced, where TypeParam = arrow::StringType
[  FAILED  ] TestIfElseBaseBinary/3.IfElseBaseBinarySliced, where TypeParam = arrow::LargeStringType

Offset invariant failure: offset for slot 2 out of bounds: 3 > 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants