GH-49420: [C++][Gandiva] Fix castVARCHAR memory allocation and len<=0 handling#49421
GH-49420: [C++][Gandiva] Fix castVARCHAR memory allocation and len<=0 handling#49421dmitry-chirkov-dremio wants to merge 4 commits intoapache:mainfrom
Conversation
|
Let me go through first-timer's hurdles (like pre-commit clang format failures). |
69b39cf to
61e5ec2
Compare
|
Pushed clang-format fixes |
|
@lriggs @akravchukdremio @xxlaykxx You may want to review this. |
61e5ec2 to
37d8e6d
Compare
https://github.com/telemenar?target=https://github.com is reviewing offline. Second commit is based on their feedback. |
Rename SUFFIX → FORMATTER_SUFFIX and FROM_TYPE → FROM_DATE64 to better reflect their actual usage after the integer optimization was added.
90c36c4 to
c6d2c25
Compare
Performance Benchmark ResultsBenchmarks run on Apple M3, 5 repetitions each. Tests exercise Test data:
Int32 (Full-Range Random)
Int32 (Small 2-Digit)
Int64 (Full-Range Random)
Int64 (Small 2-Digit)
Memory savings: Allocates 11-20 bytes per integer instead of up to 65,536 bytes - significant for engines that tend to use VARCHAR(65536) |
Rationale for this change
The
castVARCHARfunctions in Gandiva have memory allocation inefficiencies and missing edge case handling. See GH-49420 for details.What changes are included in this PR?
Functional fixes:
bool: Remove unused 5-byte arena allocation; return string literal directlyint32/int64: Add handling forlen=0(return empty string) andlen<0(set error)Memory allocation optimizations:
int32/int64: Allocate fixed small buffer (11/20 bytes) directly in arena, use optimized digit-pair conversion writing right-to-left, thenmemmoveto align. Returnsmin(len, actual_size)bytes.date64: Allocate onlymin(len, 10)bytes upfront (output is always "YYYY-MM-DD")float32/float64: Allocate onlymin(len, 24)bytes upfront (max output length)Code cleanup:
Are these changes tested?
Yes. Added tests for
len=0andlen<0edge cases for int64, date64, float32, float64, and bool types. All existing Gandiva tests pass. Adhoc performance benchmarking was performed both via direct expression evaluation as well as via query execution via Dremio.Are there any user-facing changes?
No. Users will see reduced memory usage and proper error messages for invalid len parameter values.
Note: Error messages for negative
lenremain different between precompiled ("Output buffer length can't be negative") and interpreted ("Buffer length cannot be negative") code paths, preserving existing behavior.