Skip to content

GH-49420: [C++][Gandiva] Fix castVARCHAR memory allocation and len<=0 handling#49421

Open
dmitry-chirkov-dremio wants to merge 4 commits intoapache:mainfrom
dmitry-chirkov-dremio:gandiva-castvarchar-optimization
Open

GH-49420: [C++][Gandiva] Fix castVARCHAR memory allocation and len<=0 handling#49421
dmitry-chirkov-dremio wants to merge 4 commits intoapache:mainfrom
dmitry-chirkov-dremio:gandiva-castvarchar-optimization

Conversation

@dmitry-chirkov-dremio
Copy link

@dmitry-chirkov-dremio dmitry-chirkov-dremio commented Mar 2, 2026

Rationale for this change

The castVARCHAR functions in Gandiva have memory allocation inefficiencies and missing edge case handling. See GH-49420 for details.

What changes are included in this PR?

Functional fixes:

  • bool: Remove unused 5-byte arena allocation; return string literal directly
  • int32/int64: Add handling for len=0 (return empty string) and len<0 (set error)

Memory allocation optimizations:

  • int32/int64: Allocate fixed small buffer (11/20 bytes) directly in arena, use optimized digit-pair conversion writing right-to-left, then memmove to align. Returns min(len, actual_size) bytes.
  • date64: Allocate only min(len, 10) bytes upfront (output is always "YYYY-MM-DD")
  • float32/float64: Allocate only min(len, 24) bytes upfront (max output length)

Code cleanup:

  • Extract common code into helper macros to reduce duplication

Are these changes tested?

Yes. Added tests for len=0 and len<0 edge cases for int64, date64, float32, float64, and bool types. All existing Gandiva tests pass. Adhoc performance benchmarking was performed both via direct expression evaluation as well as via query execution via Dremio.

Are there any user-facing changes?

No. Users will see reduced memory usage and proper error messages for invalid len parameter values.
Note: Error messages for negative len remain different between precompiled ("Output buffer length can't be negative") and interpreted ("Buffer length cannot be negative") code paths, preserving existing behavior.

@dmitry-chirkov-dremio
Copy link
Author

dmitry-chirkov-dremio commented Mar 2, 2026

Let me go through first-timer's hurdles (like pre-commit clang format failures).
GH-49347 for 'aws/core/utils/pagination/Paginator.h' file not found

@dmitry-chirkov-dremio
Copy link
Author

Pushed clang-format fixes

@kou
Copy link
Member

kou commented Mar 4, 2026

@lriggs @akravchukdremio @xxlaykxx You may want to review this.

@dmitry-chirkov-dremio dmitry-chirkov-dremio force-pushed the gandiva-castvarchar-optimization branch from 61e5ec2 to 37d8e6d Compare March 4, 2026 03:00
@dmitry-chirkov-dremio
Copy link
Author

dmitry-chirkov-dremio commented Mar 4, 2026

@lriggs @akravchukdremio @xxlaykxx You may want to review this.

https://github.com/telemenar?target=https://github.com is reviewing offline. Second commit is based on their feedback.

@dmitry-chirkov-dremio dmitry-chirkov-dremio marked this pull request as draft March 4, 2026 03:23
Rename SUFFIX → FORMATTER_SUFFIX and FROM_TYPE → FROM_DATE64 to better
reflect their actual usage after the integer optimization was added.
@dmitry-chirkov-dremio dmitry-chirkov-dremio force-pushed the gandiva-castvarchar-optimization branch from 90c36c4 to c6d2c25 Compare March 4, 2026 14:52
@dmitry-chirkov-dremio dmitry-chirkov-dremio marked this pull request as ready for review March 4, 2026 14:53
@dmitry-chirkov-dremio
Copy link
Author

dmitry-chirkov-dremio commented Mar 4, 2026

Performance Benchmark Results

Benchmarks run on Apple M3, 5 repetitions each. Tests exercise castVARCHAR for Int32/Int64 with various len parameters. We see small yet noticeable improvements for int conversion. Date conversion's performance remained as is.

Test data:

  • Int32 / Int64: Full-range random integers (1M rows)
  • Int32Small / Int64Small: 2-digit integers 10-99 (1M rows)

Int32 (Full-Range Random)

len Original (μs) Optimized (μs) Δ
1 17,200 16,856 -2.0%
11 18,856 18,977 +0.6%
100 19,088 18,819 -1.4%
65536 18,792 18,951 +0.8%

Int32 (Small 2-Digit)

len Original (μs) Optimized (μs) Δ
1 11,791 11,062 -6.2%
11 12,461 11,756 -5.7%
100 12,606 11,878 -5.8%
65536 12,353 11,912 -3.6%

Int64 (Full-Range Random)

len Original (μs) Optimized (μs) Δ
1 17,801 17,246 -3.1%
20 19,785 19,010 -3.9%
100 19,268 19,439 +0.9%
65536 19,688 18,950 -3.7%

Int64 (Small 2-Digit)

len Original (μs) Optimized (μs) Δ
1 11,747 11,255 -4.2%
20 12,581 11,435 -9.1%
100 12,357 11,728 -5.1%
65536 12,586 11,796 -6.3%

Memory savings: Allocates 11-20 bytes per integer instead of up to 65,536 bytes - significant for engines that tend to use VARCHAR(65536)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants