Skip to content

[FLINK-38697][runtime] Store task information blob key in JobVertex#27726

Open
DeamonDev wants to merge 1 commit intoapache:masterfrom
DeamonDev:FLINK-38697-store-blob-key-in-job-vertex
Open

[FLINK-38697][runtime] Store task information blob key in JobVertex#27726
DeamonDev wants to merge 1 commit intoapache:masterfrom
DeamonDev:FLINK-38697-store-blob-key-in-job-vertex

Conversation

@DeamonDev
Copy link

@DeamonDev DeamonDev commented Mar 2, 2026

The change ensures that TaskInformation blob keys are reused across ExecutionGraph rebuilds triggered by the adaptive scheduler, rather than creating a new permanent blob on every restart.
Previously, each time the adaptive scheduler rebuilt the ExecutionGraph (e.g. after a TaskManager failure), ExecutionJobVertex would upload a fresh TaskInformation blob to the BlobServer even if the content was identical. These blob keys are randomized, which was introduced to the blob key to prevent hash-collisions in #4359.

The fix caches the PermanentBlobKey on JobVertex, which unlike ExecutionJobVertex is never recreated during adaptive scheduler restarts. On subsequent ExecutionGraph rebuilds, ExecutionJobVertex checks if JobVertex already holds a cached key and reuses it instead of uploading a new blob. This mirrors the existing pattern used by job JAR blobs, which are correctly reused across restarts by storing their keys persistently in the JobGraph. Since permanent blobs are only cleaned up when a job reaches a globally terminated state, these orphaned blobs accumulate indefinitely on the JM's blob storage, eventually causing storage exhaustion and job stalling.

Brief change log

  • cache TaskInformation's PermanentBlobKey on JobVertex to reuse it across ExecutionGraph rebuilds

Verifying this change

This change added tests and can be verified as follows:

  • Added test that validates that PermanentBlobKey is cached and reused by ExecutionJobVertex'es related to given JobVertex

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Signed-off-by: Piotr Rudnicki <piotr.rudnicki94@protonmail.com>
@flinkbot
Copy link
Collaborator

flinkbot commented Mar 2, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@DeamonDev DeamonDev marked this pull request as ready for review March 3, 2026 07:57
@DeamonDev DeamonDev marked this pull request as draft March 3, 2026 07:58
@DeamonDev DeamonDev marked this pull request as ready for review March 3, 2026 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants