Skip to content

GH-41365: [Python] Fix S3 URI with whitespace silently falling back to LocalFileSystem#49372

Open
ernestprovo23 wants to merge 1 commit intoapache:mainfrom
ernestprovo23:fix-s3-uri-whitespace-resolution
Open

GH-41365: [Python] Fix S3 URI with whitespace silently falling back to LocalFileSystem#49372
ernestprovo23 wants to merge 1 commit intoapache:mainfrom
ernestprovo23:fix-s3-uri-whitespace-resolution

Conversation

@ernestprovo23
Copy link

@ernestprovo23 ernestprovo23 commented Feb 22, 2026

Rationale

Closes #41365

When _resolve_filesystem_and_path() receives an S3 URI containing spaces (e.g. s3://bucket/path with space/file), the C++ URI parser raises ValueError("Cannot parse URI"). The Python code previously caught all "Cannot parse URI" errors and silently fell back to LocalFileSystem, producing a confusing downstream error instead of the real parsing failure.

What changed

  1. Added _is_likely_uri() helper — a Python port of the C++ IsLikelyUri() heuristic from cpp/src/arrow/filesystem/path_util.cc. It checks whether a string has a valid URI scheme prefix (2-36 chars, RFC 3986 compliant).

  2. Modified error handling in _resolve_filesystem_and_path() — the except ValueError block now distinguishes:

    • "empty scheme" → still falls back to LocalFileSystem (no scheme at all)
    • "Cannot parse URI" + _is_likely_uri() returns False → still falls back (not a URI, just a local path)
    • "Cannot parse URI" + _is_likely_uri() returns Truere-raises the error (it IS a URI, just malformed)
  3. Changed raise e to raise — preserves the original traceback (minor style improvement).

Tests added

  • test_is_likely_uri() — unit tests for the heuristic covering valid schemes (s3://, gs://, hdfs://, abfss://, grpc+https://), local paths, Windows drives, and invalid prefixes.
  • test_resolve_filesystem_and_path_uri_with_spaces() — verifies S3/GCS/ABFSS URIs with spaces now raise ValueError.
  • test_resolve_filesystem_and_path_local_with_spaces() — verifies local paths with spaces still return LocalFileSystem.

Test plan

  • All new tests pass (14/14)
  • Manual verification: _resolve_filesystem_and_path("s3://bucket/path with space") raises ValueError
  • Manual verification: _resolve_filesystem_and_path("/local/path with space") returns LocalFileSystem
  • Existing test_fs.py tests pass (requires full C++ build environment)

…g back to LocalFileSystem

Port the C++ `IsLikelyUri()` heuristic to Python and use it in
`_resolve_filesystem_and_path()` to distinguish between local paths
and malformed URIs before suppressing `ValueError`.

Previously, URIs like `s3://bucket/path with space/file` would silently
fall back to `LocalFileSystem` because the "Cannot parse URI" error was
unconditionally swallowed. Now, paths that look like URIs (valid scheme
prefix) propagate the parsing error, while local paths continue to fall
back to `LocalFileSystem` as expected.
@github-actions
Copy link

⚠️ GitHub issue #41365 has been automatically assigned in GitHub to PR creator.

@rok
Copy link
Member

rok commented Feb 22, 2026

Hey @ernestprovo23, thanks for working on this and opening the PR. I'll kick off the CI run and hopefully I or someone else wil find time early next week to do a review.

)


def _is_likely_uri(path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention on the issue:

Working on a fix that ports the C++ IsLikelyUri() heuristic to Python and uses it in _resolve_filesystem_and_path()

Could you check if it's possible to expose IsLikelyUri through Python so ultimately the logic is only implemented once and is always doing the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think calling the C++ function would be preferable. The change to expose the functionallity could look like:

diff --git i/cpp/src/arrow/filesystem/filesystem.h w/cpp/src/arrow/filesystem/filesystem.h
index 3a47eb62f5..85ee3ff114 100644
--- i/cpp/src/arrow/filesystem/filesystem.h
+++ w/cpp/src/arrow/filesystem/filesystem.h
@@ -23,6 +23,7 @@
 #include <iosfwd>
 #include <memory>
 #include <string>
+#include <string_view>
 #include <utility>
 #include <vector>

@@ -529,6 +530,12 @@ class ARROW_EXPORT SlowFileSystem : public FileSystem {
 /// The user is responsible for synchronization of calls to this function.
 void EnsureFinalized();

+/// \brief Return whether a path string is likely a URI.
+///
+/// This heuristic is conservative and may return false for malformed URIs.
+ARROW_EXPORT
+bool IsLikelyUri(std::string_view path);
+
 /// \defgroup filesystem-factories Functions for creating FileSystem instances
 ///
 /// @{
diff --git i/cpp/src/arrow/filesystem/path_util.cc w/cpp/src/arrow/filesystem/path_util.cc
index dc82afd07e..cdc1807781 100644
--- i/cpp/src/arrow/filesystem/path_util.cc
+++ w/cpp/src/arrow/filesystem/path_util.cc
@@ -28,8 +28,28 @@
 #include "arrow/util/uri.h"

 namespace arrow {
-
 namespace fs {
+
+bool IsLikelyUri(std::string_view v) {
+  if (v.empty() || v[0] == '/') {
+    return false;
+  }
+  const auto pos = v.find_first_of(':');
+  if (pos == v.npos) {
+    return false;
+  }
+  if (pos < 2) {
+    // One-letter URI schemes don't officially exist, perhaps a Windows drive letter?
+    return false;
+  }
+  if (pos > 36) {
+    // The largest IANA-registered URI scheme is "microsoft.windows.camera.multipicker"
+    // with 36 characters.
+    return false;
+  }
+  return ::arrow::util::IsValidUriScheme(v.substr(0, pos));
+}
+
 namespace internal {

 // XXX How does this encode Windows UNC paths?
@@ -337,26 +357,6 @@ bool IsEmptyPath(std::string_view v) {
   return true;
 }

diff --git i/cpp/src/arrow/filesystem/path_util.h w/cpp/src/arrow/filesystem/path_util.h
index d49d9d2efa..2f767fc7dc 100644
--- i/cpp/src/arrow/filesystem/path_util.h
+++ w/cpp/src/arrow/filesystem/path_util.h
@@ -27,6 +27,10 @@

 namespace arrow {
 namespace fs {
+
+ARROW_EXPORT
+bool IsLikelyUri(std::string_view s);
+
 namespace internal {

 constexpr char kSep = '/';
@@ -159,8 +163,7 @@ std::string ToSlashes(std::string_view s);
 ARROW_EXPORT
 bool IsEmptyPath(std::string_view s);

-ARROW_EXPORT
-bool IsLikelyUri(std::string_view s);
+inline bool IsLikelyUri(std::string_view s) { return ::arrow::fs::IsLikelyUri(s); }

 class ARROW_EXPORT Globber {
  public:
diff --git i/python/pyarrow/_fs.pyx w/python/pyarrow/_fs.pyx
index 0739b6acba..0b8ab7480f 100644
--- i/python/pyarrow/_fs.pyx
+++ w/python/pyarrow/_fs.pyx
@@ -84,6 +84,14 @@ def _file_type_to_string(ty):
     return f"{ty.__class__.__name__}.{ty._name_}"


+def _is_likely_uri(path):
+    cdef c_string c_path
+    if not isinstance(path, str):
+        raise TypeError("Path must be a string")
+    c_path = tobytes(path)
+    return CIsLikelyUri(cpp_string_view(c_path))
+
+
 cdef class FileInfo(_Weakrefable):
     """
     FileSystem entry info.
diff --git i/python/pyarrow/fs.py w/python/pyarrow/fs.py
index 41308cd60c..b0acb97d94 100644
--- i/python/pyarrow/fs.py
+++ w/python/pyarrow/fs.py
@@ -31,6 +31,7 @@ from pyarrow._fs import (  # noqa
     _MockFileSystem,
     FileSystemHandler,
     PyFileSystem,
+    _is_likely_uri,
     _copy_files,
     _copy_files_selector,
 )
@@ -82,32 +83,6 @@ def __getattr__(name):
     )

diff --git i/python/pyarrow/includes/libarrow_fs.pxd w/python/pyarrow/includes/libarrow_fs.pxd
index af01c47c8c..48604972d2 100644
--- i/python/pyarrow/includes/libarrow_fs.pxd
+++ w/python/pyarrow/includes/libarrow_fs.pxd
@@ -92,6 +92,7 @@ cdef extern from "arrow/filesystem/api.h" namespace "arrow::fs" nogil:
     CResult[shared_ptr[CFileSystem]] CFileSystemFromUriOrPath \
         "arrow::fs::FileSystemFromUriOrPath"(const c_string& uri,
                                              c_string* out_path)
+    c_bool CIsLikelyUri "arrow::fs::IsLikelyUri"(cpp_string_view path)

     cdef cppclass CFileSystemGlobalOptions \
             "arrow::fs::FileSystemGlobalOptions":
diff --git i/python/pyarrow/tests/test_fs.py w/python/pyarrow/tests/test_fs.py
index 537ceb9999..31f3f8a8ca 100644
--- i/python/pyarrow/tests/test_fs.py
+++ w/python/pyarrow/tests/test_fs.py
@@ -1715,6 +1715,8 @@ def test_is_likely_uri():
     assert not _is_likely_uri("C:/Users/foo")
     assert not _is_likely_uri("3bucket://key")       # scheme starts with digit
     assert not _is_likely_uri("-scheme://key")        # scheme starts with dash
+    assert not _is_likely_uri("schéme://bucket/key")  # non-ASCII in scheme
+    assert not _is_likely_uri("漢字://bucket/key")     # non-ASCII in scheme

Does that make sense?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Feb 23, 2026
@ernestprovo23
Copy link
Author

@rok friendly ping on this when you get a chance. the ARM64 macOS 14 failure looks like a CI infra issue (missing Paginator.h header from the AWS SDK on the runner), not related to this change.

@rok
Copy link
Member

rok commented Mar 3, 2026

Hey @ernestprovo23 ! The Paginator.h issue is unrelated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Pyarrow fs incorrectly resolves S3 URIs with white space as a local path

2 participants