Skip to content

upload_filepath re-uploads files after transaction rollback — should check S3 before uploading #1397

@ttngu207

Description

@ttngu207

DataJoint Version

datajoint 0.14.8 (datajoint/external.py, datajoint/s3.py)

Problem

ExternalTable.upload_filepath() only checks the external tracking table (DB) to determine whether a file already exists on S3 before uploading. When a transaction rolls back after a successful S3 upload, the DB tracking entry is lost but the file remains on S3. On retry, upload_filepath finds no DB entry and re-uploads the entire file, even though it's already there.

This is particularly painful for large files (multi-GB). In our case, a 10 GB recording.dat gets uploaded to S3 in ~5 minutes, but the DB connection times out during the enclosing transaction. Every retry re-uploads the same 10 GB file — wasting time and bandwidth — and then fails again at the same point.

Root Cause

S3 uploads are not transactional, but upload_filepath treats them as if they are by relying solely on the DB tracking table:

# external.py lines 293-315
check_hash = (self & {"hash": uuid}).fetch("contents_hash")
if check_hash.size:
    # DB entry exists → skip (correct)
else:
    # DB entry missing → always upload, even if file is already on S3
    self._upload_file(local_filepath, external_path, metadata=...)
    self.connection.query("INSERT INTO ...")

After a transaction rollback:

  1. S3 file exists ✓ (upload succeeded before rollback)
  2. DB tracking entry gone ✗ (rolled back)
  3. Retry: DB check finds nothing → re-uploads entire file to S3

Proposed Fix

Before uploading, check S3 directly using s3.exists() + s3.get_size() (both already implemented in s3.py). If a file with matching size already exists at the expected path, skip the upload:

check_hash = (self & {"hash": uuid}).fetch("contents_hash")
if check_hash.size:
    # DB tracking entry exists — skip
    if not skip_checksum and contents_hash != check_hash[0]:
        raise DataJointError(...)
else:
    external_path = self._make_external_filepath(relative_filepath)
    already_uploaded = False
    if self.spec["protocol"] == "s3":
        if self.s3.exists(str(external_path)):
            remote_size = self.s3.get_size(str(external_path))
            if remote_size == file_size:
                already_uploaded = True
                logger.info(
                    f"File already exists on S3 with matching size, "
                    f"skipping upload: '{relative_filepath}'"
                )
    if not already_uploaded:
        self._upload_file(
            local_filepath, external_path,
            metadata={"contents_hash": str(contents_hash) if contents_hash else ""},
        )
    # Always insert the DB tracking entry
    self.connection.query("INSERT INTO ...")

For even stronger verification, the contents_hash is already stored in S3 object metadata (set at upload time on line 306). It could be checked via stat_object without downloading the file:

stat = self.s3.client.stat_object(self.bucket, str(external_path))
remote_contents_hash = stat.metadata.get("x-amz-meta-contents_hash")

Why This Is Safe

  • s3.exists() and s3.get_size() are cheap stat_object calls (milliseconds)
  • The external path is deterministic (derived from the relative filepath), so path + size match is a strong identity signal
  • If the S3 check fails or the file doesn't exist, it falls through to the normal upload path — zero risk to existing behavior
  • This only affects the else branch (no DB entry), so already-tracked files are unaffected

Reproduction

Any dj.Imported table that inserts filepath@store attributes pointing to large files inside make():

  1. make() runs a long computation, then calls self.insert1(...) with a filepath@store attribute referencing a large file
  2. S3 upload succeeds but takes long enough that the DB connection times out
  3. Transaction rolls back (LostConnectionError)
  4. Retry: upload_filepath finds no DB entry → re-uploads the same large file → same timeout → infinite retry loop

In our pipeline, this happens with spike sorting, which produces a 10 GB recording.dat artifact.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions