Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Merge requests
  • !441

Remove dependency on iopath.

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Administrator requested to merge davides/file_io into main Oct 21, 2022
  • Overview 6
  • Commits 6
  • Pipelines 0
  • Changes 16

Created by: davides

Patch Description This change decouples metaseq and iopath with the following:

  • Copy the main classes into the local file_io package
  • Copy the associated unit tests
  • Drop functionality not needed in metaseq: TabularIO, telemetry
  • Update setup.py. portalocker was previously a transitive dependency, and we still need it for metaseq.file_io.common.file_lock
  • Pull forward the pending AzureBlobPathHandler from https://github.com/facebookresearch/iopath/pull/17
    • Some controls have been put in place so that read/write operations use a known amount of memory when dealing with larger files:
      • _open("wb", buffering=<buffer-size>) will buffer up to the requested amount of data in memory before flushing it to the service with the PutBlock operation
      • _open("rb", buffering=<buffer-size>) will use the Blob client's chunk iterator to only download a fixed amount of data at a time
      • _close() in write-mode will flush any buffered data with one more PutBlock, and finalize the blob with PutBlockList
      • The block-based approach should work for both block blobs and append blobs (see the Azure docs).

Testing steps

$ python -m unittest discover tests/file_io/
.10/21/2022 12:23:17 PM Caching az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin ...
10/21/2022 12:23:17 PM URL az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin cached in /tmp/tmpy3w003kj/blob_cache/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin
10/21/2022 12:23:17 PM Caching az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file2.bin ...
10/21/2022 12:23:17 PM URL az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file2.bin cached in /tmp/tmpy3w003kj/blob_cache/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file2.bin
10/21/2022 12:23:17 PM URL az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin was already cached in /tmp/tmpy3w003kj/blob_cache/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin
.10/21/2022 12:23:17 PM Opening blob: path=az://lrsstoragewest3/data/temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin, mode=rb
10/21/2022 12:23:17 PM Read next chunk: blob_name=temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin, length=4096
10/21/2022 12:23:17 PM Opening blob: path=az://lrsstoragewest3/data/temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin.streamed, mode=wb
10/21/2022 12:23:17 PM Uploading a new block: blob_name=temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin.streamed, block_id=MDAwMDA=, idx=0, length=4096
10/21/2022 12:23:18 PM Committing blocks: blob_name=temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin.streamed, count=1
10/21/2022 12:23:18 PM Uploading a new block: blob_name=temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin.streamed, block_id=MDAwMDE=, idx=1, length=4096
....10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
..............................sssssssssssssss
----------------------------------------------------------------------
Ran 51 tests in 2.752s

OK (skipped=15)

(The skipped tests are for S3PathHandler which was unchanged; just moved verbatim from metaseq/s3_utils.py to metaseq/file_io/s3.py)

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: davides/file_io