Remove dependency on iopath. (!441) · Merge requests · Administrator / metaseq

Merged Administrator requested to merge davides/file_io into main Oct 21, 2022

Created by: davides

Patch Description This change decouples metaseq and iopath with the following:

Copy the main classes into the local file_io package
Copy the associated unit tests
Drop functionality not needed in metaseq: TabularIO, telemetry
Update setup.py. portalocker was previously a transitive dependency, and we still need it for metaseq.file_io.common.file_lock
Pull forward the pending AzureBlobPathHandler from https://github.com/facebookresearch/iopath/pull/17
- Some controls have been put in place so that read/write operations use a known amount of memory when dealing with larger files:
  - _open("wb", buffering=<buffer-size>) will buffer up to the requested amount of data in memory before flushing it to the service with the PutBlock operation
  - _open("rb", buffering=<buffer-size>) will use the Blob client's chunk iterator to only download a fixed amount of data at a time
  - _close() in write-mode will flush any buffered data with one more PutBlock, and finalize the blob with PutBlockList
  - The block-based approach should work for both block blobs and append blobs (see the Azure docs).

Testing steps

$ python -m unittest discover tests/file_io/
.10/21/2022 12:23:17 PM Caching az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin ...
10/21/2022 12:23:17 PM URL az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin cached in /tmp/tmpy3w003kj/blob_cache/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin
10/21/2022 12:23:17 PM Caching az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file2.bin ...
10/21/2022 12:23:17 PM URL az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file2.bin cached in /tmp/tmpy3w003kj/blob_cache/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file2.bin
10/21/2022 12:23:17 PM URL az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin was already cached in /tmp/tmpy3w003kj/blob_cache/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin
.10/21/2022 12:23:17 PM Opening blob: path=az://lrsstoragewest3/data/temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin, mode=rb
10/21/2022 12:23:17 PM Read next chunk: blob_name=temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin, length=4096
10/21/2022 12:23:17 PM Opening blob: path=az://lrsstoragewest3/data/temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin.streamed, mode=wb
10/21/2022 12:23:17 PM Uploading a new block: blob_name=temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin.streamed, block_id=MDAwMDA=, idx=0, length=4096
10/21/2022 12:23:18 PM Committing blocks: blob_name=temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin.streamed, count=1
10/21/2022 12:23:18 PM Uploading a new block: blob_name=temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin.streamed, block_id=MDAwMDE=, idx=1, length=4096
....10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
..............................sssssssssssssss
----------------------------------------------------------------------
Ran 51 tests in 2.752s

OK (skipped=15)

(The skipped tests are for S3PathHandler which was unchanged; just moved verbatim from metaseq/s3_utils.py to metaseq/file_io/s3.py)