pyproject.toml · dietgpu-checkpoints · Administrator / metaseq

Making data loader skipping 5--9x faster (#352) · 6813700a

Zachary DeVito authored Sep 28, 2022

The fast-forwarding process added with DeferredTensor avoids having to tokenize documents that have already been read, but the code to simulate the shuffle buffer and sequence construction process can still take ~7 minutes to execute on full-size shards.

This patch does some optimization to the hot path of the skipping process to make it somewhere between 5--9x faster compared to the current state and around 500 -- 900x faster than the original OPT method, which should bring the time to do the fast forward down below a minute.

* Merges the shuffling of documents, creation of sequences of documents into one object (DocumentToSequenceDataset). This allows all the deferred execution logic to be put into one file rather than have to expose a DeferredTensor object whose construction is slow.
* Uses coarse-grained per-worker locking rather than atomics so that we do not need C extensions anymore.
* Performs blocked generation of random numbers from numpy b...

6813700a