Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Merge requests
  • !352

Making data loader skipping 5--9x faster

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Administrator requested to merge github/fork/zdevito/dataloader2 into main Sep 27, 2022
  • Overview 27
  • Commits 15
  • Pipelines 0
  • Changes 9

Created by: zdevito

The fast-forwarding process added with DeferredTensor avoids having to tokenize documents that have already been read, but the code to simulate the shuffle buffer and sequence construction process can still take ~7 minutes to execute on full-size shards.

This patch does some optimization to the hot path of the skipping process to make it somewhere between 5--9x faster compared to the current state and around 500 -- 900x faster than the original OPT method, which should bring the time to do the fast forward down below a minute.

  • Merges the shuffling of documents, creation of sequences of documents into one object (DocumentToSequenceDataset). This allows all the deferred execution logic to be put into one file rather than have to expose a DeferredTensor object whose construction is slow.
  • Uses coarse-grained per-worker locking rather than atomics so that we do not need C extensions anymore.
  • Performs blocked generation of random numbers from numpy because there is high call overhead for asking for a single number.
  • The merge of the datasets and change to the random numbers are done in such a way that the random behavior of the merged object matches that of the previous code, so it is safe to have a checkpoint created in an older version of the code loaded by this code.
  • I also manually check a requeue after checkpoint to see that the docsperex and loss match before and after checkpoint load.
Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: github/fork/zdevito/dataloader2