Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Merge requests
  • !337

[jsonl] Make skipping entries in the dataloader significantly faster

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Administrator requested to merge github/fork/zdevito/dataloader into main Sep 13, 2022
  • Overview 17
  • Commits 22
  • Pipelines 0
  • Changes 11

Created by: zdevito

Patch Description Introduce machinery to skip ahead in the dataset without having to re-tokenize or read the files in the dataset again.

Measuring in a separate benchmark script in the internal repo indicates this can run reduce the 'fast-forward' stage of data loader from ~20mins to 14 seconds (around 70x speedup).

This works by storing a cache from document idx -> number of tokens that can be stored in a snapshot as a array of numbers. When the token count is known, the DeferredDataset creates DeferredTensor objects that know their size, how to compute the value if need and how to generate new tensors via slices and concatenates. Another object SkipDeferredDataset skips the first to_skip elements, without ever computing the DeferredTensors values, bypassing tokenization while keeping the state of the data loader (e.g. the shuffle buffer) exactly the same as if it had actually been running.

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: github/fork/zdevito/dataloader