Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Merge requests
  • !398

Add support for source target style datasets to DocumentToSequence

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Administrator requested to merge github/fork/zdevito/dataloader4 into main Oct 07, 2022
  • Overview 4
  • Commits 3
  • Pipelines 0
  • Changes 3

Created by: zdevito

Training runs using StreamingSrcTgtDataset were failing because they did not do the same token length caching as DocumentsToSequences.

StreamingSrcTgtDataset is really just another instances of StreamingTokenBlockDataset where the the blocks are split into a tuple (src, target). To avoid duplication this PR just adds support for this case directly to DocumentToSequences, and a test to verify this replicates the old behavior.

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: github/fork/zdevito/dataloader4