Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Merge requests
  • !312

[jsonl] Fix regression in index building.

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Administrator requested to merge jsonlregression into main Aug 20, 2022
  • Overview 3
  • Commits 2
  • Pipelines 0
  • Changes 2

Created by: stephenroller

Patch Description There was a performance regression in the JSONL dataset indexer, where when building the index it parsed the jsonl files when it really just needed start-of-line numbers. This was originally done to skip over bad parses.

This PR returns us to the much older behavior of throwing a hard crash, midway during training, when we come to a bad data point. On the positive side, we still report where the file was bad.

We may need to build a separate validator to deal with bad data parses, but at the end of the day, it doesn't feel like "skipping bad data" is the right thing to do.

Testing steps In progress

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: jsonlregression