Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Merge requests
  • !426

E2E test for training resumption, mocks download and upload of azure checkpoints

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Administrator requested to merge github/fork/Xirider/test_checkpoint_saving into main Oct 17, 2022
  • Overview 3
  • Commits 11
  • Pipelines 0
  • Changes 2

Created by: Xirider

Added E2E test for training resumption and azure checkpoint storing logic. Issue: https://github.com/facebookresearch/metaseq/issues/351 and https://github.com/facebookresearch/metaseq/issues/268

  1. Training runs for 20 steps -> check for creation of the correct checkpoints and the correct uploads to azure blob and train loss
  2. Training resumes for 15 steps -> check for correct "download" of last checkpoints from azure blob, load these checkpoints and have correct final loss

Azure blob download and upload is mocked out.

Note that newly spawned subprocesses do not keep mocked objects, so I had to instead pass functions that create the mocks inside the subprocesses.

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: github/fork/Xirider/test_checkpoint_saving