Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • B bull
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 175
    • Issues 175
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 9
    • Merge requests 9
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • OptimalBits
  • bull
  • Issues
  • #356
Closed
Open
Issue created Oct 12, 2016 by Administrator@rootContributor

Jobs get double processed (if removed immediately upon completion)

Created by: bradvogel

There exists a situation where a job can get processed a second time by the processStalledJobs periodic cleanup job. This only happens when jobs are removed upon completion by the worker that processed them (a common pattern - see #354 (closed)).

We only discovered this because we saw Error: Missing Job xx when trying to move from active to completed show up in our server logs that is running a high-volume (100/events/sec) Bull queue.

This is how it happens:

Time Process A Process B
1 In the regular Bull run loop of Process A, getNextJob moves a job from wait to active
2 In Process B, processStalledJob happens to run, pulls all active jobs (this.client.lrangeAsync(this.toKey('active')), and beings to iterate over them
3 processes the job normally pulls job data via Job.fromId (the job data still exists at this point)
4 completes processing the job and the application code calls job.remove() which removes the job data and the lock
5 calls scripts.getStalledJob which only checks if the job is in completed (which it isn't anymore so it continues to grab the lock)
6 job is processed again
7 upon job completion, Error: Missing Job 59694333 when trying to move from active to completed is thrown because the job data doesn't exist anymore

Perhaps a solution here is to have scripts.getStalledJob ensure the job is in the active state, not merely checking if it's not in completed - since now we know that it could have been removed prior to this check. So if a job is in active AND doesn't have an existing lock, then processStalledJobs knows that another worker isn't processing it. However, since the active queue is a LIST, checking the existence of the element in the list is expensive (requires list iteration in the lua script).

Our temporary workaround is to delay calling job.remove() at the completion of the job to leave around a 'tombstone' so processStalledJobs will see it in the completed queue and skip over it.

Assignee
Assign to
Time tracking