Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • H headless-chrome-crawler
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 29
    • Issues 29
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • yujiosaka
  • headless-chrome-crawler
  • Merge requests
  • !275

replace newpage by custom crawl

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged yujiosaka requested to merge replace-newpage-by-custom-crawl into master Jun 10, 2018
  • Overview 1
  • Commits 3
  • Pipelines 0
  • Changes 14

There are growing needs to access Puppeteer's raw page object both before and after requests. My implementation of newpage event was a mistake for the following three reasons:

  1. You cannot pass values retrieved from page object to the crawling results
  2. You cannot access page object after requests in order get cookie values, console logs and etc.
  3. You cannot return Promise so that you have to deal with race conditions

Thus, I'd like to introduce new feature of customCrawl and hoping to replace newpage event by it. It goes like this:

const HCCrawler = require('headless-chrome-crawler');

(async () => {
  const crawler = await HCCrawler.launch({
    customCrawl: async (page, crawl) => {
      // You can access the page object before requests
      await page.setRequestInterception(true);
      page.on('request', request => {
        if (request.url().endsWith('/')) {
          request.continue();
        } else {
          request.abort();
        }
      });
      // The result contains options, links, cookies and etc.
      const result = await crawl();
      // You can access the page object after requests
      result.content = await page.content();
      // You need to extend and return the crawled result
      return result;
    },
    onSuccess: result => {
      console.log(`Got ${result.content} for ${result.options.url}.`);
    },
  });
  await crawler.queue('https://example.com/');
  await crawler.onIdle();
  await crawler.close();
})();

Fixes: https://github.com/yujiosaka/headless-chrome-crawler/issues/254 https://github.com/yujiosaka/headless-chrome-crawler/issues/256 https://github.com/yujiosaka/headless-chrome-crawler/pull/233

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: replace-newpage-by-custom-crawl