Adding weboob in the Web Crawling section (!977) · Merge requests · Vinta Chen / awesome-python

Created by: Mistress-Anna

What is this Python project?

WebOOB is a framework for scraping websites and aggregating data from multiple websites.

Routing model of URL patterns to multiple class of Page with all the parsing associated to each of those Pages, for cleaner code
Scraping is made easy thanks to "declarative parsing": each Page can have a few XPaths, configure a few "filters" to apply on those XPaths (like parsing int, apply regex, etc.), and you're set!
Like every high-level feature in WebOOB, this declarative parsing can be disabled locally, when it doesn't fit for a particular site, and it's always possible to fallback to plain-old procedural parsing code
Pagination handling, supports infinite iterators
Typed data models to ensure clean scraped data
Can handle HTML/XML, JSON, and even XLS or PDF
(Optional) Can aggregate data from multiple websites by grouping them in categories (for example "video sites", "banking sites", "public transport sites", "event sites", etc.)
Comes builtin with a ~250 pre-existing website crawling backends
Has a few graphical and command-line apps to explore and search the scraped data