DocScraper

The docscraper package is a scrapy spider for crawling a give set of websites and dowloading all available documents with a given set of file extensions.

Getting Started

You can get started by downloading the package with pip:

$ pip install docscraper

Once the package is installed, you can use it with scrapy directly in your Python script to download files from websites as follows:

>>> import docscraper
>>> allowed_domains = ["books.toscrape.com"]
>>> start_urls = ["https://books.toscrape.com"]
>>> extensions = [".html", ".pdf", ".docx", ".doc", ".svg"]
>>> docscraper.crawl(allowed_domains, start_urls, extensions=extensions)

Indices and tables