How to set Scrapy configuration values

In a nutshell

The Public.Law Open-gov spiders set the secrets in production via Zyte’s Spider Settings UI. Then this code at the end of settings.py enables local development mode:

#
# In development mode only, set the sensitive and environment-
# dependent configuration values via env  variables. On development
# machines, set `SCRAPY_DEVELOPMENT_MODE` to make this work. This
# isn't necessary, however, to develop and run the spiders.
#
if "SCRAPY_DEVELOPMENT_MODE" in os.environ:
    SPIDERMON_TELEGRAM_SENDER_TOKEN = os.environ["TELEGRAM_BOT_TOKEN"]
    SPIDERMON_TELEGRAM_RECIPIENTS = os.environ["TELEGRAM_BOT_GROUP_ID"]
    CRAWLERA_ENABLED = True
    CRAWLERA_APIKEY = os.environ["CRAWLERA_APIKEY"]
    CRAWLERA_USER = os.environ["CRAWLERA_APIKEY"]
    LOG_LEVEL = os.environ["SCRAPY_LOG_LEVEL"]
    USER_AGENT = os.environ["SCRAPY_USER_AGENT"]

The details: env vars in dev, Scrapy configs in production.

Python Scrapy is a batteries-included, complete ecosystem for scraping data: libraries, commercial hosting, open source hosting, and active communities.

One funny, tricky thing for me has been configuring the code in the modern & secure style: no API keys or secrets committed with the code. Each environment (development and production) picks up the proper settings. Finally, for the Public.Law scrapers, enabling external open-source devs to work with the code quickly without setting up a bunch of APIs.

It turns out that Zyte’s Spider Settings UI does not set OS environment variables. Instead, they directly set Scrapy Spider settings.

So, in order to run spiders locally as well as on a host like Zyte, two methods of settings the configs must be supported:

  • In production, do not use os.environ. Instead, simply set the configs in the web UI and the Scrapy library will pick them up.
  • In development, we do want to use os.environ. In this case, we read the env vars, set constants in settings.py, and the library will pick them up from there.

The final trick is figuring out whether we’re in production or dev. I couldn’t find anything in the docs or debug output. So on my dev laptop, I simply set a flag variable (I use the fish shell):

set -gx SCRAPY_DEVELOPMENT_MODE yes

I check for that env var in a safe way which ignores the remaining code when in production:

if "SCRAPY_DEVELOPMENT_MODE" in os.environ:
    ...