In a nutshell
# # In development mode only, set the sensitive and environment- # dependent configuration values via env variables. On development # machines, set `SCRAPY_DEVELOPMENT_MODE` to make this work. This # isn't necessary, however, to develop and run the spiders. # if "SCRAPY_DEVELOPMENT_MODE" in os.environ: SPIDERMON_TELEGRAM_SENDER_TOKEN = os.environ["TELEGRAM_BOT_TOKEN"] SPIDERMON_TELEGRAM_RECIPIENTS = os.environ["TELEGRAM_BOT_GROUP_ID"] CRAWLERA_ENABLED = True CRAWLERA_APIKEY = os.environ["CRAWLERA_APIKEY"] CRAWLERA_USER = os.environ["CRAWLERA_APIKEY"] LOG_LEVEL = os.environ["SCRAPY_LOG_LEVEL"] USER_AGENT = os.environ["SCRAPY_USER_AGENT"]
The details: env vars in dev, Scrapy configs in production.
One funny, tricky thing for me has been configuring the code in the modern & secure style: no API keys or secrets committed with the code. Each environment (development and production) picks up the proper settings. Finally, for the Public.Law scrapers, enabling external open-source devs to work with the code quickly without setting up a bunch of APIs.
It turns out that Zyte’s Spider Settings UI does not set OS environment variables. Instead, they directly set Scrapy Spider settings.
So, in order to run spiders locally as well as on a host like Zyte, two methods of settings the configs must be supported:
- In production, do not use
os.environ. Instead, simply set the configs in the web UI and the Scrapy library will pick them up.
- In development, we do want to use
os.environ. In this case, we read the env vars, set constants in
settings.py, and the library will pick them up from there.
The final trick is figuring out whether we’re in production or dev. I couldn’t find anything in the docs or debug output. So on my dev laptop, I simply set a flag variable (I use the fish shell):
set -gx SCRAPY_DEVELOPMENT_MODE yes
I check for that env var in a safe way which ignores the remaining code when in production:
if "SCRAPY_DEVELOPMENT_MODE" in os.environ: ...