This was an interesting puzzle: creating one single well formed JSON from a hierarchy of web pages. E.g., the sporting goods hierarchy of an e-commerce site could be Categories, Brands, Products. And so you’d like to output JSON like this:
As an aside, I like architecting my systems to generate this kind of output from my first-level scrapers: they simply mirror the source’s structure. But since it’s now in clean, well formed JSON, the importing code that follows can be simple.
First, I give the my Spider subclass a couple of instance variables:
Next, my top-level parse method returns its data by creating an Item and adding it directly into the structure. Finally, it yields the new Item to the next page’s parser:
In the parse method for the “next level down”, I do the same thing. Except now, I save the newly created Item in the passed-in Category:
That finishes up the scraping code. At this point, the spider would run, but produce no output. That’s because we’re not transmitting the root object, self.SportsInventory, back to Scrapy. So we need to hand it over, but only after all the spiders have finished. Here’s that code, also in the Spider subclass:
You can see, this code uses a hack — it schedules a scrape just so that it can return data. It ignores the actual scrape results. Is there a more direct way to schedule a data return?
Finally, to get a proper JSON instance (with a hash at the top level), use the JSON Lines Feed Exporter.
For a complete working example, see my Oregon Administrative Rule spider.