How to produce a JSON tree with nested data from Scrapy

This was an interesting puzzle: creating one single well formed JSON from a hierarchy of web pages. E.g., the sporting goods hierarchy of an e-commerce site could be Categories, Brands, Products. And so you’d like to output JSON like this:

[Etc.]

As an aside, I like architecting my systems to generate this kind of output from my first-level scrapers: they simply mirror the source’s structure. But since it’s now in clean, well formed JSON, the importing code that follows can be simple.

My recipe

First, I give the my Spider subclass a couple of instance variables:

Next, my top-level parse method returns its data by creating an Item and adding it directly into the structure. Finally, it yields the new Item to the next page’s parser:

In the parse method for the “next level down”, I do the same thing. Except now, I save the newly created Item in the passed-in Category:

That finishes up the scraping code. At this point, the spider would run, but produce no output. That’s because we’re not transmitting the root object, self.SportsInventory, back to Scrapy. So we need to hand it over, but only after all the spiders have finished. Here’s that code, also in the Spider subclass:

Reliable, but depends on a hack.

You can see, this code uses a hack — it schedules a scrape just so that it can return data. It ignores the actual scrape results. Is there a more direct way to schedule a data return?

Finally, to get a proper JSON instance (with a hash at the top level), use the JSON Lines Feed Exporter.

For a complete working example, see my Oregon Administrative Rule spider.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s