Web Scraping of Registry of Open Data on AWS

SUMMARY: The purpose of this project is to gather data about the open datasets on AWS. The web scraping code was written in Python 3 and leveraged the Scrapy framework maintained by Scrapinghub.

INTRODUCTION: The Open Data registry exists to help people discover and share datasets that are available via AWS resources. This page lists all usage examples for datasets listed in the registry.

Starting URLs: https://registry.opendata.aws/

import scrapy
class ListdatasetsSpider(scrapy.Spider):
    name = 'listdatasets'
    start_urls = ['https://registry.opendata.aws/']

    def parse(self, response):
        for dataset in response.css('div.dataset'):
            item = {
                'dataset_name': dataset.css('h3 > a::text').extract_first(),
                'detail_url': response.urljoin(dataset.css('h3 > a::attr(href)').extract_first()),
                'tags': dataset.css('p > span::text').extract(),
                'description': dataset.css('p')[1].extract(),
            }
            yield item

The source code and JSON output can be found here on GitHub.