Web Scraping of Daines Analytics Blog Entries using Python Take 1

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping code was written in Python 3 and leveraged the Scrapy framework [] maintained by Scrapinghub.

INTRODUCTION: Daines Analytics hosts its blog at dainesanalytics.blog. The purpose of this exercise is to practice web scraping using Scrapy by gathering the blog entries from Daines Analytics’ RSS feed. This iteration of the script can capture only the most recent ten blog entries. A future iteration of the script would automatically traverse the RSS feed to capture all blog entries, not just the first ten.

Starting URLs: https://dainesanalytics.blog/feed/

import scrapy

class DainesBlogRSSSpider(scrapy.Spider):
    name = 'dainesblogrss'
    allowed_domains = ['dainesanalytics.blog/feed/']
    start_urls = ['https://dainesanalytics.blog/feed/']

    # Setting up for the JSON output file
    custom_settings = {
        'FEED_URI' : 'dainesblogrss.json'
    }

    def parse(self, response):
        self.log('I just visited: ' + response.url)

        # Remove the XML namespaces
        response.selector.remove_namespaces()

        # Extract article information
        titles = response.xpath('//item/title/text()').extract()
        authors = response.xpath('//item/creator/text()').extract()
        dates = response.xpath('//item/pubDate/text()').extract()
        links = response.xpath('//item/link/text()').extract()
        description = response.xpath('//item/description/text()').extract()

        for item in zip(titles, authors, dates, links, description):
            scraped_info = {
                'Title' : item[0],
                'Author' : item[1],
                'Publish_Date' : item[2],
                'Link' : item[3],
                'Description' : item[4]
            }
            yield scraped_info

The source code and JSON output can be found here on GitHub.