Web-Crawler (Python) for getting financial data

*** I have moved this blog to www.tanmaydutta.com

1. Pick website: [I am going to scrap http://finance.yahoo.com/q?s=DAIMX This is a random choice for an Indian equity and I am going to study a couple of more equities.
2. Create a folder to start your project –>

 scrapy startproject yahooFinance 

This should create the following kind of structure
We will leave this structure as it is for now.
Next we just need to define some items to scrap from yahoo finance in items.py (which is located in yahooFinance –> yahooFinance –> items.py)
Now, we will write some basic code in items.py (This defines name and allowed domain)

#spider to crawl yahoofinance DMS India

from scrapy.spider import BaseSpider

class DMSSpider(BaseSpider):
name = "DMSE"
allowed_domains = ["finance.yahoo.com"]
start_urls = ["http://finance.yahoo.com/q?s=DAIMX"]

def parse(self, response):
filename = response.url.split("/")[-2]

Next go to spider directory
–> create a new spider which would crawl the website

#spider to crawl yahoofinance DMS India
#written by Tanmay
# Does not do any fancy thing.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from yahooFinance.items import TickItem

class DMSSpider(BaseSpider):
name = "DMS"
allowed_domains = ["finance.yahoo.com"]
start_urls = ["http://finance.yahoo.com/q?s=DAIMX"]

def parse(self, response):
print 'in part'
hxs = HtmlXPathSelector(response)
sites = hxs.select('//span[contains(@id,"l10")]')
for site in sites:
item = TickItem()
item['name'] = [u'DMSTick']
item['value'] = site.select('text()').extract()
return item

Go to main project directory and start crawler by typing

scrapy crawl DMS

Two things (that I Can think of right now) can cause error at this point are :

#1 . Unknown command: crawl Use “scrapy” to see available commands More commands are available in project mode means you are not in the correct directory.

You have to be in a directory where your scrapy.cfg file is

#2. No crawler name “xyz” is found.

Make sure that the spider file that you just created has the “name” property exactly xyz or whatever you actually wanted


Ok so if everything runs perfectly till now then that means basic functionality is completed.


Next time will start beefing up the items.py to extract real data

# Define here the models for your scraped items
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class TickItem(Item):
    # define the fields for your item here like:
    name = Field()
    value = Field()

Run it and save resuls in a JSON (DMSTick) by

scrapy crawl DMS -o DMSTick.json -t json

The final Json that this code produce is :

[[[{"name": ["demo"], "value": ["8.72"]}]
[{"name": ["demo"], "value": ["8.72"]}]