Home > finance, python > Extract Yahoo stock list

Extract Yahoo stock list

Use Case

In this post I explain how I managed to download the stock list of Yahoo.

Problem: I wanted to have a list of Yahoo stocks. More precisely, I only needed the stock IDs and the corresponding company names, e.g. “MSFT => Microsoft Corporation”. I was looking for such a list, but I didn’t find anything useful. So after a few hours I said to myself: “I will have to solve this problem by myself :(“.

Solution

Fortunately, Yahoo has a page that lists the industries. So the problem is actually extracting data from a bunch of HTML pages. Let’s see the steps that lead to the solution.

Step 1: Check out the page http://biz.yahoo.com/p/. It lists the industries by categories, which is not bad, but there is a better page: http://biz.yahoo.com/p/sum_conameu.html. Here, on one page we can find all the industries. On the left side, in the first column, we have every industry. We need to follow each industry, and extract data from the direct child pages.

If we visit an industry page, say Accident & Health Insurance, then we can see a “Download Spreadsheet” link on the right side. However, the generated CSV file doesn’t contain the stock symbols, just the company names. If we need the stock symbols too, we will have to dive in the HTML source…

Step 2: Go to http://biz.yahoo.com/p/sum_conameu.html and extract the URL addresses of the industry pages. They are in the first column of the table. By checking the source, these links have a common form: NNNconameu.html, where N is a digit. For this job, we’ll use a Python script with the BeautifulSoup library.

First, save the HTML file:

wget http://biz.yahoo.com/p/sum_conameu.html

Then, use this script:

#!/usr/bin/env python

# A-get-list.py

import re
import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

html = 'sum_conameu.html'
base = 'http://biz.yahoo.com/p'

f1 = open(html,  'r')
text = f1.read()
f1.close()

soup = BeautifulSoup(text)

for a in soup.findAll('a'):
    if a.has_key('href'):
        if re.search(r'\d{3}conameu\.html', a['href']):
            print base + '/' + a['href']

Redirect the output of the script to a file called A-list.txt:

./A-get-list.py >A-list.txt

Step 3: Using the file A-list.txt, process each HTML file. Extract the necessary data (company names and stock symbols) from them.

#!/usr/bin/env python

# B-process.py

import re
import sys
from urllib import urlopen
import unicodedata
from BeautifulSoup import BeautifulSoup

pages = []
infile = 'A-list.txt'
cnt = 0
exceptions = 0

f1 = open(infile,  'r')
pages = map(str.strip, f1.readlines())
f1.close()

def process(page):
    text = urlopen(page).read()
    soup = BeautifulSoup(text)

    global cnt, exceptions
    cnt += 1

    sys.stderr.write( "%d of %d\n" % (cnt, len(pages)) )

    for tr in soup.findAll('tr'):
        if tr.td:
            if tr.td.font:
                font = tr.td.font
                inner = font.findAll('a')
                if len(inner) == 2:
                    name = inner[0].contents[0].replace("\n", " ")
                    symbol = inner[1].contents[0]
                    try:
                        sys.stdout.write( "%s; %s\n" % (symbol, name) )
                    except UnicodeEncodeError:
                        sys.stderr.write( "Exception! Page %s, symbol %s.\n" % (page, symbol) )
                        exceptions += 1

for p in pages:
    process(p)

sys.stderr.write("# of exceptions: %d\n" % exceptions)

Some company names contain some bizarre Unicode characters, I just ignore them. Here we’ll get almost 20,000 results and I had 72 exceptions only. I tried to convert Unicode to ASCII, but the result was not so good, so I decided to simply drop these lines.

Redirect the result to a temporary file:

./B-process.py >a

Step 4: Cleaning. The output file contains some duplicates that must be removed.

sort a | uniq >yahoo.csv

Step 5: Pushing the data into MySQL. I created the following simple table to store these data:

CREATE TABLE IF NOT EXISTS `stock_symbol_name` (
  `symbol` varchar(15) NOT NULL,
  `name` varchar(80) NOT NULL,
  PRIMARY KEY (`symbol`),
  KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Use this script to populate the database:

#!/usr/bin/env python

# sudo apt-get install python-mysqldb
import MySQLdb
import sys
import os

# reopen stdout file descriptor with write mode
# and 0 as the buffer size (unbuffered)
sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)

f1 = open('yahoo.csv',  'r')

conn = MySQLdb.connect(host = "localhost",
                       user = "user",
                       passwd = "passwd",
                       db = "db")
cursor = conn.cursor()

query = "DELETE FROM stock_symbol_name"
cursor.execute(query)
conn.commit()

for line in f1:
    line = line.rstrip("\n")
    pieces = map(str.strip, line.split(';'))

    #print "'%s' => '%s'" % (pieces[0], pieces[1])

    query = "INSERT INTO stock_symbol_name (symbol, name) VALUES (\"%s\", \"%s\")" % (pieces[0], pieces[1])
    #print query
    cursor.execute(query)
    #row = cursor.fetchone ()
    #print "row: ", row
    sys.stdout.write( '.' )

print

f1.close()

conn.commit()
cursor.close ()
conn.close ()

It’s very likely that you’ll have to execute this script in an iterative way. In the input file yahoo.csv there are still some duplicates: there are lines where the symbol is the same but the company names are different… Use the http://finance.yahoo.com/ site to verify these stock symbols and to figure out which company name is the real one. As the symbol is the primary key of our database table, we will get an exception in these cases. So correct the input file and re-execute the script. I had about 10 to 15 problematic cases.

At the end, you will have a table with Yahoo stock IDs and their corresponding company names. As of December 17, 2010, I could extract 18,933 elements.

Download

For the lazy pigs, here is a download link: yahoo.csv.zip.

If it was useful, leave a comment!!!

Categories: finance, python Tags: , , ,
  1. Wei
    May 9, 2011 at 20:02

    Thanks for the examples. It helped me a lot. However i have one problem with yahoo pages. Those “NNNconameu.html” pages are not always populated. Some times, they are populated with Sector/Industry and all the companies in there. Other times, I only got Sector/Industry and no Company portion at all. Initially i thought it was due to the weekend and server updating their pages. However it happens to me during the working days also. Any idea?

    • May 9, 2011 at 20:07

      Sorry, I don’t know… I just grabbed all that data once, pushed them in a database and then worked with the downloaded data. I only consulted these Yahoo pages once.

  2. Pravin
    December 27, 2012 at 19:04

    Complete list of yahoo symbols/tickers/stocks is available for download(excel format) at below website. http://www.myinvestorshub.com/yahoo_stock_list.php

  3. Joe Passman
    August 1, 2013 at 00:29

    Nice bit of info. I implemented mine a little differently. I am creating an “industry probe” to help with a job search. Basically, I input the industry(ies) I am interested in working for. Then I use the information you provided to pull down company symbols and names. I then wrote a script to analyze historical trends for each company in the industry(ies) I am interested in. Thank you for the help!

  4. December 9, 2013 at 03:12

    There’s a complete list of Yahoo stock ticker symbols at http://investexcel.net/all-yahoo-finance-stock-tickers/ in a single Excel spreadsheet (across all international exchanges)

  5. May 21, 2014 at 19:04

    Free historic intraday data can be found in http://www.stockchartfree.com , but i think they provide only for 100 days. They provide 30 technical indicators as well for intraday data. They also provide live and EOD data with technical analysis

  1. No trackbacks yet.

Leave a comment