Extract Yahoo stock list
Use Case
In this post I explain how I managed to download the stock list of Yahoo.
Problem: I wanted to have a list of Yahoo stocks. More precisely, I only needed the stock IDs and the corresponding company names, e.g. “MSFT => Microsoft Corporation”. I was looking for such a list, but I didn’t find anything useful. So after a few hours I said to myself: “I will have to solve this problem by myself :(“.
Solution
Fortunately, Yahoo has a page that lists the industries. So the problem is actually extracting data from a bunch of HTML pages. Let’s see the steps that lead to the solution.
Step 1: Check out the page http://biz.yahoo.com/p/. It lists the industries by categories, which is not bad, but there is a better page: http://biz.yahoo.com/p/sum_conameu.html. Here, on one page we can find all the industries. On the left side, in the first column, we have every industry. We need to follow each industry, and extract data from the direct child pages.
If we visit an industry page, say Accident & Health Insurance, then we can see a “Download Spreadsheet” link on the right side. However, the generated CSV file doesn’t contain the stock symbols, just the company names. If we need the stock symbols too, we will have to dive in the HTML source…
Step 2: Go to http://biz.yahoo.com/p/sum_conameu.html and extract the URL addresses of the industry pages. They are in the first column of the table. By checking the source, these links have a common form: NNNconameu.html
, where N
is a digit. For this job, we’ll use a Python script with the BeautifulSoup library.
First, save the HTML file:
wget http://biz.yahoo.com/p/sum_conameu.html
Then, use this script:
#!/usr/bin/env python # A-get-list.py import re import sys from urllib import urlopen from BeautifulSoup import BeautifulSoup html = 'sum_conameu.html' base = 'http://biz.yahoo.com/p' f1 = open(html, 'r') text = f1.read() f1.close() soup = BeautifulSoup(text) for a in soup.findAll('a'): if a.has_key('href'): if re.search(r'\d{3}conameu\.html', a['href']): print base + '/' + a['href']
Redirect the output of the script to a file called A-list.txt
:
./A-get-list.py >A-list.txt
Step 3: Using the file A-list.txt
, process each HTML file. Extract the necessary data (company names and stock symbols) from them.
#!/usr/bin/env python # B-process.py import re import sys from urllib import urlopen import unicodedata from BeautifulSoup import BeautifulSoup pages = [] infile = 'A-list.txt' cnt = 0 exceptions = 0 f1 = open(infile, 'r') pages = map(str.strip, f1.readlines()) f1.close() def process(page): text = urlopen(page).read() soup = BeautifulSoup(text) global cnt, exceptions cnt += 1 sys.stderr.write( "%d of %d\n" % (cnt, len(pages)) ) for tr in soup.findAll('tr'): if tr.td: if tr.td.font: font = tr.td.font inner = font.findAll('a') if len(inner) == 2: name = inner[0].contents[0].replace("\n", " ") symbol = inner[1].contents[0] try: sys.stdout.write( "%s; %s\n" % (symbol, name) ) except UnicodeEncodeError: sys.stderr.write( "Exception! Page %s, symbol %s.\n" % (page, symbol) ) exceptions += 1 for p in pages: process(p) sys.stderr.write("# of exceptions: %d\n" % exceptions)
Some company names contain some bizarre Unicode characters, I just ignore them. Here we’ll get almost 20,000
results and I had 72
exceptions only. I tried to convert Unicode to ASCII, but the result was not so good, so I decided to simply drop these lines.
Redirect the result to a temporary file:
./B-process.py >a
Step 4: Cleaning. The output file contains some duplicates that must be removed.
sort a | uniq >yahoo.csv
Step 5: Pushing the data into MySQL. I created the following simple table to store these data:
CREATE TABLE IF NOT EXISTS `stock_symbol_name` ( `symbol` varchar(15) NOT NULL, `name` varchar(80) NOT NULL, PRIMARY KEY (`symbol`), KEY `name` (`name`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Use this script to populate the database:
#!/usr/bin/env python # sudo apt-get install python-mysqldb import MySQLdb import sys import os # reopen stdout file descriptor with write mode # and 0 as the buffer size (unbuffered) sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) f1 = open('yahoo.csv', 'r') conn = MySQLdb.connect(host = "localhost", user = "user", passwd = "passwd", db = "db") cursor = conn.cursor() query = "DELETE FROM stock_symbol_name" cursor.execute(query) conn.commit() for line in f1: line = line.rstrip("\n") pieces = map(str.strip, line.split(';')) #print "'%s' => '%s'" % (pieces[0], pieces[1]) query = "INSERT INTO stock_symbol_name (symbol, name) VALUES (\"%s\", \"%s\")" % (pieces[0], pieces[1]) #print query cursor.execute(query) #row = cursor.fetchone () #print "row: ", row sys.stdout.write( '.' ) print f1.close() conn.commit() cursor.close () conn.close ()
It’s very likely that you’ll have to execute this script in an iterative way. In the input file yahoo.csv
there are still some duplicates: there are lines where the symbol is the same but the company names are different… Use the http://finance.yahoo.com/ site to verify these stock symbols and to figure out which company name is the real one. As the symbol is the primary key of our database table, we will get an exception in these cases. So correct the input file and re-execute the script. I had about 10 to 15 problematic cases.
At the end, you will have a table with Yahoo stock IDs and their corresponding company names. As of December 17, 2010, I could extract 18,933
elements.
Download
For the lazy pigs, here is a download link: yahoo.csv.zip.
If it was useful, leave a comment!!!
Thanks for the examples. It helped me a lot. However i have one problem with yahoo pages. Those “NNNconameu.html” pages are not always populated. Some times, they are populated with Sector/Industry and all the companies in there. Other times, I only got Sector/Industry and no Company portion at all. Initially i thought it was due to the weekend and server updating their pages. However it happens to me during the working days also. Any idea?
Sorry, I don’t know… I just grabbed all that data once, pushed them in a database and then worked with the downloaded data. I only consulted these Yahoo pages once.
Complete list of yahoo symbols/tickers/stocks is available for download(excel format) at below website. http://www.myinvestorshub.com/yahoo_stock_list.php
Nice bit of info. I implemented mine a little differently. I am creating an “industry probe” to help with a job search. Basically, I input the industry(ies) I am interested in working for. Then I use the information you provided to pull down company symbols and names. I then wrote a script to analyze historical trends for each company in the industry(ies) I am interested in. Thank you for the help!
There’s a complete list of Yahoo stock ticker symbols at http://investexcel.net/all-yahoo-finance-stock-tickers/ in a single Excel spreadsheet (across all international exchanges)
Free historic intraday data can be found in http://www.stockchartfree.com , but i think they provide only for 100 days. They provide 30 technical indicators as well for intraday data. They also provide live and EOD data with technical analysis