Archive

Archive for November, 2011

Raphaël — JavaScript Library

November 9, 2011 Leave a comment

Raphaël is a small JavaScript library that should simplify your work with vector graphics on the web. If you want to create your own specific chart or image crop and rotate widget, for example, you can achieve it simply and easily with this library.” (source)

See for instance the color picker demo.

Scraping AJAX web pages (Part 3)

November 8, 2011 3 comments

Don’t forget to check out the rest of the series too!

In Part 2 we saw how to download an Ajax-powered webpage. However, there was a problem with that approach: sometimes it terminated too quickly, thus it fetched just part of a page. The problem with Ajax is that we cannot tell for sure when a page is completely downloaded.

So, the solution is to integrate some waiting mechanism in the script. That is, we need the following: “open a given page, wait X seconds, then get the HTML source”. Hopefully all Ajax calls will be finished in X seconds. It is you who decides how many seconds to wait. Or, you can analyze the partially downloaded HTML and if something is missing, wait some more.

Here I will use Splinter for this task. It opens a browser window that you can control from Python. Thanks to the browser, it can interpret Javascript. The only disadvantage is that the browser window is visible.

Example
Let’s see how to fetch the page CP002059.1. If you open it in a browser, you’ll see a status bar at the bottom that indicates the download progress. For me it takes about 20 seconds to fully get this page. By analyzing the content of the page, we can notice that the string “ORIGIN” appears just once, at the end of the page. So we’ll check its presence in a loop and wait until it arrives.

#!/usr/bin/env python

from time import sleep
from splinter.browser import Browser

url = 'http://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'

def main():
    browser = Browser()
    browser.visit(url)

    # variation A:
    while 'ORIGIN' not in browser.html:
        sleep(5)

    # variation B:
    # sleep(30)   # if you think everything arrives in 30 seconds

    f = open("/tmp/source.html", "w")   # save the source in a file
    print >>f, browser.html
    f.close()

    browser.quit()
    print '__END__'

#############################################################################

if __name__ == "__main__":
    main()

You might be tempted to check the presence of ‘</html>’. However, don’t forget that the browser downloads a plain source first starting with ‘<html><body>…’ until ‘</body></html>’. Then it starts to interpret the source and if it finds some Ajax calls, they will be called, and these calls will expand something in the body of the HTML. So you’ll have ‘</html>’ right at the beginning.

Future work
This is not bad but I’m still not fully satisfied. I’d like something like this but without any browser window. If you have a headless solution, let me know. I think it’s possible with PhantomJS and/or Zombie.js but I had no time yet to investigate them.

Download login-protected pages with Python using Mechanize and Splinter (Part 3)

November 8, 2011 2 comments

In Part 2 of this series we saw how to fetch a cookie-protected page. The site that we used in the example (Project Euler) required you to log in to access the page we were interested in (statistics). First we extracted the cookies from Firefox’s cookies.sqlite and thus we could access the protected page. In the Python script no username/password had to be specified. However, it was necessary to log in in Firefox to get the cookies.

The example site, Project Euler (PE), changed completely a few weeks ago and the method presented in Part 2 doesn’t work any more with this particular site :( So I had to find an alternative solution and this is how I got to Mechanize and Splinter.

Mechanize is a headless programmable browser, which means it doesn’t open any browser window but you can navigate it as if it were a real browser. It’s quite fast but it doesn’t handle Javascript.

Splinter is similar to Mechanize but it opens a browser window. (Theoretically they support a headless browser (zope.testbrowser) but I couldn’t make it work.) Since the navigation is done in a browser window, it’s slower than Mechanize, but it handles Javascript! (Note that zope.testbrowser doesn’t handle Javascript.)

Let’s see a concrete example: fetch Project Euler’s countries page. This page requires a login.

Example #1: Mechanize
The test site doesn’t use Ajax calls, so Mechanize is the better choice here.

#!/usr/bin/env python

import mechanize
from jabbapylib.filesystem import fs   # you can ignore it

PE_LOGIN = 'http://projecteuler.net/login'
PE_COUNTRIES = 'http://projecteuler.net/countries'

USERNAME = fs.read_first_line('/home/jabba/secret/project_euler/username.txt')
PASSWORD = fs.read_first_line('/home/jabba/secret/project_euler/password.txt')

def main():
    browser = mechanize.Browser()
    browser.open(PE_LOGIN)

    browser.select_form(name="login_form")
    browser['username'] = USERNAME
    browser['password'] = PASSWORD

    res = browser.submit()
    #print res.get_data()

    res = browser.open(PE_COUNTRIES)
    print res.get_data()   # HTML source of the page

#############################################################################

if __name__ == "__main__":
    main()

This script requires your username and password on the site PE. For security reasons, I don’t like integrating such data in scripts so I store them in files on a Truecrypt volume. To try this script, you can simply ignore the “jabbapylib” import and specify your username and password directly.

Example #2: Splinter
This example is here for the sake of completeness. Since no Ajax is used on the site PE, there is no real need to navigate a real browser window. That is, Mechanize would be a better (faster) choice here.

#!/usr/bin/env python

from splinter.browser import Browser
from jabbapylib.filesystem import fs

PE_LOGIN = 'http://projecteuler.net/login'
PE_COUNTRIES = 'http://projecteuler.net/countries'

USERNAME = fs.read_first_line('/home/jabba/secret/project_euler/username.txt')
PASSWORD = fs.read_first_line('/home/jabba/secret/project_euler/password.txt')

def main():
#    browser = Browser('chrome')
    browser = Browser()    # opens a Firefox instance
    browser.visit(PE_LOGIN)
    
    browser.fill('username', USERNAME)
    browser.fill('password', PASSWORD)
    button = browser.find_by_name('login')
    button.click()

    browser.visit(PE_COUNTRIES)
       
    f = open("/tmp/stat.html", "w")
    print >>f, browser.html    # HTML source of the page
    f.close()
    
    browser.quit()    # close the browser window

    print '__END__'

#############################################################################

if __name__ == "__main__":
    main()

Basically, it works just like Mechanize. You tell the browser which page to open, what fields to fill, where to click, etc. At the end we save the HTML source in a file.

Both Mechanize and Splinter treat cookies after login so we don’t have to bother about cookielib.

The future of video gaming

November 7, 2011 Leave a comment

I hope I can try it one day :)

More info here (in Hungarian).

Resize .tif file and convert to .jpg

November 7, 2011 Leave a comment

Use case
In the lab we have a photocopier that can scan too. Quite cool, you can precise your email address and it sends you the scanned page in .tif format.

However, pages must be scanned one by one and each of them is sent as a separate .tif file. Each .tif file is around 2.8 MB large with a resolution of 4900 x 7000 pixels. How to resize them and convert them to .jpg files? Gimp is one way but could we solve it in command-line?

Solution
Put the .tif files in a folder and create a subfolder called “out”. This way the output won’t be mixed with the input.

for i in *.tif; do echo $i; convert $i -resize 24% out/`basename $i .tif`.jpg; done

Each .tif is made smaller (width around 1200 pixels) and converted to .jpg.

As a final touch, convert the JPGs to a PDF file.

cd out
convert *.jpg doc.pdf

Question
Does anyone know how to to resize an image the following way: let width be 1200 pixels and keep the aspect ratio? Above the 24% was the result of a manual computation…

Answer: just use “convert -resize 1200 in.tif out.jpg“. The output will have width=1200 pixels with the same ratio as the input image. (Thanks Yves for the tip.)

D3: A JavaScript visualization library for HTML and SVG

November 7, 2011 Leave a comment

D3 allows you to bind arbitrary data to a Document Object Model (DOM), and then apply data-driven transformations to the document. As a trivial example, you can use D3 to generate a basic HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar chart with smooth transitions and interaction.” (source)

See also D3 on GitHub.

I haven’t used it yet, this is just a “good-to-know-about-it” note.

Categories: javascript Tags: , ,

Linus doesn’t like C++

November 3, 2011 Leave a comment

See his post here (Sept. 2007) :)

Reactions on reddit here.

Somewhere else I read: ‘ C++ means “increment C by one and use the original value” ‘ :)

Categories: Uncategorized Tags: ,

Hide “about the new look | send feedback” in Gmail

November 3, 2011 10 comments

Update #2: Check out this post too for an easier solution.

Update #1: This post is deprecated. That damned widget is not shown anymore.

Problem
I upgraded to the new look of Gmail but since then I always get a notification in the bottom right corner with “about the new look | send feedback”. Closing it doesn’t help, upon a new log in it’s there again.

Solution
Install Adblock Plus and add the following filter:

mail.google.com##div[class="GcwpPb-MEmzyf GcwpPb-bEO5kc"]

Something else
Does anybody know how to list the current filters in the new look? I can’t find it anywhere.

Categories: google Tags: , ,
Follow

Get every new post delivered to your Inbox.

Join 44 other followers