Archive
Codecademy – learn HTML, CSS, Javascript
“Codecademy is the easiest way to learn how to code. It’s interactive, fun, and you can do it with your friends.“
YesScript: disable Javascript on a given site
Problem
Today I visited a site and it blew the following message in my face: “if you want to use this site, disable AdBlock”. What da… ?
Solution
Install YesScript and put the given site on blacklist. From now on Javascript is disabled on that site.
Raphaël — JavaScript Library
“Raphaël is a small JavaScript library that should simplify your work with vector graphics on the web. If you want to create your own specific chart or image crop and rotate widget, for example, you can achieve it simply and easily with this library.” (source)
See for instance the color picker demo.
Scraping AJAX web pages (Part 3)
Don’t forget to check out the rest of the series too!
In Part 2 we saw how to download an Ajax-powered webpage. However, there was a problem with that approach: sometimes it terminated too quickly, thus it fetched just part of a page. The problem with Ajax is that we cannot tell for sure when a page is completely downloaded.
So, the solution is to integrate some waiting mechanism in the script. That is, we need the following: “open a given page, wait X seconds, then get the HTML source”. Hopefully all Ajax calls will be finished in X seconds. It is you who decides how many seconds to wait. Or, you can analyze the partially downloaded HTML and if something is missing, wait some more.
Here I will use Splinter for this task. It opens a browser window that you can control from Python. Thanks to the browser, it can interpret Javascript. The only disadvantage is that the browser window is visible.
Example
Let’s see how to fetch the page CP002059.1. If you open it in a browser, you’ll see a status bar at the bottom that indicates the download progress. For me it takes about 20 seconds to fully get this page. By analyzing the content of the page, we can notice that the string “ORIGIN” appears just once, at the end of the page. So we’ll check its presence in a loop and wait until it arrives.
#!/usr/bin/env python
from time import sleep
from splinter.browser import Browser
url = 'http://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'
def main():
browser = Browser()
browser.visit(url)
# variation A:
while 'ORIGIN' not in browser.html:
sleep(5)
# variation B:
# sleep(30) # if you think everything arrives in 30 seconds
f = open("/tmp/source.html", "w") # save the source in a file
print >>f, browser.html
f.close()
browser.quit()
print '__END__'
#############################################################################
if __name__ == "__main__":
main()
You might be tempted to check the presence of ‘</html>’. However, don’t forget that the browser downloads a plain source first starting with ‘<html><body>…’ until ‘</body></html>’. Then it starts to interpret the source and if it finds some Ajax calls, they will be called, and these calls will expand something in the body of the HTML. So you’ll have ‘</html>’ right at the beginning.
Future work
This is not bad but I’m still not fully satisfied. I’d like something like this but without any browser window. If you have a headless solution, let me know. I think it’s possible with PhantomJS and/or Zombie.js but I had no time yet to investigate them.
D3: A JavaScript visualization library for HTML and SVG
“D3 allows you to bind arbitrary data to a Document Object Model (DOM), and then apply data-driven transformations to the document. As a trivial example, you can use D3 to generate a basic HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar chart with smooth transitions and interaction.” (source)
See also D3 on GitHub.
I haven’t used it yet, this is just a “good-to-know-about-it” note.
Zombie.js, PhantomJS
This is not a real post, just a reminder for me. I should look at these projects in detail in the future.
“Zombie.js is a fast, headless full-stack testing using Node.js. Zombie.js is a lightweight framework for testing client-side JavaScript code in a simulated environment. No browser required. Here is a Python driver for it called python-zombie.“
“PhantomJS is a headless WebKit with JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. PhantomJS is an optimal solution for headless testing of web-based applications, site scraping, pages capture, SVG renderer, PDF converter and many other use cases.“
Node.js
I’ve heard a lot about node.js recently so I looked after it a bit. At the homepage of the project you can find a one-hour long video presentation of the project.
Node.js is a set of libraries on top of Google’s V8 Javascript engine, which is used in Chrome. V8 is a very high performance virtual machine. Node.js uses the greatness of V8 to do networking things. The focus is on doing networking correctly.
Examples
Hello World:
setTimeout(function() {
console.log('world');
}, 2000)
console.log('hello');
It registers that the first function must be executed in 2 sec. and it goes on. It prints ‘hello’, then ‘world’ 2 sec. later. One very important thing: in node.js there is no blocking (like sleep()). The program exits when there is nothing else left to do.
Simple web server:
var http = require('http');
var s = http.createServer(function(req, res) {
res.writeHead(200, {'content-type' : 'text/plain'});
res.write("hello\n");
setTimeout(function() {
res.write("world\n");
res.end();
}, 2000)
});
s.listen(8000);
All these examples are from the introductory video. In the video there is also a TCP server. He also shows a simple chat server via TCP.
Installation
Under Ubuntu I already had a “node” command. According to “man node” it was something different so I removed it (sudo apt-get remove node).
Download the source from the HQ, extract it, then configure, make, sudo make install.
Start it with the command “node” and you get a prompt. Execute a script with “node file.js”.
Further reading
- What is node.js? @ SO
- npm, a package manager for node
- What is Node.js?
- Node.js is genuinely exciting
- Understanding node.js
- Learning Server-Side JavaScript with Node.js
- Node.js Step by Step: Introduction
- Node.js is Cancer (Ted Dziuba doesn’t like Node.js)
Scraping AJAX web pages (Part 2)
Don’t forget to check out the rest of the series too!
In this post we’ll see how to get the generated source of an HTML page. That is, we want to get the source with embedded Javascript calls evaluated.
Here is my solution:
#!/usr/bin/env python
"""
Simple webkit.
"""
import sys
from PyQt4 import QtGui, QtCore, QtWebKit
class SimpleWebkit():
def __init__(self, url):
self.url = url
self.webView = QtWebKit.QWebView()
def save(self):
print self.webView.page().mainFrame().toHtml()
sys.exit(0)
def process(self):
self.webView.load(QtCore.QUrl(self.url))
QtCore.QObject.connect(self.webView, QtCore.SIGNAL("loadFinished(bool)"), self.save)
def process(url):
app = QtGui.QApplication(sys.argv)
s = SimpleWebkit(url)
s.process()
sys.exit(app.exec_())
#############################################################################
if __name__ == "__main__":
#url = 'http://simile.mit.edu/crowbar/test.html'
if len(sys.argv) > 1:
process(sys.argv[1])
else:
print >>sys.stderr, "{0}: error: specify a URL.".format(sys.argv[0])
sys.exit(1)
You can also find this script in my jabbapylib library.
Usage:
./simple_webkit.py 'http://dl.dropbox.com/u/144888/hello_js.html'
That is, just specify the URL of the page to be fetched. The generated HTML is printed to the standard output but you can easily redirect that to a file.
Pros
As you can see, it’s hyper simple. It uses a webkit instance to get and evaluate the page, which means that Javascript (and AJAX) calls will be executed. Also, the webkit instance is not visible in a window (headless browsing).
Cons
This solution is not yet perfect. The biggest problem is that AJAX calls can take some time and this script doesn’t wait for them. Actually, it cannot be known when all AJAX calls are terminated, so we cannot know for sure when the page is completely loaded :( The best way could be to integrate a waiting mechanism in the script, say “wait 5 seconds before printing the source”. Unfortunately I didn’t manage to add this feature. It should be done with QTimer somehow. If someone could add this functionality to this script, please let me know.
Challenge:
Try to download this page: CP002059.1. If you open it in Firefox for instance, at the bottom you’ll see a progress bar. For me the complete download takes about 10 sec. The script above will only fetch the beginning of the page :( Some help: the end of the downloaded sequence is this:
ORIGIN //
If you can modify the script above to work correctly with this particular page, let me know.
Another difficulty is how to integrate this downloader in a larger project. At the end, “app.exec_()” must be called, otherwise no output is produced. But if you call it, it terminates the script. My current workaround is to call this script as an external command and catch its output on stdout. If you have a better idea, let me know.
Resources used
- Downloading a page’s content with python and WebKit :: Downloading a page’s content after the javascript executed
- PyQt save DOM to file @ SO
Update (20110921)
I just found an even simpler solution here. And this one doesn’t exit(), so it can be integrated in another project easily (without the need for calling it as an external command). However, the “waiting problem” is still there.
What’s next
In the next part of this series we will see another way to download an AJAX page. In Part 3 we will address the problem of waiting X seconds for AJAX calls. Stay tuned.
Troubleshooting
If you get the following error message:
Gtk-WARNING **: Unable to locate theme engine in module_path: "pixmap",
Then install this package:
sudo apt-get install gtk2-engines-pixbuf
This tip is from here.
Scraping AJAX web pages (Part 1.5)
Don’t forget to check out the rest of the series too!
Before attacking Part 2, I think it would be useful to investigate what the generated source of a page looks like.
Consider the following source:
<html>
<body>
<script>document.write("Hello World!");</script>
</body>
</html>
If you open it, you’ll see the text “Hello World!”. It’s not a big surprise :) But what is the generated source? How is the original html above interpreted by the browser?
Option A:
<html> <body> Hello World! </body> </html>
Option B:
<html>
<head></head>
<body>
<script>document.write("Hello World!");</script>Hello World!
</body>
</html>
Well, the correct answer is B. If you install the Web Developer add-on to Firefox, you’ll be able to see both sources: the original one (that is downloaded from the web server), and the generated one (which is produced by the browser after interpreting the original source).
If you don’t want to install Web Developer, there is another option. In Firefox, you can save a page in two different ways. If you save it as “Web Page, complete”, you’ll get the generated source. If you choose “Web Page, HTML only”, you’ll get the original source. However, if you save the “Hello World!” example as “Web Page, complete” and you open it from your local machine, you’ll see the text “Hello World!” twice! When you open the generated source, the embedded Javascript code will be executed again.
So, if you scrape AJAX pages, don’t be surprised if the resulting HTML source is still full of Javascript codes. But if you use an intelligent method that understands Javascript, then the interpreted result will be in the source too. In the next part we will see how to download webpages with Python and webkit.
