Archive

Archive for the ‘html’ Category

Image Gallery from a list of URLs

December 18, 2012 Leave a comment

Problem
I have several scrapers that extract images. How to visualize them? One way is to open each one in a new browser tab but it’s slow and who wants to have several hundreds of tabs? Is there a way to browse these images in one single tab?

Solution
A primitive solution would be to create an HTML page that lists all the images one below the other. But again, what if you have lots of images?

A better way is to organize the images in a gallery. There are tons of image gallery generators out there but most of them work with local images. I want to browse remote images when only their URLs are available. So I made my own image gallery generator that works with URLs. Available on github.

There is also a live demo, check it out.

The software is written in Python. See the README file for usage examples.

Categories: html, python Tags: , , ,

Put a text on the clipboard from your webpage

December 18, 2012 Leave a comment

Problem
From an HTML page you want to copy some text on the clipboard by pressing a button.

Example: on a page you present a list of URLs. Next to each URL there is a button. If the user clicks on the button, the corresponding URL is copied to his/her clipboard.

Solution
You can use clippy for this task. “Clippy is a very simple Flash widget that makes it possible to place arbitrary text onto the client’s clipboard.

Here is an HTML template that you must paste in your HTML: clippy.html. Simply replace “{{ clippy_text }}” and “{{ bgcolor }}” with the values you want.

Update (20130103)
GitHub also used clippy but recently they switched to ZeroClipboard. Here is their announcement.

Categories: html Tags: , ,

Youtube audio player

November 20, 2012 Leave a comment

At http://www.labnol.org/internet/youtube-audio-player/26740/ I found a nice trick to embed just a part of the Youtube Flash player, thus the player looks like an audio player. All you need is a little CSS trick:

Categories: firefox, google, html Tags: , , ,

Codecademy – learn HTML, CSS, Javascript

April 19, 2012 Leave a comment

Codecademy is the easiest way to learn how to code. It’s interactive, fun, and you can do it with your friends.

http://www.codecademy.com

Categories: html, javascript Tags: ,

Scraping AJAX web pages (Part 3)

November 8, 2011 3 comments

Don’t forget to check out the rest of the series too!

In Part 2 we saw how to download an Ajax-powered webpage. However, there was a problem with that approach: sometimes it terminated too quickly, thus it fetched just part of a page. The problem with Ajax is that we cannot tell for sure when a page is completely downloaded.

So, the solution is to integrate some waiting mechanism in the script. That is, we need the following: “open a given page, wait X seconds, then get the HTML source”. Hopefully all Ajax calls will be finished in X seconds. It is you who decides how many seconds to wait. Or, you can analyze the partially downloaded HTML and if something is missing, wait some more.

Here I will use Splinter for this task. It opens a browser window that you can control from Python. Thanks to the browser, it can interpret Javascript. The only disadvantage is that the browser window is visible.

Example
Let’s see how to fetch the page CP002059.1. If you open it in a browser, you’ll see a status bar at the bottom that indicates the download progress. For me it takes about 20 seconds to fully get this page. By analyzing the content of the page, we can notice that the string “ORIGIN” appears just once, at the end of the page. So we’ll check its presence in a loop and wait until it arrives.

#!/usr/bin/env python

from time import sleep
from splinter.browser import Browser

url = 'http://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'

def main():
    browser = Browser()
    browser.visit(url)

    # variation A:
    while 'ORIGIN' not in browser.html:
        sleep(5)

    # variation B:
    # sleep(30)   # if you think everything arrives in 30 seconds

    f = open("/tmp/source.html", "w")   # save the source in a file
    print >>f, browser.html
    f.close()

    browser.quit()
    print '__END__'

#############################################################################

if __name__ == "__main__":
    main()

You might be tempted to check the presence of ‘</html>’. However, don’t forget that the browser downloads a plain source first starting with ‘<html><body>…’ until ‘</body></html>’. Then it starts to interpret the source and if it finds some Ajax calls, they will be called, and these calls will expand something in the body of the HTML. So you’ll have ‘</html>’ right at the beginning.

Future work
This is not bad but I’m still not fully satisfied. I’d like something like this but without any browser window. If you have a headless solution, let me know. I think it’s possible with PhantomJS and/or Zombie.js but I had no time yet to investigate them.

Powerpoint is dead (HTML5 presentations with landslide)

September 23, 2011 Leave a comment

Powerpoint is dead. Well, not yet, but for simple presentations you can use the following tool perfectly. This entry is based on Francisco Souza’s excellent post entitled “Creating HTML 5 slide presentations using landslide“. Here I make a short summary.


Landslide is a Python tool for converting marked-up texts to HTML5 slide presentations. The input text can be written in Markdown, reStructuredText, or Textile. A sample slideshow presenting landslide itself is here.

Sample input: 2.md (taken from the landslide project)
Sample output: presentation.html

Installation

sudo pip install landslide

Usage

landslide text.md

If you want to share it on the Internet: “landslide -cr text.md“.

Help: “landslide --help“.

To learn about the customization of the theme, refer to Francisco’s post.

Convert to PDF

landslide file.md -d out.pdf

For this you need Prince XML, which is free for non-commercial use. Unfortunately the output is black and white with additional blank pages for notes. If you know how to have colored PDFs without the extra pages, let me know.

It’d be interesting to replace Prince XML with wkhtmltopdf. I made some tests but the output was not nice. I think it could be tweaked though.

Related stuff

Pandoc is a universal document converter.

If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Need to generate a man page from a markdown file? No problem. LaTeX to Docbook? Sure. HTML to MediaWiki? Yes, that too. Pandoc can read markdown and (subsets of) reStructuredText, textile, HTML, and LaTeX, and it can write plain text, markdown, reStructuredText, HTML, LaTeX, ConTeXt, PDF, RTF, DocBook XML, OpenDocument XML, ODT, GNU Texinfo, MediaWiki markup, textile, groff man pages, Emacs org-mode, EPUB ebooks, and S5 and Slidy HTML slide shows. PDF output (via LaTeX) is also supported with the included markdown2pdf wrapper script.

Scraping AJAX web pages (Part 2)

September 20, 2011 5 comments

Don’t forget to check out the rest of the series too!

In this post we’ll see how to get the generated source of an HTML page. That is, we want to get the source with embedded Javascript calls evaluated.

Here is my solution:

#!/usr/bin/env python

"""
Simple webkit.
"""

import sys
from PyQt4 import QtGui, QtCore, QtWebKit

class SimpleWebkit():
    def __init__(self, url):
        self.url = url
        self.webView = QtWebKit.QWebView()

    def save(self):
        print self.webView.page().mainFrame().toHtml()
        sys.exit(0)

    def process(self):
        self.webView.load(QtCore.QUrl(self.url))
        QtCore.QObject.connect(self.webView, QtCore.SIGNAL("loadFinished(bool)"), self.save)

def process(url):
    app = QtGui.QApplication(sys.argv)
    s = SimpleWebkit(url)
    s.process()
    sys.exit(app.exec_())

#############################################################################

if __name__ == "__main__":
    #url = 'http://simile.mit.edu/crowbar/test.html'
    if len(sys.argv) > 1:
        process(sys.argv[1])
    else:
        print >>sys.stderr, "{0}: error: specify a URL.".format(sys.argv[0])
        sys.exit(1)

You can also find this script in my jabbapylib library.

Usage:

./simple_webkit.py 'http://dl.dropbox.com/u/144888/hello_js.html'

That is, just specify the URL of the page to be fetched. The generated HTML is printed to the standard output but you can easily redirect that to a file.

Pros
As you can see, it’s hyper simple. It uses a webkit instance to get and evaluate the page, which means that Javascript (and AJAX) calls will be executed. Also, the webkit instance is not visible in a window (headless browsing).

Cons
This solution is not yet perfect. The biggest problem is that AJAX calls can take some time and this script doesn’t wait for them. Actually, it cannot be known when all AJAX calls are terminated, so we cannot know for sure when the page is completely loaded :( The best way could be to integrate a waiting mechanism in the script, say “wait 5 seconds before printing the source”. Unfortunately I didn’t manage to add this feature. It should be done with QTimer somehow. If someone could add this functionality to this script, please let me know.

Challenge:
Try to download this page: CP002059.1. If you open it in Firefox for instance, at the bottom you’ll see a progress bar. For me the complete download takes about 10 sec. The script above will only fetch the beginning of the page :( Some help: the end of the downloaded sequence is this:

ORIGIN
//

If you can modify the script above to work correctly with this particular page, let me know.

Another difficulty is how to integrate this downloader in a larger project. At the end, “app.exec_()” must be called, otherwise no output is produced. But if you call it, it terminates the script. My current workaround is to call this script as an external command and catch its output on stdout. If you have a better idea, let me know.

Resources used

Update (20110921)
I just found an even simpler solution here. And this one doesn’t exit(), so it can be integrated in another project easily (without the need for calling it as an external command). However, the “waiting problem” is still there.

What’s next
In the next part of this series we will see another way to download an AJAX page. In Part 3 we will address the problem of waiting X seconds for AJAX calls. Stay tuned.

Troubleshooting
If you get the following error message:

Gtk-WARNING **: Unable to locate theme engine in module_path: "pixmap",

Then install this package:

sudo apt-get install gtk2-engines-pixbuf

This tip is from here.

Scraping AJAX web pages (Part 1.5)

September 19, 2011 1 comment

Don’t forget to check out the rest of the series too!

Before attacking Part 2, I think it would be useful to investigate what the generated source of a page looks like.

Consider the following source:

<html>
<body>
<script>document.write("Hello World!");</script>
</body>
</html>

If you open it, you’ll see the text “Hello World!”. It’s not a big surprise :) But what is the generated source? How is the original html above interpreted by the browser?

Option A:

<html>
<body>
Hello World!
</body>
</html>

Option B:

<html>
<head></head>
<body>
<script>document.write("Hello World!");</script>Hello World!
</body>
</html>

Well, the correct answer is B. If you install the Web Developer add-on to Firefox, you’ll be able to see both sources: the original one (that is downloaded from the web server), and the generated one (which is produced by the browser after interpreting the original source).

If you don’t want to install Web Developer, there is another option. In Firefox, you can save a page in two different ways. If you save it as “Web Page, complete”, you’ll get the generated source. If you choose “Web Page, HTML only”, you’ll get the original source. However, if you save the “Hello World!” example as “Web Page, complete” and you open it from your local machine, you’ll see the text “Hello World!” twice! When you open the generated source, the embedded Javascript code will be executed again.

So, if you scrape AJAX pages, don’t be surprised if the resulting HTML source is still full of Javascript codes. But if you use an intelligent method that understands Javascript, then the interpreted result will be in the source too. In the next part we will see how to download webpages with Python and webkit.

Embed images in HTML pages

April 17, 2011 Leave a comment

If you have a web page with some images, usually the images are stored in separate files. Would it be possible to include all the images in the HTML file, making a self-contained file?

Well, yes, with the help of the data URI scheme. “The data URI scheme is a URI scheme that provides a way to include data in-line in web pages as if they were external resources.

Example: (borrowed from here)

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />

As you can see, the data can contain newline characters. If you put it an HTML file and open it with your browser, you’ll see a small red dot.

General format:

 data:<MIME-type>;base64,<data>
 

Script to produce it

I made a little script that can do this BASE64 encoding of image files. You can download it from here (img_to_base64.py).

Usage:
------

./img_to_base64.py <image_file>
    By default, the data is nested in an HTML tag and the output
    is wrapped. These settings can be customized.
    The output is printed to the standard output.
    
Sample output:
--------------

<img class='inline-image' src='data:image/gif;base64,R0lGODlhIgAbAPMPAGxsbNbW1v
/rhf/ge//3kf/Ub9/f3/b29oeHh/7LZv/0juazTktLS8WSLf//mf///yH5BAAAAAAALAAAAAAiABsAA
ASA8MlJq7046827/2AojiTVnI1xlFZjBisruU7tPCiqjg2h/L9KA2HgCQS5pE7UGLgwAhyCWWjYrrWE
owFgJqyEsDi82HZDja/jyGaXuV7rYE6fv8+gtLXA7/OtcCEGSoQMUyEHAQgAjI2OAAgBIwcGAZaXmAE
7Mpydnp+goaKjFBEAOw==' />

Project idea

It could be interesting to write a script that transforms an HTML file into a self-contained file, i.e. replace all occurrences of ‘<img src="some_image" />‘ with an embedded base64 encoded data.

Ref.: I met this image inlining technique in justinvh’s Markdown-LaTeX project.

Scraping AJAX web pages (Part 1)

April 15, 2011 6 comments

Don’t forget to check out the rest of the series too!

Problem

You want to download a web page whose source if full of AJAX calls. You want the result that is shown in your browser, i.e. you want the generated (post-AJAX) source.

Example
Consider this simple page: test.html. If you open it in your browser, you’ll see the text “Hi Crowbar!”. However, if you download this page with wget for instance, in the source code you’ll see the text “Hi lame crawler”. Explanation: your browser downloads and then interprets the page. The executed JavaScript code updates the DOM of the page. A simple downloader like wget doesn’t interpret the source of a page just grabs it.

Solution #1

One way for getting the post-AJAX source of a web page is to use Crowbar. “Crowbar is a web scraping environment based on the use of a server-side headless mozilla-based browser. Its purpose is to allow running javascript scrapers against a DOM to automate web sites scraping…

When you launch Crowbar, it offers a RESTful web service listening by default on port 10000. Just open the page http://127.0.0.1:10000/. The trick behind Crowbar is that it turns a web browser into a web server.

Now we can download AJAX pages with wget the following way. Let’s get the previous test page:

wget "http://127.0.0.1:10000/?url=http://simile.mit.edu/crowbar/test.html" -O tricky.html

If you check the source of the saved file, you will see the post-AJAX source that you would normally see in a web browser. You can also pass some other parameters to the Crowbar web service, they are detailed here. The most important parameter is “delay” that tells Crowbar how much it should wait after the page has terminated loading before attempting to serialize its DOM. By default its value is 3000 msec, i.e. 3 sec. If the page you want to download contains lots of AJAX calls then consider increasing the delay, otherwise you will get an HTML source that is not fully expanded yet.

Use case:
I wanted to download the following page from the NCBI database: CP002059.1. The page is quite big (about 5 MB), thus I had to wait about 10 sec. to get it in my browser. From the command-line I could fetch it this way (I gave it some extra time to be sure):

wget "http://127.0.0.1:10000/?url=http://www.ncbi.nlm.nih.gov/nuccore/CP002059.1&delay=15000" -O CP002059.1.html

Notes: If you want to download data from NCBI, there is a better way.

Did you know?
In Firefox, if you look at the source of a page (View -> Page Source), you will see the downloaded (pre-AJAX) source. If you want to see the generated (post-AJAX) source, you can use the Web Developer add-on (View Source -> View Generated Source).

Also, still in Firefox, if you save a web page with File -> Save Page As… and you choose “Web Page, HTML only”, Firefox will save the original (pre-AJAX) source. If you want the fully expanded (generated) source, choose the option “Web Page, complete”.

Solution #2
Another solution is to write a program/script that uses the webkit open source browser engine. In an upcoming post I will show you how to do it with Python.

Appendix
Crowbar launch script for Linux:

#!/usr/bin/bash

# my crowbar is installed here: /opt/crowbar
# location of this file: /opt/crowbar/start.sh
xulrunner --install-app xulapp
xulrunner xulapp/application.ini

Crowbar launch script for Windows (update of 20110601):

rem My crowbar is installed here: c:\Program Files\crowbar
rem Location of this file: c:\Program Files\crowbar\start.bat

"%XULRUNNER_HOME%\xulrunner.exe" --install-app xulapp
"%XULRUNNER_HOME%\xulrunner.exe" xulapp\application.ini

XULRunner for Windows can be downloaded from here.

/ discussion /

Follow

Get every new post delivered to your Inbox.

Join 42 other followers