“Codecademy is the easiest way to learn how to code. It’s interactive, fun, and you can do it with your friends.“
Today I visited a site and it blew the following message in my face: “if you want to use this site, disable AdBlock”. What da… ?
See for instance the color picker demo.
In Part 2 we saw how to download an Ajax-powered webpage. However, there was a problem with that approach: sometimes it terminated too quickly, thus it fetched just part of a page. The problem with Ajax is that we cannot tell for sure when a page is completely downloaded.
So, the solution is to integrate some waiting mechanism in the script. That is, we need the following: “open a given page, wait X seconds, then get the HTML source”. Hopefully all Ajax calls will be finished in X seconds. It is you who decides how many seconds to wait. Or, you can analyze the partially downloaded HTML and if something is missing, wait some more.
Let’s see how to fetch the page CP002059.1. If you open it in a browser, you’ll see a status bar at the bottom that indicates the download progress. For me it takes about 20 seconds to fully get this page. By analyzing the content of the page, we can notice that the string “ORIGIN” appears just once, at the end of the page. So we’ll check its presence in a loop and wait until it arrives.
#!/usr/bin/env python from time import sleep from splinter.browser import Browser url = 'http://www.ncbi.nlm.nih.gov/nuccore/CP002059.1' def main(): browser = Browser() browser.visit(url) # variation A: while 'ORIGIN' not in browser.html: sleep(5) # variation B: # sleep(30) # if you think everything arrives in 30 seconds f = open("/tmp/source.html", "w") # save the source in a file print >>f, browser.html f.close() browser.quit() print '__END__' ############################################################################# if __name__ == "__main__": main()
You might be tempted to check the presence of ‘</html>’. However, don’t forget that the browser downloads a plain source first starting with ‘<html><body>…’ until ‘</body></html>’. Then it starts to interpret the source and if it finds some Ajax calls, they will be called, and these calls will expand something in the body of the HTML. So you’ll have ‘</html>’ right at the beginning.
This is not bad but I’m still not fully satisfied. I’d like something like this but without any browser window. If you have a headless solution, let me know. I think it’s possible with PhantomJS and/or Zombie.js but I had no time yet to investigate them.
“D3 allows you to bind arbitrary data to a Document Object Model (DOM), and then apply data-driven transformations to the document. As a trivial example, you can use D3 to generate a basic HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar chart with smooth transitions and interaction.” (source)
See also D3 on GitHub.
I haven’t used it yet, this is just a “good-to-know-about-it” note.
Powerpoint is dead. Well, not yet, but for simple presentations you can use the following tool perfectly. This entry is based on Francisco Souza’s excellent post entitled “Creating HTML 5 slide presentations using landslide“. Here I make a short summary.
Landslide is a Python tool for converting marked-up texts to HTML5 slide presentations. The input text can be written in Markdown, reStructuredText, or Textile. A sample slideshow presenting landslide itself is here.
sudo pip install landslide
If you want to share it on the Internet: “
landslide -cr text.md“.
To learn about the customization of the theme, refer to Francisco’s post.
Convert to PDF
landslide file.md -d out.pdf
For this you need Prince XML, which is free for non-commercial use. Unfortunately the output is black and white with additional blank pages for notes. If you know how to have colored PDFs without the extra pages, let me know.
It’d be interesting to replace Prince XML with wkhtmltopdf. I made some tests but the output was not nice. I think it could be tweaked though.
Pandoc is a universal document converter.
“If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Need to generate a man page from a markdown file? No problem. LaTeX to Docbook? Sure. HTML to MediaWiki? Yes, that too. Pandoc can read markdown and (subsets of) reStructuredText, textile, HTML, and LaTeX, and it can write plain text, markdown, reStructuredText, HTML, LaTeX, ConTeXt, PDF, RTF, DocBook XML, OpenDocument XML, ODT, GNU Texinfo, MediaWiki markup, textile, groff man pages, Emacs org-mode, EPUB ebooks, and S5 and Slidy HTML slide shows. PDF output (via LaTeX) is also supported with the included
markdown2pdf wrapper script.“
This is not a real post, just a reminder for me. I should look at these projects in detail in the future.