Archive
Analyze a User-Agent string
You can analyze a user-agent string with http://user-agent-string.info/.
fartscroll.js: funniest JavaScript code ever
“Everyone farts. And now your web pages can too.“
Check out fartscroll.js in action.
SHODAN: Computer Search Engine
marker.to: annotate a webpage and share it
Problem
There is a webpage with lots of useful information and you want to highlight/annotate some parts of it for future references.
In my current case, I want to watch the videos of PyCon US 2013 but I don’t have much time so it will take some weeks till I finish them. In the meantime, I don’t want to forget which ones I had already seen. It would be nice if I could add some custom notes to the webpage that lists the videos. In addition, I want to see/edit my annotations at home and at my workplace too, i.e. it should be web-based.
Solution
Marker.to is a browser extension that does exactly this.
“Have you ever used a marker pen for highlighting your paper documents?
Marker.to will do the same for webpages! After you install a browser extension or bookmarklet and run Marker simply by clicking on the icon, you can highlight text on a website with your mouse.
Annotating helps to point out important information in long articles. Our annotator creates a special link for the highlighted version of the page (like http://marker.to/XYZ123). You can share it through Twitter, Facebook, e-mail, IM etc. Or just bookmark it with your favorite bookmarking tool, i.e. Delicious.
You can view highlighted page with any browser! No bookmarklet or browser extension is necessary for viewing.” (source)
Related Work
- Top Web Annotation and Markup Tools (I didn’t try all of them)
Nikola: a static site and blog generator
“Nikola is a static website and blog generator. The very short explanation is that it takes some texts you wrote, and uses them to create a folder full of HTML files. If you upload that folder to a server, you will have a rather full-featured website, done with little effort.“
Links:
similsite: find similar sites
“SimilSite is a search engine to find similar websites, and a directory to find interesting information on every website.“
W3Schools is not recommended
Visit http://w3fools.com/ to see why.
repl.it
“The repl.it project is an attempt to create an online environment for interactively exploring programming languages. It provides a fully-featured terminal emulator and code editor, powered by interpreter engines for more than 15 languages.
All our interpreters are written in (or compiled to) JavaScript, and run completely on the user’s device, regardless or whether it’s a desktop, laptop or phone.” (source)
FAQ here. And yes, they have Python.
Nettuts+
“Nettuts+ is a site aimed at web developers and designers offering tutorials and articles on technologies, skills and techniques to improve how you design and build websites. We cover HTML, CSS, Javascript, CMS’s, PHP and Ruby on Rails.“
They also have Python tutorials.
Scraping AJAX web pages (Part 3)
Don’t forget to check out the rest of the series too!
In Part 2 we saw how to download an Ajax-powered webpage. However, there was a problem with that approach: sometimes it terminated too quickly, thus it fetched just part of a page. The problem with Ajax is that we cannot tell for sure when a page is completely downloaded.
So, the solution is to integrate some waiting mechanism in the script. That is, we need the following: “open a given page, wait X seconds, then get the HTML source”. Hopefully all Ajax calls will be finished in X seconds. It is you who decides how many seconds to wait. Or, you can analyze the partially downloaded HTML and if something is missing, wait some more.
Here I will use Splinter for this task. It opens a browser window that you can control from Python. Thanks to the browser, it can interpret Javascript. The only disadvantage is that the browser window is visible.
Example
Let’s see how to fetch the page CP002059.1. If you open it in a browser, you’ll see a status bar at the bottom that indicates the download progress. For me it takes about 20 seconds to fully get this page. By analyzing the content of the page, we can notice that the string “ORIGIN” appears just once, at the end of the page. So we’ll check its presence in a loop and wait until it arrives.
#!/usr/bin/env python
from time import sleep
from splinter.browser import Browser
url = 'http://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'
def main():
browser = Browser()
browser.visit(url)
# variation A:
while 'ORIGIN' not in browser.html:
sleep(5)
# variation B:
# sleep(30) # if you think everything arrives in 30 seconds
f = open("/tmp/source.html", "w") # save the source in a file
print >>f, browser.html
f.close()
browser.quit()
print '__END__'
#############################################################################
if __name__ == "__main__":
main()
You might be tempted to check the presence of ‘</html>’. However, don’t forget that the browser downloads a plain source first starting with ‘<html><body>…’ until ‘</body></html>’. Then it starts to interpret the source and if it finds some Ajax calls, they will be called, and these calls will expand something in the body of the HTML. So you’ll have ‘</html>’ right at the beginning.
Future work
This is not bad but I’m still not fully satisfied. I’d like something like this but without any browser window. If you have a headless solution, let me know. I think it’s possible with PhantomJS and/or Zombie.js but I had no time yet to investigate them.