view the generated HTML source in Firefox

July 20, 2016 Leave a comment

To view the generated HTML source in Firefox, use this bookmarklet:

javascript: var win = window.open(); win.document.write('<html><head><title>Generated HTML of  ' + location.href + '</title></head><pre>' + document.documentElement.innerHTML.replace(/&/g, '&amp;').replace(/</g, '&lt;') + '</pre></html>'); win.document.close(); void 0;

This tip is from here.

Scraping AJAX web pages (Part 5.5)

July 13, 2016 Leave a comment

Don’t forget to check out the rest of the series too!

This post is very similar to the previous one (Part 5), which scraped a webpage using PhantomJS from the command line and sent the output to the stdout.

This time we use PhantomJS again, but we do it from a Python script and wrap Selenium around PhantomJS. The generated HTML source will be available in a variable. Here is the source:

#!/usr/bin/env python3
# encoding: utf-8

"""
required packages:
* selenium
optional packages:
* bs4
* lxml
"""

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
# from bs4 import BeautifulSoup

url = "http://simile.mit.edu/crowbar/test.html"

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)
html = driver.page_source
print(html)
# soup = BeautifulSoup(driver.page_source, "lxml") #page_source fetches page after rendering is complete
# driver.save_screenshot('screen.png') # save a screenshot to disk

driver.quit()

The script sets the user agent (optional but recommended). The source is captured in a variable. The last two lines are in comments but they would work. You could feed the source to BeautifulSoup and then you could extract part of the HTML source. If you uncomment the last line, then you can create a screenshot of the webpage.

Scraping AJAX web pages (Part 5)

July 13, 2016 Leave a comment

Don’t forget to check out the rest of the series too!

I’ve already written about PhatomJS, for instance here. Recall: PhantomJS is a headless WebKit scriptable with a JavaScript API.

The problem is still the same: we have a webpage that contains lots of JavaScript code and we want to get the final HTML that is produced after the JavaScript codes have been executed.

Example: http://simile.mit.edu/crowbar/test.html. If you download it with “wget” for instance, you get the text “Hi lame crawler” in the source. However, a JavaScript code changes this text to “Hi Crowbar!” in the browser and we want to get this generated source. How?

This time we’ll use PhantomJS. We also need a JavaScript script that will instruct PhantomJS what to do. Let’s call it printSource.js:

var system = require('system');
var page   = require('webpage').create();
// system.args[0] is the filename, so system.args[1] is the first real argument
var url    = system.args[1];
// render the page, and run the callback function
page.open(url, function () {
  // page.content is the source
  console.log(page.content);
  // need to call phantom.exit() to prevent from hanging
  phantom.exit();
});

Note that this code comes from here.

If you want to set the user agent, use this modified script:

var system = require('system');
var page   = require('webpage').create();
page.settings.userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)';
// system.args[0] is the filename, so system.args[1] is the first real argument
var url    = system.args[1];
// render the page, and run the callback function
page.open(url, function () {
  // page.content is the source
  console.log(page.content);
  // need to call phantom.exit() to prevent from hanging
  phantom.exit();
});

Then launch the following command:

$ phantomjs printSource.js http://simile.mit.edu/crowbar/test.html

The output is printed to the standard output.

If you want to do the same thing from a Python script, check out Part 5.5 of the series.

Categories: bash Tags: , , , ,

indent HTML

July 13, 2016 Leave a comment

Problem
You have an ugly HTML and you want to indent it nicely. For instance you want to scrape something from it, but first it would be a good idea to indent the source.

Solution
The program “tidy” can do that. Create the following config file (tidy_config.txt):

indent: auto
indent-spaces: 2
quiet: yes
tidy-mark: no

Then call tidy the following way:

$ tidy -config tidy_config.txt ugly.html > nice.html

Tip from here.

Categories: html Tags: , ,

installing EasyPHP

June 27, 2016 Leave a comment

Problem
I wanted to contribute to a PHP project but under Manjaro I couldn’t test it locally. I got a “Fatal error: Call to undefined function curl_init()” message that I didn’t manage to resolve, though the line “extension=curl.so” was present in my php.ini :(

Solution
After a few hours of trial and error, I decided to develop this project under Windows. I have Windows in VirtualBox, I put the PHP project in a shared folder, so my idea was to edit the source under Linux and visualize the result under Windows.

I chose EasyPHP and installed the latest EasyPHP Devserver that provides a complete development environment.

When I opened the project I got the same error since curl was not enabled by default. On the dashboard I could edit the php.ini file and uncommented the line “extension=php_curl.dll“. However, after restarting the webserver I got another error: libssh2 is missing. I found the solution here (ken’s comment):

"I had to also also copy libssh2.dll into my Apache24 folder 
for this to work with my PHP 5.6.2 installation. So altogether 
I had to do the following:

Move to Windows\system32 folder:
libssh2.dll, php_curl.dll, ssleay32.dll, libeay32.dll

Move to Apache24\bin folder
libssh2.dll

Uncomment extension=php_curl.dll"

I found all these files in the install folder of EasyPHP. There are two versions of each, one in a “…vc11…” folder, while the other one in a “…vc14…” folder. I worked under PHP 5.6 so I copied the vc11 versions of each file mentioned above.

After this Apache restarted without any error.

Categories: php, windows Tags: , , ,

limit the CPU usage of Firefox / Dropbox / etc.

June 24, 2016 Leave a comment

Problem
I have an older dual core laptop where Firefox sometimes uses 120%-130% CPU and slows down the machine completely. Restarting Firefox solves the problem for a few minutes but then again, it eats up the CPU. What to do?

Solution
I don’t have many tabs open but I still have this problem. I also uninstalled the Flash plugin but it didn’t solve the problem.

However, I found a nice tool called cpulimit:

Cpulimit is a tool which limits the CPU usage of a process (expressed in percentage, not in CPU time). It is useful to control batch jobs, when you don’t want them to eat too many CPU cycles. The goal is prevent a process from running for more than a specified time ratio. It does not change the nice value or other scheduling priority settings, but the real CPU usage. Also, it is able to adapt itself to the overall system load, dynamically and quickly. The control of the used CPU amount is done sending SIGSTOP and SIGCONT POSIX signals to processes. All the children processes and threads of the specified process will share the same percentage of CPU.” (from the README of the project)

The following setting worked for me:

$ cpulimit -l 80 firefox

Firefox uses several threads but as mentioned in the documentation, they will will share the same percentage of CPU.

The CPU usage may jump higher than the specified value, but cpulimit will push it back in a few seconds.

My old laptop has become useable again :)

Update (with Dropbox)
I noticed that Dropbox also loves my CPU. Here is how I could limit this greedy beast. Originally, I started “$HOME/.dropbox-dist/dropboxd” automatically at each startup. Create the file “$HOME/bin/cpulimit_dropboxd.sh” with the following content:

#!/usr/bin/env bash

cpulimit -l 50 $HOME/.dropbox-dist/dropboxd

Make it runnable (chmod u+x cpulimit_dropboxd.sh) and call this script (cpulimit_dropboxd.sh) when your system comes up. Here I give 50% CPU for Dropbox but you can play with that value.

Categories: bash, firefox Tags: ,

Linux: install Windows fonts

June 24, 2016 Leave a comment

Problem
I wanted to create a meme image manually in Gimp but I didn’t have the required font (Impact). What now?

Solution
The remedy is here: https://wiki.archlinux.org/index.php/Microsoft_fonts. I have a dual boot machine with Windows and Linux, so I decided to put a link on Windows’ Fonts folder (I didn’t want to copy 550+ MB fonts).

# ln -s /mnt/Windows/Fonts /usr/share/fonts/WindowsFonts
# fc-cache

The prompt “#” means a root shell. Restart Gimp and the Windows fonts will be available.

Categories: gimp, windows Tags:
Follow

Get every new post delivered to your Inbox.

Join 93 other followers