Archive
Translate a PDF file
Problem
Today, at Ubuntu Life I found a post (in English) with a 60-page long PDF manual that contains Ubuntu tips. I downloaded it naively but it was in Spanish! :)) Great! I don’t speak Spanish… What to do? How to translate a PDF file?
Solution
Well, my solution is not an elegant one. I would say that it falls in the category “better than nothing”.
Steps:
- Get the PDF.
- Extract the text from it with “pdftotext file.pdf file.txt”.
- Make an HTML out of file.txt: rename it to file.html, and add the header <pre> as the first line and the footer </pre> as the last line.
- Upload the file to a public place. Easiest way: put in your Dropbox/Public folder.
- Translate the public HTML file with Google Translate.
- If Google Translate doesn’t translate the whole text, then it’s too long. Cut it into several pieces.
The end result is an ugly text file, but at least you can have an idea what it’s about.
If you have a more sophisticated solution, don’t hold it back.
Patch to Google Translate https pages
Problem
When you want to translate a page with Google Translate that uses the HTTPS protocol, you will get an error. Translation only works with HTTP addresses. Here is how to reproduce the problem:
- We will use https://ubuntulife.wordpress.com/ as our test subject (Spanish site).
- Visit Google Translate.
- Paste “https://ubuntulife.wordpress.com” (without quotes) in the text area.
- In the dropdown list next to “From:”, select “Detect language”.
- Click on the button Translate.
You will get a beautiful “Sorry, this URL is invalid” error.
Explanation:
This a known bug on the side of Google but they don’t rush to correct it. In the address bar you have the string “http://translate.google.com/translate?js=n&prev=_t&hl=en&ie=UTF-8&layout=2&eotf=1&sl=auto&tl=en&u=https%3A%2F%2Fubuntulife.wordpress.com%2F”. Notice that the address to be translated begins with https. In the upper frame, Google Translate transforms this address to “http://ubuntulife.wordpress.com:443/”, i.e. port 443 is added automatically.
Update (20110211): Google has a clarification. In this thread, Google guy Josh says the following: “Currently our webpage translation service will not translate secure https pages. This is because such pages often have secure content, that you wouldn’t want to send over the wire plain text to Google Translate.” That is, they know about this problem but they don’t have a good solution.
Manual workaround #1:
Remove the port “:443″ from the Google Translate textfield and press the button Translate.
Manual workaround #2:
In the address bar, locate the string https and change it to http. Then press Enter to reload the page. The port “443″ will also disappear.
Workaround for the lazy ones
Let’s use a simple bookmarklet that implements the manual method #2:
javascript:
var l = document.location;
var h = l.href;
var https = encodeURIComponent('https://'); // https%3A%2F%2F
var http = encodeURIComponent('http://'); // http%3A%2F%2F
l.href = h.replace(https, http);
void(1);
Installation:
Open this page in a new tab. There, you will find a link “ReTr” (re-translate). Drag and drop that link to the bookmark bar.
Usage:
Once the ReTr (re-translate) bookmarklet is installed, try to translate an HTTPS page. When you get the error message, just click on the ReTr bookmarklet and the page should be translated correctly.
Translate with Flagfox
Flagfox is a Firefox add-on that can “display a country flag depicting the location of the current website’s server and provides a multitude of tools such as site safety checks, whois, translation, similar sites, validation, URL shortening, and more…“
How to translate a page with Flagfox: right click on the country flag in the address bar -> select Google Translate. You will get an error message with HTTPS pages, but you can correct it with the previously mentioned “ReTr” bookmarklet.