Archive

Posts Tagged ‘pdftk’

Edit your PDF files

March 6, 2011 Leave a comment

Problem #1

You have a PDF file and you want to remove some pages from it. You want to do this job via a GUI interface.

Solution #1

You can use PDF Mod for this task.

PDF Mod is a simple application for modifying PDF documents. You can reorder, rotate, and remove pages, export images from a document, edit the title, subject, author, and keywords, and combine documents via drag and drop.” (source)

It exists in the repositories, but if you want the latest version, install it from PPA:

sudo add-apt-repository ppa:pdfmod-team/ppa
sudo apt-get update
sudo apt-get install pdfmod

The PPA tip is from Ubuntu Geek.

Problem #2

You want to modify your PDF files from the command line.

Solution #2

Check out my post on pdftk.

Related links

Convert your wordpress blog to a PDF book

March 5, 2011 6 comments

Problem

You have a wordpress blog (.com or .org), and you would like to convert the whole blog to a PDF book. That is, convert every post to PDF and then join the pieces. The final result should be a single PDF, like a book.

Related work

An easy, simple, and free solution is offered by LJBook. Just upload your exported blog, and they generate a single PDF out of it.
However, I had a problem with it. My blog contains lots of source codes and unfortunately those blocks are not treated correctly by LJBook. So, I had to find another solution.

My solution

Here is a sample PDF and my whole blog (up to March 6, 2011). With my method, you can generate such an output.

The current version of the script (written in Python) is available here.

Steps to follow:

  • Download the script above and put it in a directory. In this directory, create a subdirectory called “pieces“. The script will download the HTML files here, and the PDF outputs are also stored in this subdirectory.
  • Customize the beginning of the script: blog name, username, password, etc.
  • The HTML to PDF conversion is done with WKhtmlToPDF. Here you will find more info about this tool and how to get it. Download it and store the binary here: /opt/wkhtmltopdf/wkhtmltopdf-i386.
  • Optional: disable the side bar on your wordpress blog. I don’t think you want to see the side bar on each page in the PDF book :) Refer to this post to figure out how to hide the side bar.
  • Now everything is set, you can launch the script. If everything is fine then the script will download each public post on your blog and convert them to PDF. Warning! When you launch the script, it will delete all *.html and *.pdf files in the directory “pieces“!
  • Once you have all the PDFs, enter the directory “pieces” and join the PDFs: “pdftk *.pdf cat output book.pdf“. If you don’t have pdftk, install it (sudo apt-get install pdftk).
  • When ready, don’t forget to set back the side bar on your blog.
  • You might want to edit the final PDF. It is almost sure that it will contain some empty pages; you can remove them with a PDF editor.

Python tutorials of Full Circle Magazine in a single PDF

February 21, 2011 1 comment

Please read first the update information at the end of the post.

Description

Full Circle Magazine (FCM) started a Python tutorial series in issue #27. At the time of writing, the current issue is #45, and the tutorial is still there :)

Problem: it would be nice to extract these tutorials from the issues and put them together in a single PDF. Thus, we would have all the tutorials together in one document.

Download

For the lazy pigs, here is the PDF (6 MB). Get it while it’s hot :)

How to produce the single PDF

For those who are interested, here I explain how to produce the single PDF above.

First, download the issues of FCM. I suppose that the required files are named issue27_en.pdf, issue28_en.pdf, …, issue45_en.pdf. Put them in a directory called full-circle.

Here is a CSV file that contains data about which pages to extract from the issues:

# issue; start page; end page
27;7;10
28;7;11
29;7;11
30;7;9
31;8;11
32;8;12
33;8;12
34;8;15
35;10;13
36;7;11
37;7;11
38;7;11
39;7;11
40;8;14
41;8;12
42;8;11
43;7;9
44;7;9
45;7;8

Put this file (download link) to the same directory where the PDF files are. Here, create a subdirectory called pieces. The extracted PDFs will be stored there.

We will use the following Python script to produce the commands that will do the extraction:

#!/usr/bin/env python

# extract.py

f1 = open('python.csv', 'r')

for line in f1:
    if line.startswith('#'):
        continue
    # else
    line = line.rstrip('\n')
    (issue, start_page, end_page) = line.split(';')
    command = "pdftk issue%s_en.pdf cat %s-%s output pieces/%s-python.pdf" % (issue, start_page, end_page, issue)
    print command

f1.close()

By executing the script (download link), you will get the following output:

pdftk issue27_en.pdf cat 7-10 output pieces/27-python.pdf
pdftk issue28_en.pdf cat 7-11 output pieces/28-python.pdf
pdftk issue29_en.pdf cat 7-11 output pieces/29-python.pdf
pdftk issue30_en.pdf cat 7-9 output pieces/30-python.pdf
pdftk issue31_en.pdf cat 8-11 output pieces/31-python.pdf
pdftk issue32_en.pdf cat 8-12 output pieces/32-python.pdf
pdftk issue33_en.pdf cat 8-12 output pieces/33-python.pdf
pdftk issue34_en.pdf cat 8-15 output pieces/34-python.pdf
pdftk issue35_en.pdf cat 10-13 output pieces/35-python.pdf
pdftk issue36_en.pdf cat 7-11 output pieces/36-python.pdf
pdftk issue37_en.pdf cat 7-11 output pieces/37-python.pdf
pdftk issue38_en.pdf cat 7-11 output pieces/38-python.pdf
pdftk issue39_en.pdf cat 7-11 output pieces/39-python.pdf
pdftk issue40_en.pdf cat 8-14 output pieces/40-python.pdf
pdftk issue41_en.pdf cat 8-12 output pieces/41-python.pdf
pdftk issue42_en.pdf cat 8-11 output pieces/42-python.pdf
pdftk issue43_en.pdf cat 7-9 output pieces/43-python.pdf
pdftk issue44_en.pdf cat 7-9 output pieces/44-python.pdf
pdftk issue45_en.pdf cat 7-8 output pieces/45-python.pdf

As can be seen, the extraction will be done with pdftk (more info here). Now, these commands are simply printed to the standard output. Here is how to execute them too:

./extract.py | sh

That is, pass the commands to the shell “sh”, which will execute them line by line.

Okay, now we have the pieces in the directory “pieces”. Enter the directory “pieces” and join the PDFs:

pdftk *.pdf cat output all.pdf

Known issue

Well, to tell the truth, this method will produce a huge single PDF. The extracted pieces are also very big (5 to 10 MB), and the final PDF is about 130 MB! So actually I used Adobe Acrobat 8 Professional to merge the pieces with the conversion setting “Smaller File Size”. Acrobat Pro optimized the files and produced a file of size 6 MB. If you know how to have a similar result with open source tools, let me know.

Update (20110305):

It seems FCM comes out with a similar idea: http://fullcirclemagazine.org/python-special-edition-1/. They collected the first 8 parts of the already published Python tutorials in a special edition.

Update (20110329):

I pushed this project to GitHub, see https://github.com/jabbalaci/Full-Circle-Magazine-Series. I added some changes but I won’t rewrite this post each time. For the latest version, please refer to GitHub.

[ @reddit ]

Categories: python Tags: , ,

Manipulate your PDFs with pdftk

February 21, 2011 Leave a comment

If PDF is electronic paper, then pdftk is an electronic staple-remover, hole-punch, binder, secret-decoder-ring, and X-Ray-glasses. Pdftk is a simple tool for doing everyday things with PDF documents.” (source)

Pdftk is a great command-line tool to manipulate your PDF files.

Installation:

sudo apt-get install pdftk

Examples:
The following examples are taken from here.

Merge Two or More PDFs into a New Document

pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf

or (Using Handles):

pdftk A=1.pdf B=2.pdf cat A B output 12.pdf

or (Using Wildcards):

pdftk *.pdf cat output combined.pdf

Split Select Pages from Multiple PDFs into a New Document

pdftk A=one.pdf B=two.pdf cat A1-7 B1-5 A8 output combined.pdf

Encrypt a PDF using 128-Bit Strength (the Default) and Withhold All Permissions (the Default)

pdftk mydoc.pdf output mydoc.128.pdf owner_pw foopass

Same as Above, Except a Password is Required to Open the PDF

pdftk mydoc.pdf output mydoc.128.pdf owner_pw foo user_pw baz

Same as Above, Except Printing is Allowed (after the PDF is Open)

pdftk mydoc.pdf output mydoc.128.pdf owner_pw foo user_pw baz allow printing

Decrypt a PDF

pdftk secured.pdf input_pw foopass output unsecured.pdf

Join Two Files, One of Which is Encrypted (the Output is Not Encrypted)

pdftk A=secured.pdf mydoc.pdf input_pw A=foopass cat output combined.pdf

Uncompress PDF Page Streams for Editing the PDF Code in a Text Editor

pdftk mydoc.pdf output mydoc.clear.pdf uncompress

Repair a PDF’s Corrupted XREF Table and Stream Lengths (If Possible)

pdftk broken.pdf output fixed.pdf

Burst a Single PDF Document into Single Pages and Report its Data to doc_data.txt

pdftk mydoc.pdf burst

Report on PDF Document Metadata, Bookmarks and Page Labels

pdftk mydoc.pdf dump_data output report.txt

Notes:
pdftk uses the iText Java library (http://itextpdf.sourceforge.net/) to read and write PDF. The author compiled this Java library using GCJ (http://gcc.gnu.org) so it could be linked with a front end written in C++.” (from the man)

Update (20110611)
You can also “explode” a PDF, i.e. split it into a set of individual pages:

pdftk file.pdf burst

Related links

Follow

Get every new post delivered to your Inbox.

Join 62 other followers