Archive
Edit your PDF files
Problem #1
You have a PDF file and you want to remove some pages from it. You want to do this job via a GUI interface.
Solution #1
You can use PDF Mod for this task.
“PDF Mod is a simple application for modifying PDF documents. You can reorder, rotate, and remove pages, export images from a document, edit the title, subject, author, and keywords, and combine documents via drag and drop.” (source)
It exists in the repositories, but if you want the latest version, install it from PPA:
sudo add-apt-repository ppa:pdfmod-team/ppa sudo apt-get update sudo apt-get install pdfmod
The PPA tip is from Ubuntu Geek.
Problem #2
You want to modify your PDF files from the command line.
Solution #2
Check out my post on pdftk.
Related links
- PyPdf module (Python)
Python tutorials of Full Circle Magazine in a single PDF
Please read first the update information at the end of the post.
Description
Full Circle Magazine (FCM) started a Python tutorial series in issue #27. At the time of writing, the current issue is #45, and the tutorial is still there :)
Problem: it would be nice to extract these tutorials from the issues and put them together in a single PDF. Thus, we would have all the tutorials together in one document.
Download
For the lazy pigs, here is the PDF (6 MB). Get it while it’s hot :)
How to produce the single PDF
For those who are interested, here I explain how to produce the single PDF above.
First, download the issues of FCM. I suppose that the required files are named issue27_en.pdf, issue28_en.pdf, …, issue45_en.pdf. Put them in a directory called full-circle.
Here is a CSV file that contains data about which pages to extract from the issues:
# issue; start page; end page 27;7;10 28;7;11 29;7;11 30;7;9 31;8;11 32;8;12 33;8;12 34;8;15 35;10;13 36;7;11 37;7;11 38;7;11 39;7;11 40;8;14 41;8;12 42;8;11 43;7;9 44;7;9 45;7;8
Put this file (download link) to the same directory where the PDF files are. Here, create a subdirectory called pieces. The extracted PDFs will be stored there.
We will use the following Python script to produce the commands that will do the extraction:
#!/usr/bin/env python
# extract.py
f1 = open('python.csv', 'r')
for line in f1:
if line.startswith('#'):
continue
# else
line = line.rstrip('\n')
(issue, start_page, end_page) = line.split(';')
command = "pdftk issue%s_en.pdf cat %s-%s output pieces/%s-python.pdf" % (issue, start_page, end_page, issue)
print command
f1.close()
By executing the script (download link), you will get the following output:
pdftk issue27_en.pdf cat 7-10 output pieces/27-python.pdf pdftk issue28_en.pdf cat 7-11 output pieces/28-python.pdf pdftk issue29_en.pdf cat 7-11 output pieces/29-python.pdf pdftk issue30_en.pdf cat 7-9 output pieces/30-python.pdf pdftk issue31_en.pdf cat 8-11 output pieces/31-python.pdf pdftk issue32_en.pdf cat 8-12 output pieces/32-python.pdf pdftk issue33_en.pdf cat 8-12 output pieces/33-python.pdf pdftk issue34_en.pdf cat 8-15 output pieces/34-python.pdf pdftk issue35_en.pdf cat 10-13 output pieces/35-python.pdf pdftk issue36_en.pdf cat 7-11 output pieces/36-python.pdf pdftk issue37_en.pdf cat 7-11 output pieces/37-python.pdf pdftk issue38_en.pdf cat 7-11 output pieces/38-python.pdf pdftk issue39_en.pdf cat 7-11 output pieces/39-python.pdf pdftk issue40_en.pdf cat 8-14 output pieces/40-python.pdf pdftk issue41_en.pdf cat 8-12 output pieces/41-python.pdf pdftk issue42_en.pdf cat 8-11 output pieces/42-python.pdf pdftk issue43_en.pdf cat 7-9 output pieces/43-python.pdf pdftk issue44_en.pdf cat 7-9 output pieces/44-python.pdf pdftk issue45_en.pdf cat 7-8 output pieces/45-python.pdf
As can be seen, the extraction will be done with pdftk (more info here). Now, these commands are simply printed to the standard output. Here is how to execute them too:
./extract.py | sh
That is, pass the commands to the shell “sh”, which will execute them line by line.
Okay, now we have the pieces in the directory “pieces”. Enter the directory “pieces” and join the PDFs:
pdftk *.pdf cat output all.pdf
Known issue
Well, to tell the truth, this method will produce a huge single PDF. The extracted pieces are also very big (5 to 10 MB), and the final PDF is about 130 MB! So actually I used Adobe Acrobat 8 Professional to merge the pieces with the conversion setting “Smaller File Size”. Acrobat Pro optimized the files and produced a file of size 6 MB. If you know how to have a similar result with open source tools, let me know.
Update (20110305):
It seems FCM comes out with a similar idea: http://fullcirclemagazine.org/python-special-edition-1/. They collected the first 8 parts of the already published Python tutorials in a special edition.
Update (20110329):
I pushed this project to GitHub, see https://github.com/jabbalaci/Full-Circle-Magazine-Series. I added some changes but I won’t rewrite this post each time. For the latest version, please refer to GitHub.
[ @reddit ]
Manipulate your PDFs with pdftk
“If PDF is electronic paper, then pdftk is an electronic staple-remover, hole-punch, binder, secret-decoder-ring, and X-Ray-glasses. Pdftk is a simple tool for doing everyday things with PDF documents.” (source)
Pdftk is a great command-line tool to manipulate your PDF files.
Installation:
sudo apt-get install pdftk
Examples:
The following examples are taken from here.
Merge Two or More PDFs into a New Document
pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf
or (Using Handles):
pdftk A=1.pdf B=2.pdf cat A B output 12.pdf
or (Using Wildcards):
pdftk *.pdf cat output combined.pdf
Split Select Pages from Multiple PDFs into a New Document
pdftk A=one.pdf B=two.pdf cat A1-7 B1-5 A8 output combined.pdf
Encrypt a PDF using 128-Bit Strength (the Default) and Withhold All Permissions (the Default)
pdftk mydoc.pdf output mydoc.128.pdf owner_pw foopass
Same as Above, Except a Password is Required to Open the PDF
pdftk mydoc.pdf output mydoc.128.pdf owner_pw foo user_pw baz
Same as Above, Except Printing is Allowed (after the PDF is Open)
pdftk mydoc.pdf output mydoc.128.pdf owner_pw foo user_pw baz allow printing
Decrypt a PDF
pdftk secured.pdf input_pw foopass output unsecured.pdf
Join Two Files, One of Which is Encrypted (the Output is Not Encrypted)
pdftk A=secured.pdf mydoc.pdf input_pw A=foopass cat output combined.pdf
Uncompress PDF Page Streams for Editing the PDF Code in a Text Editor
pdftk mydoc.pdf output mydoc.clear.pdf uncompress
Repair a PDF’s Corrupted XREF Table and Stream Lengths (If Possible)
pdftk broken.pdf output fixed.pdf
Burst a Single PDF Document into Single Pages and Report its Data to doc_data.txt
pdftk mydoc.pdf burst
Report on PDF Document Metadata, Bookmarks and Page Labels
pdftk mydoc.pdf dump_data output report.txt
Notes:
“pdftk uses the iText Java library (http://itextpdf.sourceforge.net/) to read and write PDF. The author compiled this Java library using GCJ (http://gcc.gnu.org) so it could be linked with a front end written in C++.” (from the man)
Update (20110611)
You can also “explode” a PDF, i.e. split it into a set of individual pages:
pdftk file.pdf burst
Related links
- PyPdf module (Python)
- Couturier (A small graphical utility for merging multiple PDF documents and images into one document.)
- Manipulating PDFs with the PDF Toolkit (some other tricks: add attachments, fill out forms, etc.)
