Synapseal Arclights

Random neuronal meanderings

Convert html page to a PDF document (with images)
9th June 2011
 

>
Today we'll be going over how to convert a html page to a PDF document. Pictures linked the html page are will be included1.
Yes, you can "print to PDF" from a lot of programs, but that method can include artifacts you didn't want. The following procedure will produce a PDF that is virtually identical in content to the HTML page it was spawned from.
This exercise requires GhostScript and html2ps2. Windows users will also need CYGWIN for the bash shell and Perl if they want the "rename" command to work.
GhostScript is pretty standard in linux, and is available for OS X and Windows. html2ps will probably need to be installed via your linux package manager if you run Debian or Ubuntu (or use Debian/Ubuntu repos), or compiled from the source code OS X and Windows (Windows users'll need cygwin installed to do this).

This script is really simple. It takes a single html document3 as input, and turns it into a postscript document, then turns the postscript document into a PDF. If you want to turn a bunch of html pages into one PDF, you can modify this script from t'other day.

Here's the really quick-n-dirty script. Copy and paste it into a text document (but don't use .txt on the end):
#!/bin/bash
html2ps "$1" > "$1".ps &&
ps2pdf *.ps &&
rename -v 's/\.html\.pdf/\.pdf/' *.pdf &&
rename -v 's/\.htm\.pdf/\.pdf/' *.pdf

You'll need to tell the system (linux and OS X) that the script is an executable. To do so, in a terminal run
chmod +x scriptname


Line 1 tells the system it's a bash shell script
Line 2 tells bash to run the html2ps command on whatever html file you pointed it at and name it whatever-its-name-is.ps
Line 3 tells bash to call ps2pdf and convert any files ending in .ps to a PDF
Lines 4 and 5 tells bash to call the rename command and change the converted PDF filename from 01-blahblah.html.pdf to 01-blahblah.pdf (this step is aesthetic, and won't affect how the PDF actually works.


I wrote this script because I think the resulting PDF is much more professional in appearance (no "page numbers" or "file:///blah/blah/blah.html" appearing mysteriously).

[1] If you want images in your PDF, they'll need to be linked properly. That is a howto for another day.
[2] I'm not sure where the home page is, but you can also download the source from Debian
[3] If your original HTML document is not within specifications, either html2ps or ps2pdf can take a dump on you. Please validate your html for best results.




Bookmark and Share  Want to discuss this? Have a comment for the author? Mosey on over to the Novarata Forums and let us know what you think.


Tags: technology.

Created by Chronicle v4.5

fortnight-latitude