A quick study of PDF optimization / Stoyan's phpied.com

Couple days ago I found out about a tool called pdfcpu, a PDF processor. Among its features I saw "optimize" so I had to take it ot for a spin and see how much of an optimization we're talking about. Here's a quick study of optimizing a random-ish sample of PDF files.

Source data

I thought the easiest collection of samples is to just use all the PDFs I happen to have on my computer. Not the most random sample but certainly the most immediate.

I ran (on a Mac laptop):

find / -type f -name "*.pdf" > pdf_paths.txt

... and voila, a nice collection of PDF file paths (sample).

Next step: deduplication, because it's likely that I have the same file(s) in multiple locations. So add MD5 hashes of the file contents to the pdf_paths.txt (sample), sort and dedup (sample). The scripts I used are on GitHub for your own experimentation.

After dedup, turns out I have close to 5GB worth of 3853 unique PDFs laying about on my harddrive. That's quite a bit more than I would've guessed but a decent sample size for experimentation.

Optimize

Again, the actual bash script I used is on Github, but in general the command is:

./pdfcpu optimize input.png output.pdf

As part of the optimization script, I also get the filesize before and after the optimization. The stats are here as CSV and as Mac's Numbers file.

Results

Of the total 3853 input files, 140 of the "after" files were 0 bytes, so I assume pdfcpu choked on these. We ignore them.

82 files were bigger so pdfcpu actually de-optimized file size. Ignore these too.

Of the rest, the median savings were 4.7% (9.4% average savings).

Among the outliers, the file with the largest 98.64% filesize reduction (from 167k down to 2k) was some sort of resource in Mac OS's Numbers app. Here it is as a 187 bytes PNG. Cute, innit?

diamond bullet icon

Parting words

So... pdfcpu -optimize yey or nay? Well, the average/median savings were not bad. How would that affect web performance... that depends on how many PDFs you serve. The average site certainly serves way more images than PDFs so time is better spent optimizing those. For comparisson, if you simply run MozJPEG with jpeg -copy none you can expect over 11% median savings (2022 study)

But there are all kinds of use cases out in the wild. If a non-trivial quantity of PDF serving happens on your site, I'd say look into pdfcpu.

And in any optimization script you write, do check for 0 byte results because it might happen (3.6% of the time in my test) and for cases where pdfcpu writes a larger file than the original (2.1% in my case) and keep the original in both these cases.

Are there any other PDF optimizers worth checking out? Please let me know.

Tell your friends about this post on Facebook and Twitter

Sorry, comments disabled and hidden due to excessive spam.

Meanwhile, hit me up on twitter @stoyanstefanov