The Geomblog: Anonymizing PDF

Saturday, March 17, 2007

Anonymizing PDF

File this under 'late-night LaTeX griping':

Is there any way of stripping metadata from a PDF file ? I'm writing a referee report for a journal, and used PDFLaTeX to create the report. When I scan it in acroread, there's all kinds of meta data that could identify me.

Now pdftk is a useful package that can strip out some of the simple metadata like 'creator'. However, pdftex adds "Advanced" fields, and one of them is the full pathname of the original LaTeX file. If your filesystem (UNIX) is anything like mine, then a part of that pathname is the /<username>/ section, which in many instances is an almost unique identifier. This also happens with dvipdfm, which uses the gs backend to create the PDF file, and with ps2pdf. pdftk cannot strip out these fields, because it doesn't appear to see them.

I suspect that if I owned a copy of the very-not-free Acrobat, I could meddle around with this metadata. Obviously I could submit the review as a Postscript file, but in general I prefer to maintain PDF. Further this problem also occurs if I want to do due diligence when submitting to conferences with double blind review, and sometimes I don't have the option to use PS.

9 comments:

Anonymous3/17/2007 06:02:00 AM
cant you use some online file conversion service to convert the ps to a pdf?
ReplyDelete
Replies
Anonymous3/17/2007 06:05:00 AM
In my system, pdflatex does not seem to include the full path name of the source code in the PDF file.

In Adobe Reader, the information in "File / Document properties / Descrition / Advanced / Location" seems to show the current location of the PDF file; not the location of the original Latex file.

However, I have noticed that if I use \includegraphics{foo.pdf} or similar, pdflatex sometimes adds the full path name of foo.pdf in the resulting PDF file. Adobe Reader does not seem to show this information but you can show it by using the "strings" command in Unix.
ReplyDelete
Replies
Anonymous3/17/2007 06:18:00 AM
I also couldn't find any reference to the original source file in the generated PDF.

If you use relative paths for included PDF images, it seems the relative path is stored (e.g., "./foo.pdf" would appear, rather than the full path).

Also, the metadata from included pdf images is also included in the final pdf -- so make sure these images are anonymized as well.
ReplyDelete
Replies
Anonymous3/17/2007 02:08:00 PM
I suspect that if I owned a copy of the very-not-free Acrobat

Huh?

Suresh, the university of utah store web site has it listed for faculty as:

Acrobat Professional 8.0 - Single User $43.00

That seems reasonable.
ReplyDelete
Replies
Stephen C North3/17/2007 08:13:00 PM
What about

pdf2ps orig.pdf | ps2pdf - scrubbed.pdf
ReplyDelete
Replies
Suresh Venkatasubramanian3/17/2007 11:54:00 PM
So I forgot that I can get university discounts for software :). in any case, the adobe store doesn't appear to work right on firefox/linux, and I don't know if acrobat even comes in a linux version.

The problem with converters like pdf2ps etc is that they are unreliable when it comes to dealing with images, although they handle the text just fine (as long as fonts are set up correctly)
ReplyDelete
Replies
Anonymous3/19/2007 02:23:00 AM
In cases like this I usually do the last compilation in /tmp ...
ReplyDelete
Replies
Anonymous3/19/2007 07:28:00 PM
In your situation, I usually use pdftk to first uncompress the pdf, then edit the uncompressed pdf as I see fit, and then compress it back. (Look for "/PTEX.FileName" in the uncompressed version.)

maverick
ReplyDelete
Replies
Anonymous3/20/2007 01:18:00 PM
very interesting blog, I#vre learned something new. Greeting, Bruno (Geld)
ReplyDelete
Replies

Add comment

The Geomblog

Pages

Saturday, March 17, 2007

Anonymizing PDF

9 comments:

Disqus for The Geomblog