Saturday, March 17, 2007

Anonymizing PDF

File this under 'late-night LaTeX griping':

Is there any way of stripping metadata from a PDF file ? I'm writing a referee report for a journal, and used PDFLaTeX to create the report. When I scan it in acroread, there's all kinds of meta data that could identify me.

Now pdftk is a useful package that can strip out some of the simple metadata like 'creator'. However, pdftex adds "Advanced" fields, and one of them is the full pathname of the original LaTeX file. If your filesystem (UNIX) is anything like mine, then a part of that pathname is the /<username>/ section, which in many instances is an almost unique identifier. This also happens with dvipdfm, which uses the gs backend to create the PDF file, and with ps2pdf. pdftk cannot strip out these fields, because it doesn't appear to see them.

I suspect that if I owned a copy of the very-not-free Acrobat, I could meddle around with this metadata. Obviously I could submit the review as a Postscript file, but in general I prefer to maintain PDF. Further this problem also occurs if I want to do due diligence when submitting to conferences with double blind review, and sometimes I don't have the option to use PS.

9 comments:

  1. cant you use some online file conversion service to convert the ps to a pdf?

    ReplyDelete
  2. In my system, pdflatex does not seem to include the full path name of the source code in the PDF file.

    In Adobe Reader, the information in "File / Document properties / Descrition / Advanced / Location" seems to show the current location of the PDF file; not the location of the original Latex file.

    However, I have noticed that if I use \includegraphics{foo.pdf} or similar, pdflatex sometimes adds the full path name of foo.pdf in the resulting PDF file. Adobe Reader does not seem to show this information but you can show it by using the "strings" command in Unix.

    ReplyDelete
  3. I also couldn't find any reference to the original source file in the generated PDF.

    If you use relative paths for included PDF images, it seems the relative path is stored (e.g., "./foo.pdf" would appear, rather than the full path).

    Also, the metadata from included pdf images is also included in the final pdf -- so make sure these images are anonymized as well.

    ReplyDelete
  4. I suspect that if I owned a copy of the very-not-free Acrobat


    Huh?

    Suresh, the university of utah store web site has it listed for faculty as:

    Acrobat Professional 8.0 - Single User $43.00

    That seems reasonable.

    ReplyDelete
  5. What about

    pdf2ps orig.pdf | ps2pdf - scrubbed.pdf

    ReplyDelete
  6. So I forgot that I can get university discounts for software :). in any case, the adobe store doesn't appear to work right on firefox/linux, and I don't know if acrobat even comes in a linux version.

    The problem with converters like pdf2ps etc is that they are unreliable when it comes to dealing with images, although they handle the text just fine (as long as fonts are set up correctly)

    ReplyDelete
  7. In cases like this I usually do the last compilation in /tmp ...

    ReplyDelete
  8. In your situation, I usually use pdftk to first uncompress the pdf, then edit the uncompressed pdf as I see fit, and then compress it back. (Look for "/PTEX.FileName" in the uncompressed version.)

    maverick

    ReplyDelete
  9. very interesting blog, I#vre learned something new. Greeting, Bruno (Geld)

    ReplyDelete

Disqus for The Geomblog