UK Web Focus (Brian Kelly)

Innovation and best practices for the Web

Are Your PDFs Conformant?

Posted by Brian Kelly on 23 Jan 2009

I’ve never been much of a fan of the PDF format. Back in the early days of the Web I had hoped that the proprietary PDF format would be replaced by HTML and CSS. Back then there was an expectation that CSS would be developed to provide the fine control over page layout that is available using word processing and DTP applications.  The development of the Document Object Model (DOM) for HTML/XML various also promised to deliver an environment in which such resources could be interrogated and manipulated in ways which would not be possible with more monolithic resources such as PDFs. And finally HTML and CSS provided accessibility benefits not available in PDF.

However over the years it became apparent that HTML/CSS wouldn’t provide such fine layout control. And we found that HTML as used in the real world tended to be a structural mess, sometimes referred to as ‘tag soup’.

We also discovered that in many cases users preferred PDFs, especially for resources which designed as printed documents.

And last year PDF became an ISO standard, following on from the standardisation of PDF/A as an archival format.

So PDF is now an open standard, is suitable for archival purposes, has widespread support, accessible PDFs can now be created – and there is also an Adobe SDK which supports the development of applications to create and process PDF files.

Sounds good, doesn’t it? But in practice, do PDF files actually conform to the PDF standard? And although PDF files can be accessible, in practice do the PDF files which are produced in normal work flow processes  actually comply with accessible PDF guidelines?

I recently searched for PDF validation tools.  I found that a number of tools were available, many of which were expensive to purchase. I made use of one free email-based tools (Validatepdfa) and used it to report on the conformance of a couple of PDF files for recent peer-reviewed papers which I had submitted to journal / conference organisers. Although these files may have conformed with the publisher’s layout and house style requirements, I found the tool found quite a number of error As you can see the error messages aren’t particualrly helpful and it is difficult to see how such errors can be remedied:

Issues addressed (1) File structure Incorrect delimiter used for indirect object 340 0
Issues addressed (2) File structure Incorrect delimiter used for indirect object 370 0
Issues addressed (3) File structure Missing ID in trailer dictionary

Issues addressed (118) Fonts Font ‘TrebuchetMS-Bold’ was successfully substituted and embedded
Issues addressed (119) Fonts CID font subset without CIDSet
Issues addressed (120) Fonts CIDToGIDMap has been successfully embedded in Type2 font LHCKAJ+SymbolMT.
Issues addressed (121) Fonts CID font subset without CIDSet

I then used the Adobe Acrobat software to report on any accessibility problems with the papers. I used this tool to analyse all of my peer-reviewed papers which I have written in the past 10 years – and found that none of the papers actually conformed with Adobe’s accessibility guidelines.

The error messages provided in Adobe Acrobat were mostly helpful and it seemed that one consistent problem was the lack of a language to describe the contents of the document. Fortunately Adobe Acrobat does allow some of the accessibility problems to be fixed with the software – so I assigned the language English to all of the documents. Some of my papers now do conform with PDF accessibility guidelines (at least as far as automated checking tools can detect) – but the documents which had been uploaded to the University of Bath’s institutional repository a few months ago will be the non-accessible versions. There are issue about the workflow processes for uploading papers to institutional repositories: who should have a responsibility for ensuring compliance with guidelines; at what stage should appropriate metadata be added; who should ensure that the metadata is correct; what tools can be used to create and maintain such metadata; what level of detail should be provided; how do we ensure that the metadata isn’t corrupted during workflow processes; etc. Did you really think that using PDF was easy?

I suspect that most people aren’t particularly interested in conformance of such resources with PDF standards and accessibility guidelines – although it was reassuring to see the post on”Survey on malformed PDFs?” on the DCC blog.

But if we are serious about the importance of standards, particularly in the context of digital preservation, and if we are serious about the accessibility of digital resources, we will need to ensure that our workflow practices result in resources on our Web sites and institutional repositories which are conformant.

Or perhaps strict conformance with standards and accessibility guidelines is over-rated. Should we simply acknowledge that the ease of creation of PDF resources is key to the creation of such resources and adding additional steps into the workflow processes will add unnecessary complexities and barriers?

7 Responses to “Are Your PDFs Conformant?”

  1. Roland the Headless Thompson Gunner said

    You’re making a good point here, and not just on the finer points concerning metadata. I’m sure we have all seen PDFs which are just images of scanned documents!

    Adherence to standards should be considered necessary, but never sufficient.

    On the other hand, I’d be happy if people would just stop distributing f***ing Word documents!

  2. Roland the Headless Thompson Gunner said

    And another thing;-)

    Thank heaven HTML and CSS don’t allow people the precision to format documents like they would for printing.

    One of the major blights on our efforts to create usable, accessible web sites is braindead web designers who strive for pixel-perfect rendering.

    This approach create major problems on slightly unusual platforms. e.g. text that overflows a fixed sized container and becomes unreadable, when viewed on a display with a resolution (and therefore font size) higher than the designer allowed for.

    Print and browser environments are fundamentally different, and require different design and tools.

  3. Chris Rusbridge said

    David Rosenthal, who is in serious iconoclastic mode these days, came up with several reasons why you (or at least preservationistas) should not care. See http://blog.dshr.org/2009/01/postels-law.html

  4. Designers do that in print as well. Nothing like trying to look at an old batch of vector files (Freehand, Illustrator) to see what noncompliant work practices of yore look like in the migrated file.

  5. […] – it will result in documents being provided in formats such as PDFs. And who bothers checking that PDFs conform with PDF standards? Possibly related posts: (automatically generated)The Demise of Netscape NavigatorLittleSnapper 1.0 […]

  6. Our PDF/A Validator has recently been updated to use an XML reporting format that includes more detail about the compliance issues.

    Our remedy is obviously to sell people our desktop product, Solid PDF Tools, to convert the PDF to PDF/A (there is a free trial). We figure offering validation for free is a good start. Repair we charge money for.

    All errors mentioned in the report except “fatalError”s can be repaired by Solid PDF Tools.

  7. […] Are Your PDFs Conformant? (source: UK Web Focus, […]

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

 
%d bloggers like this: