UK Web Focus (Brian Kelly)

Innovation and best practices for the Web

Automated Accessibility Analysis of PDFs in Repositories

Posted by Brian Kelly on 30 Jul 2010

Back in December 2006 I wrote a post on Accessibility and Institutional Repositories in which I suggested that it might be “unreasonable to expect hundreds in not thousands of legacy [PDF] resources to have accessibility metadata and document structures applied to them, if this could be demonstrated to be an expensive exercise of only very limited potential benefit“. I went on to suggest that there is a need to “explore what may be regarded as ‘unreasonable’ we then need to define ‘reasonable’ actions which institutions providing institutional repositories would be expected to take“.

A discussion on the costs and complexities of implementing various best practices for depositing resources in repositories continued as I described in a post on Institutional Repositories and the Costs Of Doing It Right in September 2008, with Les Carr suggesting that “If accessibility is currently out of reach for journal articles, then it is another potential hindrance for OA“. Les was arguing that the costs of providing accessibility resources in institutional repositories is too great and can act as a barrier to maximising open access to institutional research activities.

I agreed with this view, but also felt there was a need to gain evidence on possible accessibility barriers. Such evidence should help to inform practice, user education and policies. These ideas were developed in a paper published last year on “From Web Accessibility to Web Adaptability” (available in PDF and HTML formats) in which I suggested that institutions should “run automated audits on the content of [PDF resources in] the repositories. Such audits can produce valuable metadata with respect to resources and resource components and, for example, evaluate the level of use of best practices, such as the provision of structured headings, tagged images, tagged languages, conformance with the PDF standard, etc. Such evidence could be valuable in identifying problems which may need to be addressed in training or in fixing broken workflow processes.”

I discussed these ideas with my colleagues Emma Tonkin and Andy Hewson who are working on the JISC-funded FixRep project which “aims to examine existing techniques and implementations for automated formal metadata extraction, within the framework of existing toolsets and services provided by the JISC Information Environment and elsewhere“. Since this project is analysing the metadata for repository items including “title, author and resource creation date, temporal and geographical metadata, file format, extension and compatibility information, image captions and so forth” it occurred to me that this work could also include automated analyses of the accessibility aspects of PDF resources in repositories.

Emma and Andy have developed such software which they have used to analyse records in the University of Bath Opus repository.  Their initial findings were published in a paper on “Supporting PDF accessibility evaluation: Early results from the FixRep project“. This paper was accepted by the “2nd Qualitative and Quantitative Methods in Libraries International Conference (QQML2010)” which was held in Greece on 25-28 May 2010. Due to the volcanic ash Emma and Andy were unable to attend the conference. Emma did, however, produce a Slidecast of the presentation which she used as she wasn’t able to physically attend the conference. This has the advantage of being able to be embedded in this blog:

The prototype software they developed was used to analyse PDF resources by extracting information about the document in a number of ways including header and formatting analysis; information from the body of the document and information from the originating filesystem.  The initial pilot analyse PDFs held in the University of Bath repository and was successful in analysing 80% of the PDFs,with 20% being unable to be analysed due to a lack of metadata available for extraction of the file format of file was not supported by the analysis tools.

In my discussions with Emma and Andy we discussed how knowledge of the tools used to create the PDF would be useful in understanding the origins of possible accessibility limitations, with such knowledge being used to inform both user education and the workflow processes used to create PDFs which are deposited in repositories. However rather than the diversity of PDF tools which were expected to be found, there appeared to be only two main tools used. It appears that this reflects the software used to create the PDF cover page (which I have written about recently) rather than the tools used to create the main PDF resource. If you are unfamiliar with such cover pages one is illustrated – the page aims to provide key information about the paper and also provides institutional branding, as can be seen.

As Emma concluded in the presentation “We may be ‘shooting ourselves in the foot’ with additions like after-the-fact cover sheets. This may remove original metadata that could have been utilised for machine learning.

Absolutely! As well as acting as a barrier to Search Engine Optimisation (which is discussed in the paper)  the current approaches taken to the production of such cover pages act as a barrier to research, such as the analysis of the accessibility of such resources.

It does strike me that this is nothing new. When the Web first came to the attention of University marketing departments there was a tendency to put large logos on the home page, images of the vice-chancellor and even splash screens to provide even more marketing, despite Web professions pointing out the dangers associated with such approaches.

So whilst I understand that there may be a need for cover pages, can they be produced in a more sophisticated fashion so that they are friendly to those who are developing new and better ways of accessing resources in institutional repositories? Please!

8 Responses to “Automated Accessibility Analysis of PDFs in Repositories”

  1. Kara Jones said

    Glad you understand the need for a coverpage. We’d be happy to use a better solution – is there one?!

    • I wonder if a better solution could be developed by getting a group of techies together and seeing if they can come up with a more appropriate solution?

      My initial thoughts are that the software generating the cover page should not change/obscure the PDF characteristics of the main paper. Can the software generating the PDF cover page analyse the metadata of the paper’s PDF and replicate this in its metadata, I wonder? And would it be possible for the cover page to be a back cover page?

      I wonder if anyone in the repositories community has already addressed such issues?

  2. […] This post was mentioned on Twitter by DigitalKoans, Brian Kelly. Brian Kelly said: Automated Accessibility Analysis of PDFs in Repositories: Back in December 2006 I wrote a post on Accessibility an… […]

  3. […] Automated Accessibility Analysis of PDFs in Repositories :… […]

  4. […] This should only be included if it is intended to implement such a service. Note UKOLN have developed a trial application which could implement such a service which was described in a paper on Automated Accessibility Analysis of PDFs in Repositories. […]

  5. […] research carried out by my colleagues Emma Tonkin and Andy Hewson described in a post on “Automated Accessibility Analysis of PDFs in Repositories“, might the cover pages automatically generated by repository systems created additional […]

  6. […] PDF accessibility evaluation: Early results from the FixRep project” which I described in a blog post last year. Might there be an opportunity for developers to build on this initial work, I […]

  7. […] in July 2010 in a post on “Automated Accessibility Analysis of PDFs in Repositories” I mentioned a paper on “From Web Accessibility to Web Adaptability” (available […]

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: