Automated Accessibility Analysis of PDFs in Repositories
Posted by Brian Kelly (UK Web Focus) on 30 July 2010
Back in December 2006 I wrote a post on Accessibility and Institutional Repositories in which I suggested that it might be “unreasonable to expect hundreds in not thousands of legacy [PDF] resources to have accessibility metadata and document structures applied to them, if this could be demonstrated to be an expensive exercise of only very limited potential benefit“. I went on to suggest that there is a need to “explore what may be regarded as ‘unreasonable’ we then need to define ‘reasonable’ actions which institutions providing institutional repositories would be expected to take“.
A discussion on the costs and complexities of implementing various best practices for depositing resources in repositories continued as I described in a post on Institutional Repositories and the Costs Of Doing It Right in September 2008, with Les Carr suggesting that “If accessibility is currently out of reach for journal articles, then it is another potential hindrance for OA“. Les was arguing that the costs of providing accessibility resources in institutional repositories is too great and can act as a barrier to maximising open access to institutional research activities.
I agreed with this view, but also felt there was a need to gain evidence on possible accessibility barriers. Such evidence should help to inform practice, user education and policies. These ideas were developed in a paper published last year on “From Web Accessibility to Web Adaptability” (available in PDF and HTML formats) in which I suggested that institutions should “run automated audits on the content of [PDF resources in] the repositories. Such audits can produce valuable metadata with respect to resources and resource components and, for example, evaluate the level of use of best practices, such as the provision of structured headings, tagged images, tagged languages, conformance with the PDF standard, etc. Such evidence could be valuable in identifying problems which may need to be addressed in training or in fixing broken workflow processes.”
I discussed these ideas with my colleagues Emma Tonkin and Andy Hewson who are working on the JISC-funded FixRep project which “aims to examine existing techniques and implementations for automated formal metadata extraction, within the framework of existing toolsets and services provided by the JISC Information Environment and elsewhere“. Since this project is analysing the metadata for repository items including “title, author and resource creation date, temporal and geographical metadata, file format, extension and compatibility information, image captions and so forth” it occurred to me that this work could also include automated analyses of the accessibility aspects of PDF resources in repositories.
Emma and Andy have developed such software which they have used to analyse records in the University of Bath Opus repository. Their initial findings were published in a paper on “Supporting PDF accessibility evaluation: Early results from the FixRep project“. This paper was accepted by the “2nd Qualitative and Quantitative Methods in Libraries International Conference (QQML2010)” which was held in Greece on 25-28 May 2010. Due to the volcanic ash Emma and Andy were unable to attend the conference. Emma did, however, produce a Slidecast of the presentation which she used as she wasn’t able to physically attend the conference. This has the advantage of being able to be embedded in this blog:
The prototype software they developed was used to analyse PDF resources by extracting information about the document in a number of ways including header and formatting analysis; information from the body of the document and information from the originating filesystem. The initial pilot analyse PDFs held in the University of Bath repository and was successful in analysing 80% of the PDFs,with 20% being unable to be analysed due to a lack of metadata available for extraction of the file format of file was not supported by the analysis tools.
In my discussions with Emma and Andy we discussed how knowledge of the tools used to create the PDF would be useful in understanding the origins of possible accessibility limitations, with such knowledge being used to inform both user education and the workflow processes used to create PDFs which are deposited in repositories. However rather than the diversity of PDF tools which were expected to be found, there appeared to be only two main tools used. It appears that this reflects the software used to create the PDF cover page (which I have written about recently) rather than the tools used to create the main PDF resource. If you are unfamiliar with such cover pages one is illustrated – the page aims to provide key information about the paper and also provides institutional branding, as can be seen.
As Emma concluded in the presentation “We may be ‘shooting ourselves in the foot’ with additions like after-the-fact cover sheets. This may remove original metadata that could have been utilised for machine learning.“
Absolutely! As well as acting as a barrier to Search Engine Optimisation (which is discussed in the paper) the current approaches taken to the production of such cover pages act as a barrier to research, such as the analysis of the accessibility of such resources.
It does strike me that this is nothing new. When the Web first came to the attention of University marketing departments there was a tendency to put large logos on the home page, images of the vice-chancellor and even splash screens to provide even more marketing, despite Web professions pointing out the dangers associated with such approaches.
So whilst I understand that there may be a need for cover pages, can they be produced in a more sophisticated fashion so that they are friendly to those who are developing new and better ways of accessing resources in institutional repositories? Please!