Reflections on the Discussion on the Quality of Embedded Metadata in PDFs
Posted by Brian Kelly on 11 January 2013
The Quality of Metadata Embedded in PDFs
The recent post on Embedded Metadata in PDFs Hosted in Institutional Repositories: An Inside-Out & Outside-In View generated a fair amount of discussion, with ~17 comments on the post itself but perhaps more significantly, a more interactive discussion on Twitter, with relevant contributions being made by @mrnick, @neilstewart, @rmounce, @carusb, @pj_webster, @emmatonkin, @MikeTaylor and @wrap_ed, with other Twitter users sharing links to the posts to their communities.
Whilst some people may still feel that discussions should take place on one centralised system (e.g. a mailing list) in reality this is an unrealistic expectation. In the real world discussions based on ideas which may have originated online will be dispersed across office and common rooms in institutions around the world, to say nothing of other discussions which may take place in pubs and coffee rooms as well as whilst travelling. Conversations about interesting ideas will be distributed; we have to accept that. However it can be helpful to aggregate valuable comments which may be fragmented across a variety of communication channels. Since I felt that the Twitter discussions about the post were particularly interesting I have created a Storify summary entitled The Quality of Embedded Metadata in PDFs (Jan 2013). Note that this complements the Topsy summary which gives the tweets which contains links to the blog post.
I wonder if some of these issues might be relevant within the context of the UK RepNet project which is holding a meeting in London on 21st Jan –http://www.rsp.ac.uk/events/supporting-and-enhancing-your-repository/
I will therefore provide a summary of the main issues which were discussed on the blog and on Twitter.
The initial post was written in response to a post by Ross Mounce in which he asked PDF metadata – why so poor? and a follow-up post a week late on PDF metadata: different tool, same story. Ross’s post was based on an analysis of the metadata embedded in PDFs hosted by scholarly publishers. Ross’s second post succinctly summarised his work:
So a week ago, I investigated publisher-produced Version of Record PDFs with pdfinfo and the results were very disappointing. Lots of missing metadata was found and one could not reliably identify most of these PDFs from metadata alone, let alone extract particular fields of interest.
I wondered whether PDFs hosted in institutional repositories also suffered from poor quality or missing embedded metadata. I examined some papers I had deposited in the University of Bath repository and found that metadata which was contained in the original PDF file I uploaded to the repository was missing from the PDF which users can download. I surmised that the metadata had been lost in the workflow when a cover sheet was added to the paper.
My post referenced a post by Lorcan Dempsey entitled Discovery vs discoverability … in which he explored the idea of the “inside-out and outside-in library“. This seemed very relevant to this scenario as both Ross and myself were concerned primarily by the implications is missing metadata for systems which may be used outside of the repository context: in Ross’s case this related to text mining of large collections of PDFs whereas my interest focussed on reuse of PDFs in other repositories.
The initial comment on the blog post by Ingmar Koch illustrated how embedded PDF metadata can be (mis-)used by external systems. Ingmar descried how “the company that designed the document templates for most of the government agencies added a title and author in the template-file. The result is that thousands of online government documents (.pdf and .doc) are titled “at opinio facillime sumitur” and are written bij M. Hes.” This example provides a vivid illustration of how metadata embedded in PDFs is being used by Google. However this example might also be used to demonstrate the poor quality of embedded metadata.
In light of such examples Neil Stewart therefore asked “does it matter if the rare and patchy instances of author-created metadata gets over-written or otherwise distorted?” since “the structured metadata provided at Eprint/DSpace/other repository software record level does the job here (as opposed to metadata embedded within the PDF itself).”
But surely we cannot argue that since some resources may contain poor quality metadata we should delete all metadata! I would argue that there is a need to educate authors on the importance of appropriate metadata, which includes showing how such metadata can be used by services outside of the host institution. Neil recognises the validity of this point when he acknowledged that “not every service will use OAI-PMH or web crawling, some might parse the objects themselves“.
The discussion then moved on Twitter and initially addressed the relevance of cover sheets, since these appear to cause problems in work flows which take place outside of the institutional repository.
Ross Mounce asked:
Neil Stewart provided one use case for cover sheets:
However Ross re-iterated his criticisms of cover sheets:
Others, such as Chris Rusbridge, agreed with this view:
The discussion then moved on to problems which may occur if a paper is to be downloaded, with Nick Sheppard provided a good example of how PDFs may end up containing multiple cover sheets if they are taken from one repository and deposited (by, for example, a co-author) in another repository:
I then highlighted a paper by my colleague Emma Tonkin which showed that that problems with poor quality metadata went beyond the individual examples provided on Twitter:
The paper (PDF format) described how:
Many repositories … have developed or identified a means of adding a cover sheet to each document within the repository. This has potential for positive impact, for example, as a means of clearly indicating the provenance of an item (Puplett, 2008). As can be seen in Fig. 7, Google Scholar does not necessarily recognise the cover sheet for what it is, and this has negative implications for effective indexing and retrieval.
and went on to conclude:
However, the addition of a cover sheet has caused a number of issues beyond those that are usually encountered with the PDF format (ie. font problems, file corruption, etc). This limits the ability for automated processes to make use of this information, and could therefore be said on the level of automated indexing and other software access (such as conversion) to be a retrograde step. If this becomes common practice it may be necessary to review both the assumptions under which automated systems are developed, and perhaps the rationale that lead us to make use of cover sheets in this context.
The paper on Supporting PDF accessibility evaluation: early results from the FixRep project was written in 2010 by my colleagues Emma Tonkin and Andy Hewitt and presented at the 2nd Qualitative and Quantitative Methods in Libraries International Conference (QQML2010).The concluding sentence in the paper highlighted work which needs to be addressed:
it may be necessary to review both the assumptions under which automated systems are developed, and perhaps the rationale that lead us to make use of cover sheets in this context
The paper identified the benefits of cover sheets but also the problems they can cause for automated activities which may take place outside of the institutional repository environment.
But should repository managers and developers necessarily devote resources to addressing potential problems which may arise downstream of the repository environment? In a comment on Ross Mounce’s blog the point was made that publishers will need there to be a sound business case to be made:
“Why would publishers add metadata? Because their customers – libraries, governments, research funders (in the case of Open Access PDFs ) should demand it.” I’m not seeing a compelling business case here. High-quality metadata would be nice, but can anybody argue that their research is being hampered by a lack of such metadata? Could someone working in publishing make a case to their boss that adding such metadata would generate more revenue, web traffic, manuscript submissions (insert whatever metric matters)?
In the context of institutional repositories perhaps the approach to be taken would be to ensure that embedded metadata is preserved and that the training and advice provided by repository support staff ensures that authors are made aware of the ways in why embedded metadata can be used, even if such reuse takes place outside of the institutional repository.
The discussion also highlighted the need for enhanced workflow practices for merging cover pages with the original content and also for enabling users (and automated tools) to be able to access the original source paper in addition to the version contained provenance information designed for consumption by users.
Do any institutional repositories currently provide solutions to these requirements? In addition, I am interested in how many institutional repositories provide cover pages and whether those that do use a repository plugin technology to do this, some other automated technologies or by manual processes. Two polls on these questions are embedded in this post but if the situation is more complex than can be summarised in the poll, feel free to add a comment.
Footnote (added 12 January 2012): A tweet from @community alerted me to the blog post on SEO Action for PDF files on the Adobe blog. This describes an “Action” for use in Acrobat X Pro that will automate setting the properties of the PDF file in accordance with guidelines which can enhance the discoverability of PDF files by Google.