UK Web Focus

Innovation and best practices for the Web

Archive for January 28th, 2013

Why I’m Now Embedding ORCID Metadata in PDFs

Posted by Brian Kelly on 28 January 2013

“Every PDF needs a title”

The day after announcing a post on Reflections on the Discussion on the Quality of Embedded Metadata in PDFs I received a tweet from @community which alerted me to a blog post on SEO Action for PDF files on the Adobe blog. The post describes an extension for use in Acrobat X Pro which automates the settings of the properties of the PDF file in accordance with guidelines which can enhance the discoverability of PDF files by Google. The guidelines, which had been published way back in August 2009, were based on experiments which demonstrated improvements in Google’s indexing of PDF files. The article’s main conclusion was that “Every PDF needs a title“:

In terms of PDF files, the blue underlined text in Google’s search results comes from one of two places. First, Google looks in the “Title” document information field. If it finds nothing, Google’s indexer tries to guess the document’s title by scanning the text on the first few pages. This usually doesn’t work, producing incorrect and improperly formatted results.

In addition to this advice, the article also suggested use of other metadata fields including author, subjects and keywords.

Metadata For Peer-Reviewed Papers

Although I ensure that I provide the correct title for my peer-reviewed papers when I create them in MS Word I was unsure whether I included the names of the co-authors or made use of other metadata fields.

Metadata fields in MS WordOn Friday 25 January 2013 I decided to update the metadata for one of my papers, “Developing A Holistic Approach For E-Learning Accessibility” which was the first paper myself, Lawrie Phipps and Elaine Swift wrote back in 2004

I added a number of tags to the paper and used the Comments field to provide the abstract. In addition the publication details were added to the Status field.

Whilst updating the metadata it occurred to me that it would be useful to include the ORCID ID for the authors as this will be less volatile than the author’s email address (one of the co-authors was based at the University of Bath when the paper was published but subsequently moved to Nottingham Trent University).

alt text for images in MS WordIn addition to the resource discovery metadata for the paper I also remembered that I should ensure that images in the paper contained appropriate alt text so that image descriptions are available to those who may make use of a screen reader. Fortunately we had done this for the paper, but I have to admit that this isn’t necessarily done for all of my research papers.

Having updated the metadata for the paper and embedded images I then created the PDF from MS Word. I noticed that the Save As PDF option in MS Word enabled a number of options to be specified, including Save As ISO-19005 (PDF/A).

As described in Wikipedia PDF/A is “an ISO-standardized version of the Portable Document Format (PDF) specialized for the digital preservation of electronic documents“. The articles goes on to explain that “PDF/A differs from PDF by omitting features ill-suited to long-term archiving, such as font linking (as opposed to font embedding)“.

Savie as PDF option in MS WordSince the digital preservation of peer-reviewed publications is important I ensured that I saved the paper in PDF/A format, using the Save As option illustrated.

Approaches to Embedded Metadata Embedded in PDFs

What practices should be used in providing the metadata to be created in the original authoring tool (MS Word, in my case) which will then be available in the PDF version of the paper? Here’s a summary of the approaches I have used:

Title: The title of the paper

Tags: My preferred tags about the content and my organisation.

Comments: The abstract of the paper, normally taken from the abstract provided in the paper.

Author: First Name Surname (ORCID: ORCID ID) e.g. Brian Kelly (ORCID: 0000-0001-5875-8744)

The title field will be obvious. The tags will reflect keywords which I feel will enhance access to the document (and I choose less than five). I am using the comments field to host the abstract for the paper. Finally the author field contains the full name followed by ORCID: ORCID ID (in brackets). I feel that this is a pragmatic approach to ensuring that the significant information which will be indexed by Google is found in the metadata fields which are available through my authoring tool (MS Word).

But could this cause problems? Might Google think my name is Mr Orcid or Mr 0000-0001-5875-8744? Might other indexing and aggregation tools have problems as I am misusing the semantics of these metadata tools? My feeling is that Google will be capable of understanding the content and it is better to have such quality metadata (which I have chosen) rather than no metadata. But are other researchers embedding ORCID IDs in PDFs? In order to answer this question I have using Google’s advanced search capability to search for “ORCID” in PDF resources across a number of domains, as summarised below.search for "ORCID" in PDFs in ac.uk domain

Domain Results Date Current Results
All 3,840 28 Jan 2013 Try it
.ac.uk   109 28 Jan 2013 Try it
bath.ac.uk       0 28 Jan 2013 Try it

These numbers are low – and when you realise that the results include PDFs which contain the string “ORCID” in the text of the pages (as illustrated) it seems clear that there is little evidence that ORCID IDs are being embedded in PDFs yet.

So before I embed ORCID IDs in my other papers I would welcome feedback on this proposal. Is it desirable to include the ORCID IDs of authors in the PDF versions of papers? If so, is the approach I have taken to be recommended to others? Or might it be desirable to provided richer structured metadata in PDF files, using the XMP (Extensible Metadata Platform) standard? But if this is felt to be desirable, how would it fit into the workflow, given that it appears difficult to persuade authors to provide metadata for their papers in any case?


View Twitter conversation from: [Topsy] | View Twitter statistics from: [TweetRearch] – [Bit.ly]

Posted in Identifiers, Repositories | Tagged: | 6 Comments »