UK Web Focus (Brian Kelly)

Innovation and best practices for the Web

Embedded Metadata in PDFs Hosted in Institutional Repositories: An Inside-Out & Outside-In View

Posted by Brian Kelly on 4 Jan 2013

PDF Metadata – Why Is it So Poor?

Metadata in PDF sourcePDF metadata – why so poor? asked Ross Mounce in a blog post published on New Year’s eve.

In the post Ross expressed surprise that although “with published MP3 files of audio you get rather good metadata … the results from a little preliminary survey of academic publisher PDF metadata” were poor: “Out of the 70 PDFs I’ve published (meta)data on over at Figshare, only 8 of them had Keywords metadata embedded in them“.

This made we wonder about the quality of the metadata for papers I have uploaded to Opus, the University of Bath repository.

I looked at a paper on A Challenge to Web Accessibility Metrics and Guidelines: Putting People and Processes First which is available in Opus in PDF and MS Word formats.

I first used Adobe Acrobat in order to display the metadata for the original source PDF file, prior to uploading to the repository. As can be seen from the accompanying screen shot the metadata included the title, the author details (with the email address for one of the authors) and two keywords.

Metadata for repository copy of paperHowever looking at the display for the PDF downloaded form the repository we find that no metadata is available!

This PDF differs from the original source in that a cover page is added dynamically by the repository in order to provide appropriate institutional branding. It would appear that in the creation of the new PDF resource, the original metadata is lost.

Metadata for MS Word masterLooking at the metadata created in the original source document – an MS Word file – we can see how the authors’ names which were subsequently concatenated into a single field. We can also see that although the title of the paper was given correctly, poor keywords had been included, which did not reflect the keywords which were included in the paper itself (Web accessibility, disabled people, policy, user experience, social inclusion, guidelines, development lifecycle, procurement).

I suspect that I am not alone in not spending much time in ensuring that appropriate metadata is embedded in the master source of a peer-reviewed paper. I have also previously not considered how such metadata might be lost in the workflow processes when uploading to an institutional repository: after all, surely the important metadata is added when the paper is deposited into the repository?

Ross’s blog post made me check the embedded metadata – and I discovered that the correct metadata is still included in the MS Word file which was uploaded to the repository along with the PDF copy.

Does the loss of the metadata embedded in the PDF matter? After all, surely people will use the search facilities provided in the repository in order to find papers of interest?

But people will not necessarily visit a repository to find papers of interest. A post which described A Survey of Use of Researcher Profiling Services Across the 24 Russell Group Universities showed that on 1 August 2012 there were over 18,000 users of ResearchGate in the 24 Russell Group universities and judging by the messages along the lines of “28 of your colleagues from University of Bath have joined ResearchGate in the last month. Why not follow them today?” which I am currently receiving, use of this service is growing.

researchgate-papers-abstractAs can be seen from the screenshot of my ResearchGate profile, the service provides access to PDF copies of my papers. I normally simply provide a link to the PDF hosted in the repository but the example illustrated contains a copy of original PDF which was uploaded to the service by one of the co-authors.

In the case of most of my papers it is clear from the thumbnail of the PDF that the paper contains the coversheet provided by the repository.

Researchgate Paper (hosted in Opus)

Discussion

We can see that the PDF copy of a paper hosted in a repository should not be regarded as a final destination; rather the PDF may be surfaced in other environments.

It will therefore be important to ensure that workflow processes do not degrade the quality of the PDF. It will also be important to ensure that authors are made aware of how embedded metadata may be used by services beyond the institutional repository. But to what extend do repository managers feel they have a responsibility to advise on practices which will enhance the discoverability of content on services hosted outside the institution?

Taylor FrancisIn a paper which asked “Can LinkedIn and Academia.edu Enhance Access to Open Repositories?” myself and Jenny Delasalle commented on how “commercial publishers are encouraging authors to use social media to drive traffic to papers hosted on publishers’ web sites” and provided examples of such approaches from Taylor and Francis, Springer, Sage and Oxford Journals. As an example, Taylor and Francis describe how they are “committed to promoting and increasing the visibility of your article and would like to work with you to promote your paper to potential readers” and go on to document services which can help achieve this goal.

In a blog post which discussed the ideas describe din the paper I described how we had failed to find significant evidence of similar approaches being employed by repository managers:

It was interesting that in Jenny’s research she found that a number of commercial publishers encourage their authors to use services such as LinkedIn and Academia.edu to link to their papers hosted behind the publishers paywalls – and yet we are not seeing institutional views of the benefits of coordinated use of such services by their researchers. Institutional repository managers, research support staff and librarians could be prompting their institutions to make the most of these externally provided services, to enhance the visibility of their researchers’ work in institutional repositories.

But that paper was limited to use of third-party services to provide access routes to research papers. What of the bigger picture in which institutional work flow processes should be designed to enhance discoverability?

The ‘inside-out and outside-in library’

On Wednesday in a post entitled Discovery vs discoverability … Lorcan Dempsey explored the idea of the “inside-out and outside-in library“. In the post Lorcan described how:

Throughout much of their existence, libraries have managed an outside-in range of resources: they have acquired books, journals, databases, and other materials from external sources and provided discovery systems for their local constituency over what they own or license.

However in a digital and network world, there have been two major changes, which shift the focus towards inside-out:

First access and discovery have now scaled to the level of the network: they are web scale. If I want to know if a particular book exists I may look in Google Book Search or in Amazon, or in a social reading site, in a library aggregation like Worldcat, and so on. … Secondly the institution is also a producer of a range of information resources: digitized images or special collections, learning and research materials, research data, administrative records (website, prospectuses, etc.), faculty expertise and profile data, and so on.

Lorcan goes on to describe the challenge facing libraries:

How effectively to disclose this material is of growing interest across libraries or across the institutions of which the library is a part. This presents an inside-out challenge, as here the library wants the material to be discovered by their own constituency but usually also by a general web population.

I would suggest that institutional repositories could usefully adopt the approach taken by Taylor and Francis:

 “[The institution is] committed to promoting and increasing the visibility of your article and would like to work with you to promote your paper to potential readers

But rather than simply encourage researchers to simply add links to papers deposited in the repository from popular services such as LinkedIn and ResearchGate might the institutional goal be enhanced by encouraging researchers to make the content of their papers available in such third party services (subject to copyright considerations) – with the institutional repository providing both a destination and a component in a workflow, with papers being surfaced in services such as ResearchGate, as I have illustrated above.

If such an approach were to be embraced there would be a need to ensure that embedded metadata was not corrupted through repository workflow processes. If, however, the repository is regarded as the sole access point, there would be little motivation to address such limitations in the work flow.

Or to put it another way, repository managers will have a need to manage content hosted within the institution, including management to support the use of the content by services they have no control over.

To a certain extent, this has already been accepted: repositories were designed to have “cool URIs” which can help resources to be discovered by Google. I am suggesting that there is a need to observe usage patterns which indicate emerging ways in which users are finding content. The growing numbers of email alerts from ResearchGate suggest that it may be a service to monitor – with Ross Mounce’s recent post of on the quality of metadata embedded in PDFs suggesting one area in which there will be a need to revisit existing workflow processes.

PS. Ross Mounce described “a little preliminary survey of academic publisher PDF metadata” and has published the data on Figshare. Has anyone harvested the metadata embedded in PDFs hosted on repositories and published the findings?


View Twitter conversation from: [Topsy]

21 Responses to “Embedded Metadata in PDFs Hosted in Institutional Repositories: An Inside-Out & Outside-In View”

  1. A related problem occurred in the Netherlands a couple of months ago. Apparently the company that designed the document templates for most of de government agencies added a title and author in the template-file. The result is that thousands of online government documents (.pdf and .doc) are titled “at opinio facillime sumitur” and are written bij M. Hes.
    I wrote a blog about in, it’s in Dutch, but Google Translate does a reasonable job: http://ingmarbladertenschrijft.blogspot.nl/2012/10/dat-is-maar-een-mening-metadata-in.html

  2. Nick said

    Hi Brian

    Need more than 140!

    Historically yes, I have added cover pages to PDFs. I’m not sure how typical my workflow is though as, mainly due to software idiosyncracies (I don’t have EPrints and its nice workflow), I have always offered a fully-mediated service which typically involves me soliciting a suitable (author produced) version of a paper, usually as a word doc(x) and converting it myself with Acrobat and manually adding a cover page. I believe there is a plug in for EPrints though that automatically adds a cover page? In any case, my manual workflow will just result in a PDF with no metadata unless I add it manually. If these are subsequently picked up by Google I just get “Leeds Metropolitan University Repository” in the search result i.e. indexed from the text of the PDF itself.

    All this has been on my mind recently as we are in the process of implementing Symplectic and integrating with the repository such that I should finally be able to implement a self-deposit workflow, but I anticipate folk are likely to upload word files rather than convert them to PDF…so not sure how my workflow will develop; I’ll obviously offer guidance, but I’m keen to procure content in any format, whether I intervene in the workflow, convert to PDF, add a nice coversheet and metadata depends on how onerous that becomes.

    Nick

  3. […] PDF Metadata – Why Is it So Poor? PDF metadata – why so poor? asked Ross Mounce in a blog post published on New Year’s eve. In the post Ross expressed surprise that although ”with publi…  […]

  4. Hi Brian et al.

    As noted on Twitter, we often convert Word docs into PDFs on behalf of our academics here at City to put them into City Research Online– it’s rare to get “author final” versions in PDF format. I suspect that even when we do get PDFs, it’s rare to have “good” embedded metadata in that PDF.

    On the question of discoverability, I had assumed that the structured metadata provided at Eprint/DSpace/other repository software record level did the job here (as opposed to metadata embedded within the PDF itself). Certainly records in City Research Online are highly ranked in Google and are harvested by Google Scholar, BASE, OAIster etc. If this is the case, does it matter if the rare and patchy instances of author-created metadata gets over-written or otherwise distorted?

    Neil.

    • Hi Neil

      Thanks for the comment.

      Your question “does it matter if the rare and patchy instances of author-created metadata gets over-written or otherwise distorted?” is a good one.

      The example provided by Ingmar Koch (of template metadata which is not updated) illustrates that such embedded metadata is used. In this case, the authors may well have preferred it if the metadata had been lost as part of the creation of a cover sheet! However this does show how embedded metadata is being processed by Google. In addition, as I suggested in my post, we could find that other third party services processed the metadata associated with the object, rather than metadata which is decoupled form the resource.

      However if there are concerns that the metadata will be poor (as in Ingmar Koch’s example) perhaps there is a need to be honest about this, and explicitly state that embedded metadata will be removed prior to the resource being deposited in the repository.

      • Thanks Brian, I see the distinction- not every service will use OAI-PMH or web crawling, some might parse the objects themselves. It looks to me like Word docs we turn into PDFs here at City have garbage contained in the original doc’s metadata, we might have to look at this.

  5. Hi Brian,

    We have had discussions about coversheets in the past, and as you note, have worked to solve some issues. Can I reiterate please, that the coversheet is a policy decision and that actually due the lack of identifying content on many pdfs we receive, it’s not a bad decision. In light of this, we are not going to drop the coversheet but perhaps there is potential for the Eprints plugin to draw more/better metadata onto the page to aid SEO?

    Am I right in thinking you need to subscribe to ResearchGate in order to view full text?

    Also, what’s so bad about a coversheet from a users POV? Granted the machine-readable issues but I often find it easier to find identifying metadata about a paper from the coversheet than from the document itself, particularly on post-prints. Am I alone here?!

    Kara

    • Hi Kara
      As an example of the problems which can be caused by cover sheets see this example!
      Note I didn’t say that coversheets were bad per se. My post was about the need to keep the metadata, as this may be processed by other tools.
      I would agree with you, however, that it would be useful to investigate how the Eprints plugin could be used to enhance the disoverability of repository items.

      Brian

      • Also note that if you go to the information on ResearchGate on the paper on Approaches To Archiving Professional Blogs Hosted In The Cloud you will see a thumbnail (including the cover page) of the copy hosted on the University of Bath repository and you can download the PDF.

        If you go to the page about the paper on A Challenge to Web Accessibility Metrics and Guidelines: Putting People and Processes First you will be able to view a thumbnail of the PDF (with no cover page) which was uploaded by one of my co-authors. You can also download the PDF.

        In both cases, you do not need to sign in to the service to access the PDFs,

      • Hi Brian,
        This was my experience with ResearchGate – not to question the service or the point you are making – maximising visibility is great – but for information. The first time you click on the link, as a non-registered user this message appears as a pop-up: “You are trying to access the full-text version of A challenge to web accessibility metrics and guidelines: putting people and processes first. Sign up to ResearchGate and request a full-text of this article! ”
        Attempting a second time gave me the full text. So whilst registering is not apparently compulsory, this is not the impression given to first time users.
        BW,
        Kara

      • Nick said

        Whether a cover sheet is less than ideal from a user POV was an issue that arose as part of a subsequent discussion on Twitter – https://twitter.com/mrnick/status/287191698319241217

        Not sure I necessarily see that argument myself and is fairly standard practice from several major publishers as well as repositories I think and strikes me that it’s rather like arguing a book-cover is inconvenient as it’s such a faff to open it and get past that page with the isbn and copyright on it! The effect on pdf metadata and how that is indexed by search engines is a separate issue.

        The example linked above which is originally from our repository at Leeds Met and now has multiple cover sheets is more of a practical issue. Nothing wrong with sourcing papers from other repositories to increase exposure etc; I’ve just uploaded our paper from or2012 in Opus – http://opus.bath.ac.uk/30226/ – to our repository at Leeds Met and replaced the cover sheet with our own – http://repository.leedsmet.ac.uk/main/view_record.php?identifier=7827&SearchGroup=Research

        I’m not sure whether the Portsmouth repo manager felt they should retain the Leeds Met cover sheet to preserve provenence or if it was merely an oversight (I should probably have also removed the existing cover sheet(s) when I originally uploaded!) in any case I agree the final result is somewhat unfortunate with 6 pages before the actual content of the research itself :-/

      • @Kara Yes, when I logged out of ResearchGate I initially thought I had to login to access the full text. I subsequently realised realised that this isn’t mandatory, but they imply it is in order to maximise subscriptions. Slightly spammy but, as you pointed out, the post wasn’t about ResearchGate per se.

        PS I’ve realised that nesting comments doesn’t really work, which is why this comment is out of sequence.

  6. […] PDF Metadata – Why Is it So Poor? PDF metadata – why so poor? asked Ross Mounce in a blog post published on New Year’s eve. In the post Ross expressed surprise that although ”with publi…  […]

  7. Peter Cliff said

    This just sounds like a workflow problem – if adding the coversheet removes embedded metadata then, assuming you want to keep the embedded metadata, fix the thing that adds the coversheet.

    On the wider issue – resources that lose their way home when out in the wild – I think it makes sense that on ingest any content gets the URL to its catalogue record (the IR page in this instance) embedded. While a coverpage could contain that information, it’d be nicer if it were machine readable.

    I’m aware of two schools of thought in the preservation community about this – one says embed as much metadata as possible in the file such that it is self-describing. The other says a link to a catalogue record is enough. The latter relies on the persistence of the catalogue – which not everyone can guarantee and as such is perhaps more of a risk.

  8. @Nick: The copy of the PDF you have deposited in your repository contains value embedded metadata. Did you create that yourself? If so, then this does not seem to be a scalable solution. As Pete Cliff has pointed out, we are talking about addressing workflow issues. This post was initially about whether adding,coversheets would lose embedded metadata and how significant a problem this might be across the sector. However the subsequent discussions (here and on Twitter) have broadened the discussion to include considerations of PDFs which may be taken from one repository and added to another: will this (should this) result in cascading coversheets? Should embedded metadata be preserved during the process?

    • Nick said

      Yes I added it myself and no it clearly isn’t scalable! As I said in my original comment I’m not sure how our workflow will evolve as we implement Symplectic and historically my workflows have tended to be somewhat labour intensive which is a combined result of my less than optimal (research repository) software and my somewhat pedantic nature!

      I wonder if some of these issues might be relevant within the context of the UK RepNet project which is holding a meeting in London on 21st Jan – http://www.rsp.ac.uk/events/supporting-and-enhancing-your-repository/

  9. […] Brian Kelly, has taken this a slightly different direction and looked at the metadata of PDFs in institutional repositories. I hadn’t realise this but apparently some institutional repositories (IRs) universally add […]

  10. […] Institutional repository managers, research support staff and librarians could be prompting their institutions to make the most of these externally provided services, to enhance the visibility of their researchers' work in …  […]

  11. […] Institutional repository managers, research support staff and librarians could be prompting their institutions to make the most of these externally provided services, to enhance the visibility of their researchers' work in …  […]

  12. […] Embedded Metadata in PDFs Hosted in Institutional Repositories: An Inside-Out & Outside-In … […]

  13. […] As described previously workflow processes used in the creation of cover sheets for items hosted in our repository means that metadata embedded in PDFs is lost. Although we’re having discussions with repository staff about this, it occurred to me that I now have an ideal opportunity to make use of a third-party repository service. […]

Leave a comment