UK Web Focus (Brian Kelly)

Innovation and best practices for the Web

File Formats For Papers In Your Institutional Repository

Posted by Brian Kelly on 7 Jul 2010

File Formats I Have Used to Deposit Items in the Bath Institutional Repository

What file formats should you use to deposit papers in your institutional repository?  Although I recently suggested that RSS could have a role to play in allowing the contents of a repository to be syndicated in other environments  that post didn’t address the question of the preferred file format(s) for mainstream resources such as peer-reviewed papers.

For my papers in the University of Bath Opus repository I initially normally deposited the original MS Word and the PDF version which is normally submitted to the journal or conference: the MS Word file is the original source material which is needed for preservation purposes and the PDF file is the open standard version which should be more resilient to software changes than the MS Word format.

What I hadn’t done, though, was to deposit a HTML version of my papers, despite that fact that I normally create such files.  I think I suspected that uploading HTML files into a repository might be somewhat complicated so when I uploaded my papers I omitted the HTML versions of the papers.

Problems With PDFs

PDF cover page for a paper in the Opus repositoryHowever when I recently viewed the repository copy of the PDF version of my paper on “Library 2.0: Balancing the Risks and Benefits to Maximise the Dividends” I discovered that such papers have a cover page appended as shown.

Having recently being a co-facilitator on a series of workshop on “Maximising the Effectiveness of Your Online Resources” I am well aware of best practices to help ensure that valuable resources can be easily discovered by search engines. And although papers in the repository do have a ‘cool URI’ prefixing the content of all papers in the repository with the same words (“University of Bath Open Online Publications Store” followed by “” and “This version is made available in accordance with publisher policies. Please cite only the published version using the citation below.” goes against best practices for Search Engine Optimisation.

The cover page isn’t the only concern I have with use of PDFs in institutional repositories.  Despite PDF being an ISO standard not all PDF creation programs will necessarily create PDF which conform with the standard, with papers containing mathematical formula or scientific notation being particularly prone to failing to embed the fonts needed to provide a resources suitable for long-term preservation.  I also suspect that, although it is possible to create accessible PDFs, I suspect that many PDF files stored in repositories will fail to conform with PDF accessibility guidelines.

Providing HTML Versions of Papers

In light of these reservations I have decided to provide a HTML version of my recent papers in the University of Bath institutional repository. So my paper on “From Web Accessibility to Web Adaptability” (for which the publisher’s embargo has recently expired) is available in HTML as well as PDF formats.

As I suspected, however, depositing the HTML version of the paper was slightly tricky.  I uploaded the paper using the Upload for URL option and this initial attempt resulted in the page’s navigational elements are search interface being embedded in the page.  And since the upload mechanism only uploads files which are ‘beneath’ the paper in the underlying directory structure the page’s style sheet was not included.  In short, the page looked a mess.

Since the HTML files I have created contain the contents of the paper separately from the page’s navigational elements it was not too difficult to create a very simple HTML file which I included (with the citation details appended at the end of the paper) in the resource which is available in the repository. As can be seen the contents are available even if the page is not visually appealing.

There are, of course, resource implications in creating HTML versions of papers. However it will be interesting to see if providing content which is more easily found in Google provides benefits in enhancing access to papers which are provided in HTML format  – and since resource discovery is one of the main aims of a repository it might be argued that resources should be provided to ensure that HTML versions of papers are made accessible.

But What About Richer XML Formats?

The purist might argue that whilst HTML is an open and Web-native resource is may not be rich enough for use with peer-reviewed papers. I have some sympathies which such views. Anthony Leonard has described how we should go about “Fixing academic literature with HTML5 and the semantic web“. I would agree that there’s a need to explore how HTML5 can be used in the context of institutional repositories.

But mightn’t there be another XML format we should consider? How about an open format which is widely supported and deployed and which, for many authors, will not require any changes to their authoring environment? The format is OOXML – an ECMA standard which has also been standardised as an International Standard (ISO/IEC 29500). However not all open standards are equally open and as this standard is based on Microsoft’s format for their office applications, as Wikipedia describes “the ISO standardization of Office Open XML was controversial and embittered“.

In light of this discussion, what format(s) would you recommend for use with institutional repositories?

12 Responses to “File Formats For Papers In Your Institutional Repository”

  1. Ben Toth said

    The HTML version is fine. It would be nice to have content in a standard elegant format. But the format is less important than the fact that it is there, at least for now, since the principal requirement of a repository is that it holds something more than metadata about a paper.

  2. […] This post was mentioned on Twitter by Brian Kelly, Anthony Leonard. Anthony Leonard said: @briankelly Thanks for the reference! #blowingmyownvuvuzela […]

    • I should have known that tweet would get picked up here! Genuine thanks for quoting my article.

      I am convinced that the standards that last are those that are future-proof and (more importantly) those that are trusted. It’s a “brand” thing. End-users like to see HTML versions of content in repositories because when they click that link they know what to expect, and trust and that it will probably be good enough for them. HTML lacks trust amongst publishers because poorly done it leads to wonky rendering in different browsers/platforms. Worse, HTML includes javascript, which has a huge potential for error by either author or browser. By way of illustration, noone ever tried to code Pac Man in a PDF document, for good reason. (Having said that I bet someone’s tried it in a Word document!) Ultimately a dodgy end-user experience hurts the publisher’s brand.

      Of course HTML is designed for dynamic hyperlinked content capable of incorporating external media, HTML, data, even Javascript code, on-the-fly, all of which may change (or not be found) over time. This is a world away from traditional academic papers which are self contained artifacts frozen in time with carefully stated references where needed – essentially making use of the same literary technique as first pioneered by the Royal Society. Any process with that sort of pedigree is trusted, and meddled with at your peril. Nevertheless my article was originally commenting on a well argued Biochemical Journal review article that suggested that overturning this literary technique by moving towards more linked research outputs was precisely what is urgently needed to address the mess that is literature review in this day and age.

      Perhaps there is an analogy here between Google and Apple’s respective strategies. Google would say “don’t fight the internet” and argue that HTML is the format that will run and run in a networked world. Apple would argue that papers are just another kind of media which needs high production values to protect end user experience and the publisher’s brand (oh yes, and Apple’s brand too) and favour “quality assurable” formats. I don’t know whether OOXML or even ePub meet this criterion, but an acid test for any of these alternatives is how well mathematics is typeset. Now there’s a challenge that HTML consistently fails to rise to, though with HTML5 who knows? Which would you prefer – academic papers that download and render beautifully without fail on your iPad or the printed page, or papers that have all the power of dynamic interlinked web pages to enhance your research potential?

  3. Marcus Tucker said

    Though it’s not ideal, you could embed the CSS within the HTML document, making the document completely self-contained.

    • Yes, I agree. I did wonder, though, whether the repository should have a standard style sheet itself which made the paper somewhat more attractive. If this was deployed I’d want to avoid my style sheet conflicted with the official one. The official style sheet might, for example, include institutional branding.

      • Marcus Tucker said

        Yes, that would make particularly good sense if your Word document was based upon an official template, and if the official style sheet then mirrored that official template, so that all versions (Word/PDF/HTML) were visually consistent with each other.

  4. Chris Rusbridge said

    I’ve been worrying for a while that HTML was a mess when saving for future use, so I really liked to read this. But you were asking about the “appropriate” formats. I know nothing about HTML5 so won’t comment on that option. But I think it’s safe to assume for the moment that NO FORMAT is completely safe for all content for very long term preservation, whereas most common formats are pretty safe for most content for medium term preservation.

    By “most common formats” I mean Word, ODF, PDF and HTML. Pretty safe means you have an extremely good chance of being able to read almost everything in the file, even if the look and feel has changed slightly. As of now, medium term qualifies as at least 15 years based on the files available to me on this laptop (only PowerPoint 4 files are not readable, as far as I can find out).

    If some content (you quote maths etc) is known to be fragile in certain formats, it seems reasonable to include more than one format. So the future reader has an alternate should the first choice fail.

    I’ve heard that DOCX is risky (in that there are parts not fully defined in the standard), but I had a case just this week where a .docx file (assuming that is in OOXML) would not load into my Word 2004 via the converter tool, but did load into OpenOffice.

    It’s usually better than you think!

    • Hi Chris
      Thinking about the issue of ‘safe’ formats for long-term preservation makes me realise that while LOCKSS (Lots of Copies Keep Safe and Secure) has always meant lots of *copies* of the resource, I am now thinking that it can mean lots of *versions*. PDF might be fine if you want to preserve the look-and-feel but something richer if you want to preserve the structure and the source material if you want to avoid any losses in conversion.

  5. Marcus Tucker said

    I’ve just read your paper “From Web Accessibility to Web Adaptability” and have noticed that the document’s TITLE is seemingly unrelated: “Empowering Users and Institutions: A Risks and Opportunities Framework for Exploiting the Social Web”.

    Presumably this is an authoring error caused by using an existing HTML document as a template?

    I suggest that you fix this to ensure that your paper is indexed correctly by search engines, etc.

  6. […] used. It appears that this reflects the software used to create the PDF cover page (which I have written about recently) rather than the tools used to create the main PDF resource. If you are unfamiliar with such cover […]

  7. […] a post entitled “File Formats For Papers In Your Institutional Repository” I suggested that depositing a HTML version of a paper might have various advantages over the […]

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: