UK Web Focus (Brian Kelly)

Innovation and best practices for the Web

EPub Format For Papers in Repositories

Posted by Brian Kelly on 4 Aug 2010

EPub as a Format for Use in Institutional Repositories?

In a post entitled “File Formats For Papers In Your Institutional Repository” I suggested that depositing a HTML version of a paper might have various advantages over the PDF format which is the norm. But in light of the growing importance of mobile devices wouldn’t it seem appropriate to make such papers available in the EPub format?

EPub is described in Wikipedia as “a free and open e-book standard by the International Digital Publishing Forum (IDPF)“. The article goes on to add that “EPUB is designed for reflowable content, meaning that the text display can be optimized for the particular display device used by the reader of the EPUB-formatted book. The format is meant to function as a single format that publishers and conversion houses can use in-house, as well as for distribution and sale.

In terms of the open standards used EPub consists of three specifications:

  • Open Publication Structure (OPS) 2.0, contains the formatting of its content.
  • Open Packaging Format (OPF) 2.0, describes the structure of the .epub file in XML.
  • OEBPS Container Format (OCF) 1.0, collects all files as a ZIP archive.

The articles states that “EPUB internally uses XHTML or DTBook (an XML standard provided by the DAISY Consortium) to represent the text and structure of the content document and a subset of CSS to provide layout and formatting. XML is used to create the document manifest, table of contents, and EPUB metadata. Finally, the files are bundled in a zip file as a packaging format.

Using the EPub Format

Paper in EPub format, showing imagePaper in EPub format showing page-turningThis sounds interesting so I converted the HTML version of my recent paper on “Empowering users and their institutions: A risks and opportunities framework for exploiting the potential of the social web” into EPub format and added it to my library of ebooks on my iPod Touch using the Stanza application.

The accompanying images show how the paper is displayed. The first image illustrates the page turning style of navigation provided using EPub and the second image illustrates an embedded image.

The paper is also available from Opus, the University of Bath’s institutional repository service. I should mention that the URL for the EPub file is http://opus.bath.ac.uk/17484/5/i4.epub. I discovered that entering the URL into a browser on my iPod Touch allowed me to view the document in the Stanza application. On a normal PC users will probably not have a viewer set up to render this format, which may cause some confusion.

As might be expected for a format which uses XHTML the conversion from the XHTML original was a simple operation. I should add that I also experimented with converting a PDF version of the paper to EPub but this resulted in various problems due, I think, to the way in which the two-columns used in the paper were linearised.

Revisiting the Issue of Formats for Use in Repositories

This initial experiment seemed to show that creating an EPub version of a paper in a repository can be done quite easily. However the ease of doing this may have been due to the availability of a HTML version of a paper; doing this on a large-scale may be time-consuming if HTML formats of papers are not available.

Let’s revisit the question of what formats for papers should we be seeking to deposit in institutional repositories?

From a preservation perspective the advice from archivists tends to be that you should preserve the original master copy. In many cases this is likely to be MS Word, although other popular formats will probably include Open Office and LaTeX.

From an interoperability perspective an open standard is preferable. I would suggest that rather than making use of a specific DTD designed for scholarly publishing we should use a well-established and popular existing open format – HTML (in whatever version).

If we wish to maximise the take-up of our repositories whilst minimising the effort in processing the files it seems to me that we should explore ways of creating derivative versions from the master source. So rather than uploading a PDF shouldn’t we be uploading the master file and creating a PDF automatically form this resource? And rather than creating an EPub file, as I have done, shouldn’t the repository software create the EPub file from a HTML version of the file? And whilst I acknowledge that authors may not wish to make their original document (in, say MS Word or Open Office format) available to others and would regard the interoperability aspects of PDF as a feature rather than a flaw there should be nothing to stop the master file being stored in the repository but not openly accessible.

Is anyone thinking along these lines?


Twitter conversation from Topsy: [View]

22 Responses to “EPub Format For Papers in Repositories”

  1. ePub is rapidly becoming the standard for published digital books (eg in Apple’s Book Store and places like Feedbooks), so I think it’s probably a pretty safe bit.

    There’s a difference between distribution and production formats though – ePub (and PDF) are more suitable for distribution, whereas Word / Pages / Open Office file formats are really for in-production work (and should be privately saved in case further editing is required).

    That said, there are probably better tools for natively producing ePub books now…

  2. […] EPUB Format For Papers in Repositories by: Brian Kelly […]

  3. I’m a technologist and not a repository manager, but I would definitely support moves to use ePub as a standard format for electronic copies of academic journals, and move away from PDFs. This is about who controls the standard. PDFs are openly specified but they have become bloated under the stewardship of Adobe. The International Digital Publishing Forum consists of: “academic, trade and professional publishers, hardware and software companies, digital content retailers, libraries, educational institutions, accessibility advocates and related organizations”, which sounds like a better bet.

    I downloaded your ePub file easily onto my iPad and it’s beautifully readable, although sadly the iPad version of Stanza doesn’t do page turning for some reason. I also notice that linked web pages in your paper can be rendered nicely within the Stanza app with the option of popping out to Safari if you want. In my comment on your previous post I pointed out discussions afoot concerning whether academic outputs need to be more dynamic and interlinked (like AJAX enriched HTML pages) and less self contained. For the less radically minded however I think the ePub format will cover the current practice of producing “frozen in time” academic papers admirably.

    Again in my previous comment I wondered how well ePub might handle maths typesetting. In fact plastex can apparently do this. I couldn’t find an ePub example, but the HTML example of plastex output shows a maths equation rendered nicely as an generated image. That’s not quite as accessible perhaps as using embedded fonts in a future HTML5 world, but it’s still a very nice solution for the majority of today’s users.

  4. One thing I point out quickly: if you have an iPhone, iPod Touch or iPad, you can import ePub files directly into iTunes, and then view them in the iBooks application on any of those devices.

    There’s not a huge difference between iBooks and Stanza, but I find the added functionality of being able to manage books (and PDFs) in iTunes to be helpful.

  5. Mike Cook said

    EPUB is a very viable solution, and certianly a format that is going to have a long shelf life. However, it must be remembered that EPUB will not be suitable for all documents. EPUB is a reflowable format and documents will often be viewed on smaller screens. Any source doucment that has a complicated layout structure, or a lot of programming code, is not going to reneder the best, so PDF may be more suitable in these situations. This is one reason why publishers like O’Reilly don’t offer all their titles in EPUB.

    >> From a preservation perspective the advice from archivists tends to be that
    >> you should preserve the original master copy…this is likely to be MS Word

    A “Master Format” is a very solid approach to take, though I’d be cautious about using something like MS Word. My approach is to use TEI (Text Encoding Initiative), an XML format. From the same source you can also produce both EPUB and PDF, to suit your audiences need.

    If the content of the source document changes, or you wish to chang the presentation style of your EPUB’s, then you can make the change in just one place, update the XSLT, and immediately all the output formats will reflect these changes. Although I don’t believe it will, if EPUB were to ever dissapear, you can easily write a new stylesheet to produce whatever new format you like.

    I haven’t seen a lot of UK eBook activity to date, and as a Brit myself, it’s great to see more people thinking about EPUB.

    • Thanks for the reply – it’s useful to read about the type of resources for which EPub is not suitable.

      I agree with you about the value of TEI. However in my environment most papers tend to be created in MS Word, and sometimes LateX or Open Office.

      I agree with you that there hasn’t seem to have been much activity in the ebook area in the UK to date – this is a reason why I’ve started to explore various issues surrounding EPub.

  6. I think an interesting approach would be to look at electronic theses and dissertations. These are publications that each institution control, and where they can basically demand anything they want (reasonable) from their students, who wish to graduate… At my institution we have a deposit mandate, which is great. However, instead of specifying the exact margins and fonts to be used, why not give us a DTD? Or some other form of easy authoring a structured document? This would make it much more future-proof, and also enable creation of different versions – letter-sized double spaced manuscripts might be great for proofing, but not good for final readers. I’d much rather have a nicely formatted PDF, a clean HTML of each chapter, a nice epub for my ebook reader, or even ship it straight to Lulu to get a nice clean bound book back… Why do we have to choose?

    (I’m still not sure about what format and production process that would be the best. The NIH DTDs for academic publishing seem very robust and future-proof, but there would have to be an easy way to generate the content, with stylesheets or macros for Word/OOffice etc).

    • Thanks for the comment. Back in December 2006 I wrote a post entitled Accessibility and Institutional Repositories in which I argued the need for institutions to educate researchers on best practices for enhancing the accessibility of research papers. These ideas were further developed in a paper on “Accessibility 2.0: People, Policies and Processes” which describes the need to address issues included:

      • Education: Development of an educational strategy to ensure that depositors of resources are made aware of accessibility issues and techniques for addressing such issues.
      • Monitoring: Monitoring tools used to create papers and formats used for depositing and prioritising training and technical developments based on popular tools.
      • Work flow evaluation: Evaluating work flow processes to ensure that accessibility features used are not discarded.
      • Technical innovation: Monitoring technical innovation which may help in making resources more accessible.
      • End user support: Development of policies for supporting users who may not be able to access resources.

      Your suggestions fit in with these ideas.

    • The ETD movement is littered with attempts to use DTDs and coerce people into using structured authoring tools like XML editors. As far as I know none of these have been successful, and what happens is they end up falling back on word processor input – but typically the source files don’t have the required level of conformance so you end up with lots of human-effort. See my recent post explaining why I think word processor -> HTML / ePub is a much better process.

  7. […] Brian Kelly has been exploring EPub delivery, and there’s a really good discussion going on in the comments of his blog where people are […]

  8. I’d like to see scholarly publishers providing EPUB in addition to PDFs. With an XML-based workflow, that should be easy. Reformatting in repositories is a little harder because authors tend to pay attention to document layout rather than structure. Document specifications (DTDs, layouts, …) can mitigate this in some cases (theses, formal publications).

    @Mike – If you’ve got examples, that would be great; my impression was that O’Reilly published everything electronically, in HTML and PDF and EPUB. Definitely, it’s easier to design for a single size than for reflowability. But figuring out how to metaphorcally “downsample” as screensizes shrink (and grow) would be useful. I’m really sick of trying to read portrait PDFs on landscape screens, so even moving these fixed-layout forms to landscape mode (like the ASIST Bulletin does) would be really useful to me.

    • Mike Cook said

      O’Reilly are continually trying to provide their titles in several formats but due to some technical limitations this is not possible. An example title is their Mastering Regular Expressions (http://oreilly.com/catalog/9780596528126/), which is only available in PDF. I remember a post somewhere explaining the reasons why, but I’m struggling to find the link–I’ll keep looking.

      All the books on my site (epubBooks.com) have XML masters, which are then converted to EPUB, but these are only novels so there’s not a lot of complicated layout to worry about.

      Earlier this year I was working with a company on some technical documents for the Swedish Standards Institute and how best to store (XML) and deliver (EPUB) these documents. I can’t provide examples of these but from the small sample base we ran, nothing major really showed up and everything rendered fine on Sony Reader sized screens, if not smaller, without issue.

      My background is not such that I know the kind of content that you are expecting to be published, but I’ll bet it can all be stored in linear XML.

      Peter mentions that there’s been no successful XML editor. I know O’Reilly store all their content in DocBook, but then pretty much all their authors are technical writers–we can’t expect everyone to have those skills. There’s been the same sort of discussions in many publishing/eBook blogs and the same conclusions come from this; we need an easy to use editor/word processor that controls the users input (keeping things well formed) yet still allows freedom to produce visually pleasing documents. I don’t see anything arriving soon.

      • I know O’Reilly store all their content in DocBook

        Almost all of their content is stored in DocBook, but not 100%. Note also that they actually have a more diverse range of authors than most would expect, so your argument about skills isn’t actually as strong as you might believe, although I agree that there’s little downside to a range of tools for every level of ability.

      • Mike Cook said

        Thanks for setting me straight Keith :)

        Does O’Reilly recommend any particular editor for their less technical authors to use, or do they just work with whatever is sent to them?

  9. I’d also add, for keeping up with EPUB, I highly recommend ThreePress’s blog.

  10. Many thanks for the perspective on ePub. Much agreed that there’s a lot there, and that ways to produce it easily will be important. I’m one of the devs on a project called Anthologize, which let’s you take your content in WordPress and publish as ePub. Still alpha and lots of bugs, but I’m hoping it will be promising in this way. That’s especially important to me in education, as a possible way to archive the scholarly work and student work taking place in blogging platforms.

    Thanks!

  11. You’re making two important points for this context, Mike:

    1. Reflowable formats like EPUB will continue to improve and be a good choice for an expanding range of titles, but there will always be holdouts whose design or other elements are not preserved in reflowable formats (a big deal if you are striving for a 100%-of-everything preservation format [an unrealistic goal, actually])
    2. In many situations, storing a high-quality semantic XML representation of a published work is attractive (even if you never distribute it in this format to the outside world), but creating these formats has the same limitations as 1 (usually) and can be expensive. The most pragmatic solution to this problem is to support and encourage authors who are able to write in these formats with various carrots and then to simply convert (typically outsourced) the finished manuscript when it’s completed.

    Does O’Reilly recommend any particular editor for their less technical authors to use, or do they just work with whatever is sent to them?

    I’m not sure if this has evolved since I’ve left, but O’Reilly still likes XMLMind and oXygen. (I recommend oXygen in quite a few contexts.)

  12. […] EPub Format For Papers in Repositories […]

  13. […] for Augus… on Should Event Web Sites Be The …Newsletter for Augus… on EPub Format For Papers in…Newsletter for Augus… on Delivering Blog Posts By Email…Newsletter for Augus… […]

  14. […] indicated by my post on EPub Format For Papers in Repositories I’ve an interest in the potential EPub format so this blog provides an opportunity for […]

  15. […] networking environments of tweeting links to the post.  As illustrated below the blog post on EPub Format For Papers in Repositories has received 4 […]

Leave a comment