UK Web Focus (Brian Kelly)

Innovation and best practices for the Web

What Formats For Peer-Reviewed Papers?

Posted by Brian Kelly on 17 May 2010

Formats for my Papers

The papers I’ve written which have been published in peer-reviewed journals, conference proceedings or have been included in other types of publications have been listed on my papers page on the UKOLN Web site since my first papers were published in 1999. More recently I have made use of the University of Bath’s institutional repository –  OPUS.

Wherever possible I have tried to provide access to the paper itself. But what formats should I provide?  The papers are initially written using MS Word and a PDF version is submitted to the publishers.  I normally try to provide access to both formats, and also create a HTML version of the paper.  The MS Word version is the master source, and so is the richest format; the PDF version provides the ‘electronic paper’, which preserves the page fidelity and the HTML format is the most open and reusable format.  So all three formats have their uses.

But none of these formats are particularly ’embeddable’. And even the HTML format is normally trapped within the host Web site. The HTML file also contains navigational elements in addition to the contents of the paper.

Shouldn’t the full contents of papers be provided in an RSS format, allowing the content to be easily embedded elsewhere?  And wouldn’t use of RSS enable the content to be reused in interesting ways?

Creating an RSS Format for a Paper

As an experiment I have created an RSS file for my paper on “Deployment Of Quality Assurance Procedures For Digital Library Programmes” which I wrote with Alan Dawson and Andrew Williamson for the IADIS 2003 conference.

As well as the MS Word and PDF formats of the paper I had also created a HTML version. The process for creating the RSS file was to copy and paste contents of the HTML file (omitting navigation elements of the page) into a WordPress blog. I then viewed the RSS file using the WordPress RSS view of a page and copied this RSS file to the UKOLN Web site.

Using the RSS Format

Display of RSS view of paper in Netvibes My first test was to add the RSS version of the paper to Netvibes.  As you can see the Netvibes RSS viewer successfully rendered the page.

It should be noted, however, that internal anchors (i.e. links to the references) did not link to the references within the RSS view, but back to the original paper.

I also tried FeedBucket, another Web-based RSS reader. In this case, as can be seen, the tool only displayed the first 500 characters or so of the paper. This seems to be a feature of a number of RSS tools which only provide a summary of the initial content of an RSS feed, with a link being provided for the full content.

Wordle View of PaperSince the content of the paper is available without the navigational elements and other possibly distracting content which may be provided on a HTML page, it is possible to analyse the contents of the paper. For this I used Wordle – if you wish you can view the Wordle cloud for the paper.

Should We Be Doing This?

Should we be providing access to papers in a mature and widely used format which allows the content to be reused in other environments using a wide range of readily available technologies?  And which also allows the content to be processed and analysed using simple-to-use tools such as Yahoo Pipes?

I think we should. But perhaps publishers will think differently, as they are more likely to seek to maintain tight control over papers if the copyright has been assigned to them. But is this necessarily the case?  My most recent paper, “Developing Countries; Developing Experiences: Approaches to Accessibility for the Real World” will be presented at the W4A 2010 on 26-27 April 2010.  We have recently completed the copyright form and I’ve noticed the following information on the author rights:

The right to post author-prepared versions of the Work covered by the ACM copyright in a personal collection on their own home page, on a publicly accessible server of their employer and in a repository legally mandated by the agency funding the research on which the Work is based. Such posting is limited to noncommercial access and personal use by others, and must include the following notice both embedded within the full text file and in the accompanying citation display as well:
“© ACM, (YEAR). This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution …

Hmm. So can I make the paper available in an RSS format as long as I include the ACM copyright statement?

23 Responses to “What Formats For Peer-Reviewed Papers?”

  1. Rod said

    You can always change the text on the publishers copyright form before you send it back. If they have put the work into reviewing & formatting paper, I’ve found they rarely argue with the CC type copyright statement I add to their form. It has also in a couple of cases contributed to the publishers reviewing their own rules & procedures.

    Rod

  2. Ben Toth said

    Works fine in Google Reader, with same proviso over internal anchors. Thanks for doing this – interesting idea and a logical next step to ‘teaser’ RSS contents page feeds.

  3. Les Carr said

    Is it me, or is this a really perverse thing to describe. What you have actually created is an HTML version of the paper, and then embedded it in an XML format that is intended for describing lists of things. Admittedly, you have a list, it’s just not a very comprehensive one. Also, the paper itself has been turned into a CDATA blob and therefore (I guess) can’t be easily manipulated by XML tools without ripping out the CDATA and creating a new XML entity from it.

    • Hi Les – good question! I would question whether RSS is intended just for describing lists of things. It was developing for syndicating news and has subsequently started to be used as a syndication technology.

      I do take your point that a CDATA blob – and this has its limitations. But there are also benefits to doing this – and, I would argue, the content is more interoperable than the PDF blob we normally get.

      So I’m still inclined to feel that the advantages of this approach outweigh the disadvantages as a step towards providing richer ways in which content can be reused.

  4. As you know, I publish everything I write in one or another RSS feed. Not so much for embedding, but for ease of access and reuse of content.

    For me, if it comes down to decision to make the paper freely available, or to publish in (say) ACM (or any other academic journal), I always opt in favour of the free publication, under a CC By-NC-SA license.

  5. I actually prefer having the live links as endnotes, or maybe footnotes. It’s a bit old school, but it creates one less font/style impediment to smooth reading. (On the other hand, that approach leaves out the richness of embedded media: a paper becomes, well, just a paper.) I wonder about the goal of embedding the paper itself in other structures or environments. I mean, I wonder how that would present visually. RSS is handy, but not necessarily ideal for reading longer texts (?)

    Regards, :)

  6. Chris Rusbridge said

    I’m with Les that RSS just doesn’t seem right for capturing an article. I’m with you that neither PDF nor Word nor even HTML seem to work well in all the circumstances one would want. For HML in particular, it is still difficult to save a HTML document from an arbitrary browser, and view the document later from a different browser, and possibly offline. (The Safari .webarchive format does the saving and viewing offline well but isn’t standardised nor used by other browsers; many browsers do some ugly stuff with the HTML source and an associated directory that just won’t survive much manipulation).

    I suspect the answer is to save in XML to a DTD/schema like NLM, which I believe is becoming more widely adopted, and I also had heard is the format of choice for Portico (which should say something about longevity! However, I don’t know any tools to do this…

    • Ta Chris. Thinking about use of RSS some more, wouldn’t you agree that RSS is valuable in the syndication of the full contents of a blog post? And there’s nothing which requires blog posts to be short and snappy. As we’ve never worried about use of a CDATA blob to syndicate blog posts, why should be be concerned if we apply this approach to papers?

  7. Chris Rusbridge said

    Brian, I suppose it’s at what level the encoding is happening. RSS seems fine for syndicating materials at the title and abstract level, and maybe even for passing content as blobs. But I want articles to have meaningful structure; I’m less concerned with how they are passed around than with what they _say_. That means styles, headings, figures, table, citations, and increasingly marked-up content. I refer to THAT Brian Kelly, from UKOLN, not the Jazz pianist/composer (unless you do that as well as rapper sword dancing in your copious spare time ;-)! I just don’t think RSS is the right approach for this semantic level of content.

    • Thanks for the reply. I would agree that we *should* be providing access to richer semantic structure. However in reality in many cases papers in repositories are likely to be PDF. So providing HTML is doing better – and then providing access to that content to that content in a container (RSS) which is widely supported adds further value.

      I guess the difference between our approaches is that I’m suggesting an incrementally richer approach based on existing technologies and you, I think, feel that we should be going for something which is better designed.

  8. > I just don’t think RSS is the right approach for this semantic level of content.

    Well, it’s more in the direction of the right approach than the typically favoured document formats like PDF and .doc

    And it is very straightforward to embed semantic information into RSS documents. You can simply extend the specification, as I do here http://www.downes.ca/schema/event/

    Or you can embed semantic data into post descriptions, either with escapes, as Google Calendar does it here http://www.google.com/calendar/feeds/hrgittpa7v391egcukqlr9kkik%40group.calendar.google.com/public/basic or better, using CData, as mentioned above.

    RSS is a versatile widely used lightweight transport protocol for semantic data, and that includes styles, headings, figures, tables, citations and increasingly marked-up content.

  9. Chris Rusbridge said

    @Stephen: yes, we could extend RSS to do all the things HTML does, but why bother? We have HTML, and in terms of academic paper it seems to have only one major flaw: no robust “Save as…”.

    @Brian: I think I’m the one suggesting an incremental approach based on existing technologies, and you’re the one suggesting something better designed (;-).

    I will admit that I went for XML plus NLM DTD above, but really XHTML or even perhaps HTML5 would do the trick, if the quality of the HTML is high enough (see the rants of Peter Sefton from USQ on this topic), and there was one simple, standardised extension, referred to above: a proper Save As that would wrap up all the necessary stuff into one simple portable bundle. And these days the default approach for this seems to be a ZIP container…

  10. > wrap up all the necessary stuff into one simple portable bundle

    What’s the use case for “one simple portable bundle?”

    I’m asking that in all seriousness, especially in an age where we are looking at digital media – even academic papers – as being supported by multimedia objects such as videos and animations, and perhaps even interactive features?

    Why isn’t it good enough where it is on the web, simply transferred as a document and linking to relevant resources?

    If the use case is something like caching, why not an approach that would create “one simple portable bundle” instead of something professionals (like Akamai) do, maintaining a cache in the native format of the resource?

    It feels to me like the concept itself of “one simple portable bundle” is pernicious. That’s not the nature of web content, and even if it described peer-reviewed papers in the past, the lifespan of that model can now be measured in months. Certainly, it’s a model I’m well past, as evidenced here http://www.downes.ca/presentation/251

    Indeed – is this a discussion that should be dominated by librarians? Is it even a library issue any more?

    Just asking..

  11. Interesting discussion. My thoughts are for use of RSS as a container format for the HTML representation of peer-reviewed papers in order to build on the ways RSS can be used for blog posts – in particular the rich ways in which WordPress blog posts can be syndicated.

    The RSS representation can be used to process the contents of the paper (without the navigational elements and other content which are likely to be found on a Web page containing a paper) using RSS tools which are readily available (e.g. Yahoo Pipes). The contents could also be bookmarked and shared in RSS readers, such as Google Reader. The contents could easily be embedded in other HTML pages using RSS embedding tools – perhaps allowing papers to be reused in a teaching content.

    These were my initial thoughts. Ben Toth has pointed out limitations in lack of support for internal anchors but the other arguments against strike me as being for a more elegant approach – but how would that be delivered? And note that I’d have thought that this approach should be usable for syndication of richer structures provided in HTML5.

  12. @Chris @Stephen

    The case for “wrap up all the necessary stuff into one simple portable bundle” is that it’s a paper, not a presentation. Even if this is a model with only months to live (seriously? people are going to be surprised) it is nevertheless the model that’s raising Brian’s question…. Um, sorry. …that Brian’s question raised in me.

    Can it just be a stand-alone web page? Does there need to be a “save as”? Is that because Chris and me are all old skool and tactile and want to hold things in our hands (or in our ipads) on days the cloud is down?

    Anyway, personally, yes. I want librarians talking about this. Librarians will save us all when the iVandals come sweeping through.

  13. > it’s a paper, not a presentation.

    The difference between ‘paper’ and ‘presentation’ is not even semantic at this point. The idea that an academic representation must have a form that is printable on paper is surely outmoded, isn’t it?

  14. That it *must* have a printable form, yes. That’s outmoded. But that it *can* have a printable form (or another save-as form) is a choice.

    Wanting to make that choice, I look for suitable tools, tools suitable to my purpose. They aren’t the only tools. They may not be the best tools – whatever ‘best’ might mean. But it’s my purpose, my looking.

    I’m aided in this by others looking for similar tools because they have similar purposes – or, at least, I imagine they do.

    I’m not aided by people who tell me not to bother.

  15. I wasn’t trying to aid you – I was trying to prevent needless constraints making this much more difficult for very many people than it has to be.

  16. The idea that an academic representation must have a form that is printable on paper is surely outmoded, isn’t it?” – not in my area (Web, standards, accessibility) where the format for papers is unchanged from when I first started writing peer-reviewed papers ten years ago. The only noticeable change is that the proceedings are no longer provided in hard copy format – instead a memory stick is used. But the papers are still PDFs – HTML hasn’t yet had a significant impact!

  17. Chris Rusbridge said

    What’s the use case for “one simple portable bundle?”

    The use case is that for any paper (or presentation) I want to be able to see it as it was. That’s required for citation at least. I want to be able to store the paper on my home computer system, engaged with my reference management system. Sometimes I want to print it out (even an old paper) and read it on the train. I may need to transfer it to a preservation system, etc, etc. The use case is at least partly why PDF is so popular for this form of content: I know I can store a PDF and read it later at my leisure. For an article or paper, that’s valuable. I want to be able to do that, PLUS have richer content, marked up with RDFa or microformats or whatever.

  18. Late to the party again, but I have blogged on this subject recently here: “Fixing academic literature with HTML5 and the semantic web”. I think that HTML5 will emerge as the document form of choice over PDF, and with it will come benefits such as those that many, particularly in biosciences, feel are sorely needed for academic literature, where the volume of research output is piling on the pressure to make research papers more interlinked, searchable and accessible in many forms.

  19. […] file formats should you use to deposit papers in your institutional repository?  Although I recently suggested that RSS could have a role to play in allowing the contents of a repository to be syndicated in […]

Leave a comment