UK Web Focus (Brian Kelly)

Innovation and best practices for the Web

A Pilot Survey of File Formats in Institutional Repositories

Posted by Brian Kelly on 14 Jun 2011

Background

A recent post provided A Pilot Survey of the Numbers of Full-Text Items in Institutional Repositories. The survey made use of the advanced search functionality of ePrints repository software in order to gather data on the numbers of full-text items. Unfortunately it was found that most repositories had not configured the software to provide such information. Whilst exploring the advanced search features it was noticed that it was possible to provide searches based on file formats. This would appear to provide an answer to the question of formats used for depositing items in repositories and how, for example, this relates to preservation policies.

Survey Across Russell Group University Repositories

Testing Approach

In order to test the approach the advanced search facility for the ECS repository at the University of Southampton was used. The figure for the total number of items used the same search option as described in the previous post. Details of the number of HTML items, PDF or Postscript items, other formats and the total number of formats were obtained and links to the findings included so that the current status can be obtained (which also had the advantage of documenting the search parameters used). The findings are given in the following table.

Ref. No. Institutional Repository Details Total
in IR
Total
Full-text
HTML PDF/
Postscript
MS Word Other
formats
All formats Policy
A InstitutionECS, University of Southampton
Repository used
: eprint Repository
Summary
: Uses ePrints.
15,545  8,452 385 7,738 311  7,778  8,453  Policy details

It should be noted that there are differences between the total number of full-text items and the total of all formats. I am assuming that the number of full-text items will be equal to or less than the total number of items in a repository but the total number of items could be larger if there are multiple formats for a single item.

It should be noted that in this survey a link is provided to the policy statement for the repository which has been taken from the ROARMap summary of IR policies. In this example the implementation of the following policy statement might be demonstrated by the evidence presented:

It is our policy to maximise the visibility, usage and impact of our research output by maximising online access to it for all would-be users and researchers worldwide. 

Survey

Once again a survey of the institutional repositories for Russell Group Universities was carried out. The results are given in the following table, which this time includes a link to the IR policies. The table below gives the results of the findings. Note that the results were gathered using the public advanced search interface where this was available. If information on the numbers of full-text items becomes available I will update this post and annotate accordingly.

1 2 3 4 5 6 7 8 9 10
Ref.
No.
Institutional Repository Details Total
in IR
Total
Full-text
HTML PDF/
Postscript
MS Word Other
formats
All formats Policy
1 Institution: University of Birmingham
Repository used: eprint Repository
Summary: Three entries. Uses ePrints.
   415 Policy details
2 Institution: University of Bristol
Summary: One entry. Uses DSpace
 Not available
3 Institution: University of Cambridge
Summary: Four entries. Uses DSpace.
 Not available
4 Institution: Cardiff University
Summary: 1 entry. Uses ePrints.
Repository used: ORCA
 4,562  1  67  2 32 72  Not available
5 Institution: University of Edinburgh
Summary: Three entries. Uses DSpace.
 Policy details
6 Institution: University of Glasgow
Summary: Three entries. Uses ePrints.
Repository used: Enlighten
 40,803  494 2,914 93  11 3,508  Policy details
7 Institution: Imperial College
Repository used: Spiral
Summary: Type not known.
Not
known
 Not available
8 Institution: King’s College London
Repository used: Department of
Computer Science E-Repository

Summary: One entry. Uses ePrints.
    999  Not available
9 Institution: University of Leeds
Repository used: White Rose Research Online
Summary
: Uses ePrints. Shared by
Leeds, Sheffield and York.
  8,013  Not available
10 Institution: University of Liverpool
Summary: One entry.
Repository used: Research Archive
   698    641  1   615 138   0  642  Not available
11 Institution: LSE
Summary: 2 entries.
Repository used: LSE Research Online
 26,044  4,534  Not available
12 Institution: University of Manchester
Summary: One entry.
Repository used: eScholar [See Note below]
138,708 94,561 2  4,502   77  128 7,166 Policy details
13 Newcastle University
Summary: One entry.
Repository used: Newcastle Eprints
Not
known
 Not available
14 Institution: University of Nottingham
Summary: One entry.
Repository used: Nottingham Eprints
   781 Policy details
15 Institution: University of Oxford
Summary: Five entries
Repository used
: ORA
Not
known
Not available
16 Institution: Queen’s University Belfast
Summary: One entry.
Repository used: Queen’s Papers
on Europeanisation & ConWEB
Not
determined
Not available
17 Institution: University of Sheffield
Repository used: White Rose Research Online
Summary: See entry for Leeds.
   8,013  Not available
18 Institution: University of Southampton
Summary: 11 entries.
Repository used: eprints.soton
  60,438   86 10,962  652 9,550 11,872  Policy details
19 Institution: University College London
Summary: 1 entry
Repository used: UCL Discovery
  30,904 Policy details
20 Institution: University of Warwick
Summary: 3 entries
Repository used: WRAP
   1,633 Not available
TOTAL 322,011

NOTE: The entry for the University of Manchester was updated on 15 June 2011, the day after the post was published, using information provided in a comment to the post. Since this information was gathered in a different way to the other findings (which used the ePrints advanced search function) the findings may not be directly comparable.

It should also be noted that:

  • The total number of full-text items listed in column 3 is taken from the findings for the ‘Full Text Status’ search. If this option is not available the entry is left blank.
  • The totals listed in columns 3-7 is taken from the findings for the “Format” search option with column 6 including both PDF and Postscript options.
  • The ‘Other formats’ totals listed in column 8 is taken from the findings for the “Format” search option using the options not included in columns 3-7. Note that this includes MS PowerPoint and Excel and various image, video, audio, XML and archive formats.
  • The totals listed in column 9 is taken from the findings for the “Format” search option with all format options selected.
  • The link to the policy page in column 10 is taken from entries provided in the ROARMap summary of IR policies. If no information is provided the entry is listed as “Not available”.

Comments

It seems that information on the file formats of items stored in institutional repositories is not easily obtained from using the advanced search option in ePrints software. From the previous survey we found that only IRs hosted at Liverpool and LSE of the 20 Russell Group Universities provided information on the numbers of full-text items. From this survey we find that Liverpool also provides information on the file formats used, but LSE does not. However Cardiff, Glasgow and Southampton Universities do provide information on the file formats used, so a better picture across this sector can be obtained.

It is clear that PDF/PostScript is the most popular format used for depositing items with very little evidence that HTML items are deposited. This will be disappointing for those who feel that the structure and ease of reuse provided by HTML should outweigh the convenience provided by PDF. Similarly the low usage of MS Word will be of concern to those who feel that the master format of a resource should be deposited rather than a lossy format such as PDF – although since PDF has been standardised by ISO it could be argued that depositing items in an open standard format reflects best practices.

In some respects the details of the widely-used formats is lost in the ‘Other formats’ column. This includes various multimedia formats, but also archive formats: ZIP, TGZ and BZ2. Perhaps we may find that there are large numbers of ZIP archives containing MS Word, PDF and HTML versions of resources.

I feel there is a need for further data mining in order to understand the patterns of usage which are emerging across institutional repositories, how such patterns relate to policies (including access and preservation policies) and the implementation difficulties which depositers may experience in uploading items to their institutional repository. The danger is that we may develop an improved format (Scholarly HTML, perhaps) but fail to understand the current barriers to depositing full-text items.

Conclusions

A comment to my post which argued that Numbers Matter: Let’s Provide Open Access to Usage Data and Not Just Research Papers suggested that the term ‘Paradata’ would be appropriate to use. As described on Wikipediathe paradata of a survey are data about the process by which the survey data were collected” or alternatively “administrative data about the survey“. The term has been used in a CETIS Wiki page on “Generating Paradata from MediaWiki” which refers to a NSDL Community Network page which proposes how the term can be applied in the context of educational resources and suggests that paradata can provide opportunities to “explicates usage patterns and inferred utility of resources“.

Repository managers have a clear need to understand usage patterns and how their resources can be reused. Since repositories are also closely linked with the open access agenda it would seem to be self evident that repository ‘paradata’ should be published openly – after all, if repository managers are promoting the benefits of open access to research publications to their researchers their arguments will be undermined  if they fail to publish data under their control, where there should be no complexities of copyright ownership claimed by publishers.

It seems that there repositories which use DSpace do not provide advanced search capabilities similar to those available in ePrints. Perhaps this might be a reason for the lack of data from such repositories. But for those who are lucky enough to be using ePrints, what reasons can there be for not providing a full range of statistics?


Twitter conversation from Topsy: [View]

9 Responses to “A Pilot Survey of File Formats in Institutional Repositories”

  1. I’m deeply concerned about the power lying in the webometrics league table. http://repositories.webometrics.info/toprep_inst.asp

    The give a ranking bonus for your number of “Rich Files”, which basicaly means “Number of PDFs”. This means that if we were to push for using “scholarly HTML” rather than PDF than our rank would drop.

    Currently eprints.ecs.soton.ac.uk is at 22 and eprints.soton.ac.uk is at 60. — I couldn’t tell you why, but stats isn’t my strong suit.

    My real concen is that this league table will stifle innovation by only measuring common quality factors, rather than promoting new ones. Also, I think the ‘delta’ is more important than the size, and always have. The success criteria for the TARDIS project, which launched eprints.soton was that it should have a number (2000, I think) of records by a date. I opposed that at the time, and still think it was wrong. A better criteria would have been a sustained deposit rate and (in the first 2 years) a continuous increasing number of contributors.

    http://roar.eprints.org/ is run by one of my colleagues, but I’m very happy to see that they show graphs of ‘deposit activity’ rather than size. This shows that eprints.soton is in very robust healt; http://roar.eprints.org/1423/ with a sustained level of daily deposits over the past few years.

    What’s unhealthy is that a drop in the ranking for eprints.soton caused the board which oversees the site to discuss how to improve our rankings, and there was no really obvious way I could see to do it without generating un-necisary additional PDF files. Of course this was rejected as a silly idea, but my fear is that other sites may feel pressured to improve their ranking and make bad decisions. The community should be calling the shots of what metrics make a good repository. I’m not sure what those metrics should be, but they should be as careful as they can to avoid a situation where I can inflate my score by making my repository worse, eg. by encouraging bad formats like PDF.

    If you’ve not heard the PDF rant, then in short it’s that people write and read papers primariy on computers. In most cases they write in a format with some markup (latex or Word) and then convert it to simulated sheets of A4 paper (PDF). Computers rarely have displays whre an A4 page is useful. I don’t see how it’s acceptable to produce papers (gah, even the name is inappropriate) which cant’ be comfortably viewed on my landscape laptop screen, on my phone, and on the iPad I might justify buying one day. Reading papers is one of the key things an academic does for a living and it’s still easier to read them by printing them out first.

    There’s some people moving in the right direction, at least: http://scholarlyhtml.org/ but the repository and research-publication community needs to be goaded into this direction out of it’s PDF comfort zone.

  2. Chris Rusbridge said

    Chris and Brian, you should take no notice of the webometrics site, which is fundamentally flawed in so many ways (IMHO). For a start, it only ranks those sites that follow its pattern! From its methodology page:

    “- Only repositories with an autonomous web domain or subdomain are included:

    repository.xxx.zz (YES)

    http://www.xxx.zz/repository (NO)”

    PS I don’t mind ordinary PDFs too much (always read on-screen, never print out), but I SO HATE 2-column PDFs, which are a roaring pain to read on screen :-(.

  3. Chris Rusbridge said

    Brian, why not use the ROAR stats for totals to fill out your table? Last I looked, the OpenDOAR stats were way out of date, but ROAR stats are gathered every day. You can also download a tsv file for the full set of UK repositories, and analyse each for things like monthly deposit rate (average, media, inter-quartile range etc). It may not exactly agree with the totals you get from querying the repository itself (perhaps affected by “dark” items), but it’s a pretty good estimate.

  4. Actually, I think my above rant is worth reposting on our team blog :) http://blogs.ecs.soton.ac.uk/webteam/2011/06/14/concerns-about-competative-metrics-for-repositories/

  5. Hi Brian

    Just to help fill in a gap in your analysis, I include some stats for Manchester University’s institutional repository below. We are one of those that don’t use ePrints (the ePrints instance you found is a departmental repository set up a few years a go and running an early version of ePrints). At the institutional level we use the Fedora framework having concluded this better fits with our technology stack and long-term needs.

    Stats are as of 14th June 2011. Manchester’s IR was launched in Sept 2009.

    Total deposits : 138,708
    Total deposits (open access) : 94,561
    Total deposits with (one or more) file(s) attached : 7,166 (this includes examination and final versions of eTheses which are closed access indefinitely or embargoed)
    Total deposits (open access) with (one or more) file(s) attached : 4,663
    Total files deposited (open access) : 4,709
    Total HTML files (open access) : 2
    Total PDF (open access) : 4,502
    Total MS Word (doc/docx) (open access): 77
    Other formats (image, ppt, opendoc) (open access) : 128

    Policies URL : http://www.manchester.ac.uk/escholar/about/policies

    I suspect other UK repository managers could supply such stats if asked – UKCORR (http://www.ukcorr.org/) is possible good place to ask.

    In general, I suggest comparing repositories using such stats is of little value (although I wasn’t sure that was your intention). Institutional repositories operate within the context of the institution they exist, with all the complexities and caveats which accompany this, making any meaningful comparisons very difficult. Furthermore, surely whether such stats are easily available or not, relates little to the open access aims of an institution’s repository? Potentially they are just a form of IR vanity. As an IR Manager, one question I constantly ask myself is, do I work towards promoting my institution’s research or my institution’s repository?

    Regards, Phil

    ***********************************
    Dr PR Butler
    eScholarship Manager
    Tel: +44 (0)161 275 1514 (internal x51514)
    Email: p.butler@manchester.ac.uk
    Web: http://www.manchester.ac.uk/escholar
    ***********************************

    • Hi Phil

      Many thanks for your comments and the information you have provided. I have updated the post with your date, and have annotated the update in order to both make it clear that there are changes to the original post and to flag that this information was gathered differently from the other figures.

      You are correct that the publication of the statistics isn’t intended to provide a league table based on the ‘value’ of the IRs (we all know that the institutions represented are very different, with differing goals, sizes, etc.) Rather the point of the two recent surveys on IRs and other surveys on institutional use of various Social Web services is to help to understand the patterns of use of such services which can be used in order to help identify emerging best practices in the various areas and also, possibly, help shape future directions for developments. In addition the ways in which the information has been gathered has been done (and documented) openly so that limitations in the survey methodology can be easily identified (in your case it was quickly noticed that the OpenDOAR registry which I used did not include details of your main repository – I’m sure that that omission will now be rectified). The methodology I used was based on the advanced search capabilities of ePrints and the survey provides direct links to the findings which will help to sport errors as well as providing documentation on the search parameters used. In particular note that the findings for the numbers of items for which full text was available included full-text items for which access may be restricted. My interest was in the formats and preservation issues rather than the Open Access agenda (the formats issue is less of a political issue and more amenable to reasoned discussion, I think!). This survey methodology is therefore more objective and reproducible than requesting statistics for IR managers.

      What we now have is, for example, an emerging picture which shows the ubiquity of PDF and the very limited use of HTML. This could indicate how the ease of creation and depositing PDF resources shows that is an ideal format for depositing papers. However this will be of concern to those who feel a more richly structured format (perhaps ScholarlyHTML) is desirable. We now have a (partial) benchmark from which we can see how the pattern may change (or not) in the future.

      Note that although those of us who work in the sector are very much aware of the flaws in attempting to publish league table, we also need to note that others may not be so reticent, as we saw in a Daily Telegraph headline which informed its readers that “Universities spending millions on websites which students rate as inadequate“. Although I subsequently published a post which agreed that ” University Web sites cost money” and described how such spending provided a positive ROI, the damage was done. The data used in the article was gathered by FOI requests, and the institutions had no opportunity to be able to demonstrate the value gained from their investment. Gathering evidence of how about IRs are being used, and the complexities of such uses, in an open fashion can enable the sector to take a lead in promoting the benefits of our activities and not be left on the defensive if we attempt to hide behind arguments such as “Our data is complex and you wouldn’t understand the subtleties” – MPs weren’t allowed to get away with such arguments over their expenses and, in the current political climate, we shouldn’t expect higher education to get off lightly!.

      I should add that in this comment I’m explaining the rationale for this work and not necessarily responding to comments you may have made.

      Thanks

      Brian

  6. Brian, An advanced search in EPrints repositories will find specified file formats. A more comprehensive profile of file formats in repositories can be obtained with the EPrints preservation app. This will be easiest to install from the Bazaar app store in the next version of EPrints 3.3, but is available now http://files.eprints.org/581/

    To see how the app can profile different types of repositories, and to find out how many ‘other formats’ might be found in institutional repositories, see our recent paper in Ariadne, Characterising and Preserving Digital Repositories: File Format Profiles http://www.ariadne.ac.uk/issue66/hitchcock-tarrant/

    As that paper says: “The starting point for preservation is to know what content you have, not just in terms of bibliographic metadata such as title, author, etc., but also in terms of technical metadata, including file formats.”

    That starting point can lead to a range of tools available for any repository to build a full preservation programme http://blogs.ecs.soton.ac.uk/keepit/tag/keepit-course/

    As a result, there is no need to restrict deposit to repositories for preservation purposes based on file format.

    Where you lament the dominance of PDF over other formats in the items found in repositories, one longstanding reason for this has been a simple view of preservation. There is now at least one less reason for this to continue.

    • Hi Steve

      Many thanks for your comments and the links you’ve provided. I think we are starting to understand the various tools which are needed as part as an IR toolkit in order to support the auditing of the ways in which repositories are being used.

      Note that when you say “The starting point for preservation is to know what content you have, not just in terms of bibliographic metadata such as title, author, etc., but also in terms of technical metadata, including file formats” I would suggest that it is not just repository managers which will need such information but, at at higher, aggregated layer, policy makers and even politicians who have a responsibility to ensure that public sector investment is being used effectively.

      I should also say that my main interest is in ensuring the availability of open data about repositories is made available – my interests in policy issues related to open access and file formats is secondary to this (and in retrospect I should probably have been neutral in this particular post on the issues related file formats).

  7. Tips Review…

    […]A Pilot Survey of File Formats in Institutional Repositories « UK Web Focus[…]…

Leave a comment