UK Web Focus (Brian Kelly)

Innovation and best practices for the Web

Archive for June 14th, 2011

A Pilot Survey of File Formats in Institutional Repositories

Posted by Brian Kelly on 14 June 2011


A recent post provided A Pilot Survey of the Numbers of Full-Text Items in Institutional Repositories. The survey made use of the advanced search functionality of ePrints repository software in order to gather data on the numbers of full-text items. Unfortunately it was found that most repositories had not configured the software to provide such information. Whilst exploring the advanced search features it was noticed that it was possible to provide searches based on file formats. This would appear to provide an answer to the question of formats used for depositing items in repositories and how, for example, this relates to preservation policies.

Survey Across Russell Group University Repositories

Testing Approach

In order to test the approach the advanced search facility for the ECS repository at the University of Southampton was used. The figure for the total number of items used the same search option as described in the previous post. Details of the number of HTML items, PDF or Postscript items, other formats and the total number of formats were obtained and links to the findings included so that the current status can be obtained (which also had the advantage of documenting the search parameters used). The findings are given in the following table.

Ref. No. Institutional Repository Details Total
in IR
MS Word Other
All formats Policy
A InstitutionECS, University of Southampton
Repository used
: eprint Repository
: Uses ePrints.
15,545  8,452 385 7,738 311  7,778  8,453  Policy details

It should be noted that there are differences between the total number of full-text items and the total of all formats. I am assuming that the number of full-text items will be equal to or less than the total number of items in a repository but the total number of items could be larger if there are multiple formats for a single item.

It should be noted that in this survey a link is provided to the policy statement for the repository which has been taken from the ROARMap summary of IR policies. In this example the implementation of the following policy statement might be demonstrated by the evidence presented:

It is our policy to maximise the visibility, usage and impact of our research output by maximising online access to it for all would-be users and researchers worldwide. 


Once again a survey of the institutional repositories for Russell Group Universities was carried out. The results are given in the following table, which this time includes a link to the IR policies. The table below gives the results of the findings. Note that the results were gathered using the public advanced search interface where this was available. If information on the numbers of full-text items becomes available I will update this post and annotate accordingly.

1 2 3 4 5 6 7 8 9 10
Institutional Repository Details Total
in IR
MS Word Other
All formats Policy
1 Institution: University of Birmingham
Repository used: eprint Repository
Summary: Three entries. Uses ePrints.
   415 Policy details
2 Institution: University of Bristol
Summary: One entry. Uses DSpace
 Not available
3 Institution: University of Cambridge
Summary: Four entries. Uses DSpace.
 Not available
4 Institution: Cardiff University
Summary: 1 entry. Uses ePrints.
Repository used: ORCA
 4,562  1  67  2 32 72  Not available
5 Institution: University of Edinburgh
Summary: Three entries. Uses DSpace.
 Policy details
6 Institution: University of Glasgow
Summary: Three entries. Uses ePrints.
Repository used: Enlighten
 40,803  494 2,914 93  11 3,508  Policy details
7 Institution: Imperial College
Repository used: Spiral
Summary: Type not known.
 Not available
8 Institution: King’s College London
Repository used: Department of
Computer Science E-Repository

Summary: One entry. Uses ePrints.
    999  Not available
9 Institution: University of Leeds
Repository used: White Rose Research Online
: Uses ePrints. Shared by
Leeds, Sheffield and York.
  8,013  Not available
10 Institution: University of Liverpool
Summary: One entry.
Repository used: Research Archive
   698    641  1   615 138   0  642  Not available
11 Institution: LSE
Summary: 2 entries.
Repository used: LSE Research Online
 26,044  4,534  Not available
12 Institution: University of Manchester
Summary: One entry.
Repository used: eScholar [See Note below]
138,708 94,561 2  4,502   77  128 7,166 Policy details
13 Newcastle University
Summary: One entry.
Repository used: Newcastle Eprints
 Not available
14 Institution: University of Nottingham
Summary: One entry.
Repository used: Nottingham Eprints
   781 Policy details
15 Institution: University of Oxford
Summary: Five entries
Repository used
Not available
16 Institution: Queen’s University Belfast
Summary: One entry.
Repository used: Queen’s Papers
on Europeanisation & ConWEB
Not available
17 Institution: University of Sheffield
Repository used: White Rose Research Online
Summary: See entry for Leeds.
   8,013  Not available
18 Institution: University of Southampton
Summary: 11 entries.
Repository used: eprints.soton
  60,438   86 10,962  652 9,550 11,872  Policy details
19 Institution: University College London
Summary: 1 entry
Repository used: UCL Discovery
  30,904 Policy details
20 Institution: University of Warwick
Summary: 3 entries
Repository used: WRAP
   1,633 Not available
TOTAL 322,011

NOTE: The entry for the University of Manchester was updated on 15 June 2011, the day after the post was published, using information provided in a comment to the post. Since this information was gathered in a different way to the other findings (which used the ePrints advanced search function) the findings may not be directly comparable.

It should also be noted that:

  • The total number of full-text items listed in column 3 is taken from the findings for the ‘Full Text Status’ search. If this option is not available the entry is left blank.
  • The totals listed in columns 3-7 is taken from the findings for the “Format” search option with column 6 including both PDF and Postscript options.
  • The ‘Other formats’ totals listed in column 8 is taken from the findings for the “Format” search option using the options not included in columns 3-7. Note that this includes MS PowerPoint and Excel and various image, video, audio, XML and archive formats.
  • The totals listed in column 9 is taken from the findings for the “Format” search option with all format options selected.
  • The link to the policy page in column 10 is taken from entries provided in the ROARMap summary of IR policies. If no information is provided the entry is listed as “Not available”.


It seems that information on the file formats of items stored in institutional repositories is not easily obtained from using the advanced search option in ePrints software. From the previous survey we found that only IRs hosted at Liverpool and LSE of the 20 Russell Group Universities provided information on the numbers of full-text items. From this survey we find that Liverpool also provides information on the file formats used, but LSE does not. However Cardiff, Glasgow and Southampton Universities do provide information on the file formats used, so a better picture across this sector can be obtained.

It is clear that PDF/PostScript is the most popular format used for depositing items with very little evidence that HTML items are deposited. This will be disappointing for those who feel that the structure and ease of reuse provided by HTML should outweigh the convenience provided by PDF. Similarly the low usage of MS Word will be of concern to those who feel that the master format of a resource should be deposited rather than a lossy format such as PDF – although since PDF has been standardised by ISO it could be argued that depositing items in an open standard format reflects best practices.

In some respects the details of the widely-used formats is lost in the ‘Other formats’ column. This includes various multimedia formats, but also archive formats: ZIP, TGZ and BZ2. Perhaps we may find that there are large numbers of ZIP archives containing MS Word, PDF and HTML versions of resources.

I feel there is a need for further data mining in order to understand the patterns of usage which are emerging across institutional repositories, how such patterns relate to policies (including access and preservation policies) and the implementation difficulties which depositers may experience in uploading items to their institutional repository. The danger is that we may develop an improved format (Scholarly HTML, perhaps) but fail to understand the current barriers to depositing full-text items.


A comment to my post which argued that Numbers Matter: Let’s Provide Open Access to Usage Data and Not Just Research Papers suggested that the term ‘Paradata’ would be appropriate to use. As described on Wikipediathe paradata of a survey are data about the process by which the survey data were collected” or alternatively “administrative data about the survey“. The term has been used in a CETIS Wiki page on “Generating Paradata from MediaWiki” which refers to a NSDL Community Network page which proposes how the term can be applied in the context of educational resources and suggests that paradata can provide opportunities to “explicates usage patterns and inferred utility of resources“.

Repository managers have a clear need to understand usage patterns and how their resources can be reused. Since repositories are also closely linked with the open access agenda it would seem to be self evident that repository ‘paradata’ should be published openly – after all, if repository managers are promoting the benefits of open access to research publications to their researchers their arguments will be undermined  if they fail to publish data under their control, where there should be no complexities of copyright ownership claimed by publishers.

It seems that there repositories which use DSpace do not provide advanced search capabilities similar to those available in ePrints. Perhaps this might be a reason for the lack of data from such repositories. But for those who are lucky enough to be using ePrints, what reasons can there be for not providing a full range of statistics?

Twitter conversation from Topsy: [View]

Posted in Repositories | 9 Comments »