Background
A recent post provided A Pilot Survey of the Numbers of Full-Text Items in Institutional Repositories. The survey made use of the advanced search functionality of ePrints repository software in order to gather data on the numbers of full-text items. Unfortunately it was found that most repositories had not configured the software to provide such information. Whilst exploring the advanced search features it was noticed that it was possible to provide searches based on file formats. This would appear to provide an answer to the question of formats used for depositing items in repositories and how, for example, this relates to preservation policies.
Survey Across Russell Group University Repositories
Testing Approach
In order to test the approach the advanced search facility for the ECS repository at the University of Southampton was used. The figure for the total number of items used the same search option as described in the previous post. Details of the number of HTML items, PDF or Postscript items, other formats and the total number of formats were obtained and links to the findings included so that the current status can be obtained (which also had the advantage of documenting the search parameters used). The findings are given in the following table.
| Ref. No. | Institutional Repository Details | Total in IR |
Total Full-text |
HTML | PDF/ Postscript |
MS Word | Other formats |
All formats | Policy |
| A | Institution: ECS, University of Southampton Repository used: eprint Repository Summary: Uses ePrints. |
15,545 | 8,452 | 7,778 | 8,453 | Policy details |
It should be noted that there are differences between the total number of full-text items and the total of all formats. I am assuming that the number of full-text items will be equal to or less than the total number of items in a repository but the total number of items could be larger if there are multiple formats for a single item.
It should be noted that in this survey a link is provided to the policy statement for the repository which has been taken from the ROARMap summary of IR policies. In this example the implementation of the following policy statement might be demonstrated by the evidence presented:
It is our policy to maximise the visibility, usage and impact of our research output by maximising online access to it for all would-be users and researchers worldwide.
Survey
Once again a survey of the institutional repositories for Russell Group Universities was carried out. The results are given in the following table, which this time includes a link to the IR policies. The table below gives the results of the findings. Note that the results were gathered using the public advanced search interface where this was available. If information on the numbers of full-text items becomes available I will update this post and annotate accordingly.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| Ref. No. |
Institutional Repository Details | Total in IR |
Total Full-text |
HTML | PDF/ Postscript |
MS Word | Other formats |
All formats | Policy |
| 1 | Institution: University of Birmingham Repository used: eprint Repository Summary: Three entries. Uses ePrints. |
415 | Policy details | ||||||
| 2 | Institution: University of Bristol Summary: One entry. Uses DSpace |
Not available | |||||||
| 3 | Institution: University of Cambridge Summary: Four entries. Uses DSpace. |
Not available | |||||||
| 4 | Institution: Cardiff University Summary: 1 entry. Uses ePrints. Repository used: ORCA |
4,562 | 1 | 67 | 2 | 32 | 72 | Not available | |
| 5 | Institution: University of Edinburgh Summary: Three entries. Uses DSpace. |
Policy details | |||||||
| 6 | Institution: University of Glasgow Summary: Three entries. Uses ePrints. Repository used: Enlighten |
40,803 | 494 | 2,914 | 93 | 11 | 3,508 | Policy details | |
| 7 | Institution: Imperial College Repository used: Spiral Summary: Type not known. |
Not known |
Not available | ||||||
| 8 | Institution: King’s College London Repository used: Department of Computer Science E-Repository Summary: One entry. Uses ePrints. |
999 | Not available | ||||||
| 9 | Institution: University of Leeds Repository used: White Rose Research Online Summary: Uses ePrints. Shared by Leeds, Sheffield and York. |
8,013 | Not available | ||||||
| 10 | Institution: University of Liverpool Summary: One entry. Repository used: Research Archive |
698 | 641 | 1 | 615 | 138 | 0 | 642 | Not available |
| 11 | Institution: LSE Summary: 2 entries. Repository used: LSE Research Online |
26,044 | 4,534 | Not available | |||||
| 12 | Institution: University of Manchester Summary: One entry. Repository used: eScholar [See Note below] |
138,708 | 94,561 | 2 | 4,502 | 77 | 128 | 7,166 | Policy details |
| 13 | Newcastle University Summary: One entry. Repository used: Newcastle Eprints |
Not known |
Not available | ||||||
| 14 | Institution: University of Nottingham Summary: One entry. Repository used: Nottingham Eprints |
781 | Policy details | ||||||
| 15 | Institution: University of Oxford Summary: Five entries Repository used: ORA |
Not known |
Not available | ||||||
| 16 | Institution: Queen’s University Belfast Summary: One entry. Repository used: Queen’s Papers on Europeanisation & ConWEB |
Not determined |
Not available | ||||||
| 17 | Institution: University of Sheffield Repository used: White Rose Research Online Summary: See entry for Leeds. |
8,013 | Not available | ||||||
| 18 | Institution: University of Southampton Summary: 11 entries. Repository used: eprints.soton |
60,438 | 86 | 10,962 | 652 | 9,550 | 11,872 | Policy details | |
| 19 | Institution: University College London Summary: 1 entry Repository used: UCL Discovery |
30,904 | Policy details | ||||||
| 20 | Institution: University of Warwick Summary: 3 entries Repository used: WRAP |
1,633 | Not available | ||||||
| TOTAL | 322,011 | ||||||||
NOTE: The entry for the University of Manchester was updated on 15 June 2011, the day after the post was published, using information provided in a comment to the post. Since this information was gathered in a different way to the other findings (which used the ePrints advanced search function) the findings may not be directly comparable.
It should also be noted that:
- The total number of full-text items listed in column 3 is taken from the findings for the ‘Full Text Status’ search. If this option is not available the entry is left blank.
- The totals listed in columns 3-7 is taken from the findings for the “Format” search option with column 6 including both PDF and Postscript options.
- The ‘Other formats’ totals listed in column 8 is taken from the findings for the “Format” search option using the options not included in columns 3-7. Note that this includes MS PowerPoint and Excel and various image, video, audio, XML and archive formats.
- The totals listed in column 9 is taken from the findings for the “Format” search option with all format options selected.
- The link to the policy page in column 10 is taken from entries provided in the ROARMap summary of IR policies. If no information is provided the entry is listed as “Not available”.
Comments
It seems that information on the file formats of items stored in institutional repositories is not easily obtained from using the advanced search option in ePrints software. From the previous survey we found that only IRs hosted at Liverpool and LSE of the 20 Russell Group Universities provided information on the numbers of full-text items. From this survey we find that Liverpool also provides information on the file formats used, but LSE does not. However Cardiff, Glasgow and Southampton Universities do provide information on the file formats used, so a better picture across this sector can be obtained.
It is clear that PDF/PostScript is the most popular format used for depositing items with very little evidence that HTML items are deposited. This will be disappointing for those who feel that the structure and ease of reuse provided by HTML should outweigh the convenience provided by PDF. Similarly the low usage of MS Word will be of concern to those who feel that the master format of a resource should be deposited rather than a lossy format such as PDF – although since PDF has been standardised by ISO it could be argued that depositing items in an open standard format reflects best practices.
In some respects the details of the widely-used formats is lost in the ‘Other formats’ column. This includes various multimedia formats, but also archive formats: ZIP, TGZ and BZ2. Perhaps we may find that there are large numbers of ZIP archives containing MS Word, PDF and HTML versions of resources.
I feel there is a need for further data mining in order to understand the patterns of usage which are emerging across institutional repositories, how such patterns relate to policies (including access and preservation policies) and the implementation difficulties which depositers may experience in uploading items to their institutional repository. The danger is that we may develop an improved format (Scholarly HTML, perhaps) but fail to understand the current barriers to depositing full-text items.
Conclusions
A comment to my post which argued that Numbers Matter: Let’s Provide Open Access to Usage Data and Not Just Research Papers suggested that the term ‘Paradata’ would be appropriate to use. As described on Wikipedia “the paradata of a survey are data about the process by which the survey data were collected” or alternatively “administrative data about the survey“. The term has been used in a CETIS Wiki page on “Generating Paradata from MediaWiki” which refers to a NSDL Community Network page which proposes how the term can be applied in the context of educational resources and suggests that paradata can provide opportunities to “explicates usage patterns and inferred utility of resources“.
Repository managers have a clear need to understand usage patterns and how their resources can be reused. Since repositories are also closely linked with the open access agenda it would seem to be self evident that repository ‘paradata’ should be published openly - after all, if repository managers are promoting the benefits of open access to research publications to their researchers their arguments will be undermined if they fail to publish data under their control, where there should be no complexities of copyright ownership claimed by publishers.
It seems that there repositories which use DSpace do not provide advanced search capabilities similar to those available in ePrints. Perhaps this might be a reason for the lack of data from such repositories. But for those who are lucky enough to be using ePrints, what reasons can there be for not providing a full range of statistics?


