A Pilot Survey of File Formats in Institutional Repositories
Posted by Brian Kelly on 14 Jun 2011
Background
A recent post provided A Pilot Survey of the Numbers of Full-Text Items in Institutional Repositories. The survey made use of the advanced search functionality of ePrints repository software in order to gather data on the numbers of full-text items. Unfortunately it was found that most repositories had not configured the software to provide such information. Whilst exploring the advanced search features it was noticed that it was possible to provide searches based on file formats. This would appear to provide an answer to the question of formats used for depositing items in repositories and how, for example, this relates to preservation policies.
Survey Across Russell Group University Repositories
Testing Approach
In order to test the approach the advanced search facility for the ECS repository at the University of Southampton was used. The figure for the total number of items used the same search option as described in the previous post. Details of the number of HTML items, PDF or Postscript items, other formats and the total number of formats were obtained and links to the findings included so that the current status can be obtained (which also had the advantage of documenting the search parameters used). The findings are given in the following table.
Ref. No. | Institutional Repository Details | Total in IR |
Total Full-text |
HTML | PDF/ Postscript |
MS Word | Other formats |
All formats | Policy |
A | Institution: ECS, University of Southampton Repository used: eprint Repository Summary: Uses ePrints. |
15,545 | 8,452 | 7,778 | 8,453 | Policy details |
It should be noted that there are differences between the total number of full-text items and the total of all formats. I am assuming that the number of full-text items will be equal to or less than the total number of items in a repository but the total number of items could be larger if there are multiple formats for a single item.
It should be noted that in this survey a link is provided to the policy statement for the repository which has been taken from the ROARMap summary of IR policies. In this example the implementation of the following policy statement might be demonstrated by the evidence presented:
It is our policy to maximise the visibility, usage and impact of our research output by maximising online access to it for all would-be users and researchers worldwide.
Survey
Once again a survey of the institutional repositories for Russell Group Universities was carried out. The results are given in the following table, which this time includes a link to the IR policies. The table below gives the results of the findings. Note that the results were gathered using the public advanced search interface where this was available. If information on the numbers of full-text items becomes available I will update this post and annotate accordingly.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Ref. No. |
Institutional Repository Details | Total in IR |
Total Full-text |
HTML | PDF/ Postscript |
MS Word | Other formats |
All formats | Policy |
1 | Institution: University of Birmingham Repository used: eprint Repository Summary: Three entries. Uses ePrints. |
415 | Policy details | ||||||
2 | Institution: University of Bristol Summary: One entry. Uses DSpace |
Not available | |||||||
3 | Institution: University of Cambridge Summary: Four entries. Uses DSpace. |
Not available | |||||||
4 | Institution: Cardiff University Summary: 1 entry. Uses ePrints. Repository used: ORCA |
4,562 | 1 | 67 | 2 | 32 | 72 | Not available | |
5 | Institution: University of Edinburgh Summary: Three entries. Uses DSpace. |
Policy details | |||||||
6 | Institution: University of Glasgow Summary: Three entries. Uses ePrints. Repository used: Enlighten |
40,803 | 494 | 2,914 | 93 | 11 | 3,508 | Policy details | |
7 | Institution: Imperial College Repository used: Spiral Summary: Type not known. |
Not known |
Not available | ||||||
8 | Institution: King’s College London Repository used: Department of Computer Science E-Repository Summary: One entry. Uses ePrints. |
999 | Not available | ||||||
9 | Institution: University of Leeds Repository used: White Rose Research Online Summary: Uses ePrints. Shared by Leeds, Sheffield and York. |
8,013 | Not available | ||||||
10 | Institution: University of Liverpool Summary: One entry. Repository used: Research Archive |
698 | 641 | 1 | 615 | 138 | 0 | 642 | Not available |
11 | Institution: LSE Summary: 2 entries. Repository used: LSE Research Online |
26,044 | 4,534 | Not available | |||||
12 | Institution: University of Manchester Summary: One entry. Repository used: eScholar [See Note below] |
138,708 | 94,561 | 2 | 4,502 | 77 | 128 | 7,166 | Policy details |
13 | Newcastle University Summary: One entry. Repository used: Newcastle Eprints |
Not known |
Not available | ||||||
14 | Institution: University of Nottingham Summary: One entry. Repository used: Nottingham Eprints |
781 | Policy details | ||||||
15 | Institution: University of Oxford Summary: Five entries Repository used: ORA |
Not known |
Not available | ||||||
16 | Institution: Queen’s University Belfast Summary: One entry. Repository used: Queen’s Papers on Europeanisation & ConWEB |
Not determined |
Not available | ||||||
17 | Institution: University of Sheffield Repository used: White Rose Research Online Summary: See entry for Leeds. |
8,013 | Not available | ||||||
18 | Institution: University of Southampton Summary: 11 entries. Repository used: eprints.soton |
60,438 | 86 | 10,962 | 652 | 9,550 | 11,872 | Policy details | |
19 | Institution: University College London Summary: 1 entry Repository used: UCL Discovery |
30,904 | Policy details | ||||||
20 | Institution: University of Warwick Summary: 3 entries Repository used: WRAP |
1,633 | Not available | ||||||
TOTAL | 322,011 |
NOTE: The entry for the University of Manchester was updated on 15 June 2011, the day after the post was published, using information provided in a comment to the post. Since this information was gathered in a different way to the other findings (which used the ePrints advanced search function) the findings may not be directly comparable.
It should also be noted that:
- The total number of full-text items listed in column 3 is taken from the findings for the ‘Full Text Status’ search. If this option is not available the entry is left blank.
- The totals listed in columns 3-7 is taken from the findings for the “Format” search option with column 6 including both PDF and Postscript options.
- The ‘Other formats’ totals listed in column 8 is taken from the findings for the “Format” search option using the options not included in columns 3-7. Note that this includes MS PowerPoint and Excel and various image, video, audio, XML and archive formats.
- The totals listed in column 9 is taken from the findings for the “Format” search option with all format options selected.
- The link to the policy page in column 10 is taken from entries provided in the ROARMap summary of IR policies. If no information is provided the entry is listed as “Not available”.
Comments
It seems that information on the file formats of items stored in institutional repositories is not easily obtained from using the advanced search option in ePrints software. From the previous survey we found that only IRs hosted at Liverpool and LSE of the 20 Russell Group Universities provided information on the numbers of full-text items. From this survey we find that Liverpool also provides information on the file formats used, but LSE does not. However Cardiff, Glasgow and Southampton Universities do provide information on the file formats used, so a better picture across this sector can be obtained.
It is clear that PDF/PostScript is the most popular format used for depositing items with very little evidence that HTML items are deposited. This will be disappointing for those who feel that the structure and ease of reuse provided by HTML should outweigh the convenience provided by PDF. Similarly the low usage of MS Word will be of concern to those who feel that the master format of a resource should be deposited rather than a lossy format such as PDF – although since PDF has been standardised by ISO it could be argued that depositing items in an open standard format reflects best practices.
In some respects the details of the widely-used formats is lost in the ‘Other formats’ column. This includes various multimedia formats, but also archive formats: ZIP, TGZ and BZ2. Perhaps we may find that there are large numbers of ZIP archives containing MS Word, PDF and HTML versions of resources.
I feel there is a need for further data mining in order to understand the patterns of usage which are emerging across institutional repositories, how such patterns relate to policies (including access and preservation policies) and the implementation difficulties which depositers may experience in uploading items to their institutional repository. The danger is that we may develop an improved format (Scholarly HTML, perhaps) but fail to understand the current barriers to depositing full-text items.
Conclusions
A comment to my post which argued that Numbers Matter: Let’s Provide Open Access to Usage Data and Not Just Research Papers suggested that the term ‘Paradata’ would be appropriate to use. As described on Wikipedia “the paradata of a survey are data about the process by which the survey data were collected” or alternatively “administrative data about the survey“. The term has been used in a CETIS Wiki page on “Generating Paradata from MediaWiki” which refers to a NSDL Community Network page which proposes how the term can be applied in the context of educational resources and suggests that paradata can provide opportunities to “explicates usage patterns and inferred utility of resources“.
Repository managers have a clear need to understand usage patterns and how their resources can be reused. Since repositories are also closely linked with the open access agenda it would seem to be self evident that repository ‘paradata’ should be published openly – after all, if repository managers are promoting the benefits of open access to research publications to their researchers their arguments will be undermined if they fail to publish data under their control, where there should be no complexities of copyright ownership claimed by publishers.
It seems that there repositories which use DSpace do not provide advanced search capabilities similar to those available in ePrints. Perhaps this might be a reason for the lack of data from such repositories. But for those who are lucky enough to be using ePrints, what reasons can there be for not providing a full range of statistics?
Twitter conversation from Topsy: [View]
Christopher Gutteridge said
I’m deeply concerned about the power lying in the webometrics league table. http://repositories.webometrics.info/toprep_inst.asp
The give a ranking bonus for your number of “Rich Files”, which basicaly means “Number of PDFs”. This means that if we were to push for using “scholarly HTML” rather than PDF than our rank would drop.
Currently eprints.ecs.soton.ac.uk is at 22 and eprints.soton.ac.uk is at 60. — I couldn’t tell you why, but stats isn’t my strong suit.
My real concen is that this league table will stifle innovation by only measuring common quality factors, rather than promoting new ones. Also, I think the ‘delta’ is more important than the size, and always have. The success criteria for the TARDIS project, which launched eprints.soton was that it should have a number (2000, I think) of records by a date. I opposed that at the time, and still think it was wrong. A better criteria would have been a sustained deposit rate and (in the first 2 years) a continuous increasing number of contributors.
http://roar.eprints.org/ is run by one of my colleagues, but I’m very happy to see that they show graphs of ‘deposit activity’ rather than size. This shows that eprints.soton is in very robust healt; http://roar.eprints.org/1423/ with a sustained level of daily deposits over the past few years.
What’s unhealthy is that a drop in the ranking for eprints.soton caused the board which oversees the site to discuss how to improve our rankings, and there was no really obvious way I could see to do it without generating un-necisary additional PDF files. Of course this was rejected as a silly idea, but my fear is that other sites may feel pressured to improve their ranking and make bad decisions. The community should be calling the shots of what metrics make a good repository. I’m not sure what those metrics should be, but they should be as careful as they can to avoid a situation where I can inflate my score by making my repository worse, eg. by encouraging bad formats like PDF.
If you’ve not heard the PDF rant, then in short it’s that people write and read papers primariy on computers. In most cases they write in a format with some markup (latex or Word) and then convert it to simulated sheets of A4 paper (PDF). Computers rarely have displays whre an A4 page is useful. I don’t see how it’s acceptable to produce papers (gah, even the name is inappropriate) which cant’ be comfortably viewed on my landscape laptop screen, on my phone, and on the iPad I might justify buying one day. Reading papers is one of the key things an academic does for a living and it’s still easier to read them by printing them out first.
There’s some people moving in the right direction, at least: http://scholarlyhtml.org/ but the repository and research-publication community needs to be goaded into this direction out of it’s PDF comfort zone.
Chris Rusbridge said
Chris and Brian, you should take no notice of the webometrics site, which is fundamentally flawed in so many ways (IMHO). For a start, it only ranks those sites that follow its pattern! From its methodology page:
“- Only repositories with an autonomous web domain or subdomain are included:
repository.xxx.zz (YES)
http://www.xxx.zz/repository (NO)”
PS I don’t mind ordinary PDFs too much (always read on-screen, never print out), but I SO HATE 2-column PDFs, which are a roaring pain to read on screen :-(.
Chris Rusbridge said
Brian, why not use the ROAR stats for totals to fill out your table? Last I looked, the OpenDOAR stats were way out of date, but ROAR stats are gathered every day. You can also download a tsv file for the full set of UK repositories, and analyse each for things like monthly deposit rate (average, media, inter-quartile range etc). It may not exactly agree with the totals you get from querying the repository itself (perhaps affected by “dark” items), but it’s a pretty good estimate.
Christopher Gutteridge said
Actually, I think my above rant is worth reposting on our team blog :) http://blogs.ecs.soton.ac.uk/webteam/2011/06/14/concerns-about-competative-metrics-for-repositories/
Phil Butler said
Hi Brian
Just to help fill in a gap in your analysis, I include some stats for Manchester University’s institutional repository below. We are one of those that don’t use ePrints (the ePrints instance you found is a departmental repository set up a few years a go and running an early version of ePrints). At the institutional level we use the Fedora framework having concluded this better fits with our technology stack and long-term needs.
Stats are as of 14th June 2011. Manchester’s IR was launched in Sept 2009.
Total deposits : 138,708
Total deposits (open access) : 94,561
Total deposits with (one or more) file(s) attached : 7,166 (this includes examination and final versions of eTheses which are closed access indefinitely or embargoed)
Total deposits (open access) with (one or more) file(s) attached : 4,663
Total files deposited (open access) : 4,709
Total HTML files (open access) : 2
Total PDF (open access) : 4,502
Total MS Word (doc/docx) (open access): 77
Other formats (image, ppt, opendoc) (open access) : 128
Policies URL : http://www.manchester.ac.uk/escholar/about/policies
I suspect other UK repository managers could supply such stats if asked – UKCORR (http://www.ukcorr.org/) is possible good place to ask.
In general, I suggest comparing repositories using such stats is of little value (although I wasn’t sure that was your intention). Institutional repositories operate within the context of the institution they exist, with all the complexities and caveats which accompany this, making any meaningful comparisons very difficult. Furthermore, surely whether such stats are easily available or not, relates little to the open access aims of an institution’s repository? Potentially they are just a form of IR vanity. As an IR Manager, one question I constantly ask myself is, do I work towards promoting my institution’s research or my institution’s repository?
Regards, Phil
***********************************
Dr PR Butler
eScholarship Manager
Tel: +44 (0)161 275 1514 (internal x51514)
Email: p.butler@manchester.ac.uk
Web: http://www.manchester.ac.uk/escholar
***********************************
Brian Kelly (UK Web Focus) said
Hi Phil
Many thanks for your comments and the information you have provided. I have updated the post with your date, and have annotated the update in order to both make it clear that there are changes to the original post and to flag that this information was gathered differently from the other figures.
You are correct that the publication of the statistics isn’t intended to provide a league table based on the ‘value’ of the IRs (we all know that the institutions represented are very different, with differing goals, sizes, etc.) Rather the point of the two recent surveys on IRs and other surveys on institutional use of various Social Web services is to help to understand the patterns of use of such services which can be used in order to help identify emerging best practices in the various areas and also, possibly, help shape future directions for developments. In addition the ways in which the information has been gathered has been done (and documented) openly so that limitations in the survey methodology can be easily identified (in your case it was quickly noticed that the OpenDOAR registry which I used did not include details of your main repository – I’m sure that that omission will now be rectified). The methodology I used was based on the advanced search capabilities of ePrints and the survey provides direct links to the findings which will help to sport errors as well as providing documentation on the search parameters used. In particular note that the findings for the numbers of items for which full text was available included full-text items for which access may be restricted. My interest was in the formats and preservation issues rather than the Open Access agenda (the formats issue is less of a political issue and more amenable to reasoned discussion, I think!). This survey methodology is therefore more objective and reproducible than requesting statistics for IR managers.
What we now have is, for example, an emerging picture which shows the ubiquity of PDF and the very limited use of HTML. This could indicate how the ease of creation and depositing PDF resources shows that is an ideal format for depositing papers. However this will be of concern to those who feel a more richly structured format (perhaps ScholarlyHTML) is desirable. We now have a (partial) benchmark from which we can see how the pattern may change (or not) in the future.
Note that although those of us who work in the sector are very much aware of the flaws in attempting to publish league table, we also need to note that others may not be so reticent, as we saw in a Daily Telegraph headline which informed its readers that “Universities spending millions on websites which students rate as inadequate“. Although I subsequently published a post which agreed that ” University Web sites cost money” and described how such spending provided a positive ROI, the damage was done. The data used in the article was gathered by FOI requests, and the institutions had no opportunity to be able to demonstrate the value gained from their investment. Gathering evidence of how about IRs are being used, and the complexities of such uses, in an open fashion can enable the sector to take a lead in promoting the benefits of our activities and not be left on the defensive if we attempt to hide behind arguments such as “Our data is complex and you wouldn’t understand the subtleties” – MPs weren’t allowed to get away with such arguments over their expenses and, in the current political climate, we shouldn’t expect higher education to get off lightly!.
I should add that in this comment I’m explaining the rationale for this work and not necessarily responding to comments you may have made.
Thanks
Brian
Steve Hitchcock said
Brian, An advanced search in EPrints repositories will find specified file formats. A more comprehensive profile of file formats in repositories can be obtained with the EPrints preservation app. This will be easiest to install from the Bazaar app store in the next version of EPrints 3.3, but is available now http://files.eprints.org/581/
To see how the app can profile different types of repositories, and to find out how many ‘other formats’ might be found in institutional repositories, see our recent paper in Ariadne, Characterising and Preserving Digital Repositories: File Format Profiles http://www.ariadne.ac.uk/issue66/hitchcock-tarrant/
As that paper says: “The starting point for preservation is to know what content you have, not just in terms of bibliographic metadata such as title, author, etc., but also in terms of technical metadata, including file formats.”
That starting point can lead to a range of tools available for any repository to build a full preservation programme http://blogs.ecs.soton.ac.uk/keepit/tag/keepit-course/
As a result, there is no need to restrict deposit to repositories for preservation purposes based on file format.
Where you lament the dominance of PDF over other formats in the items found in repositories, one longstanding reason for this has been a simple view of preservation. There is now at least one less reason for this to continue.
Brian Kelly (UK Web Focus) said
Hi Steve
Many thanks for your comments and the links you’ve provided. I think we are starting to understand the various tools which are needed as part as an IR toolkit in order to support the auditing of the ways in which repositories are being used.
Note that when you say “The starting point for preservation is to know what content you have, not just in terms of bibliographic metadata such as title, author, etc., but also in terms of technical metadata, including file formats” I would suggest that it is not just repository managers which will need such information but, at at higher, aggregated layer, policy makers and even politicians who have a responsibility to ensure that public sector investment is being used effectively.
I should also say that my main interest is in ensuring the availability of open data about repositories is made available – my interests in policy issues related to open access and file formats is secondary to this (and in retrospect I should probably have been neutral in this particular post on the issues related file formats).
Tips Review said
Tips Review…
[…]A Pilot Survey of File Formats in Institutional Repositories « UK Web Focus[…]…