UK Web Focus

Innovation and best practices for the Web

Archive for the ‘Repositories’ Category

A Pilot Survey of File Formats in Institutional Repositories

Posted by Brian Kelly (UK Web Focus) on 14 June 2011

Background

A recent post provided A Pilot Survey of the Numbers of Full-Text Items in Institutional Repositories. The survey made use of the advanced search functionality of ePrints repository software in order to gather data on the numbers of full-text items. Unfortunately it was found that most repositories had not configured the software to provide such information. Whilst exploring the advanced search features it was noticed that it was possible to provide searches based on file formats. This would appear to provide an answer to the question of formats used for depositing items in repositories and how, for example, this relates to preservation policies.

Survey Across Russell Group University Repositories

Testing Approach

In order to test the approach the advanced search facility for the ECS repository at the University of Southampton was used. The figure for the total number of items used the same search option as described in the previous post. Details of the number of HTML items, PDF or Postscript items, other formats and the total number of formats were obtained and links to the findings included so that the current status can be obtained (which also had the advantage of documenting the search parameters used). The findings are given in the following table.

Ref. No. Institutional Repository Details Total
in IR
Total
Full-text
HTML PDF/
Postscript
MS Word Other
formats
All formats Policy
A InstitutionECS, University of Southampton
Repository used
: eprint Repository
Summary
: Uses ePrints.
15,545  8,452 385 7,738 311  7,778  8,453  Policy details

It should be noted that there are differences between the total number of full-text items and the total of all formats. I am assuming that the number of full-text items will be equal to or less than the total number of items in a repository but the total number of items could be larger if there are multiple formats for a single item.

It should be noted that in this survey a link is provided to the policy statement for the repository which has been taken from the ROARMap summary of IR policies. In this example the implementation of the following policy statement might be demonstrated by the evidence presented:

It is our policy to maximise the visibility, usage and impact of our research output by maximising online access to it for all would-be users and researchers worldwide. 

Survey

Once again a survey of the institutional repositories for Russell Group Universities was carried out. The results are given in the following table, which this time includes a link to the IR policies. The table below gives the results of the findings. Note that the results were gathered using the public advanced search interface where this was available. If information on the numbers of full-text items becomes available I will update this post and annotate accordingly.

1 2 3 4 5 6 7 8 9 10
Ref.
No.
Institutional Repository Details Total
in IR
Total
Full-text
HTML PDF/
Postscript
MS Word Other
formats
All formats Policy
1 Institution: University of Birmingham
Repository used: eprint Repository
Summary: Three entries. Uses ePrints.
   415 Policy details
2 Institution: University of Bristol
Summary: One entry. Uses DSpace
 Not available
3 Institution: University of Cambridge
Summary: Four entries. Uses DSpace.
 Not available
4 Institution: Cardiff University
Summary: 1 entry. Uses ePrints.
Repository used: ORCA
 4,562  1  67  2 32 72  Not available
5 Institution: University of Edinburgh
Summary: Three entries. Uses DSpace.
 Policy details
6 Institution: University of Glasgow
Summary: Three entries. Uses ePrints.
Repository used: Enlighten
 40,803  494 2,914 93  11 3,508  Policy details
7 Institution: Imperial College
Repository used: Spiral
Summary: Type not known.
Not
known
 Not available
8 Institution: King’s College London
Repository used: Department of
Computer Science E-Repository

Summary: One entry. Uses ePrints.
    999  Not available
9 Institution: University of Leeds
Repository used: White Rose Research Online
Summary
: Uses ePrints. Shared by
Leeds, Sheffield and York.
  8,013  Not available
10 Institution: University of Liverpool
Summary: One entry.
Repository used: Research Archive
   698    641  1   615 138   0  642  Not available
11 Institution: LSE
Summary: 2 entries.
Repository used: LSE Research Online
 26,044  4,534  Not available
12 Institution: University of Manchester
Summary: One entry.
Repository used: eScholar [See Note below]
138,708 94,561 2  4,502   77  128 7,166 Policy details
13 Newcastle University
Summary: One entry.
Repository used: Newcastle Eprints
Not
known
 Not available
14 Institution: University of Nottingham
Summary: One entry.
Repository used: Nottingham Eprints
   781 Policy details
15 Institution: University of Oxford
Summary: Five entries
Repository used
: ORA
Not
known
Not available
16 Institution: Queen’s University Belfast
Summary: One entry.
Repository used: Queen’s Papers
on Europeanisation & ConWEB
Not
determined
Not available
17 Institution: University of Sheffield
Repository used: White Rose Research Online
Summary: See entry for Leeds.
   8,013  Not available
18 Institution: University of Southampton
Summary: 11 entries.
Repository used: eprints.soton
  60,438   86 10,962  652 9,550 11,872  Policy details
19 Institution: University College London
Summary: 1 entry
Repository used: UCL Discovery
  30,904 Policy details
20 Institution: University of Warwick
Summary: 3 entries
Repository used: WRAP
   1,633 Not available
TOTAL 322,011

NOTE: The entry for the University of Manchester was updated on 15 June 2011, the day after the post was published, using information provided in a comment to the post. Since this information was gathered in a different way to the other findings (which used the ePrints advanced search function) the findings may not be directly comparable.

It should also be noted that:

  • The total number of full-text items listed in column 3 is taken from the findings for the ‘Full Text Status’ search. If this option is not available the entry is left blank.
  • The totals listed in columns 3-7 is taken from the findings for the “Format” search option with column 6 including both PDF and Postscript options.
  • The ‘Other formats’ totals listed in column 8 is taken from the findings for the “Format” search option using the options not included in columns 3-7. Note that this includes MS PowerPoint and Excel and various image, video, audio, XML and archive formats.
  • The totals listed in column 9 is taken from the findings for the “Format” search option with all format options selected.
  • The link to the policy page in column 10 is taken from entries provided in the ROARMap summary of IR policies. If no information is provided the entry is listed as “Not available”.

Comments

It seems that information on the file formats of items stored in institutional repositories is not easily obtained from using the advanced search option in ePrints software. From the previous survey we found that only IRs hosted at Liverpool and LSE of the 20 Russell Group Universities provided information on the numbers of full-text items. From this survey we find that Liverpool also provides information on the file formats used, but LSE does not. However Cardiff, Glasgow and Southampton Universities do provide information on the file formats used, so a better picture across this sector can be obtained.

It is clear that PDF/PostScript is the most popular format used for depositing items with very little evidence that HTML items are deposited. This will be disappointing for those who feel that the structure and ease of reuse provided by HTML should outweigh the convenience provided by PDF. Similarly the low usage of MS Word will be of concern to those who feel that the master format of a resource should be deposited rather than a lossy format such as PDF – although since PDF has been standardised by ISO it could be argued that depositing items in an open standard format reflects best practices.

In some respects the details of the widely-used formats is lost in the ‘Other formats’ column. This includes various multimedia formats, but also archive formats: ZIP, TGZ and BZ2. Perhaps we may find that there are large numbers of ZIP archives containing MS Word, PDF and HTML versions of resources.

I feel there is a need for further data mining in order to understand the patterns of usage which are emerging across institutional repositories, how such patterns relate to policies (including access and preservation policies) and the implementation difficulties which depositers may experience in uploading items to their institutional repository. The danger is that we may develop an improved format (Scholarly HTML, perhaps) but fail to understand the current barriers to depositing full-text items.

Conclusions

A comment to my post which argued that Numbers Matter: Let’s Provide Open Access to Usage Data and Not Just Research Papers suggested that the term ‘Paradata’ would be appropriate to use. As described on Wikipediathe paradata of a survey are data about the process by which the survey data were collected” or alternatively “administrative data about the survey“. The term has been used in a CETIS Wiki page on “Generating Paradata from MediaWiki” which refers to a NSDL Community Network page which proposes how the term can be applied in the context of educational resources and suggests that paradata can provide opportunities to “explicates usage patterns and inferred utility of resources“.

Repository managers have a clear need to understand usage patterns and how their resources can be reused. Since repositories are also closely linked with the open access agenda it would seem to be self evident that repository ‘paradata’ should be published openly - after all, if repository managers are promoting the benefits of open access to research publications to their researchers their arguments will be undermined  if they fail to publish data under their control, where there should be no complexities of copyright ownership claimed by publishers.

It seems that there repositories which use DSpace do not provide advanced search capabilities similar to those available in ePrints. Perhaps this might be a reason for the lack of data from such repositories. But for those who are lucky enough to be using ePrints, what reasons can there be for not providing a full range of statistics?


Twitter conversation from Topsy: [View]

Posted in Repositories | 9 Comments »

A Pilot Survey of the Numbers of Full-Text Items in Institutional Repositories

Posted by Brian Kelly (UK Web Focus) on 6 June 2011

Background

A recent post on How Do We Measure the Effectiveness of Institutional Repositories? sought to address the question of “What makes a good repository?” which was raised on the JISC-Repositories JISCMail list. The post outlined possible metrics which could be used for identifying the effectiveness of institutional repositories based on the intended purposes of a repository. In the post I suggested that if the purpose of a repository was to ensure the long-term preservation of resources, then there was a need to measure the number of full-text items in the repository – after all if the full text of a paper is not available the repository won’t be doing a very good job in the preservation of such resources!

The interest in this topic was revisited yesterday in a Twitter discussion which began with the suggestion from @PaulWalk that “I’ve thought we should use RepUK to measure actual persistence in repositories“‘. But in order to measure the persistence of of the actual resource we need to be able to differentiate between the persistence of the full-text item and the resource itself and not just the persistency of the URI of the item. How might one do this?

Initial Experimentation

Following a discussion with Les Carr at the JISC 2011 conference I discovered that the ePrints advanced search interface can be used to retrieve information on both the numbers of items containing the full text and those that do not. In order to see if this approach could be used I looked at UKOLN’s items in Opus, the University of Bath’s institutional repository. From this I found that there were a total of 344 items, of which 146 full text items were available (including published and confidential items) and 198 are metadata-only items. We can see that 42% of the items contain the full-text.

In order to see if this this use of ePrint’s advanced search could be used in a similar fashion for another repository I looked at the ECS ePrint Repository at the University of Southampton. This time I found that out of a total of 974 15,532  items the departmental repository contained 861 8,429 items with the full text and 113 7.093 metadata-only items – this time 54.3% of items contain the full-text.

But are these initial findings typical across the sector?

Survey Across Russell Group University Repositories

We might expect the 20 research-intensive Russell Group Universities to be playing a leading role in use of institutional repositories, with either institutional mandates (in the case of Southampton University) or institutional research culture helping to ensure that significant numbers of full-text items are deposited. But is this really the case? In order to investigate whether the approach described could be applied more widely the survey was carried out across Russell Group Universities.

Using the list of repositories taken from the OpenDOAR directory I found that 3 of the Russell group Universities seem to use the DSpace repository software and the advanced search functional in DSpace does not appear to allow searching to be restricted to full-text and metadata-only records.

Subsequent investigation of the advanced search capabilities of the remaining 17 institutions showed that only two seemed to provide the advanced search function which I used on the University of Bath and ECS, University of Southampton repositories. However there is a RESTful interface to the search and so the search parameters used to search the University of Bath repository was used across the other ePrint repositories. The following searches were carried out:

Query 1: Total Number of Items

http://eprint.domain/cgi/search/quicksearch?screen=Public%3A%3AEPrintSearch&basic_merge=ALL&basic=web&full_text_status=public&full_text_status=restricted&full_text_status=none&groups_merge=ALL&satisfyall=ALL&order=-date%2Fcreators_name%2Ftitle&_action_search=Search

Query 2: Full text deposited (but access may be restricted)

http://eprint.domain/cgi/search/quicksearch?screen=Public%3A%3AEPrintSearch&basic_merge=ALL&basic=web&full_text_status=public&full_text_status=restricted&groups_merge=ALL&satisfyall=ALL&order=-date%2Fcreators_name%2Ftitle&_action_search=Search

Query 3: No full text available:

http://eprint.domai/cgi/search/quicksearch?screen=Public%3A%3AEPrintSearch&basic_merge=ALL&basic=web&full_text_status=none&groups_merge=ALL&satisfyall=ALL&order=-date%2Fcreators_name%2Ftitle&_action_search=Search

It was intended to use the survey methodology across the Russell Group universities which host an institutional repository based on the ePrints software. However it was not possible to get valid results for most of the repositories and it was subsequently discovered that this is an optional feature for ePrints repositories.

Rather than abandon this work I have decided to publish this post in order to encourage institutions which host an ePrints repository to implement this feature since I feel it would be beneficial to the repository community if we had a better picture of how institutions are using repositories to host full-text items.

The table below gives the results of the two test cases (from Bath and Southampton) together with details of the total number of items in the other repositories. If information on the numbers of full-text items becomes available I will update this post and annotate accordingly. [Note there was an error in the figures for the ECS repository. This has now been corrected in the table below.]

Ref. No. Institutional Repository Details Query 1: Total Nos. of Items Query 2: Total Nos. of Full text Items Query 3: Total Nos.
of Metadata-Only items
Percentage of Full-Text Items
A InstitutionUniversity of Bath
Repository used
: Opus Repository
Summary
: Uses ePrints.
20,210 1,387 18,823 6.86%
B InstitutionECS, University of Southampton
Repository used
: eprint Repository
Summary
: Uses ePrints.
974 15,532 861 8,439 113  7,093  11.6% 54.3%
TOTAL 21,184  35,742 2,248 9,826 18,936 25,916  10.6% 27.4%

The table below gives the results of the findings for what seems to be the main repository from Russell Group Universities. Note that the results were gathered using the public advanced search interface where this was available. If information on the numbers of full-text items becomes available I will update this post and annotate accordingly.

Ref. No. Institutional Repository Details Query 1: Total Nos.
of Items
Query 2: Total Nos. of
Full text Items
Query 3: Total Nos.
of Metadata-Only
items
Percentage of
Full-Text Items
1 Institution: University of Birmingham
Repository used: eprint Repository
Summary: Three entries. Uses ePrints.
411
2 Institution: University of Bristol
Summary: One entry. Uses DSpace
3 Institution: University of Cambridge
Summary: Four entries. Uses DSpace.
4 Institution: Cardiff University
Summary: 1 entry. Uses ePrints.
Repository used: ORCA
4,562
5 Institution: University of Edinburgh
Summary: Three entries. Uses DSpace.
6 Institution: University of Glasgow
Summary: Three entries. Uses ePrints.
Repository used: Enlighten
40,803
7 Institution: Imperial College
Repository used: Spiral
Summary: Type not known.
Not determined
8 Institution: King’s College London
Repository used: Department of
Computer Science E-Repository

Summary: One entry. Uses ePrints.
999
9 Institution: University of Leeds
Repository used: White Rose Research Online
Summary
: Uses ePrints. Shared by
Leeds, Sheffield and York.
8,013
10 Institution: University of Liverpool
Summary: One entry.
Repository used: Research Archive
698 641 57 93%
11 Institution: LSE
Summary: 2 entries.
Repository used: LSE Research Online
26,044 4,534 21,510 17.4%
12 Institution: University of Manchester
Summary: One entry.
Repository used: MMS
Not determined
13 Newcastle University
Summary: One entry.
Repository used: Newcastle Eprints
Not determined
14 Institution: University of Nottingham
Summary: One entry.
Repository used: Nottingham Eprints
781
15 Institution: University of Oxford
Summary: Five entries
Repository used
: ORA
Not determined
16 Institution: Queen’s University Belfast
Summary: One entry.
Repository used: Queen’s Papers
on Europeanisation & ConWEB
Not determined
17 Institution: University of Sheffield
Repository used: White Rose Research Online
Summary: See entry for Leeds.
8,013
18 Institution: University of Southampton
Summary: 11 entries.
Repository used: eprints.soton
60,438
19 Institution: University College London
Summary: 1 entry
Repository used: UCL Discovery
30,904
20 Institution: University of Warwick
Summary: 3 entries
Repository used: WRAP
1,633
TOTAL 183,299 5,175  21,567

At the time of writing we have to say that we do not know how many of the 183,299 items contain the full-text. All we can say is that there are at least 5,175 full-text items (or only 2.8%) – and this is based on the assumption that a full-text item represents the content of the metadata item, rather than for example, a PowerPoint slide used in the presentation of a paper.

An Opportunity for Developers

I should also like to point out that, as described on the DevCSI blog, the deadline for the Developer Challenge at Open Repositories 2011 (Austin, Texas) is Thursday 9 June. A CrowdVine page for the developer challenge describes how the Challenge is to “Show us the future of repositories“. Since “Remote presentations would be considered in exceptional circumstances” it strikes me that there might be an opportunity to submit an entry based on an analysis of the percentage of full-text items in repositories, but this would probably have to be done using an alternative approach. A suggestion for anyone who wold like to submit an based on this idea could be:

The future of repositories is to preserve the full text of research papers for future generations. We can see how well we are doing in implementing this vision which shows that xx% of repositories across the y sector already contain full-text items :-)

Or, if the results are disappointing:

The future of repositories is a gloomy one as only y% of repositories across the z sector contain full text items :-(

Alternatively we might conclude that new development is not required for those running ePrint repositories:

The future of repositories is reliant on the provision of evidence which can be used to policies and so ePrints repository managers should configure their services to provide the evidence describes in this post!

Is that an unreasonable suggestion?


Twitter conversation from Topsy: [View]

Posted in Evidence, Repositories | 14 Comments »

What I Like and Don’t Like About IamResearcher.com

Posted by Brian Kelly (UK Web Focus) on 27 April 2011

IamResearch.com

I was recently told about the Iamresearcher.com service, a repository of information about researchers and their research activities. “Not another one!” was one reaction I heard. But is there anything that can be learnt from this service, which has been developed by Mr Yang Yang, an MSc student at the University of Southampton? Les Carr, over on his Repository Man blog has been “Experimenting With Repository UI Design” and describes how he is “always on the lookout for engaging UI paradigms to inspire repository design“. Might this service provide any new UI design paradigms?

Things I Like

I have to admit that I was pleased with how easy it was to get started with the service. I signed up and asked the system to find papers associated with my email address. It found many of my papers, with much of the metadata being obtained from the University of Bath Opus repository. I them searched for other papers which weren’t included in the initial set and was able to claim them as belonging to me – including one short paper which had been published in the Russian Digital Libraries Journal in 2000 which I had forgotten about.

I can now view my 49 entries and sort them in various ways: in addition to the default date order I can also sort by item type; lead author; co-authors and keywords. The view of my co-authors (illustrated) was of particular interest: I hadn’t realised that I had written papers with 55 others.

In comparison the interface provided on my institutional repository service does now seem quite dated. However this is perhaps not unexpected as according to the Wikipedia entry the ePrints software (which is widely used across the UK) was created way back in 2000.

Revisiting the question as to whether we need another service which provides access to research information I would say ‘yes’. Such developments can help drive innovation. In this case ePrints developers are in a position to see more modern approaches to the user interface. In addition the service describes itself as “Web 3.o ready application” by which they seem to mean that the service “connects researcher and research students anywhere in the world using an intelligent network”.

I haven’t seem much evidence of Web 3.0 capabilities in the service, apart from being able to download details of my papers in FOAF format, but perhaps the “ready” word is providing a signal that such functionality is not yet available.

Things I Don’t Like

There are some typos on the data entry forms and some usability niggles, but nothing too significant – indeed after attending a recent Bathcamp Startup Night and hearing the suggestion that “If you’re not embarrassed about the launch version of your software then you released it too late” (a quote from the founder of LinkedIn) I welcome seeing this service before everything has been thoroughly checked.

The language used in the terms of service are somewhat worrying, however:

No Injunctive Relief.
In no event shall you seek or be entitled to rescission, injunctive or other equitable relief, or to enjoin or restrain the operation of the Service, exploitation of any advertising or other materials issued in connection therewith, or exploitation of the Services or any content or other material used or displayed through the Services.

It also seems that as a user of the service I undertake not to:

Duplicate, license, sublicense, publish, broadcast, transmit, distribute, perform, display, sell, rebrand, or otherwise transfer information found on iamResearcher (excluding content posted by you) except as permitted in this Agreement, iamResearcher’s developer terms and policies, or as expressly authorized by iamResearcher

Hmm. The service harvested its metadata from other repository services, such as the University of Bath’s Opus repository but does not allow others to reuse its content. This seems to undermine the benefits provided by permitting (indeed encouraging) others to make use of open data. In addition the service appears to be hypocritical, as the University of Bath’s repository policy (which was created using the OpenDOAR Policy tool) states that “The metadata must not be re-used in any medium for commercial purposes without formal permission“. Now the Iamresearcher.com service does not appear to be a commercial service – but its privacy policy states that “To support the Services we provide at no cost to our Users, as well as provide a more relevant and useful experience for our Users, we serve our own ads and also allow third party advertisements on the site“. If advertising does appear on the service, won’t it then be breaching the terms and conditions of the service from which it harvested its data?

Personally I have no problem with advertising being used to fund services where, as in this case, there are multiple providers of services. Indeed those who argue for openness of data should be willing to accept that data may be used for commercial purposes. However services which accept the opportunities provided by open data should accept that they should be providing similar conditions of usage.

The final concern that I have about the service is that currently it can only be accessed if you sign in. I feel this is counter-productive – indeed one person I mentioned this service to asked why he should bother. That’s a fair comment, I think. And seeing that the terms and conditions also state that users of the service are not allowed to:

Deep-link to the Site for any purpose, (i.e. including a link to a iamResearcher web page other than iamResearcher’s home page) unless expressly authorized in writing by iamResearcher or for the purpose of promoting your profile or a Group on iamResearcher as set forth in the Brand Guidelines;

I now wonder what benefits this service can provide to the research community. Developers of other repository services, however, should be able to learn from the technological enhancements the service provides, even if the business model is questionable.


Twitter conversation from Topsy: [View]

Posted in openness, Repositories | Tagged: | 8 Comments »

How Do We Measure the Effectiveness of Institutional Repositories?

Posted by Brian Kelly (UK Web Focus) on 24 February 2011

 

The Need for Metrics

How might one measure the effectiveness of an institutional repository? An approach which is arising from various activities I am involved in related to evidence, value and impact is based on the need to identify the underlying purpose(s) of services and to gather evidence related to how such purposes are being addressed.

Therefore there is a need to initially identify the purposes of an institutional repository. Institutions may have a variety of different purposes (which is why, although gathering evidence can be important, drawing up league tables is often inappropriate). But let’s suggest that two key purposes may be: (1) maximising access to research publications and (2) ensuring long-term preservation of research publications. What measures may be appropriate for ensuring such purposes are being achieved?

For maximising access to research publications two important measures will be the numbers of items in the repository and the numbers of accesses to the items. Since the numbers themselves will have little meaning in isolation there will be a need to measure trends over time, with an expectation of growth in the numbers of items deposited (which show slow down once legacy items have been uploaded and only new items are being deposited) and continual increase in overall the traffic to the repository as the number of items grows and access to the items via various resource discovery services provides easier ways of findings such resources.

Access Statistics for Institutional Repositories

The relevance of such statistics is well-understood with, here at the University of Bath, the IRStats module for the ePrints repository service providing access to information such as details of all downloads, the overall number of downloaded items (100,003 at the time of writing), the trends over time and various other summaries, as illustrated.

However it is important to recognise that such measures only indirectly provide an indication of how well a repository may be doing in maximising access to research publications. In part traffic may be generated by users following links to content of no interest to them through use of search engines such as Google (which is responsible for providing 38% of traffic to the University of Bath repository, with another 10.2% arriving via Google Scholar). In addition even if a relevant paper is found and read, the ideas it contains may not be felt to be of direct interest and may not be used to inform subsequent research activities.

A citation to a resource will provide more tangible evidence of direct benefits of a repository to supporting research activities and work such as the MESUR metrics activity is looking to “investigate an array of possible impact metrics that includes not only frequency-based metrics (citation and hit counts), but also network-based metrics such as those employed in social network analysis and web search engines“. However in this post I will focus on evidence which can be easily gleaned from repositories themselves.

Whilst it is possible to point out various limitations in using such metrics the danger is that we lose sight of the fact that they can still have a role to play in providing a proxy indicator of value. So although repository items which are found and downloaded may not be of interest or may not be used, other items will be relevant and inform, either directly or indirectly, other research work. We might therefore assert that an increase in traffic may also have a positive correlation with an increase in use.

The Numbers of Items in Repositories

Measuring the numbers and growth in numbers of items in a repository would seem to be less problematic than access statistics. This measurement can reflect the effectiveness of a repository’s aims in supporting the preservation of research publications, as publication are migrates from departmental Web sites or individual’s personal home pages to a centrally managed environment. The growth in the numbers of items should also, of course, help in enhancing access to the papers too.

Repositories may, however, only provide access to the metadata about a paper and not access to the paper itself. This may be due to a number of factors including copyright restrictions, (perceived) difficulties in uploading document or the unavailability of the documents.

There may also be a need to be able to differentiate between the total number of distinct items in a repository and the numbers of formats which may be made available. Storage of the original master format is often recommended for preservation purposes and if ease-of-reuse of the content may be required (e.g. merging together various papers and producing a table of contents can be much easier if the original files are available, rather than a series of PDFs which can be more difficult to manipulate.

Alternative formats for items may also help to enhance access for users of mobile devices or users with disabilities who may require assistive technologies to process repository items. This then leads to the question of not only the formats provided but how those formats are being used: is a PDF easily processed by assistive technology or is it simply a scanned image which cannot be read by voice browsers? In addition, as suggested by preliminary research carried out by my colleagues Emma Tonkin and Andy Hewson described in a post on “Automated Accessibility Analysis of PDFs in Repositories“, might the cover pages automatically generated by repository systems created additional barriers to access of such resources?

Trends Across the Community

This post has outlined areas in which evidence should be gathered and used in order to be able to help demonstrate the value of an institutional repository service and help to ensure that a number of best practices are being addressed (and, if not, to be able to develop plans for implementing such best practices).

Although such work should be done within the context of an individual repository service there are also benefits to be gained from observing trends across the community. My colleague Paul Walk recently mentioned on the JISC-Repositories JICMail list UKOLN development of a prototype harvesting and aggregation system for metadata from UK Institutional repositories called ‘RepUK’. One aspect of this work is aggregation of metadata records from institutional repositories and visualisation of various aspects of the data quality. Mark Dewey, lead developer for this work, has released an initial prototype tool. As can be seen this can provide a visualisation of the growth in the number of records across the 133 repositories which have been harvested.

Discussion

This post has suggested that metrics are needed in order to help to provide answers, perhaps indirectly, to questions regarding the effectiveness of institutional repositories as well as to support and inform the development of the repositories and the adoption of best practices. Of course measuring the effectiveness of institutional repositories will also require user surveys, but this post only considers quantitative approaches which are summarised in the table below.

Metric Purpose Comments
Total usage Provides an indication of repository’s effectiveness in enhancing access to research papers. Data may need to be carefully interpretted.
Number of items Provides an indication of repository’s effectiveness in both enhancing access to research papers and in ensuring their preservation. It might be expected that growth with decrease after a backlog of papers have been uploaded.
Profiling Alternative Formats May provide an indication that papers can be accessed by users with disabilities or my users using mobile devices. Provision of multiple formats may enhance access and reuse.
Profiling Format Quality Provides an indication that the formats provided are fit for purpose (e.g. PDFs are not just scanned images) This may indicate problems with repository workflow, need for education, etc.

But what additional tools may be needed (I would welcome a mobile app for my iPod Touch along the lines of the stats app for WordPress blogs)?  What advice is needed in interpretting the findings (and avoiding misinterpretations?)  Your thoughts are welcomed.


Twitter conversation from Topsy: [View]

Posted in Evidence, Repositories | 9 Comments »

Web Accessibility, Institutional Repositories and BS 8878

Posted by Brian Kelly (UK Web Focus) on 24 January 2011

Review of Work on Accessibility and Institutional Repositories

Back in December 2006 I wrote a post on Accessibility and Institutional Repositories in which I suggested that it might be “unreasonable to expect hundreds in not thousands of legacy [PDF] resources to have accessibility metadata and document structures applied to them, if this could be demonstrated to be an expensive exercise of only very limited potential benefit“. I went on to suggest that there is a need to “explore what may be regarded as ‘unreasonable’ we then need to define ‘reasonable’ actions which institutions providing institutional repositories would be expected to take“.

A discussion on the costs and complexities of implementing various best practices for depositing resources in repositories continued in September 2008 as I described in a post on Institutional Repositories and the Costs Of Doing It Right, with Les Carr suggesting that “If accessibility is currently out of reach for journal articles, then it is another potential hindrance for OA“. Les was arguing that the costs of providing accessibility resources in institutional repositories is too great and can act as a barrier to maximising open access to institutional research activities.

I agree – but that doesn’t mean that we should abandon any thoughts of exploring ways of enhancing accessibility. A paper on “From Web Accessibility to Web Adaptability” (available in PDF and HTML formats) described an approach called “Web Adaptability” which has the flexibility to account for a variety of contextual factors which is not possible with an approach based purely on conformance with WCAG guidelines. An accompanying blog post which summarised the paper described how the adaptability approach could be applied to institutional repositories”:

Adaptability and institutional repositories: Increasing numbers of universities are providing institutional repositories in order to enhance access to research publications and to preserve such resources for future generations. However many of the publications will be deposited as a PDF resource, which will often fail to conform with accessibility guidelines (e.g. images not being tagged for use with screen readers; text not necessarily being ‘linearised’ correctly for use with such devices, etc.). Rather than rejecting research publications which fail to conform with accessibility guidelines the Web adaptability approach would support the continued use and growth of institutional repositories, alongside an approach based on advocacy and education on ways of enhancing the accessibility of research publications, together with research into innovative ways of enhancing the accessibility of the resources.

The stakeholder approach to Web accessibility, originally developed by Jane Seale for use in an elearning context and described in a joint paper on Accessibility 2.0: People, Policies and Processes (available in PDF, MS Word and HTML formats) has been extended for use in a repository context. The approaches to engagement with some of the key stakeholders is given below:

Education: Training provided (a) for researchers to ensure they are made aware of importance of accessibility practices (including SEO benefits) and of techniques for implementing best practices and (b) for repository managers and policy makers to ensure that accessibility enhancements can be procured in new systems.

Feedback to developers: Ensure that suppliers and developers are aware of importance of accessibility issues  and enhancements featured in development plans.

Feedback to publishers: Ensure that publishers who provide templates are aware of importance of provision of accessible templates.

Auditing: Systematic auditing of papers in repositories to monitor extent of accessibility concerns and trends.

But is this approach valid?  Surely SENDA accessibility legislation requires conformance with WCAG guidelines? And if it is difficult to conform with such guidelines, surely the best approach is to keep a low profile?

BS 8878 Web Accessibility Code of Practice

The BS 8878 Web accessibility Code of practice was launched in December 2010.  A summary of an accompanying Webinar about the Code of Practice was described in a post on BS 8878: “Accessibility has been stuck in a rut of technical guidelines” – and it was interesting to hear how the code of practice has been written in the context of the Equal Act which has replaced the DDA.  I was also very pleased to hear of the user-focus which is at the heart of the code of practice, and how mainstream approaches on best practices have moved away from what was described as a “rut of technical guidelines“.

Although the Code of Practice is not available online and costs £100 to purchase an accompanying set of guidelines was produced by Abilitynet which I have used in the following summary. Note I had to request a copy of these guidelines and I can no longer find the link to contact details to request copies. However AbilityNet’s complete set of guidelines can be purchased for £4,740!

It seems that there is a clear financial barrier to the implementation of new accessibility guidelines. In order to minimise the costs to higher education (which would approach a quarter of a million pounds if all UK Universities were to purchase a copy at the list price!)  I’ll give my interpretation of how the code of practice could be applied in the context of institutional repositories. But please note that this is very much an initial set of suggestions and should not be considered to be legal advice!

The heart of the BS 8878 document is a 16 step plan:

  1. Define the purpose.
  2. Define the target audience.
  3. Analyse the needs of the target audience.
  4. Note any platform or technology preferences.
  5. Define the relationship the product will have with its target audience.
  6. Define the user goals and tasks.
  7. Consider the degree of user experience the web product will aim to provide.
  8. Consider inclusive design & user-personalised approaches to accessibility.
  9. Choose the delivery platform to support.
  10. Choose the target browsers, operating systems & assistive technologies to support.
  11. Choose whether to create or procure the Web product.
  12. Define the Web technologies to be used in the Web product.
  13. Use Web guidelines to direct accessibility Web production
  14. Assure the Web products accessibility through production (i.e. at all stages).
  15. Communicate the Web product’s accessibility decisions at launch.
  16. Plan to assure accessibility in all post-launch updates to the product.

Note that Step 13, which covers use of WCAG guidelines, may previously have been regarded as the only or the most significant policy item. BS 8878 places these guidelines in a more appropriate context.

Using BS 8878 for Institutional Repositories

A summary of how I feel each of these steps might be applied to institutional repositories is given below.

  1. Define the purpose:
    The purposes of the repository service will be to enhance access to research papers and to support the long term preservation of the papers.
  2. Define the target audience:
    The main target audience will be a global research community.
  3. Analyse the needs of the target audience:
    Researchers may need to use assistive technologies to read PDFs.
  4. Note any platform or technology preferences:
    PDFs may not include accessibility support.
  5. Define the relationship the product will have with its target audience:
    The paper will be provided at a stable URI.
  6. Define the user goals and tasks:
    Users will use various search tools to find resource. Paper with then be read on screen or printed.
  7. Consider the degree of user experience the web product will aim to provide:
    Usability of the PDF document will be constrained by publisher’s template. Technical accessibility will be constrained by workflow processes.
  8. Consider inclusive design & user-personalised approaches to accessibility:
    Usability of the PDF document will be constrained by publisher’s template. Technical accessibility will be constrained by workflow processes.
  9. Choose the delivery platform to support:
    Aims to be available on devices with PDF support including mobile devices
  10. Choose the target browsers, operating systems & assistive technologies to support:
    All?
  11. Choose whether to create or procure the Web product:
    The service is provided by repository team.
  12. Define the Web technologies to be used in the Web product:
    HTML interface to PDF resources.
  13. Use Web guidelines to direct accessibility web production:
    HTML pages will seek to conform with WCAG 2.0 AA. PDF resources may not conform with PDF accessibility guidelines.
  14. Assure the Web products accessibility through production (i.e. at all stages):
    Periodic audits of PDF accessibility planned.
  15. Communicate the Web product’s accessibility decisions at launch:
    Accessibility statement to be published.
  16. Plan to assure accessibility in all post-launch updates to the product:
    Periodic reviews of technical developments.

Step 15 requires the publication of an accessibility statement, which “states in an easy to understand and non-technical way the accessibility features of the site and any known limitations“. This will be the aspect of the accessibility work which will be visible to users of the service. But what might such an accessibility statement cover?

Current Approaches to Accessibility Statements for Repositories

The first step to answering this question was to see what accessibility statements are currently provided for institutional repositories.  An analysis of the first page of results for a Google search for “repository accessibility statement” provided only a single example of an accessibility statement for an institutional repository. This was provided by UBIR, the University of Bolton Institutional Repository and appears to be a description of WCAG conformance for the repository Web pages rather than the contents of the Web site :

Standards Compliance

  1. All static pages follow U.S. Federal Government Section 508 Guidelines.
  2. All static pages follow priorities 1 & 2 guidelines of the W3C Web Content Accessibility Guidelines.
  3. All static pages validate as HTML 4.01 Transitional.
  4. All static pages on this site use structured semantic markup. H2 tags are used for main titles, H3 and H4 tags for subtitles.

The Google results for other institutional repositories, including UEA and the University of Salford Informatics Research Institute Repository (USIR) were based on links to standard accessibility statements for the institutional Web site, with the statement for the University of Salford, for example, stating that:

The University of Salford strives to ensure that this website is accessible to everyone. If you have any questions or suggestions regarding the accessibility of this site, or if you come across a page or resource that does not meet your access needs, please contact the webmaster@salford.ac.uk, as we are continually striving to improve the experience for all of our visitors.

It seems that the contents of an institutional repository, the core purpose, after all,  of a repository, do not appear to have statements regarding the accessibility of such contents.  I will admit that I have only had a cursory exploration for such statements and would love to be proved wrong.  But for now let’s assume that the accessibility statement required for step 15 of BS 8878 will have to be produced from scratch.

A Possible Accessibility Statement For An Institutional Repository

Might the following be an appropriate statement for inclusion on an institutional repository?  Please note that I am not a repository manager so I don’t know if such a statement is realistic.  However I should also add that I have deposited 46 of my papers and related articles in the University of Bath repository and am aware of some of the difficulties in ensuring such items will conform with accessibility guidelines for PDFs, MS Word and HTML, the main formats used for depositing items.   Since it is likely to be difficult for the motivated individual author to address accessibility concerns for their own items, we cannot expect best practices to be applied for the 1,568 items deposited in 2010, never mind items deposited before then.

It is therefore not realistic to suggest that authors or repository managers should simply implement the advice on producing accessible PDFs provided by organisations such as JISC TechDis.  Rather the accessibility statement needs to be honest about the limitations of the service and difficulties which people with disabilities may have in accessing items hosted in institutional repositories.

The following draft accessibility statement is therefore suggested as providing a realistic summary regarding the accessibility of a typical repository service.

Statement Comments
The University’s repository service is an open-access information storage & retrieval system containing the university’s research findings and papers, openly and freely accessible to the research community and public. 

A full description of each item is provided, and where copyright regulations permit, the full-text of the research output is stored in the repository and fully accessible.

Items are deposited in the repository via a number of resources, including author self-deposit, deposit by authorised staff in departments and deposits by repository staff.

Note this has taken this definition of the purpose of the service from the UEA Digital Repository
Items are normally provided in PDF format although other formats such as MS Word or HTML may also be used. An audit of file formats may inform this statement.
Items are normally deposited in the format required by the publisher. Popular formats should be accessible using standard viewing tools. However some formats may require specialist browsers to be installed. An audit of file formats may inform this statement an provide information on how to install any specialist viewers.
Items may not conform to appropriate accessibility guidelines due to the devolved responsibilities for depositing items and the complexities of implementing the guidelines across the large number of items housed in the repository. If this is the case, it should be stated.
Future developments to the service will include an “Accessibility problem” button which will enable repository staff to be alerted to the scale of accessibility problems. This should only be included if it is intended to implement such a service.
Repository staff will work with the University Staff Development Unit to ensure that training is provided on ways of creating accessible documents which will be open to all staff and research students. This should only be included if it is intended to implement such training.
Repository staff will carry out periodic audits on the accessibility of repository items, monitor trends and act accordingly. This should only be included if it is intended to implement such a service. Note UKOLN have developed a trial application which could implement such a service which was described in a paper on Automated Accessibility Analysis of PDFs in Repositories.
The Web interface to repository content will conform with University Web site accessibility guidelines. This statement should taken form the policy for the main University’s Web site accessibility statement.

I hope this has provided something to initiative a discussion on ways in which institutional repositories can address accessibility issues which can provide barriers to researchers with disabilities and build on the successes repositories are having in addressing access barriers providing by copyright issues, complex business models and fragmented resources which may be difficult to find and retrieve.

Posted in Accessibility, Repositories | Tagged: | 8 Comments »

Scridb Seems to be Successful in Enhancing Access to Papers

Posted by Brian Kelly (UK Web Focus) on 10 January 2011

I first wrote about the Scribd document repository service back in March 2007 in a post entitled “Scribd – Doing For Documents What Slideshare Does For Presentations“. Since then I have uploaded a number of papers to the service.  But almost three years on, how has the service developed?

My original post summarised some of the benefits of the service but highlighted a number of concerns:

Has Scribd raised the bar in users’ expectations for digital repositories? In some respects, I feel it has. However there are concerns which need to be recognised:

  • Poor quality resources which are hosted: there is no guarantee of the quality of the resources which are hosted on Scribd. And there are copyrighted publications (including those from O’Reilly) which have already been uploaded.
  • Sustainability of the service: As will all of these type of services, there is the question as to whether such services are sustainable. Techcrunch reported on 6 March 2007 that the service “is coming out of private beta this morning with a fresh Angel investment of $300K on top of their original Y Combinator nest egg of $12,000.“This may keep the service running for a short time, but will it be around in the medium to long term? And what will happen if copyright holders, such as O’Reilly, take the service to court for their misuse of their copyrighted resources (as Viacomm have recently done to YouTube).
  • Lack of a interoperable resource discovery architecture: The approach taken by Scribd is not interoperable with the approach being taken by the JISC development community, which is looking to support the development of distributed interoperable digital repository services which make use of OAI-PMH.

Three years later the service is still available.  And looking at the statistics for access to documents I uploaded to the service, it also seems very popular:  during 2010 there were no fewer than 11,729 views of the 15 papers I uploaded to the service, an average of 32 per day.  As you can see from the graph below there were two significant peaks in the year, when there were over 800 in a day.  If I remove these outliers by viewing the statistics for the last six months of the year I find 4,215 views in the six month period, giving an average of  24 per day.

In comparison looking at the usage statistics for my 26 papers hosted in the University of Bath Opus repository I find that there have been 2,505 views during 2010.

Hmm, the repository has almost twice as many papers and resources in the repository are linked to from the UKOLN Web site and  from posts on this blog.  The repository also benefits from being part of a larger repository ecology, with access available from services such as OpenDOAR and MIMAS’s Institutional Repository Search.  And yet the Scribd service seems to get significantly more visits.

Looking at a specific instance, my most recent paper, “Moving From Personal to Organisational Use of the Social Web“, was presented at the Online Information 2010 at the end of November. This paper was uploaded to the University of Bath repository and was mentioned in a blog post on “Availability of Paper on “Moving From Personal to Organisational Use of the Social Web”” which linked to the copy in the repository.    The paper was also uploaded to Scribd – and this was also mentioned in the blog post (and was, indeed, embedded in the post). The usage statistics to date (10 January 2011) are 53 views in the University of Bath repository and 447 views on Scribd.

Scribd also provides a  easy-to-use interface for viewing usage statistics for individual papers. As can be see from the image, there was a peak (of 181 views) on the day the blog post was published with a smaller peak (102 views)  three days previously.  The total number of views from embedded reads (i.e. people who read the blog post and may – or may not -have actually read the embedded paper) is 349. This leaves 160 views of the paper within the Scribd environment – over three times as many views as received for the copy in the institutional repository.

Whilst I can’t help but think that the usage statistics are flawed, I don’t have any evidence of this. I would appreciate suggestions why the views seem so large. But I also suspect that there will be views from people who were searching for information provided in the papers – and if only 10% of the views came from satisfied users that would be on par with those viewing the larger number of papers in the institutional repository (which is also likely, of course, to be inflated by readers using  Google to view papers which aren’t of interest).

Now Scribd does seem to host, how shall I put it, a wide variety of types of documents, not all of which are of relevance to researchers. But the service does have a variety of features which can help to enhance access to documents such as links to Social Web services such as Twitter and Facebook for promoting documents of interest to one’s professional network and the ability for documents to be embedded in other Web sites.

So if one wishes to maximise the impact of one’s ideas will the institutional repository or a commercial service such as Scribd provide the best solution? Or perhaps one should use both approaches?  And if you feel that researchers will prefer to use a more research-friendly environment than is provided by Scridb, remember than researchers, like everyone else, use Google, which will also find resources of dubious scholarly relevance for searches.

Posted in Repositories, Web2.0 | Tagged: | 4 Comments »

Is It Too Late To Exploit RSS In Repositories?

Posted by Brian Kelly (UK Web Focus) on 22 December 2010

A few years ago we had discussions about ways in which information about UKOLN peer-reviewed papers could be more effectively presented. We asked “Could we provide a timeline view? Or how about a Wordle display which illustrates the variety of subject areas researchers at UKOLN are engaged in?” The answer was yes we could, but it wouldn’t be sensible to carry out development work ourselves. Rather we should ensure that our publications were made available in Opus, the University of Bath’s institutional repository.  And since repositories are based on open standards we would be able to reuse the metadata about our publications in various ways.

We now have a UKOLN entry in Opus and there’s also an RSS feed for the items. And similarly we can see entries for individuals, such as myself, and have an RSS feed for individual authors.

Unfortunately the RSS feed is limited to the last ten deposited items rather than returning the 223 UKOLN items for UKOLN or 45 items belonging to me. The RSS feed is failing to live up to its expectations and isn’t much use :-(

The Leicester Research Archive (LRA), in contrast, does seem to provide comprehensive set of data available as RSS. So, for example, if I go to the Department of Computer Science’s page in the repository there is, at the bottom right of the page (though, sadly, not available as an auto-discoverable link) an RSS feed – and this includes all 50 items.

Sadly when I tried to process this feed, in Wordle, Dipity and Yahoo! Pipes, I had no joy, with the feed being rejected by all three applications. I did wonder if the feed might be invalid, but the W3C RSS validator and the RSS Advisory Board’s RSS Validator only gave warnings. These warning might indicate the problem, as the RSS feed did contain XML elements, such as which might not be expected in an RSS feed.

But whilst my experiment to demonstrate how widely available applications which process RSS feeds could possibly be used to enrich the outputs from an institutional repository  has been unsuccessful to date, I still feel that we should be encouraging developers of institutional repository software to allow full RSS feeds to be processed by popular services which consume RSS.

I have heard arguments that providing full RSS feeds might cause performance problems – but is that necessarily the case? I’ve also heard it suggested that we should be using ‘proper’ repository standards, meaning OAI-PMP – but as Nick Sheppard has recently pointed out on the  UKCORR blog:

I have for some time been a little nonplussed by our collective, continued obsession with the woefully under-used OAI-PMH. Other than OAIster (an international service), the only services I’m currently aware of in the UK are the former Intute demo now maintained by Mimas.

In his post Nick goes on to ask “Perhaps OAI-PMH has had it’s day“.  It’s unfortunate, I feel, that RSS does not seem to have been given the opportunity to see how it can be used to provide value-added services to institutional repositories.  Is it too late?

Posted in Repositories, rss | 8 Comments »

Availability of Paper on “Moving From Personal to Organisational Use of the Social Web”

Posted by Brian Kelly (UK Web Focus) on 29 November 2010

I will present a paper on “Moving From Personal to Organisational Use of the Social Web” at the Online Information 2010 conference tomorrow as well as, as described previously, via a pre-recorded video at the Scholarly Communication Landscape: Opportunities and Challenges symposium.

The eight page paper will be included in the conference proceedings and can also be purchased for a sum of £135! However my paper is available (for free!) from the University of Bath Opus Repository. In addition, in order to both enhance access routes to the paper (and the ideas it contains) and to explore the potential of a Web 2.0 repository service, the document has also been uploaded to the Scribd service.

From the University of Bath repository users can access various formats of the paper and a static and persistent URI is provided for the resource.   But what does Scribd provide?

Some answers to this question can be seen from the screen shot shown below.  Two facilities which I’d like to mention are the ability to can:

  • Let others know about papers being read in Scribd using the Readcast option which will send a notification to services such as Twitter and Facebook.
  • Embed the content in third party Web pages.

In addition the Scribd URI seems likely to be persistent: http://www.scribd.com/doc/43280157/Moving-From-Personal-to-Organisational-Use-of-the-Social-Web

I had not expected the WordPress.com service to allow Scribd documents to be embedded but, as can be seen below, this is possible.

There are problems with Scribd, however.  It’s list of categories for uploaded resources is somewhat idiosyncratic (e.g. Comics, Letters to our leaders, Brochures/Catalogs). There is also a lot of content from UKOLN, my host organisation, which has been uploaded without our approval.  But in terms of the functionality and ways in which the content can be reused in other environments it has some appeal.  If only these benefits could be integrated with the more managed environment for content and metadata provided by institutional repositories.  But should that be provided by institutional repositories embedded Web 2.0 style functionality or, alternatively, by Web 2.0 repositories services adding on additional management capabilities?

Posted in Repositories, Web2.0 | Tagged: | 4 Comments »

EPub Format For Papers in Repositories

Posted by Brian Kelly (UK Web Focus) on 4 August 2010

EPub as a Format for Use in Institutional Repositories?

In a post entitled “File Formats For Papers In Your Institutional Repository” I suggested that depositing a HTML version of a paper might have various advantages over the PDF format which is the norm. But in light of the growing importance of mobile devices wouldn’t it seem appropriate to make such papers available in the EPub format?

EPub is described in Wikipedia as “a free and open e-book standard by the International Digital Publishing Forum (IDPF)“. The article goes on to add that “EPUB is designed for reflowable content, meaning that the text display can be optimized for the particular display device used by the reader of the EPUB-formatted book. The format is meant to function as a single format that publishers and conversion houses can use in-house, as well as for distribution and sale.

In terms of the open standards used EPub consists of three specifications:

  • Open Publication Structure (OPS) 2.0, contains the formatting of its content.
  • Open Packaging Format (OPF) 2.0, describes the structure of the .epub file in XML.
  • OEBPS Container Format (OCF) 1.0, collects all files as a ZIP archive.

The articles states that “EPUB internally uses XHTML or DTBook (an XML standard provided by the DAISY Consortium) to represent the text and structure of the content document and a subset of CSS to provide layout and formatting. XML is used to create the document manifest, table of contents, and EPUB metadata. Finally, the files are bundled in a zip file as a packaging format.

Using the EPub Format

Paper in EPub format, showing imagePaper in EPub format showing page-turningThis sounds interesting so I converted the HTML version of my recent paper on “Empowering users and their institutions: A risks and opportunities framework for exploiting the potential of the social web” into EPub format and added it to my library of ebooks on my iPod Touch using the Stanza application.

The accompanying images show how the paper is displayed. The first image illustrates the page turning style of navigation provided using EPub and the second image illustrates an embedded image.

The paper is also available from Opus, the University of Bath’s institutional repository service. I should mention that the URL for the EPub file is http://opus.bath.ac.uk/17484/5/i4.epub. I discovered that entering the URL into a browser on my iPod Touch allowed me to view the document in the Stanza application. On a normal PC users will probably not have a viewer set up to render this format, which may cause some confusion.

As might be expected for a format which uses XHTML the conversion from the XHTML original was a simple operation. I should add that I also experimented with converting a PDF version of the paper to EPub but this resulted in various problems due, I think, to the way in which the two-columns used in the paper were linearised.

Revisiting the Issue of Formats for Use in Repositories

This initial experiment seemed to show that creating an EPub version of a paper in a repository can be done quite easily. However the ease of doing this may have been due to the availability of a HTML version of a paper; doing this on a large-scale may be time-consuming if HTML formats of papers are not available.

Let’s revisit the question of what formats for papers should we be seeking to deposit in institutional repositories?

From a preservation perspective the advice from archivists tends to be that you should preserve the original master copy. In many cases this is likely to be MS Word, although other popular formats will probably include Open Office and LaTeX.

From an interoperability perspective an open standard is preferable. I would suggest that rather than making use of a specific DTD designed for scholarly publishing we should use a well-established and popular existing open format – HTML (in whatever version).

If we wish to maximise the take-up of our repositories whilst minimising the effort in processing the files it seems to me that we should explore ways of creating derivative versions from the master source. So rather than uploading a PDF shouldn’t we be uploading the master file and creating a PDF automatically form this resource? And rather than creating an EPub file, as I have done, shouldn’t the repository software create the EPub file from a HTML version of the file? And whilst I acknowledge that authors may not wish to make their original document (in, say MS Word or Open Office format) available to others and would regard the interoperability aspects of PDF as a feature rather than a flaw there should be nothing to stop the master file being stored in the repository but not openly accessible.

Is anyone thinking along these lines?


Twitter conversation from Topsy: [View]

Posted in Repositories | 22 Comments »

Automated Accessibility Analysis of PDFs in Repositories

Posted by Brian Kelly (UK Web Focus) on 30 July 2010

Back in December 2006 I wrote a post on Accessibility and Institutional Repositories in which I suggested that it might be “unreasonable to expect hundreds in not thousands of legacy [PDF] resources to have accessibility metadata and document structures applied to them, if this could be demonstrated to be an expensive exercise of only very limited potential benefit“. I went on to suggest that there is a need to “explore what may be regarded as ‘unreasonable’ we then need to define ‘reasonable’ actions which institutions providing institutional repositories would be expected to take“.

A discussion on the costs and complexities of implementing various best practices for depositing resources in repositories continued as I described in a post on Institutional Repositories and the Costs Of Doing It Right in September 2008, with Les Carr suggesting that “If accessibility is currently out of reach for journal articles, then it is another potential hindrance for OA“. Les was arguing that the costs of providing accessibility resources in institutional repositories is too great and can act as a barrier to maximising open access to institutional research activities.

I agreed with this view, but also felt there was a need to gain evidence on possible accessibility barriers. Such evidence should help to inform practice, user education and policies. These ideas were developed in a paper published last year on “From Web Accessibility to Web Adaptability” (available in PDF and HTML formats) in which I suggested that institutions should “run automated audits on the content of [PDF resources in] the repositories. Such audits can produce valuable metadata with respect to resources and resource components and, for example, evaluate the level of use of best practices, such as the provision of structured headings, tagged images, tagged languages, conformance with the PDF standard, etc. Such evidence could be valuable in identifying problems which may need to be addressed in training or in fixing broken workflow processes.”

I discussed these ideas with my colleagues Emma Tonkin and Andy Hewson who are working on the JISC-funded FixRep project which “aims to examine existing techniques and implementations for automated formal metadata extraction, within the framework of existing toolsets and services provided by the JISC Information Environment and elsewhere“. Since this project is analysing the metadata for repository items including “title, author and resource creation date, temporal and geographical metadata, file format, extension and compatibility information, image captions and so forth” it occurred to me that this work could also include automated analyses of the accessibility aspects of PDF resources in repositories.

Emma and Andy have developed such software which they have used to analyse records in the University of Bath Opus repository.  Their initial findings were published in a paper on “Supporting PDF accessibility evaluation: Early results from the FixRep project“. This paper was accepted by the “2nd Qualitative and Quantitative Methods in Libraries International Conference (QQML2010)” which was held in Greece on 25-28 May 2010. Due to the volcanic ash Emma and Andy were unable to attend the conference. Emma did, however, produce a Slidecast of the presentation which she used as she wasn’t able to physically attend the conference. This has the advantage of being able to be embedded in this blog:

The prototype software they developed was used to analyse PDF resources by extracting information about the document in a number of ways including header and formatting analysis; information from the body of the document and information from the originating filesystem.  The initial pilot analyse PDFs held in the University of Bath repository and was successful in analysing 80% of the PDFs,with 20% being unable to be analysed due to a lack of metadata available for extraction of the file format of file was not supported by the analysis tools.

In my discussions with Emma and Andy we discussed how knowledge of the tools used to create the PDF would be useful in understanding the origins of possible accessibility limitations, with such knowledge being used to inform both user education and the workflow processes used to create PDFs which are deposited in repositories. However rather than the diversity of PDF tools which were expected to be found, there appeared to be only two main tools used. It appears that this reflects the software used to create the PDF cover page (which I have written about recently) rather than the tools used to create the main PDF resource. If you are unfamiliar with such cover pages one is illustrated – the page aims to provide key information about the paper and also provides institutional branding, as can be seen.

As Emma concluded in the presentation “We may be ‘shooting ourselves in the foot’ with additions like after-the-fact cover sheets. This may remove original metadata that could have been utilised for machine learning.

Absolutely! As well as acting as a barrier to Search Engine Optimisation (which is discussed in the paper)  the current approaches taken to the production of such cover pages act as a barrier to research, such as the analysis of the accessibility of such resources.

It does strike me that this is nothing new. When the Web first came to the attention of University marketing departments there was a tendency to put large logos on the home page, images of the vice-chancellor and even splash screens to provide even more marketing, despite Web professions pointing out the dangers associated with such approaches.

So whilst I understand that there may be a need for cover pages, can they be produced in a more sophisticated fashion so that they are friendly to those who are developing new and better ways of accessing resources in institutional repositories? Please!

Posted in Accessibility, Repositories | 8 Comments »

File Formats For Papers In Your Institutional Repository

Posted by Brian Kelly (UK Web Focus) on 7 July 2010

File Formats I Have Used to Deposit Items in the Bath Institutional Repository

What file formats should you use to deposit papers in your institutional repository?  Although I recently suggested that RSS could have a role to play in allowing the contents of a repository to be syndicated in other environments  that post didn’t address the question of the preferred file format(s) for mainstream resources such as peer-reviewed papers.

For my papers in the University of Bath Opus repository I initially normally deposited the original MS Word and the PDF version which is normally submitted to the journal or conference: the MS Word file is the original source material which is needed for preservation purposes and the PDF file is the open standard version which should be more resilient to software changes than the MS Word format.

What I hadn’t done, though, was to deposit a HTML version of my papers, despite that fact that I normally create such files.  I think I suspected that uploading HTML files into a repository might be somewhat complicated so when I uploaded my papers I omitted the HTML versions of the papers.

Problems With PDFs

PDF cover page for a paper in the Opus repositoryHowever when I recently viewed the repository copy of the PDF version of my paper on “Library 2.0: Balancing the Risks and Benefits to Maximise the Dividends” I discovered that such papers have a cover page appended as shown.

Having recently being a co-facilitator on a series of workshop on “Maximising the Effectiveness of Your Online Resources” I am well aware of best practices to help ensure that valuable resources can be easily discovered by search engines. And although papers in the repository do have a ‘cool URI’ prefixing the content of all papers in the repository with the same words (“University of Bath Open Online Publications Store” followed by “http://opus.bath.ac.uk/” and “This version is made available in accordance with publisher policies. Please cite only the published version using the citation below.” goes against best practices for Search Engine Optimisation.

The cover page isn’t the only concern I have with use of PDFs in institutional repositories.  Despite PDF being an ISO standard not all PDF creation programs will necessarily create PDF which conform with the standard, with papers containing mathematical formula or scientific notation being particularly prone to failing to embed the fonts needed to provide a resources suitable for long-term preservation.  I also suspect that, although it is possible to create accessible PDFs, I suspect that many PDF files stored in repositories will fail to conform with PDF accessibility guidelines.

Providing HTML Versions of Papers

In light of these reservations I have decided to provide a HTML version of my recent papers in the University of Bath institutional repository. So my paper on “From Web Accessibility to Web Adaptability” (for which the publisher’s embargo has recently expired) is available in HTML as well as PDF formats.

As I suspected, however, depositing the HTML version of the paper was slightly tricky.  I uploaded the paper using the Upload for URL option and this initial attempt resulted in the page’s navigational elements are search interface being embedded in the page.  And since the upload mechanism only uploads files which are ‘beneath’ the paper in the underlying directory structure the page’s style sheet was not included.  In short, the page looked a mess.

Since the HTML files I have created contain the contents of the paper separately from the page’s navigational elements it was not too difficult to create a very simple HTML file which I included (with the citation details appended at the end of the paper) in the resource which is available in the repository. As can be seen the contents are available even if the page is not visually appealing.

There are, of course, resource implications in creating HTML versions of papers. However it will be interesting to see if providing content which is more easily found in Google provides benefits in enhancing access to papers which are provided in HTML format  - and since resource discovery is one of the main aims of a repository it might be argued that resources should be provided to ensure that HTML versions of papers are made accessible.

But What About Richer XML Formats?

The purist might argue that whilst HTML is an open and Web-native resource is may not be rich enough for use with peer-reviewed papers. I have some sympathies which such views. Anthony Leonard has described how we should go about “Fixing academic literature with HTML5 and the semantic web“. I would agree that there’s a need to explore how HTML5 can be used in the context of institutional repositories.

But mightn’t there be another XML format we should consider? How about an open format which is widely supported and deployed and which, for many authors, will not require any changes to their authoring environment? The format is OOXML – an ECMA standard which has also been standardised as an International Standard (ISO/IEC 29500). However not all open standards are equally open and as this standard is based on Microsoft’s format for their office applications, as Wikipedia describes “the ISO standardization of Office Open XML was controversial and embittered“.

In light of this discussion, what format(s) would you recommend for use with institutional repositories?

Posted in Repositories | 12 Comments »

Getting Into The Top Ten For Your Institutional Repository

Posted by Brian Kelly (UK Web Focus) on 10 June 2010

Statistics on Downloads for the University of Bath Institutional Repository

The University of Bath is currently testing the IR Stats package in Opus, the University’s institutional repository. Using the Web interface to the package I ran a search for the top ten downloads over the past year.   The results are shown below -and, as you can see, a paper on “Library 2.0: balancing the risks and benefits to maximise the dividends” by myself, Paul Bevan, Richard Akerman, Jo Alcock and Josie Fraser is in second place!  You’ll have to scroll on beneath the image to discover the secrets of how to ensure that your research paper gets into the top ten for your institutional repository :-)

Top ten downloads from Opus repository in past year

Seeking An Explanation

On 11 August 2009 I wrote a blog post in which I described how my Paper on “Library 2.0: Balancing the Risks and Benefits to Maximise the Dividends” [had been] Published in Program.

Now looking at the blog statistics for visits to the post I discover that there have been a total of 735 views (with 162 on the day of publication ).

Since the blog post linked directly to the details of the paper provided in the institutional repository I believe that many of the visits to the blog post resulted in downloads of the paper in the repository – and so it was a direct result of having a blog and writing a timely post about the paper which resulted in the paper being the second most downloaded paper last year.

Do I have any further evidence to back up this assertion? It would have been interesting to see it a tweet about the post had generated traffic to the article but, having looked at the archive of my tweets in BackUpMyTweets it seems I didn’t use Twitter on the day the post was published. It also seems that a bit.ly URL for the post hadn’t been minted previously, so unfortunately there are no bit.ly statistics to examine.

However looking at the download statistics over the past year for my other items in the repository this particular item stands out for its popularity – and so I will assert that the timely blog post linking to the repository item generated over thirty times the normal annual traffic to one of my papers.

Search engine traffic to my items in the Opus repositoryLooking at the search engine statistics for all of my items over the period I discover than 80% of the traffic is not delivered by a search engine (the red quadrant in the pie chart).

Referrers traffic to my items in the Opus repositoryUsing the display of referring traffic to my items confirms that search engines aren’t significant in providing traffic (20%) and the repository search itself only that only delivers 10% of the traffic. Rather it is external Web sites (i.e. my blog, I believe) which delivers 39% of the traffic with 31% of the traffic having no referred information (I have found this is often traffic from Twitter clients but in this case in may be traffic coming from RSS readers used to view the post).

Discussion

Of course the large number of downloads is no indication of the quality of the paper.  And it might be that the paper was downloaded by an automated agent (perhaps someone was retrieving papers on Library 2.0 and the harvester repeatedly downloaded this paper).  Or, alternatively, maybe the statistics package is producing incorrect results.

But, unless I come across alternative evidence, I will regard the popularity of this item as an indication that blog posts can have a significant impact on the traffic to items in an institutional repository.  Note that I am not saying that blogs are the only significant factor – my UKOLN colleague Alex Ball and Andy Ramsden, head of the e-learning team (both of whom work on the same corridor as me) also figure in the top ten downloads. In their case I think embedding links to their Opus items in external Web sites helps to drive traffic.

However, especially for those working in areas in which there are significant numbers of blog readers, having a blog and using it effectively may provide the researcher with an advantage in raising awareness of their research.

Would you agree?

Posted in Blog, Repositories | 16 Comments »

Video of Dorothea Salo’s Seminar at UKOLN

Posted by Brian Kelly (UK Web Focus) on 22 April 2010

I recently mentioned that Dorothea Salo (better known in some circles as The Repository Rat – which is also her Twitter ID) was visiting UKOLN to give a seminar entitled “Grab a bucket – it’s raining data!“. Dorothea gave a fascinating talk on the importance of the management of scientific data, but tempered with a description of the complexities of this work and the challenges to be faced by whoever (librarians?) should take responsibility for such work.

Dorothea Salo's seminarStaff at UKOLN and visitors from elsewhere at the University of Bath and elsewhere very much enjoyed Dorothea’s talk and the subsequent discussions.  For those who weren’t there we have, with Dorothea’s kind permission, recorded a video of her talk which is available on the Vimeo service (in two parts: part 1 and part 2).

Posted in Repositories | Leave a Comment »

UKOLN Seminar: “Grab a Bucket – It’s Raining Data!”

Posted by Brian Kelly (UK Web Focus) on 15 April 2010

Across the international repository community Dorothea Salo established a reputation for her Caveat Lector blog which ran from 2002–2009.  On her current  The Book of Trogool blog Dorothea now describes herself as “an academic librarian exploring the practices, processes, and praxis of e-research“.

As mentioned in a recent post on her blog entitled “Hello from Scotland!” Dorothea, who works at the University of Wisconsin, is currently in the UK. At the start of the week Dorothea spoke at the UKSG conference in Edinburgh where she gave a plenary talk on “Who Owns Our Data?“.

On Monday morning (19 April 2010) Dorothea will be speaking at a UKOLN seminar which will be held at the University of Bath.  The title of the seminar is “Grab a bucket – it’s raining data!” and the abstract is given below:

From a distance, the coming-together of libraries and research data looks like a match made in heaven. Libraries need the attention and support of scientists, and libraries offer digital services and portals that should accommodate the preservation and dissemination needs of data.

When we look a little closer, however, we find a lot of impedance mismatches between what data need and what libraries have on offer. This talk will explore those mismatches and suggest ways forward.

The seminar will take place from 09.30-12.00 in the Library seminar room 3E 3.8 on the University of Bath campus.  If you would like to attend please sign up on the Eventbrite booking form.

Posted in Events, Repositories | 1 Comment »

Talk at Edspace Event, University of Southampton

Posted by Brian Kelly (UK Web Focus) on 3 November 2009

I have been invited by the JISC-funded Edspace project, based at the University of Southampton to give a talk at an event on “Traditional educational repositories v. Web 2.0 resource sharing” to be held on Wednesday 4 November 2009. I have been asked speak on “the future for educational resources and services on the Web” – a rather grandiose topic, I think! I’ve entitled the talk “The Future for Educational Resource Repositories and Services in a Web 2.0 World” as its the Web 2.0 aspect I feel is important (and reflects my area of expertise – I don’t claim to have anything particularly significant to say on the repository side of things).

I’ll be saying that many of the technical aspects of Web 2.0 are now mainstream – and indeed the Edspace’s Edshare service provides RSS feeds, tag clouds, embed functionality and ‘cool URIs’.

But the term Web 2.0 also  covers the network as the platform and a culture of openness. The issue of openness of educational resources is being addressed in, for example, the JISC OER programme and although I personally seek to ensure that my content (such as blog posts, slides and papers) are available under a Creative Commons licence I know that there are added complexities in the area of educational resources – so I’ll not focus on the openness issue.

Instead I’ll raise the question of the network as the platform in the context of the futures for educational resource repositories.  I’ll suggest that as experts predict further cuts in the public sector, including higher education, wouldn’t it be appropriate for our repository services to be hosted in the cloud?  And the concerns which tend to be raised (sustainability, reliability, legal issues, etc.) are implementation details which do need to be addressed – but these aren’t the important policy issues.

The slides I’ll be using are available on Slideshare (in the Cloud(!) although a master copy is also held locally) and is embedded below.

Posted in Events, Repositories | Leave a Comment »

Depositing My Paper Into the University of Bath Institutional Repository

Posted by Brian Kelly (UK Web Focus) on 21 July 2009

I recently mentioned that my paper on “From Web accessibility to Web adaptability” had been published in a special issue of the Disability and Rehabilitation: Assistive Technology journal. Shortly after receiving the notification that the paper had been published I deposited the author’s version of the paper in Opus, the University of Bath Institutional Repository. As I had attended a short training course on use of Opus (which uses the ePrints repository software) a few hours before uploading the paper to the repository I decided to time how long it took to complete the process.

I discovered it took me 16 minutes to do this. As someone responded to my tweet about this, this seemed too long.  I subsequently discovered that I had mistakenly chosen the New Item option – as a DOI for the paper was available I should have selected the Import Items option (not an intuitive name, I feel). In addition I also copied the list of 46 references and tried to apply some simple formatting (line breaks between items) to the list and also to the abstract. This was a mistake, as any line breaks appear to be ignored.

In order to understand what I should have done, I went through the deposit process a second time and this time recorded my actions, with an accompanying commentary as a screencast which is available on YouTube and embedded below.

The video lasts for 10 minutes and the deposit process took 7 minutes (although this includes the time taken in giving the commentary and showing what I did the first time).

It does occur to me that it might be useful to make greater use of screencasting not only as a training aid for institutional repository staff to demonstrate the correct processes for depositing items but also to allow authors themselves to show and describe the approaches they take. I’m sure that some of the mistakes I made are due to limitations of the user interface and I won’t be alone in making such mistakes. Indeed having shown this view to the University of Bath’s institutional repository manager she commented:

I’ve also noticed, from your video a few issues that should be fixed, so it was helpful to see.

Why aren’t we making more screencasts available of user interactions with the services we develop, I wonder? And why aren’t we sharing them?


Note: Just to clarify, this post was intended encourage users to described (openly) their experiences in using services such as repositories. and to share these experiences. The video clip is not intended as a training resource on how to deposit an item in a repository! [24 July 2009]

Posted in Repositories | 13 Comments »

The Launch of OPuS

Posted by Brian Kelly (UK Web Focus) on 4 February 2009

The University of Bath’s OPuS service, the online archive for University of Bath research publications, was launched yesterday (3rd February 2009) by Professor Jane Millar, the University’s Pro-Vice Chancellor (Research).

OPuS (which, incidentally, stands for ‘Online Publications Store’) currently holds over 12,000 references including journal articles, books and book sections, conference items, patents, reports and working papers, and research degree theses. Some of these items, including the theses are available in full-text. The aim of the service is to help strengthen the promotion and preservation of research outputs.

I recorded (with permission) Professor Jane Millar’s official launch of the service and this clip (which is also available on YouTube) is embedded below:

I should also add that the introduction to the launch was given by University Librarian, Howard Nicholson (YouTube video clip available) and Kara Jones, the university’s Research Publications Librarian, concluded the event by providing some facts and figures about the service and the role that she can play in supporting departmental use of the service (YouTube video clip available).

Many thanks to Kara Jones for organising this launch event and ensuring that a large number of the University’s research publications were uploaded to the service prior to the launch. Readers with particular interests in repositories may wish to add Kara’s My:self Archive blog to their RSS reader.

Posted in Repositories | Leave a Comment »

Institutional Repositories and the Costs Of Doing It Right

Posted by Brian Kelly (UK Web Focus) on 29 September 2008

There’s an interesting discussion taking place on the JISC-Repositories JISCMail list, following a post from Jenny Delasalle who asked:

Do any of you know how long it takes you to process a single item, before it is available as a live record in your repository? Please can you share that information with the list? 

Jenny provided details of her experiences:

Here at Warwick it takes at least 2 hours to process a single item. We are adding to our repository at a rate of about 15 items per week. I’m desperate to try to speed this up as we are receiving items faster than we can process them.

My colleague Pete Cliff somewhat tentatively suggestedwhy not put the items in the repository with minimal metadata“.

Pete and others seemed to feel that such compromises may be needed “in the current climate where quantity seems to have more impact than quality“. But this is where I would disagree.  This argument seems to be simply a cry for more resources in an area of interest to those making such a plea. But people will always be asking for more resources for their areas of interest – and, as there will always be limited resources, others will argue that their areas are more worthy of being allocated more resources.  And it strikes me as being somewhat disingenuous to have developed an approach which is known to be resource-intensive and then to make a plea for additional resources in order for the particular approach to be effective. A more honest approach would have been to develop a solution which was better suited for the available resources.

This was an argument I made last week in my talk on “Web Accessibility 3.0: Learning From The Past, Planning For The Future“. As I described in my talk (and note a 30 minute video of the talk is available). I pointed out that evidence suggests that Web accessibility policies based on conformance with WCAG AA have clearly failed, except in a small number of cases. And rather than calling for additional resources to be allocated to changing this we need to acknowledge that this won’t happen, and to explore alternative approaches.

And it is interesting to note that apprarent lack of interest on the JISC-Repositiories list in discussing the accessibility of resources in the repositories rather than the metadata requirements for aiding resource discover. Indeed when this topic was discussed a couple of year’s ago Les Carr, with a openness which I appreciated, argued that:

If accessibility is currently out of reach for journal articles, then it is another potential hindrance for OA. I think that if you go for OA first (get the literature online, change researchers’ working practices and expectations so that maximum dissemination is the normal state of affairs) THEN people will find they have a good reason to start to adapt their information dissemination behaviours towards better accessibility.

Here Les is arguing that the costs of providing accessibility resources in Institutional Repositories is too great, and can act as a barrier to maximising open access to institutional research activities. I would very much agree with Les that we need to argue priorities – as opposed to simply asking that someone (our institutions, the government – it’s never clear who) should give us more money to do the many good things we would like to do in our institutions.  

In the case of Institutional Repositories we then have competing pressures for resources for metadata creation and management and for enhancing the accessibility of the resources. In this context It should be noted that the WCAG 2.0 guidelines have reached the status of Candidate Recommendation, and that WAI Web site states quite clearlyWe encourage you to start using WCAG 2.0 now“. And note that, unlike the WCAG 1.0 guidelines, WCAG 2.0 is format neutral. So you can provide resources on your Web site in a variety of formats, but such resources need to conform with the guidelines if it is your institutional policy to do so.

So shouldn’t institutions who have made public commitment to comply with WCAG guidelines ensure that this applies to content in their institutional repositories, even if this will require a redeployment of effort from other activities, such as metadata creation?

Or, alternatively, you may feel that complying with a set of rules, such as WCAG, without doing the cost-benefit analysis or exploring other approaches to achieving the intended goals is mis-guided. In which case perhaps Pete’s suggestion that you might wish to consider “put[ting] the items in the repository with minimal metadata” might actually be a sensible approach rather than an unfortunate compromise? And in response to Philip Hunter’s comment that “achieving interoperability through dumbing-down the metadata has a strange attractiveness in a world not overly crazy for quality” perhaps we should be arguing that “achieving interoperability and accessibility through labour-intensive manual efforts is a perverse solution in a public sector environment in which should be demonstrating that we can provide cost effective solutions“?

Posted in Accessibility, Repositories | 3 Comments »

GCSEs Revisited

Posted by Brian Kelly (UK Web Focus) on 21 February 2008

It always pleasing when a blog post achieves its aim, and even more so when this happens so quickly. So it was good to read AJ Cann’s post in which he describes how he spent 3 minutes using the Google Custom Search Engine (GCSE) to provide an alternative to his institutional search engine. As he titled his post “It was all Brian Kelly’s fault“!

Revisiting my original post it would seem that there are a number of ways in which GCSE is being used:

In this latter case, AJ is clearly unhappy with the local search engine service (ht://Dig): “I can’t stand the inadequate institutional search tools I’ve been forced to use for a decade” – and decided it was worth spending “less than 30 seconds” to set up an alternative! And this approach reflects AJ’s interests in Personal Learning Environments (PLEs). He now has a Personal Search Engine.

Now if setting up GSCE across a range of Web sites is so easy and can be done by individuals without the need for institutional commitment. in what other ways could the software be used?

As we’ve recently discussed institutional repositories and various people have aired their concerns on the approaches being taken, it seems to me that the GCSE could have a role to play in providing an alternative way of searching repositories.

And this approach has already been taken on the OpenDOAR Search Repository Contents service and the Search ROAR Content With Google service.

This approach fits in nicely with Rachel Heery’s comment that “I don’t really see that there is conflict between encouraging more content going into institutional repositories and ambitions to provide more Web 2.0 type services on top of aggregated IR content. Surely these things go together?“. We have the managed content in the repository and are providing users with a choice in the selection of a search interface.

It’s good to see that happening. But can’t we do even more. We could, for example, use the two ways of searching for gaining evidence of the preferences users may have for searching. And perhaps rather than exposing new users of repositories to the rich functionality of the repository’s search interface, shouldn’t we acknowledge that many users will prefer the simplicity of a Google search, and provide the GCSE interface as better focussed alternative to the global Google search tool, with the option of pointing the users in the direction of the richer service if they find that this search interface is not good enough.

This approach would have the added advantage of not requiring the expenses associated with in-house software development. Indeed could it not be argued public-sector organisations should have a responsibility to make use of relevant freely-available services, at least in prototyping or providing a service for making comparisons even if it isn’t envisaged that the service will be used in a final production role?

Of course the danger may be that the users decide that they are happy with Google. And we wouldn’t want that to happen, would we?

Posted in Repositories | Tagged: | 5 Comments »

Distributed Discussions On Repositories

Posted by Brian Kelly (UK Web Focus) on 19 February 2008

The Repositories Debate

Andy Powell recently wrote a post on the eFoundations blog about his opening plenary talk at the VALA 2008 conference.

His post generated interesting discussions and debate amongst those involved in repository activities in the UK and the wider community. Paul Miller was in agreement with Andy’s comments in his post on the Panlibus blog entitled “Andy Powell is Spot On” with Paul feeling that “Our current approach, fundamentally, is totally, completely, utterly wrong, isn’t it?”.

Over on his blog my colleague Paul Walk has given his thoughts on Andy’s post expressing agreement in several areas but disagreeing with Andy’s view that “we need to focus on building and/or using global scholarly social networks based on global repository services“. Paul (W) responds by asking “Why can’t we “focus on building and/or using global scholarly social networks” (which I support) based on institutional repository services? We don’t have a problem with institutional web sites do we? Or institutional library OPACs?”. My former colleague Rachel Heery has responded in a similar vein to Paul in a response to Andy’s post: “I don’t really see that there is conflict between encouraging more content going into institutional repositories and ambitions to provide more Web 2.0 type services on top of aggregated IR content. Surely these things go together?“.

Meanwhile over on his Overdue Ideas blog Owen Stephens gives his thoughts from the perspective of a practitioner involved in setting up the Spir@l institutional repository at Imperial College with a wittily-titled post “R.I.Positories“. Owen concludes “we need is a system that helps us administer the workflow around the delivery of digital objects in a corporate environment, but that is invisible to those not involved in the administration – and that’s what I want out of a ‘repository’ – so, for me, the Repository is dead, long live the repository“.

And a few minutes ago I noticed a pop-up alert informing me of a blog post entitled “RESTful Repositories?“. An intriguing title, I thought, so I viewed the post and came across Stu Weibel’s contribution which suggested that “One way to think about repositories is as the bookshelves of the digital library“. Stu went on to point out that “We don’t ask scholars, having just published an article or book, to ‘go to the library to find the most appropriate place for it… and don’t come back until you do!’”   This sounds reasonable to me – there’s a need for the physical library and the infrastructure that is associated with it, but the researchers don’t need to know how it works. This might be an approach to be taken with institutional repositories – so let’s not scare them off with the ins and outs of the metadata schemas.

Engaging With A Distributed Debate

There’s clearly an interesting debate taking place around the approaches which should be taken to maximising access to the UK’s research papers. But if you have an interest in institutional repositories how do you find out where the debate is taking place and how do you participate?

I have had discussions with colleagues who feel that such debates should be centralised and should use a ubiquitous communications channel – namely email. From this perspective the debate about institutional repositories within the UK higher education community should take place on the JISC-Repositories JISCMail list. However I feel that this will result in the debate being marginalised to those with a particularly strong interest in repositories, will tend to focus on the nitty-gritty details which email tends to encourage and, in the case of JISCMail, the debate will be trapped within the JISCMail Web site, not only because the JISCMail archives are not exposed to search engines such as Google, but also because of the ‘uncool’ URIs for messages in the archive.

And, of course, email discussions fragment, in any case, and I suspect the Australian participants at the VALA 2008 conference will be having their own discussions about repositories on their own mailing lists.

An alternative view is that the debate with take place via scholarly articles published in peer-reviewed journals. This may be the case in many areas of research, but man in the digital library community would be frustrated by the lengthy timescales that process would entail.

Like it or not, the debate is taking place using a variety of communications tools, including the blogosphere.

So, if you wish to engage with such discussions, how do you find out what is happening? In my case my RSS reader (Feedreader) will automatically inform me of new posts for the blogs I’ve subscribed to. This includes the eFoundations blog, although in the case of Andy’s post I was alerted to its publication a couple of hours after it had been published via a tweet on Twitter.

The distributed nature of such debates has benefit, such as allowing the discussions to be brought to the attention of different communities. When doing this, there is an expectation that bloggers will link to the original post. And if blogs allow trackbacks, it will be possible to follow links from an original post to blogs which have commented on it.

Returning to Andy’s original post, Paul Walk noticed that the eFoundation’s blog hadn’t included a trackback to Paul’s post. This is probably a technical glitch – but this incident made me think about the importance of trackbacks in the integration of distributed discussions. Owen Stephen’s R.I.P.ositories post included a link to a post on The importance of being open the eFoundation blog dating back to October 2006. But comments to such old posts are disabled – I assume to minimise the effort in deleting spam comments. But this is breaking the linkages to related discussions. How, then, should we balance the benefits of allowing such tracebacks versus the maintenance costs of managing misuse?  Or do you disagree with blogs being used for this type of discussion and debate?

Posted in Blog, Repositories | 7 Comments »

CRIG Teleconference Chats On ‘Repositories And Other Services’

Posted by Brian Kelly (UK Web Focus) on 6 December 2007

I recently took part in one of a series of teleconference chats organised by the JISC-funded CRIG (Common Repository Interfaces Working Group) project.

The project organised a day of tele-conferences on 8th November 2007. The aim of the day was to facilitate a “discussion between members on how repositories might be improved (bluesky thinking)“. A recording of the discussions is available from the DigRep wiki. In addition, the project team created a series of mindmaps which helped to visualise the topics covered in the seven areas covered during the day.

I took part on the final discussion of the day which looked at other services which may interface with repositories, with a particular focus on the role of externally-hosted Web 2.0 services. The mindmap for this session is shown below.

Mindmap of discussions
(Click for larger display).

The discussions revolved around the in-house development vs. use of Web 2.0 services which are a recurring topic of discussion. I did, however, find that the visualisation of the discussions provided me with the opportunity to revisit these issues from a different perspective. I’ll have to have another look at mindmapping tools, I think.  And reading Mike Ellis’s post on Good web apps: Back of postage stamp… it would seem that MindMeister should be the first tool for me to look at.

Posted in Repositories | Tagged: , | Leave a Comment »

Scribd – Doing For Documents What Slideshare Does For Presentations

Posted by Brian Kelly (UK Web Focus) on 29 March 2007

As I’ve recently described, a couple of months ago I uploaded PDFs of a few of my papers to Slideshare, and wondered whether there was a business opportunity for Slideshare in extending its remit from providing a repository of slideshows to include documents in general.

Well last week I came across Scribd – a Web 2.0 service which provides this functionality, describing itself as “YouTube for documents”. I registered for the service (although, strangely, you don’t need to be registered to upload documents) and uploaded several of my papers. And I have to admit that I’m very impressed with the service. I could upload my papers in several formats (including MS Word, PDF, MS PowerPoint and MS Excel) and, when I uploaded an MS Word document, alternative formats were created, including PDF, HTML, plain text and even an MP3 file which provided a computer-generated sound file for the paper! As well as the accessibility benefits which this may provide, being able to download various formats means that the service cannot be accusing of ‘fake sharing’ – a term coined on the lessig blog and discussed on the O’Reilly Radar and eFoundations blogs.

Scribd Interface

The interface seemed very usable; as well as allowing the paper to be viewed in a variety of formats Scribd, as seems to be the norm for these type of services, allows resources to be bookmarked (‘favourited’ seems to be the word used to describe this), usage statistics are provided and, as with Slideshare, the resource can be embedded in Web pages.

Has Scribd raised the bar in users’ expectations for digital repositories? In some respects, I feel it has. However there are concerns which need to be recognised:

  • Poor quality resources which are hosted: there is no guarantee of the quality of the resources which are hosted on Scribd. And there are copyrighted publications (including those from O’Reilly) which have already been uploaded.
  • Sustainability of the service: As will all of these type of services, there is the question as to whether such services are sustainable. Techcrunch reported on 6 March 2007 that the service “is coming out of private beta this morning with a fresh Angel investment of $300K on top of their original Y Combinator nest egg of $12,000.“This may keep the service running for a short time, but will it be around in the medium to long term? And what will happen if copyright holders, such as O’Reilly, take the service to court for their misuse of their copyrighted resources (as Viacomm have recently done to YouTube).
  • Lack of a interoperable resource discovery architecture: The approach taken by Scribd is not interoperable with the approach being taken by the JISC development community, which is looking to support the development of distributed interoperable digital repository services which make use of OAI-PMH.

So perhaps Scribd might be felt to have no relevance to those involved in digital repository development work. I, however, feel that it would be a mistake to dismiss Scribd. We can’t guarantee that the service would have a role to play in the long term, but the approaches it has taken are worth exploring. Indeed, as I commented on some time ago in a posting about the accessibility of PDF resources in digital repositories) I feel that we should be exploring ways of improving the accessibility of repository services, and it is interesting that this commercial service, rather than one developed with the academic community, is taking a leading role in providing MP3 versions of papers in the repository.

And rather than just trying out Scribd to see what features might be worth implementing in our own repository services, is there an argument for making a deal with Scribd to host our scholarly resources in a managed fashion?

Technorati Tags:

Posted in Repositories, Web2.0 | 2 Comments »

Slideshare Repository and PDFs

Posted by Brian Kelly (UK Web Focus) on 28 March 2007

I recently discovered that the Slideshare service (a repository service for slides in PowerPoint or Open Office formats) also allows PDF files to be uploaded. This makes sense as PDFs can be used as a presentation format for slide shows. I then wondered whether Slideshare could be used as a repository for papers in PDF format. So I uploaded a PDF version of a paper on Contextual Web Accessibility – Maximizing the Benefit of Accessibility Guidelines (a paper presented at the W4A workshop in Edinburgh in May 2006). As can be seen, the PDF file has been successfully uploaded to the service (with over 200 views since the document was uploaded).

Slideshare service with an uploaded PDF file

Why am I doing this? If you access the resource you will discover that the text is too small to read unless you zoom in, and if you do this, you will have only a small screen area to read the paper. The file may be inaccessible (a Flash interface to a PDF file) , an issue discussed recently, and the PDF file is not easily printed, downloaded or reused (as Andy Powell commented a while ago, Slideshare is an example of ‘fake sharing’).

However such reservations are based on Slideshare in its current form. If the company felt there was a business case for hosting papers in PDF format, it would surely not be too difficult to provide a more appropriate user interface, and perhaps also providing access to printing and downloading services.

And even if Slideshare felt this was an inappropriate use of their service (and they could, of course, ban papers in PDF format for being hosted by the service) there are still a number of interesting issues which evaluating the service in this way can help address:

  • ease of uploading
  • rapid prototyping
  • architecture (URIs, APIs, …)
  • additional functionality
  • the pros and cons of allowing only quality publications to be uploaded

But since I first drafted this post, there have been further developments in this area – which I’ll address shortly.

Technorati Tags:

Posted in Repositories, Web2.0 | 8 Comments »

Slideshare – It’s Working For Me

Posted by Brian Kelly (UK Web Focus) on 14 February 2007

One of the first posts to this blogs, back in November 2006, describes my initial experiments with the Slideshare repository for presentations.

Slideshare Repository I described how I had uploaded several of my presentations, suggesting that this would provide greater exposure to the slides (and hence the ideas) than if they were only available on UKOLN’s Web site.

A few days ago I received an email alert which informed me that a number of the presentations had been added as a Favourite by a Slideshare user.

From his profile I discover that srains has a blog, Rolling Rains, which explores ‘the adoption of Universal Design (Design-for-All; Human-Centered Design) by the tourism industry’.

From the other slide show he has added to his list of favourites, I have found presentations which are of interest to me (including one on Two Trainers Trade Twenty Technology Training Tips and one on standards used on Oxfam Australia’s Web site).

Revisiting my uploaded slides I discover that the most popular of my presentations is Web 2.0: What Is It, How Can I Use It, How Can I Deploy It? with 666 views in two months, with 6 users including it in their list of favourite slideshows (jensjeppe, cezinha.com, noticiasmias2002, gerarddummer, erywin and MCL).

I can then follow their list of other favourites and the slides which they may have uploaded. And guess what: people who are interested in my slides on Web 2.0 are also interested in other slides on the same subject. So this ‘social network’ provides a form of resource discovery for me :-)

Three months after my initial posting about Slideshare what can I conclude:

  • It allows my slides (and therefore my ideas) to be accessed by people who would probably not find the resources otherwise.
  • It provides some form of measuring the impact/quality of the slides by observing the numbers of users who have added it to their list of favourites.
  • It help me (and others) to find related resources

Is there a downside? I need to remember that:

  • I don’t know how sustainable the service is – it could, for example, go out of business or change its licensing conditions (perhaps charging for access to the slides)
  • It is an example of ‘fake sharing’ – I can view the resources but not (easily) reuse the materials. In my case, however, I provide access to the original source files by including the URL of the master copy on the title slide and in the metadata.

I feel that these experiences provide some useful indications of features which could be adopted by the digital library development community: the importance of ease of use and lightweight approach to IPR issues for content providers; the advantages of getting content out ‘where the users are’ and the benefits of social networks for resource discovery.

Technorati Tags:

Posted in Repositories, Web2.0 | 14 Comments »

Accessibility and Institutional Repositories

Posted by Brian Kelly (UK Web Focus) on 12 December 2006

There has been some discussion on the JISC-Repositories JISCMail list (under the confusing subject line of “PLoS business models, global village”) on the issue of file formats for depositing scholarly papers. Some people (including myself) feel that open formats such as XHTML should be the preferred format; others feel that the effort required in creating XHTML can be a barrier to populating digital repositories, and that use of PDF can provide a simple low-effort solution, especially if authors are expected to take responsibility for uploading their papers to an institutional repository.

An issue I raised was the accessibility of resources in digital repositories. There are well established guidelines developed by WAI which can help to ensure that HTML content can be accessible to people with disabilities. Myself and others have argued that the guidelines and the WAI model is flawed, but many of the guidelines are helpful and institutions should seek to implement them (indeed there are legal requirements to ensure that services do not discriminate against people with disabilities).

WCAG 1 has the following requirements:
3.2 Create documents that validate to published formal grammars. [Priority 2]
11.1 Use W3C technologies when they are available and appropriate for a task and use the latest versions when supported. [Priority 2]
11.4 If, after best efforts, you cannot create an accessible page, provide a link to an alternative page that uses W3C technologies, is accessible, has equivalent information (or functionality), and is updated as often as the inaccessible (original) page. [Priority 1].

This seems to be pretty unfriendly towards PDFs, I would argue. WCAG 2.0 (which is in draft form) is, however, neutral regarding file formats – a development I welcome (although the guidelines still have their limitations). However the guidelines still require that content is accessible; and as well as the requirement in the guidelines, there are also legal and ethical requirements to address such issues.

Proprietary formats such as PDF can be made accessible. However I am uncertain as to how alternative text for images and providing structure to PDF documents will happen in a distributed workflow environment.

Rather than dwelling on this (technical) issue, I would like to focus on the policy issues, which should be independent of particular file formats. UK legislation requirements organisations to take reasonable measures to ensure that people with disabilities are not discriminated against unfairly. One could argue that it would be unreasonable to expect hundreds in not thousands of legacy resources to have accessibility metadata and document structures applied to them, if this could be demonstrated to be an expensive exercise of only very limited potential benefit. However if we seek to explore what may be regarded as ‘unreasonable’ we then need to define ‘reasonable’ actions which institutions providing institutional repositories would be expected to take.

One approach would be for the institution to ensure that it provides appropriate training and staff development for authors who are expected to upload documents to repositories. Linked to this may be tools which can flag problem areas to the authors, as documents are being prepared for uploading. There may then be auditing tools which can alert institutions to potential problems.

Related to policies to support the authors, are policies which address specific problems which users with disabilities may have. Clearly many scientific papers (containing formulae, for example) may be difficult to be processed by traditional assistive technologies. Perhaps this is where there is a need for just-in-time accessibility (as opposed to the traditional just-in case approach) or blended accessibility (real world alternatives to digital accessibility barriers).

Posted in Accessibility, Repositories | 9 Comments »