UK Web Focus (Brian Kelly)

Innovation and best practices for the Web

Archive for the ‘preservation’ Category

“Pondering the Online Legacy of my Work”

Posted by Brian Kelly on 16 Jul 2015

Neglected Areas for Web Managers?

Online legacy of ILRT workYesterday I came across two posts in my Facebook stream which addressed areas which appear to be neglected by those with responsibilities for providing institutional web services. In the first of two posts I comment on responsibilities for maintaining the online legacy of staff after they have left their host institution.

“Pondering the online legacy of my work”

Yesterday Virginia Knight shared a link on Facebook to a blog post with the words “Pondering the online legacy of my work at Bristol, or: why is there not much of it visible now?“. in the blog post, entitled “Where did my work go?“, Virginia described how she has been “working out how much of what I did in my sixteen years at ILRT at Bristol University has survived in a recognisable form“. Virginia pointed out that “Obviously there are publications, such as an article in Ariadne [such as ‘The SPP Alerting Portlet: Delivering Personalised Updates’– Editor] and more recently a prizewinning essay” but concluded “my online legacy is harder to trace“.

This is an area of particular interest to me. Almost two years ago I finished work at UKOLN. During my final week at UKOLN I published a series of blog on “Reflections on 16 years at UKOLN“. The five blog posts covered my early involvement with the Web (which dated back to December 1992), my outreach activities, my research work, my work for UKOLN’s core funders and my interests in evidence-based policies and openness.

Digital Preservation – Whose Responsibility?

During my final few months at UKOLN I had responsibilities for managing the preservation of UKOLN’s web resources. In brief this covered updating web sites so that the home page for self-contained activities described the background to the work and made it clear that the web site was no longer being maintained (e.g. see the Cultural Heritage Web site and the web site for the JISC-funded QA Focus project). After updating the content the web sites were archived by the UK Web Archive, which included the main UKOLN Web site, sub-sites (such as the QA Focus project and sites with their own domain such as the Cultivate Interactive ejournal).

In addition to the management of traditional web assets, cheap medications online typically hosted on an institutional web site, I also emphasized the importance of being able to continue to manage and maintain one’s professional profile, running a workshop session at the IWMW 2013 event on “Managing Your Professional Online Reputation“. During this period I became aware of the possible tensions between the provision of institutional web sites and the use of third-party services from the perspective of a professional who wishes to continue professional activities after leaving the host institution. As Virginia has pointed out, one’s online legacy can easily vanish.

But whose responsibility is to ensure that an institution does not lose its scholarly digital resources and individuals do not lose their online legacy? In a poster presented at the LILAC 2014 conference on “Preparing our users for digital life beyond the institution” I summarized a survey carried out by myself and Jenny Evans in which we found that librarians do not feel they are responsible for supporting academics who wish to continue making use of their digital assets after they have left the institution.

I therefore wondered whether web managers felt they had responsibilities for the preservation of web resources, not just as institutional assets but also as assets of value to members of staff after they leave the institution. A workshop session on “Page Not Found’: Practical Web Preservation Advice” was intended to explore some of these issues, with the abstract for the session suggested that “in web site development projects … a full impact analysis encompassing all stakeholders is essential“. Unfortunately the session has been cancelled due to lack of numbers.

In the poster presented at the LILAC 2014 conference I asked, in light of the survey findings “Are librarians enablers of life-long access to digital technologies or custodians of institutional services?” In light of the apparent lack of interest in web preservation at the IWMW 2015 event there seems to be a gap: who should be responsible for managing long-term access to web resources? Perhaps the answer will be self-motivated individuals, just as it was for long-lost copies of episodes of Doctor Who?


View Twitter metrics and conversation: [Topsy] – [bit.ly]

Advertisement

Posted in preservation | Tagged: | 1 Comment »

Rediscovering Missing Conference Web Sites

Posted by Brian Kelly on 5 Aug 2013

Revisiting Lanyrd

lanyrd entry for Brian KellyI’m a big fan of the Lanyrd service. As described in Wikipedia Lanyrd is  “a conference directory website created by Simon Willison and Natalie Downe and launched in 2010“.  In November 2010, shortly after Lanyrd’s launch I described Developments to the Lanyrd Service and gave some Further Thoughts on Lanyrd. In May 2012 I asked Why Would You Not Use #Lanyrd For Your Event?, and then in August 2012 I described how Lanyrd Gets Even Better – But Can It Provide The Main Event Web Site?

Last week a post on the Lanyrd blog entitled Find speakers for your events with Lanyrd’s new speaker directory described further developments to the service:

At Lanyrd, we’re building the definitive database of professional events, conferences, talks and speakers. We want to help organisers run better events, speakers get more exposure and attendees find the events that are right for them.

Our brand new speaker directory provides a powerful new way to explore the 70,000+ speaker profiles already on Lanyrd, and helps organisers connect with new talent to help make their events even better.

Since I am experienced speaker I have a professional interest in making use of Lanyrd’s speaker directory in order to provide an online record of my previous speaking activities which may be useful in finding new opportunities in my post-UKOLN career.

Lanyrd Entries For Past Events

In order to ensure that my Lanyrd speaker profile contained a suitable record of my main speaking appearances I wanted to ensure that details of significant international conferences were included.

Back in October 2008 I presented a paper on “Library 2.0: balancing the risks and benefits to maximise the dividends” at the Bridging Worlds 2008 conference which was organised by the National Library of Singapore. This was a particularly memorable conference for me, not only due to the location but also because I had a couple of weeks holiday afterwards, visiting Malaysia and Thailand. In addition the paper, which was subsequently published in a special edition of the Program journal which featured papers from the conference, is also the most downloaded paper by UKON staff hosted in the University of Bath repository. I was therefore keen on ensuring that this event was included n my Lanyrd speaker profile.

Bridging Worlds 2008 Web Site In Internet ArchiveSince there wasn’t a Lanyrd entry for the Bridging Worlds 2008 conference I had to create one. As I was a speaker but not an organiser of the event, there is a question as to who should take responsibility for the creation of an entry. However this is addressed in the Lanyrd FAQ:

I’ve noticed anyone can edit an event and add and remove speakers — is that really a good idea?
Lanyrd works a bit like Wikipedia — we keep track of all changes made to an event (we don’t yet expose that information in the UI) and any vandalism can be quickly reverted.

I therefore decided to create a Lanyrd entry for the Bridging Worlds 2008 conference. However although I had details of my session on the UKOLN Web site I found that the conference Web site, which was at http://www.bridgingworlds.sg/, no longer existed. It was therefore not clear how I would recreate details of all of the talks given at the conference. Such information was needed if the Lanyrd entry for the conference was to have a role to play in providing information on thee talks, the speakers and links to information about the conference.

Digital Archeology Using the Internet Archive and Slideshare

My first port of call in looking for the conference programme was the Internet Archive. Fortunately there had been nine captures of the Bridging Worlds 2008 conference homepage, two captures of the programme for the first day and three for the second day. As illustrated there was sufficient information to find the title, times and speaker information for the talks. This information was used to recreate the conference timetable on Lanyrd.

In addition to the Internet Archive I also discovered that there was a Bridgingworlds2008 Slideshare account which contained the slides used for 18 of the talks together with copies of the papers in three cases. Since Slideshare resources can be embedded within Lanyrd I was therefore able to provide access to the slides used for many of the talks.

However the Internet Archive’s copy of the conference Web site only included a couple of the abstracts so I was not able to reproduce this information for all of the talks.

Since several of the speakers were known to me or could easily be found I was able to find their Twitter ID and use this as an identifier in the Bridging Worlds 2008 speaker directory, as illustrated. It should be noted that in a couple of cases, the information for speakers for whom I do not know their Twitter ID is replicated.

Lanyrd Entry for Bridging Worlds 2008 conferenceDiscussion

Although this work began in order to provide an entry in my Lanyrd speaker profile, the demise of the conference Web site led to an interesting exercise in ‘excavating’ Web resources in order to reproduce the past and reproduce the information which was discovered in order to provide a resource which may be of use for others.

It does seem that conference Web sites are regarded as displosable, which can be deleted after the conference is over. This is the case for CILIP’s recent Umbrella 2013 conference, held at the University of Manchester on 2-3 July 2013.

If you visit the CILIP Web site you will find that most of the information about the conference, including the dates and location, has vanished. All that remains are links to the presentations (in PDF format). As shown the links provide speaker information but nothing about the timings, the strand they were in, the room locations, etc. More importantly this information is not interoperable with the Social Web: there is no way of providing associations with the talks and commentary about the talks (such as tweets and blog posts) or for the speakers (e.g. their talks at other events; their connections with other speakers and participants at the event).

Umbrella conferenceIt does seem that Internet archeology will be needed already for this recent conference. There is a Lanyrd entry for the Umbrella 2013 conference. However this currently has very little information, beyond the conference dates and location. Perhaps motivated individual or individuals from the CILIP community might be willing to recreate the conference timetable (which was previously published in a large PDF file) within the Lanyrd environment, enabling additional information, such as the slides, reports on the talks, links to Twitter archives, etc, to be included as part of the conference record.

But shouldn’t conference organisers take a more pro-active approach in ensuring that (a) conference information is replicated beyond the institutional environment to minimise potential that such information due to in-house decisions and (b) conference information can be integrated with other information sources hosted outside the institution? This has been the approach taken for the IWMW series of events. Wouldn’t it be sensible for other organisations, such as CILIP, Jisc and UCISA, to provide information for many years of high-profile events in this fashion? Of is there still a reluctance to make use of third-party services?

 

Posted in preservation | Tagged: | Leave a Comment »

Preservation of UKOLN’s Web Resources and Papers

Posted by Brian Kelly on 29 Jul 2013

My main work activity this year has been managing the preservation of UKOLN’s Web resources, prior to the cessation of Jisc’s core funding on 31 July and the departure of most of UKOLN staff.

UKOLN Web site in UK Web ArchiveThis work involved:

  • Identifying UKOLN’s Web assets and the owner.
  • Preparing the content so that it was suitable for preservation.
  • Submitting details of the Web resources to the UK Web Archive team.
  • Liaison with the  UK Web Archive team to ensure that the resources had been successfully archived.

The preparation work, which was quite time-consuming to complete, involved switching off functions and technologies which were not suitable for archival (e.g. backend CGI scripts and features which were dependent on specific CMS technologies.

In addition the content of the main entry points for Web sites (and micro sites)  was updated in order to ensure that the page provided information on the purpose of the site; the funders; the UKOLN staff involved; the start and end dates for the work and, where possible, links to the key outputs of the work.

The UK Web Archive team have confirmed that they have successfully harvested the resources  we submitted some time ago. In addition Web sites which were still being updated, such as the UKOLN Web site itself and the IWMW sites, were submitted for archiving more recently.

In addition to this work significant papers and reports have been deposited in Opus, the University of Bath institutional repository. During the initial preparatory work we found that entry points for individuals were not available after they had left the institution, although their items would continue to be hosted in the repository. Since the individual’s name could be an important way of finding such content the repository team agreed that people’s entry point would continue to be available after they left the institution (although this would not be applied retrospectively to UKOLN staff who have left the institution prior to the change in the policy).

In order to make it easier to find items written by UKOLN staff the following table provides links to their list of items and accompanying usage statistics.

Name No. of items Usage statistics URL
Top 20 items Total nos.
of downloads
Alex Ball 69 [View] [View] http://opus.bath.ac.uk/view/person_id/1760.html
Talat Chaudhri   2 [View] [View] http://opus.bath.ac.uk/view/person_id/2693.html
Michael Day 68 [View] [View] http://opus.bath.ac.uk/view/person_id/571.html
Monica Duke

11

[View] [View] http://opus.bath.ac.uk/view/person_id/2186.html
Kora Golub

  5

[View] [View] http://opus.bath.ac.uk/view/person_id/2421.html
Marieke Guy

34

[View] [View] http://opus.bath.ac.uk/view/person_id/985.html
Brian Kelly

81

[View] [View] http://opus.bath.ac.uk/view/person_id/588.html
Liz Lyon

19

[View] [View] http://opus.bath.ac.uk/view/person_id/1039.html
Mahendra Mahey

  2

[View] [View] http://opus.bath.ac.uk/view/person_id/1707.html
Manjula Patel

92

[View] [View] http://opus.bath.ac.uk/view/person_id/356.html
Catherine Pink

20

[View] [View] http://opus.bath.ac.uk/view/person_id/3237.html
Rosemary Russell

15

[View] [View] http://opus.bath.ac.uk/view/person_id/500.html
Stephanie Taylor

  4

[View] [View] http://opus.bath.ac.uk/view/person_id/2057.html
Emma Tonkin

70

[View] [View] http://opus.bath.ac.uk/view/person_id/1560.html

Table 1: Information on UKOLN Items Deposited in Opus Repository

Posted in preservation | 2 Comments »

Jisc Report on Sustaining Our Digital Future: Institutional Strategies for Digital Content

Posted by Brian Kelly on 30 Jan 2013

JISC SCA reportEarlier today the Jisc announced the launch of a report on Sustaining Our Digital Future: Institutional Strategies for Digital Content.

This report, which provides a close look at three institutions (UCL, Imperial War Museums and the National Library of Wales) in the United Kingdom confirms:

  • How fragmented the digital landscape is at universities and within other organisations.
  • How there are examples of good practice within and outside higher education that all can learn from but that greater co-ordination is required to deliver this at a UK level.
  • How little the topic of post-build sustainability comes up at the higher levels of administration.
  • How risk is present within the current system, concerning the sustainability of digital content.

The report (which is available in PDF format) is substantial, containing 88 pages. In addition to this main report a second document (also available in PDF format) provides a “Sustainability Health Check Tool for Digital Content Projects“.

This report is very timely arriving at a time in which we are seeing reductions in the levels of funding available across public sector organisations in the UK, which will lead to questions regarding the sustainability of existing online services and digital resources.

The report is based on a study conducted by Ithaka S+R, with funding from the Jisc-led Strategic Content Alliance, which reported on findings of earlier studies showing that both funders and project leaders rely heavily on their host institutions to support and sustain digital content, beyond the end of the grant. But what will happen when the host institutions have significantly reduced levels of funding to continue to maintain and develop such content?

The report describes the need for an “early and honest appraisal of which projects are likely to require .. support post-launch“:

  • Digital content, requiring just “maintenance”: These may not require ongoing growth, but certainly do require a clear exit plan to ensure that the content will be smoothly deposited and integrated into some other site, database, or repository. The issue of ongoing investment does not disappear; it just becomes the concern of the larger platform on which this piece of content now lives.
  • Digital resources, requiring ongoing growth and investment: These require early sustainability planning, including identifying institutional or other partners and careful consideration of the full range of costs and activities needed to keep the resource vibrant.

The Sustainability Health Check Tool provides a paper-based checklist for those with responsibilities for managing digital content. The tool covers a number of areas including ongoing support; audience, usage and impact assessment together with preservation issues.

A series of video clips have been produced to accompany the launch of this report. It was particularly interesting to hear the comment from Prof David Price, Vice-Provost (Research) at UCL:

We’re not just worried about things disappearing but about things never appearing! They are hosted all over the place, and not all the projects have a sustainable plan.”

This video clip is available on YouTube and embedded below.


View Twitter conversation from: [Topsy] | View Twitter statistics from: [TweetReach] – [Bit.ly]

Posted in preservation | Tagged: | 1 Comment »

Disappearing Conference Web Sites: Learning From the EUNIS Experience

Posted by Brian Kelly on 27 Nov 2012

EUNIS Conference Resources

Back in June 2005 I presented peer-reviewed papers on Let’s Free IT Support Materials!, IT Services – Help Or Hindrance To National IT Development Programmes? and Using Networked Technologies to Support Conferences. I also, I’ve just noticed, facilitated a half-day workshop session on Supporting Technology-Facilitated Learning In The Conference Environment – this was, I think, the first time I gave a workshop on what subsequently became better known as ‘amplified events’.

But what of the context of this work? The papers were presented at the EUNIS 2005 conference, with the workshop being one of several pre-conference sessions. The conference was held at the EUNIS 2005 conference at the University of Manchester on 20-25 June 2005. But recently I noticed that the conference Web site, which was hosted at http://www.mc.manchester.ac.uk/eunis2005/, was no longer available.

Does this matter? The conference, which is organised annually by the European University Information Systems Organization, took place over 7 years ago. Might it not be argued that the sharing of best practices and innovation across IT support services departments across Europe does not need a record of best practices dating back to the mid 1990s?

EUNIS does provide information about its previous conferences, as illustrated. This shows that conferences were held in Düsseldorf in 1995 and Manchester in 1996. However the EUNIS 1997 conference, held in Grenoble, is the oldest EUNIS event for which Web resources are still available.

From the list of papers presented at EUNIS 1997 (which is hosted on the main EUNIS Web site) I discovered a paper on Information Services – the Convergence Agenda by M Clark, IT Services Director at the University of Salford, about mergers at Salford University.

The other papers with authors from UK institutions were Preservation of the Electronic Assets of a University by Alex Reid, University of Oxford; “Applying Risk Analysis Methods to University Systems” by W R Chisnall, University of Manchester; Managing Information for Management by John Townsend, Edge Hill University College and Information Strategy – a Tool for Institutional Change by Andrew Rothery, Worcester College of Higher Education and Ann Hughes, University of Nottingham.

Ironically all of these papers have some relevance to the disappearance of the EUNIS 2005 Web site. The conference took place shortly after a merger of the University of Manchester and UMIST, which led to the integration of the IT Service departments from both of these institutions, with subsequent changes in staffing, departmental names and responsibilities. It seems that Manchester Computing no longer exists, with the http://www.mc.manchester.ac.uk/ URL now being redirected to Research Computing at http://www.rcs.manchester.ac.uk/

It would appear that there is still a need for the sector to be able to develop strategic responses and use of risk analysis methods to held ensure the preservation of digital resources arising from mergers. It would seem that all of the papers from the EUNIS 1997 conference still have some relevance!

Preserving Conference Resources

If, as had been suggested, old conference Web sites have value, how should one respond to the disappearance of sites such as the EUNIS 2005 Web site?

For me the first port of call is the Internet Archive’s Wayback Machine. It seems that the EUNIS 2005 Web site has been crawled 52 times, going all the way back to November 19, 2004.

The earliest archive contains a record of the call for papers and therefore does not contain any of the papers. It is therefore the latest archive, which was carried out on 5 May 2008, which should be of the most relevance. However in order to ensure that this archive contained relevant information I ensured that it contained a copy of the final programme. As illustrated, the final programme is available in the archive, but I noticed that this page had been archived on 8 October 2007; there had been 14 captures of this paper between 3 March 2006 and 8 October 2007.

I also found that my papers on Using Networked Technologies To Support ConferencesLet’s Free IT Support Materials! and IT Services – Help Or Hindrance To National IT Development Programmes? were also available in the archive. It was interesting to note that the archive included the PDF versions of the papers as well as the HTML resources for the conference Web site.

The Internet Archive appears to have been successful in keeping a copy of the key resources on the conference Web site. However when I followed a link to “Photographs from the Conference: (registration staffsessionsconference dinner)” I found that the archive appeared to simply contain a copy of an error message, as shown below.

This may have been a failing by the Internet Archive’s software but, looking at the path name, I suspect the crawler simply captured an error message generated by the EUNIS 2005 Web server software.

Next Steps

When I noticed that the EUNIS 2005 Web site had vanished I informed the EUNIS organisers and suggested that they may wish to provide a link to the Internet Archive’s copy. This has now been done. I have also updated the links to the conference Web site from my list of papers and presentations.

There are clearly operational decisions which need to be take in order to minimise the risk of loss of content (and context) when intellectual content is deposited on conference Web sites. But what are the implications as we look to the future? For my content, I had previously ensured that the papers were deposited in the University of Bath repository so, for me, it was the loss of context which had the greatest significance. But what is likely to be the more sustainable resource in the future: the conference Web site hosted on an established, viable and trusted University Web site or the Internet Archive? I can’t help but feel that I should be looking to ensure that the Internet Archive contains a working copy of content currently hosted on areas of institutional Web sites which may not be sustained in light of policy or organisational changes. And what of EUNIS? Might they find it useful to provide links to the copies of previous EUNIS conferences held on Internet Archive, in addition to the existing conference Web sites?


View Twitter conversation from: [Topsy]

Posted in Events, preservation | 3 Comments »

Case Study: Managing the Closure of a Project Web Site

Posted by Brian Kelly on 3 Sep 2012

The Importance of Managing the Closure of Project Web Sites

The project has been successfully completed. It is therefore time to move on to new areas of work. However just as there is likely to be a need for the project paperwork to be completed after the project deliverables have been submitted, there will also be a need to manage the closure of the project web site. This can be tiresome, especially when there are interesting new projects which will need to be started. However a failure to manage the closure of a project web site can be counter-productive – remember that the project’s web site may be looked at when evaluators are marking bids for new projects and if the project web site fails to work or contains inaccurate or misleading content, the evaluators could potentially be influenced by such negative experiences.

Best Practices

Many of the best practices for closing a project blog will appear obvious – but that does not necessarily mean they are implemented! This post therefore summarises best practices and provides an example of a project web site which was closed 8 years ago in order to investigate some of the practices which may only become apparent over a period of time.

Best practices relevant to visitors

The following examples of best practices are intended to ensure that visitors accessing the project web site are aware of the status of the project and do not encounter misleading information. Implementation of these best practices should be regarded as essential.

Provide a summary of the status of the project and links to key resources: The visitor who arrives at the project web site via Google may well be unaware of the status of the project. You should project easily found information about the status of the project. Initially this may state when the project was completed although at a later date you may wish to add links to related follow-up work. You should also provide easily found links to the key resources related to the project which may include the final report and articles and papers related to the project’s activities.

Manage the content of the web site: As well as providing information about the work of the project, you should ensure that dated information or information which may become dated or misleading is edited or removed. This might include statements such as “Our workshop will be held next April” and “We will do xxx” – although such statements may legitimately continue to be provided in initial project planning documents. Dates should include years: for example the statement “Our workshop was held last April” may cause visitors to try to find workshop outputs if they are felt to be recent, but not if they are 10 years old!

Best practices relevant to technical staff

The following practices are intended to ensure that visitors accessing the project web site are aware of the status of the project and do not find misleading information. Implementation of these practices should be regarded as useful, especially if the project web site makes extensive use of server-side technologies or the site is likely to be reused elsewhere.

Audit the technologies used: Provide information which summarises the technologies used to deliver the project web site. For a simple web site, this might simply state that the web site is based on an Apache server which hosts HTML and CSS files, together with MS Word, MS PowerPoint and PDF files. A more complicated web site might be based on a CMS, such as Drupal, a blog platform such as WordPress, a database server such as MySQL, a search engine or a scripting environment, such as PHP.

Manage the technologies used on the web site: Once you have audited the technologies used on the project web site you may wish to switch off technologies which may require ongoing maintenance and support. You may wish to create a static web site of content managed by server-side technologies in order to simplify the technical infrastructure and minimise ongoing technical support requirements.

Audit third-party services used: Provide information which summarises third-party services used to deliver content or services. This might include Flickr badges, Twitter feeds, etc.

Manage the external services used to support the project web site: You may wish to switch off third-party services which are no longer needed, such as a dynamic Twitter feed. You should also ensure that the account details (username, password and email address used to authenticate changes) are also managed and aren’t owned solely by an individual (who may leave the organisation).

Verify the long-term persistency of the project Web site’s domain: If the project Web site is not hosted on the institution’s main Web site there will be a need to ensure that the project’s domain name persists and the service continues to be operational. If a domain name has been purchased it may be sensible to ensure that the domain is registered for an appropriate period of time. As described in a post entitled Link Checking For Old Web Sites it may be useful to set up an automated alert so that you receive notification if the Web site becomes unavailable.

Case Study: The QA Focus Web site

The JISC-funded QA Focus project ran from January 2002 to July 2004. The aim of the project was stated on the project home page:

QA Focus’s remit was to help ensure that project deliverables were interoperable and widely accessible. We sought to ensure that projects deployed appropriate standards and best practices. We did this by providing support materials which explained the recommended standards and best practices and carrying out a number of surveys which helped to share evolving practices across the projects.

We stopped developing the QA Focus project web site in 2004 and implemented some of the best practices given above. The project home page contained links to the project’s Final Report and the Impact Analysis statement, descriptions of the QA methodology developed by the project and accompanying briefing papers, case studies, papers and articles as well as names of the staff involved in the project delivery.

In some respects the freezing of the project web site was simple, as the project took place prior to the availability of many Web 2.0 services. However the project web site did make use of a database hosted on a Windows NT server which had to be discontinued. In addition as well as myself and my colleague Marieke Guy, the project was supported initially by staff at TASI (now JISC Digital Media) but for most of the life of the project by staff at the AHDS, an organisation which no longer exists, so it will not be easy to recollect memories of the technical decisions which may have been made.

In revisiting the QA Focus web site it was noticed that two third-party services had been used: the AvantGo service had been used to provide access to the web site on a PDA and SiteMeter had been used to provide usage statistics. According to Wikipedia the Avantgo service was discontinued in 2009. However the AvantGo page still provided a link to the service which could potentially be embarrassing. This link was removed. The embedded SiteMeter service had been removed previously.

The project web site contains a number of RSS feeds which provide access to the main project deliverables. Since this are simple static files and the content will not change, these files have been kept. However it was noticed that a link to an email service which informed subscribers of newly published documents was still available. Since no new documents will be published, and the status of the RSS to email service is unknown, this link was removed.

Although the links to the database server had been removed it was realised that the web site search engine appeared to provide the one remaining example of a server-side technology used to support the project web site. The search link on the web site’s navigation bar provides access to two search facilities: the UKOLN web site’s SWISH-E service and a Google search of the project web site. It is felt that this will provide a useful service and the duplication will minimise problems if either of the search facilities stops working.

Audits of the technical robustness of the project web site, covering link checking and HTML conformance, had been carried out in 2004, while the project was still live. These audits were repeated recently in order to ensure that no errors had been introduced over the previous 8 years. Following this work a page (which is illustrated) providing information on the technical architecture and links to the automated surveys, together with a summary of the main content areas and file types was provided in order that a public record is available. Links to the page have been added to the project home page. In addition an “Archived site” watermark was added to key pages on the project web site.

The Xenu link checker also provided a report on the file formats found on the web site. This provided a useful way of complementing the knowledge I have about the web site. It should be noted, however, that the video/unknown files listed in the table actually refer to WMF (Windows Metafile format) images which were produced in conversion of Microsoft PowerPoint files to HTML format. The table, which has been included on the QA Focus web site, is reproduced below.

MIME type count % count Σ size Σ size (KB) % size min size max size Ø size Ø size (KB) Ø time
text/html 662 URLs 48.71% 7277902 Bytes (7107 KB) 11.38% 543 Bytes 684983 Bytes 10993 Bytes (10 KB) 0.026
text/xml 6 URLs 0.44% 7600 Bytes (7 KB) 0.01% 503 Bytes 2731 Bytes 1266 Bytes (1 KB)
application/xml 94 URLs 6.92% 535671 Bytes (523 KB) 0.84% 1682 Bytes 53907 Bytes 5698 Bytes (5 KB)
text/css 13 URLs 0.96% 32101 Bytes (31 KB) 0.05% 180 Bytes 12144 Bytes 2469 Bytes (2 KB)
image/gif 81 URLs 5.96% 1811489 Bytes (1769 KB) 2.83% 50 Bytes 120552 Bytes 22364 Bytes (21 KB)
image/png 148 URLs 10.89% 12472014 Bytes (12179 KB) 19.50% 2998 Bytes 379755 Bytes 84270 Bytes (82 KB)
application/msword 263 URLs 19.35% 29181952 Bytes (28498 KB) 45.64% 34304 Bytes 1861120 Bytes 110957 Bytes (108 KB)
application/pdf 4 URLs 0.29% 931936 Bytes (910 KB) 1.46% 131571 Bytes 318749 Bytes 232984 Bytes (227 KB)
image/jpeg 6 URLs 0.44% 103860 Bytes (101 KB) 0.16% 3284 Bytes 34689 Bytes 17310 Bytes (16 KB)
application/vnd.ms-powerpoint 38 URLs 2.80% 10736640 Bytes (10485 KB) 16.79% 61952 Bytes 1052160 Bytes 282543 Bytes (275 KB)
text/plain 8 URLs 0.59% 98528 Bytes (96 KB) 0.15% 308 Bytes 57382 Bytes 12316 Bytes (12 KB)
video/unknown 35 URLs 2.58% 733306 Bytes (716 KB) 1.15% 9486 Bytes 21400 Bytes 20951 Bytes (20 KB)
application/vnd.ms-excel 1 URLs 0.07% 22016 Bytes (21 KB) 0.03% 22016 Bytes 22016 Bytes 22016 Bytes (21 KB)
Total 1359 URLs 100.00% 63945015 Bytes (62446 KB) 100.00%

Following the audit and appropriate updates to the content of the project Web site the Web site was submitted to the British Library’s UK Web Archive service.

Discussion

This case study intentionally aims to provide a simple example of a project web site hosted on the main institutional web site with limited use of server-side or client-side technologies and little use of third-party services. However even in this case study it was felt useful to document the technical architecture of the site and summarise the main content areas, especially since this work only took a couple of hours (which included writing this post).

Would it be reasonable to expect projects to provide a similar summary to the one provided for the QA Focus web site as part of the official closing of a project? I’d welcome your thoughts.


Twitter conversation from Topsy: [View]

Posted in preservation | Leave a Comment »

“Conferences don’t end at the end anymore”: What IWMW 2012 Still Offers

Posted by Brian Kelly on 25 Jun 2012

IWMW 2012 Is Over: Long Live IWMW 2012!

Conferences don’t end at the end anymoretweeted @markpower two days after IWMW 2012 delegates had left Edinburgh and returned home.  This has always been the case: conferences organisers will have evaluation forms to analyse and invoices to chase.  But the point Mark was making related to the continuing discussions about the ideas discussed at an event and the accompanying resources, resources which increasingly these days may have been created during the event and support for the participants, which can help to ensure that an event is not just an collection of individuals who are co-located for a few days but, as I described in a recent post, a sustainable and thriving community of practice.  A related point was made recently in a post on “#mLearnCon 2012 Backchannel – Curated Resources” in which David Kelly described how “The backchannel is an excellent resource for learning from a conference or event that you are unable to attend in-person” and went on to add that he finds “collecting and reviewing backchannel resources to be a valuable learning experience …, even when [he is] attending a conference in person. Sharing these collections on this blog has shown that others find value in the collections as well.” But what are the resources from the IWMW 2012 which may be of interest to others, where can they be found and what value may they provide?

Key Resources

Slideshare

The slides used by the plenary speakers were uploaded to Slideshare in advance of the talks in order to allow the slides to be embedded in relevant Web pages and enable a remote audience to view the slides.  It should also be added that this also allowed participants at the event to view the slides if they were not able to view the main display of the slides. The slides have been tagged with the “iwmw12” tag on Slideshare.  This enables the collection of slides to be accessed by a search for this string or by  browsing slideshows which use this tag.  Note that in previous years an event tag had been used, but this service was discontinued recently, after Slideshare had been bought by LinkedIn.

Creating a collection of slides used at the event enables a Slideshare presentation pack to be created, as illustrated, thus making it easy to access all slides used at the event which have been made available. As can be seen from the IWMW 2012 web site, the presentation pack can be embedded in Web pages. This service is being used since participants at IWMW have frequently asked to be able to access slides, including slides used in parallel sessions which they were not able to attend. Using Slideshare makes it easy to respond to this user need. In addition it helps to raise the profile  and visibility of speakers at the event.

Lanyrd

The IWMW 2012 Lanyrd page was set up in advance to provide a social directory for participants at the event so they could see who else was attending. The value of this grows as Lanyrd is used across a number of events: from my Lanyrd, profile, for example, I can see that I have appeared at events on 12 occasions with my colleagues Marieke Guy and on 5 occasions with Paul Boag, Tony Hirst, Andy Powell, Keith Doyle and  Mike Nolan. In addition to the social dimension. Lanyrd also provides calendar entries for sessions at events. The date and time of sessions at IWMW 2012 has been provided together with links to the main page on the IWMW 2012 web site have been added, together with slideshows and links to reports on the sessions which we are aware of. It should be noted that, as illustrated, a Lanyrd has a Wiki-style environment for uploading resources which avoids the single-curator bottleneck. As the person who set up the IWMW 2012 Laynrd entry, together with the IWMW guide for all IWMW events, it should be noted that I receive an email alert when new entries are added to the coverage, such as:

<http://lanyrd.com/2012/iwmw12/?t=c955d8172reV> (In guide IWMW) [22nd Jun 2012 07:52] *
@sheilmcn added coverage “Developing Digital Literacies and the role  of institutional support services” (http://www.slideshare.net/sheilamac/developing-digital-literacies-and-the-role-of-institutional-support-services  type:slides)
to session  “B2: Developing Digital Literacies and the Role of Institutional  Support Services” http://lanyrd.com/sqwtp

This can help to spot if inappropriate entries are being added.

Vimeo

As described in a post on Streaming of IWMW 2012 Plenary Talks – But Who Pays? we used the ustream.tv service for the live video stream. The videos are currently being processed and will be made available via UKOLN’s Vimeo account shortly. This service will be used to wider access to the plenary talks so that they are available for those who were not present at the event – although, of course, they can also be viewed by people who were at the event and wish to watch the talks again. In addition to the video recordings of the talks we have also taken a number of short interviews with participants at the event which will enable their thoughts on the event to be shared with a wider audience.

Flickr

With so many delegates now having digital cameras and smartphones there are a large number of photographs which have been uploaded to Flickr with the IWMW12 tag which can help to provide a collective memory of the event.

Having a large number of photographs, rather than a small set of selected ones taken  by an official photographer, provides a much broader perspective on the event. It also means that images browsing interface services, such as Tag Galaxy, are more useful by having a more diverse range of content.

The two images show a display of a Tag Galaxy search for photographs on Flickr with the “iwmw12” tag and one of the many photographs taken by Sharon Steeples of the final conclusions session during which I showed an image of the video stream, captured earlier that morning when Dawn Ellis gave a summary of Web developments at the University of Edinburgh, subverting normal conference-style approaches to case studies by telling this as a fairy tale. The video recording of this talk will be particularly worth watching.

Twitter

As can be seen from the image shown above, the lecture theatre also has a large blackboard.  The opportunity to use a blackboard during the final session provided too much temptation to ignore –  so in the summing up a tweet posted on the backboard was displayed, as a reminder that not everyone necessarily has a mobile device they could use for tweeting. However many people did use Twitter during the event. As is widely known, content posted on the Twitter stream becomes unavailable available a short period. There is therefore a need to analyse event tweets shortly after an event – or archive the tweets to allow them to be analysed subsequently.

Topsy

As can be seen from the image of the Topsy search for #IWMW12 tweets posted over a period of the past 7 days (click for a larger display) there were 666 mentions on 18 June and 574 on 19 June.  The most highly tweeted link was to the IWMW 2012 video page, which was mentioned in 43 tweetsduring the week on 17-24 June 2012. In total Topsy reported that there were 748 tweets during the week on 17-24 June 2012, 808 in the month from 24 May-24 June and an overall total of 846 tweets to date.

Other Commercial Twitter Analytics Tools

It should be noted that a large number of Twitter analytics tools are available which be used to analyse how Twitter has been used. The Tweetreach service, for example, reports that tweets containing the #iwmw12 hashtag have reached 7,553 Twitter accounts. However, as is often the case with usage statistics, such figures need to be treated with a pinch of salt.

Beyond Commercial Twitter Analysis Tools

Topsy, Tweetreach and other Twitter analytics tools can provide a useful summary of use of Twitter hashtags. However  in the UK higher education development community we are fortunate to have the expertise of developers such as Martin Hawksey and Tony Hirst who have a well-established track record in the development of value Twitter analysis tools and who can continually develop their tools based on particular needs and interests of the community.

As Martin described in a post entitled IWMW12 Data Hacks for the IWMW 2012 event he was  “collecting an archive of tweets which already gives you the TAGSExplorer view“.

Looking at Martin’s Twitter archive of #iwmw12 tweets, provided by the TAGS v.40 service, we can see that the top five Twitterers were @iwmwlive (281 tweets), @PlanetClaire (149 tweets), @sharonsteeples (103 tweets), @mariekeguy (100 tweets) and @jessica_hobbs (81 tweets). Since the @iwmwlive Twitter account was managed by Kirsty Pitkin it seems that the top twitters at the event were all female: this seems particularly interesting in light of the fact that only about a quarter of the participants were female.

It should also be noted that this tool also provides a display of the tweets over time.  It can also be seen (right) that tweeting peaked at 2pm on Tuesday, 19 June 2012 with 229 tweets.

Finally I should mention Martin’s most recent development:  a filterable/searchable archive of IWMW12 tweets. As illustrated below, this provides a clickable word cloud of the content of the tweets, together with a search box and browse interface for the tweets.  It was while browsing the tweets that I came across a comment from @JohnGreenway who, during the conclusions, tweeted:

As someone from a commercial background, #iwmw12 has been excellent – hope everyone in HE realises how rare this is in other industries!

Such live tweeting helped in providing useful real time feedback not only to the event organisers but also the plenary speakers.  Other comments received during the event included:

  •  Excellent talk by Stephen Emmott – always a reliable IWMW speaker! #iwmw12 from @adriant
  • First time at #iwmw12 and had a brilliant time. Great ideas, great people, great weather, who could ask for more. from @millaraj
  • First time at IWMW: great speakers, interesting topics, fantastic Ceilidh. Many thanks to organisers and presenters. #IWMW12 #new #social from@seajays
  • Great summary by @sloands on how to build accessibility into project management processes using BS8878 #iwmw12 from @chistabel6

Further examples of tools which Martin Hawksey developed at the IWMW 2012 event can be accessed from his Delicious IWMW12 Hacks set of bookmarks.

The paper.li Daily newspaper

Finally I should mentioned the IWMW12 paper.li daily newspaper, which had been set up in advance of the event. This automated newspaper consisted of articles based on links which had been tweeted  containing the event hashtag.

Reflections

Conferences have never ended immediately after the final talk has been given – this is always the paperwork to be processed, the evaluation forms to be analysed and feedback given to the speakers and local event organisers. What is different nowadays is that event resources and discussions are no longer ‘trapped in space and time’.  If an event has value, it should surely have value for those who may not have been able to attend.

It was therefore appropriate that during my opening talk I was able to announce the launch of the JISC-funded Greening Events II; Event Amplification report. We hope that the report will be useful for others who are planning amplified events.  As Mark Power put it: “Conferences don’t end at the end anymore” – you need to make plans for managing the resources after the conference is over. We hope the report will be useful for those planning amplified events.


NOTE: Shortly after this post was published a post entitled “But who is going to read 12,000 tweets?!” How researchers can collect and share relevant social media content at conferences was posted on the LSE Impact of Social Sciences blog which echoed the approaches described in this post.

Posted in Events, Evidence, preservation, Twitter, Web2.0 | 3 Comments »

Thoughts on “The Future of the Past of the Web” Event (#fpw11)

Posted by Brian Kelly on 10 Oct 2011

imageOn Friday 7th October 2011 I attended a one-day event on “The future of the past of the web“. The event, which was organised was organised by the British Library, the Digital Preservation Coalition (DPC) and the JISC, was the third joint Web archiving workshop, the previous two workshops having been held in 2006 and 2009 .

I have had an interest for some time having given a talk way in 2002 on “Archiving The UK Domain and UK Web Sites: What Are The Issues?” at a DPC seminar on “Web-archiving: managing and archiving online documents and records“. It seems that the Web archiving world changed significantly since I gave my talk and, indeed, since the first two workshops.  As a number of people commented, many of  those involved in Web archiving initiatives are no longer primarily focussed on archiving conventional Web ‘pages’ – rather the sector is facing the challenges in archiving a much more dynamic environment, with the Social Web now providing significant content which social historians of the future will wish to analyse in order to make sense of today’s online (and offline) environment.

The changes in emphasis can also be seen from the developments of end user services which can help to make the importance of Web archiving move obvious to the wider community.  In the opening plenary talk Herbert van der Sompel described Memento, an initiative which is looking to “add time to the Web” by developments which build on existing web protocols including HTTP and content negotiation.

imageA Memento plugin for Firefox is available which enables end users to gain an understanding of benefits which such developments can provide. I was also pleased to hear that a Memento Browser is available for Android mobile devices. For those who may not be able to install such applications, use of Memento’s capabilities can also be seen by using the Internet Archive’s Wayback Machine. As can be seen from the accompanying image you can view the BBC News Web site for October 2008, and perhaps reminisce about the early days of the financial crisis.

Further examples of rich interactive interfaces to Web archives have been developed to enhance the  UK Web Archive service and, as described by Maureen Pennock and Lewis Crawford, this includes N-Gram visualisations of searches across the archive, tag clouds generated from the General Election 2005 Collection and a 3D wall visualisation across archived collections.

Services provided by the British Library have, of course, always been valued by researchers.  But in a talk on “Web Archiving: the State of the Art and the Future” Eric Meyer, Research Fellow Director at the Oxford Internet Institute, asked us to consider how effective we have been in making social science researchers aware of the potential of Web archives in supporting their research.  There is, I feel, a need for further advocacy for ensuring that researchers are aware of the ways in which not only archived digital resources, but also data associated with such archives, can sup[port research interests.

The increasing importance of Web archiving has led to archiving tools and services being developed within the commercial sector in addition to activities led by national libraries and archives, higher education and EU-funded consortia. Mark  Williamson was invited to give a presentation at the last minute and described various archiving activities of his company, Hazno. It was interesting to hear how a well-known multi-national company such as Coca Cola, which, as might be expected, has well-established archiving processes for archiving of physical objects but was slow in recognising the importance of digital archiving, including initially the development of its public Web site and then its public presence on social web sites including the Coca Cola Facebook page. Mark also described how APIs are being developed for the Hazno Web archiving system and how the APIs would be valuable in analysing the data associated with large collections of Web archives. As Mark put it: “The individual pages in a web archive are pretty boring – it’s the Big Data that’s exciting“. It will be interesting to see whether the Hazno software could provide a solution for Universities which may be interested in archiving their digital presence, especially uses of social web services for which the content cannot be managed through use of a content management system used to manage the institutional Web presence.

As well as finding the talks at the workshop of interest it was also interesting to observe the gaps. In the final session Neil Grindley, JISC Programme Manager for digital preservation asked the panel for their thoughts on standards for web archiving – and found that no one on the panel. However in response to my tweet that:

Interesting that nobody wanted to respond to the question about standards for web archiving at #fpw11

Helen Hockx commented that:

@briankelly I agree. Both ISO and BSI have initiated and are going to initiate work on standards related to web archiving.

If the next Web archiving event is held in another two years time, it will be very interesting to see what the focus of development work will be.  Ten years ago the drive for Web archiving came from national and international bodies.  However as suggested in a tweet posted by Les Carr a few hours ago who provided a link to a blog post on EPrints repositories to collect data from Twitter perhaps we shall see institutions appreciating the value of digital content created by members of the institution, including content hosted outside of the institution. Or perhaps, as suggested by the EU-funded Arcomem project, it may be large EU-funded projects which help to preserve todays’ cultural memories which are help on online service, including social web services.  And although motivated individuals may wish to make use of tools such as Memolane, a “Social Web application that captures all of your memories from different Social Networks like Flickr, Facebook, Twitter, Youtube ” highlighted on the Arcomem Website as a “Personal Timemachine for the Social Web“, in reality I don’t think we can leave it to individuals to take responsibility for preserving their own public content. Of course, this begs the question of ‘walled gardens’ which apparently mean that content cannot be accessed by third parties and issues such as privacy and copyright.  I wonder if the next Web archiving workshop will have got bogged down by the difficulties which such issues raise, or if ways of circumventing such difficulties may have been found?

Posted in preservation | 4 Comments »

“Battling legal, logistical and technical obstacles to archiving the Web”

Posted by Brian Kelly on 12 Sep 2011

Recent Features on Web Archiving

The recent guest blog post entitled Web archives: more useful than just a ‘historical snapshot’ was quite timely, having been published a few days after a related article in the Time Higher Education (Memory Failure Detected) which described how:

A coalition of the willing is battling legal, logistical and technical obstacles to archive the riches of the mercurial World Wide Web for the benefit of future scholars

The article went on to illustrate a use case from the preservation of Web resources:

It is 2031 and a researcher wants to study what London’s bloggers were saying about the riots taking place in their city in 2011. Many of the relevant websites have long since disappeared, so she turns to the archives to find out what has been preserved. But she comes up against a brick wall: much of the material was never stored or has been only partially archived. It will be impossible to get the full picture.

But, as I describe below, we don’t need to wait until 2031 to have a reason to analyse Web content which may have been thought to be ephemeral.

Analysis of Twitter Usage at Recent ALT-C Conferences

The article in the Times Higher Education referred to an archiving initiative led by the Library of Congress which is archiving Twitter posts which will allow, at some time in the future, researchers to analyse public tweets. The article could also have mentioned the TwapperKeeper  archiving service which benefitted from JISC-funding to enhance its archiving capabilities to address requirements of the UK HE’s sector. The TwapperKeeper service was used to keep an archive of tweets posted about last week’s ALT-C 2011 conference.  The JISC-funded developments to the service included the provision of enhanced API access which led to development of the Summarizr analysis service  by Andy Powell at Eduserv.

In order to make valid comparisons across annual events I have previously suggested that the Twitter traffic for a week is analysed, so that discussions in advance of an event and shortly afterwards can be analysed. The Summarizr statistics for tweets at the ALT-C conferences for the past three years are given in the following table.

Note: Following the publication of this post Martin Hawksey pointed out in a comment on the post that the Twapper Keeperr archive was not available at the start of the ALT-C 2011 conference, until he created the archive on the opening morning of the conference.  An updated column has been published, but note that this does not include tweets form the opening morning of the conference.

ALT-C 2009 ALT-C 2010 ALT-C 2011 ALT-C 2011 (updated)
Date of event 8-10 Sept 2009 7-9 Sept 2010 6-8 Sept 2011 6-8 Sept 2011
Dates for analysis 6-12 Sept 2009 5-11 Sept 2010 4-10 Sept 2011
(partial archive)
6-11 Sept 2011
Nos. of tweets 4,442 6,138 6,296 6,342
Nos. of users 726 658 802 809
Nos. of URLs tweeted 701 664 1,083 1,102
Top five twitterers jamesclay (168)
sputuk (113)
haydnblackey (112)
emmadw (110)
JackieCarter (97)
dajbconf (330)
timbuckteeth (279)
AJCann (174)
jamesclay (153)
jak82 (111)
digitalfprint (327)
timbuckteeth (212)
sarahhorrigan (187)
FieryRed1 (165)
kevupnorth (140)
digitalfprint (327)
timbuckteeth (217)
sarahhorrigan (187)
FieryRed1 (165)
amcunningham (141)
Top five tweeted hashtags altc2009 (4,333)
jisccdd (108)
dubaimetro (84)
wheniwaslittle (72)
dupedb (64)
altc2010 (6,089)
digilit (173)
awesome (25)
altc2011 (24)
fail (23)
altc2011 (6194)
ds106radio (54)
altc2012 (42)
oer (39)
opencountry (35)
altc2011 (6,240)
ds106radio (54)
altc2012 (42)
oer (39)
opencountry (35)
Nos. of geo-located tweets 0 (0%) 35 (0%) 83 (1%) 83 (1%)

Archiving of the tweets allows us to provide such analyses in order to see the importance of Twitter at such events and identify the people who are particularly active Twitter users at the events. The figures also suggest that the amount of Twitter traffic seems to have stabilised over the past two years and the geo-located tweets, although growing in numbers, is not yet being used to any significant extent.

The Coalition of the Willing – Should Include You

The article published in the Times Higher Education highlighted a number of examples of  initiatives designed for archiving the broad ranges of resources available on the Web, including work being undertaken at the British Library, the Library of Congress and the Internet Archive as well as a number of national libraries in Europe.

The emphasis of national and international organisations may lead to the impression that archiving of Web resources is being addressed by others and so there is no need for individual universities to need to consider web preservation issues. This is, I feel,  a mistaken view.  Indeed not only should those who have a responsibility for the management of institutional digital resources need to address preservation issues, so too do those who manage project resources as well as, as we have seen above, those who may wish to preserve content associated with events.

JISC has recognised the importance of Web archiving and will be hosting an event on “The Future of the Past of the Web” which will be held at the British Library Conference Centre on 7 October 2011. This free event is the third joint Web archiving workshop which has been organised by the JISC in conjunction with the British Library and the DCC. The event is aimed at:

  • Curators, librarians, archivists interested in the preservation of web resources
  • Organisations that are engaged in web archiving and digital preservation
  • Researchers who depend on access to stable web resources for their research
  • Web developers and content creators who value their content
  • Information managers with responsibility for legal compliance

If this event is of interest to you note that bookings should be made before 12:00 on Friday 30th September 2011.

Posted in Events, preservation | Tagged: | 4 Comments »

Guest Post: Web archives: more useful than just a ‘historical snapshot’

Posted by Brian Kelly on 7 Sep 2011

In this guest blog post Maureen Pennock, the Web Archive Engagement & Liaison Manager at the British Library, explores some possible approaches to exploiting the scholarly value of web archives.

Web archives: more useful than just a ‘historical snapshot’

The importance of the internet for research is well-known. As a constantly growing and evolving information source, the web contains vast amounts of information not available or published elsewhere. It is also a unique record of life and society in this technological age. Rarely these days do scholars carry out their research without going online, and the research value of the web is undeniable.

Web archives seek to capture this value and uniqueness by harvesting websites so that they may be re-used in the future even when they are no longer available on the live web. Over the past decade, numerous web archives have been established and grown, including the UK Web Archive. At almost 10 terabytes, over 9,300 web sites and 38,000 instances of archived sites, the UK Web Archive is a unique selective web archive that reflects the collection policies of the participating institutions.

Use of the web archive is steady. However, as recent reports have identified, there remains a gap between the potential community of researchers who could exploit the content, and those who actually do so. To address this, we are collaborating with researchers to explore different ways in which they may use the web archive and exploit the data contained within. We have developed and released a number of visualisation tools as an early first step:

  • the 3D Visualisation Wall, (shown below) which provides a high-level, more dynamic presentation of search results and special collections;
  • the N-Gram search, which encourages users to consider the web archives as data as well as websites, enabling visualisation and comparisons of term frequency;
  • the General Election 2005 Tag Cloud, which visualises the most frequently used (single and pairs of) words in the websites related to key political parties during the 2005 election campaign.

Analysis shows that our single most popular site is the One & Other site, otherwise known as the Fourth Plinth, the website of a 2009 public arts project by artist Anthony Gormley. The site is no longer available on the live web. This type of usage, where users browse websites in order to access content that was available at a given point of time but is no longer accessible, is a widely accepted, original user scenario. It is based largely on original user experiences and early interactions with the live web. But there are other ways in which a web archive may be used, aside from visiting sites as they were captured at a given date and time. For example:

  1. Resource citation. Researchers typically use the live web for research and cite live web resources with the date last visited. Why? Because content changes over time and they want to indicate when the content was available on the website. But if the content changes – and web pages are frequently updated or refreshed without archiving old versions – then there is no proof that the content cited actually existed. The web archive provides a more reliable and persistent citation than the live web.
  2. Data exploitation. Web archives enable automatic identification of social trends over time (automated temporal trend research). The tools available will impact on the type of research that can be undertaken. This is a chicken & egg scenario: we rely to an extent on users to tell us what tools they want, but users need some direction on what might be possible with the data available. We need to work together to further develop the archive and support the emerging research needs of our users.
  3. Intelligent querying, of the Q&A sort. Given the amount of data available in the web archive, it’s not inconceivable that future users will expect a more intelligent query mechanism than simple search and result presentation. More complex questions, for example, ‘tell me about the competing interests of oil companies in the late twentieth century’ are the stuff of sci-fi but rely upon an extensive historical database – such as a web archive.

Of course the characteristics of a web archive inevitably impact on how viable these different scenarios may be. For example, a selective web archive with limited scope but rich resource description will support research differently to a broad domain or international archive, with minimal accompanying metadata. The age of the web archive may be another factor. These factors must be recognised when developing tools and functionality.

Increasing usage and responding to researcher needs is an important element of our growth strategy for the UK Web Archive over the next five years. If you use the web archive for research and/or have ideas about tools or functionality to support specific types of research, we’d really like to hear from you. You can get in touch with us either by email, on Twitter, or by leaving a comment below.


Contact Details

Maureen Pennnock
Web Archive Engagement & Liaison Manager
The British Library (Yorkshire)

Email: maureen.pennock@bl.uk
Twitter: @mopennock

Posted in Guest-post, preservation | 2 Comments »

Case Study: Opening Access to a Closed and Unused Mailing List

Posted by Brian Kelly on 31 Aug 2011

The Value of List Archives

A recent post on Policies on Unused JISCMail Lists highlighted the potential value of JISCMail lists which are no longer active but which host content which may provide historical insights into digital library developments. As is suggested by the recent JISC ITT for an Analysis of the Value and Benefits of Text Mining and Text Analytics in UK FE and HE  data mining tools have developed in sophistication since the JISCMail service was launched in 2000. It  may well now be timely to perform data mining work on our email archives – particularly as recent email messages have been sent to owners of unused lists inviting them to delete archives without mentioning the implications of such actions.  The current importance of JISCMail as an archive rather than as a communications tool is also suggested by the JISCMail statistics which show that the majority of lists (5,840) have no recent posts (and this number is steadily increasing), 1,583 have between 1 and 10 posts, 707  have between 11 and 100 and only 114  have over 100 posts.

However, as was discussed in the comments on the recent post, it is unclear whether closed lists which are no longer in use can be made open. Does the 30 year rule which, according to Wikipedia, states that “Public records ….other than those to which members of the public have had access before their transfer …., shall not be available for public inspection until they have been in existence for [thirty] years or such other period….as the Lord Chancellor may,…. for the time being prescribe as respects any particular class of public records” apply to JISCMail lists? But as Chris Rusbridge has pointed out Section 7 of the JISCMail Acceptable Use Policy states that:

Messages sent to a JISCMail list will normally be archived, and these archives can then be retrieved by any member of that same list. These archives may also (at the discretion of the listowner) be made publicly available on the web, and thus be available to anyone. … 

Archives or collections of the messages sent to a JISCMail list may not be made publicly available at another site unless the listowner has granted explicit permission, and the list members have been informed.

It would therefore appear that as listowner I can make the LIS-ELIB-MANAGERS archive available (and also make the archive available elsewhere) provided I inform the list members.  However although the FAQ suggests that the decision for opening access to a closed list resides with the list owner, the list owner will need to make a decision as to whether it is appropriate for a list to be made open. Clearly there may be lists which contains confidential, sensitive, embarrassing or even potential illegal content which should not be made available.  In addition, as described in a JISCMail page on Copyright:

 When you send a message to a JISCMail list, you retain your copyright in that message. You also retain your moral right to be identified as the author of the work, and your moral right against derogatory treatment.

The extent to which your message is made available across the internet will depend on the level of access that has been decided by the listowner.

What processes should be taken to decide whether or not to open up a closed list archive?  This post describes the processes which are being taken for the LIS-ELIB-MANAGERS archive.

Processes For Informing List Members

Auditing The List

The LIS-ELIB-MANAGERS list currently has 26 members. A message was sent to the list in order to see how many of the email addresses were still valid. There were 11 bounced messages but only four people replied to a request to respond to a message sent to the list. It does not seem to be possible to find out how many people in total have subscribed to a list. For data protection reasons when users leave JISCMail, their name and email address are removed from the JISCMail database. However the ownership of email messages relates to list members who have posted to a list and not to those who have only lurked on a list.   It therefore would seem feasible to explore information about the numbers of people who have posted to the list and the number of messages they have posted.

Unfortunately there doesn’t seem to be an easy way of getting reports on the numbers of people who have posted to a JISCMail list or the number of messages they have posted. I therefore used the advanced search function to search for the numbers of messages posted for each year.

 Year 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 TOTAL
Nos.
of posts
147 113 209 46 29 1 0 0 0 0 0 0 0 0 0 1 546

It would be useful if information on the numbers of messages posted by list members could be obtained. However this does not seem to be provided.  I therefore looked at the list archives in order to find the email address of people who had posted to the list and searched for this email address in order to see the total number of messages posted from the email address.  I did this for 30 users, including those whose names were familiar to me and whom I felt were likely to have posted significant numbers of messages to the list.  The details are given below.

In addition I skimmed through some of the messages in order to gain a feel for the issues being discussed and to see if there appeared to be any sensitive topics been discussed or flame wars breaking out.  As can be seen from the list of subjects which are illustrated, there doesn’t appear to be any sensitive issues being routinely discussed. However subsequently I came across one post which contained personal information about a member of the community which I feel should be deleted if the list archives are to be made open.

Name Nos. of posts
1 Chris Rusbridge 119 [118+1]
2 Elizabeth Graham 50
3 John Kirriemuir 29 [16 (UKOLN), 6 (ILRT) + 7 (OMNI)]
4 Kelly Russell 24 [13 + 8 + 3]
5 Rosemary Russell 14
6 Janine Packard 13
7 Catherine Edwards 13
8 John Paschoud 12
9 Verity Brack 11
10 Brian Kelly 11 [10+1]
11 Astrid Wissenburg 8
12 Tony Gill 8
13 Stephen Smith 8
14 Jill Foster 7
15 Philip Hunter 6
16 Dee Wood 6
17 Lorcan Dempsey 5 [4+1]
18 Hugh Brailsford 5
19 Kay Flatten 5
20 Bruce Royan 5
21 Roddy Macleod 4
22 Ioanna Dandolos 4
23 Stephen Pinfield 4
24 John Kelleher 4
25 Liora Rolfe Stubbs 3
26 Tom Wilson 3
27 Nicky Ferguson 3
28 Isabel Stark 3
28 Hazel Gott 2
30 Anne Ramsden 2
TOTAL
391

In the above table it should be noted that one person (John Kirriemuir) posted from three different organisational addresses. In addition four others posted from different variants of the same email address (e.g. foo@ukoln.ac.uk and foo@ukoln.bath.ac.uk).

This table does not include everyone who has posted and also does not necessarily include information on those who have posted significant numbers of messages, since there are 155 messages not attributed to a sender.   However we seem to have listed the most active participants, including those who worked for the eLib programme team and those who worked at UKOLN who hosted the eLib programme Web site and were actively involved in the design of the eLib programme.    Having skimmed through the list archives, especially for the most active period in 1996-1998, it seems that many of the remaining posts will have been from a long tail of people posting informational messages about their projects, events, publications, etc.

Policy and Processes for Changing Access to the List Archives

Following this audit I have been in touch with Chris Rusbridge and Lorcan Dempsey in order to solicit feedback on the following proposed policy and implementation processes:

Information on the audit of the lis-elib-managers JISCMail list will be published and promoted to those who were active in the eLib community in order to solicit their views on opening access to the lis-elib-managers-archives.

Current and previous list members will be informed that the list owner and others involved in managing the list when it was being actively used feel that the list had been made closed in order that the list helped to address a particular audience and wanted to minimise distractions.

Posts which are discovered which contain personal information which we feel may be inappropriate to be published openly will be deleted.

Individuals who have posted to the list who may have concerns regarding issues related to confidentiality, legality and related issues for their posts can request further information about their posts.

If there are no specific concerns raised after a period of a month the list archives will be made open. This policy on openness will allow the archives to be published elsewhere, such as on the Markmail.org service. If concerns are raised these will be discussed by Brian Kelly, Chris Rusbridge and Lorcan Dempsey.

Rachel Bruce and Neil Grindley from the JISC who will have interests in preservation policies will be informed of the proposed change in status of the list and the processed which have been used prior to this change.

In brief the process for opening up access to the mailing list archive which may be applicable for other lists consists of:

  • Auditing the archive in order to identify the numbers of people who have posted messages and the numbers of messages that have been posted.
  • Identifying the reasons why the list was set up as a closed list.
  • Gaining an understanding of possible risks in opening up access to list archives.
  • Formulating a policy decision with key stakeholders.
  • Communicating the policy and gathering feedback.
  • Analysing the feedback and reviewing any changes to the proposed policy.
  • Implementing the policy.

Conclusions

This post began by the value which text mining tools can potentially provide by exploring the contents of email archives.  It is important to note that such text mining need not be carried out by the organisation hosting the archives; indeed there may be advantages in allowing an email distribution service to focus on the challenges in delivering large volumes of email for the higher education sector and allowing other organisations with expertise in data mining to provide this service.

The proposed changes to the policy will also allow the content to be reused elsewhere, such as the Markmail.org service.  As can be seen, this contains a large amount of content about JISC (1,235 messages from the 8,371 lists it currently indexes).  However this does not include lists which are hosted by JISCMail, due to the JISCMail policy which prohibits archives from being hosted elsewhere, without the permission of the list owner.  I hope that this post has outlined one way in which a closed list can be made open and such openness exploited by enabling a service which can demonstrably add value to be allowed to make use of the valuable archives provided by the JISCMail service.

Is this an appropriate approach?  I’d welcome your feedback.


Posted in preservation | 4 Comments »

Policies on Unused JISCMail Lists

Posted by Brian Kelly on 17 Aug 2011

Last week I received an email from JISCMail which invited me to state whether an unused mailing list should be retained or deleted:

Lists: LIS-ELIB-MANAGERS

Your JISCMail list(s) have not been used for over 3 years. Please email to helpline@jiscmail.ac.uk to confirm whether the list should now be deleted or retained. If you choose deletion, let us know if you would like a zipped copy of the archives for your records.

Back in January 2010 I wrote a post on Decommissioning / Mothballing Mailing Lists in which I discussed policies and processes for decommissioning and mothballing lists:

How should a list owner go about deleting unused lists? And aren’t there dangers that deleting the contents of lists which may have been used to influence the research process or provide possibly valuable historical insights on the content area covered by the list would be regarded as a mistake by future generations?

Following the subsequent discussions I decided on the policy for unused lists which I owned: I disabled postings to the lists and updated the list description accordingly. For example the DNER-TECH list now states:

List to discuss technical issues relating to the establishment of the Distributed National Electronic Resource. These issues should particularly relate to inter-operability matters. Other topics may be introduced later. THIS LIST IS NOW CLOSED.

I have decided not to delete the unused lists as the lists I own tend to have been used to discuss various aspects of early developments of digital library initiatives in the sector and I feel that the issues which were discussed could provide information which may have some value from an historical perspective.  For example ten years ago on the DNER-TECH list there were discussions of “issues related to deploying the Bath Profile, the emerging proposals for ‘Z39.50 Next Generation’ (ZNG), and presentations by a number of UK-based projects with significant experience of deploying Z39.50 applications in a number of domains“. This message can therefore provide evidence of the interest in Z39.50 at that time.

You could, of course, manage the content by requesting a zipped copy of the archive (although note that the Web page on deleting a list, somewhat confusing called Deleting a Group, does not provide any further information, including details of the contents held in a zipped archive – will, for example, this include details of the members of the list?). But this would mean that the original location of the resource being deleted and will make it more difficult for other interested parties to find this information. To be honest I can’t see the point of requesting a zipped copy for most open lists, especially since the existing JISCMail archive provides a rich archive which may be of value and provides an interface (using JISCMail commands) which potentially could support data mining of these resources.  However for closed lists, such as the LIS-ELIB-MANAGERS list which I own, since it would probably be inappropriate to retrospectively provide open access to such archives (will there be a 30 year limit, I wonder, before the general public can see what Chris Rusbridge, Lorcan Dempsey and eLib project managers were discussing on this list?!)

On further reflection it does seem to me that JISC-funded projects should probably have a policy on the management of legacy lists related to the project work.  There is, for example, a requirement for Web sites to be maintained for at least three years after the funding has ceased.   What should the policy be on mailing lists? And what should practices should be implemented once a list archive is felt to be no longer of interest?  I would welcome comments from other list owners on how they are managing any unused lists they own.

Posted in preservation | 9 Comments »

Blog Preservation and Plugins

Posted by Brian Kelly on 18 Jul 2011

Best Practices for Blog Preservation

A paper entitled “Moving From Personal to Organisational Use of the Social Web” described best practices for exploiting services such as blogs which were hosted in the Cloud. The paper further developed guidelines initially outlined in  a paper on “Approaches To Archiving Professional Blogs Hosted In The Cloud” including advice on managing the closure of a blog:

Monitoring of technologies used: Information on the technologies used to provide the blog including blog plugins, configuration options, themes, etc. can be useful if a blog environment has to be recreated.

It should be noted that advice on managing blog hosted in the Cloud might also need to be applied to blogs hosted within the institution.  As an example we implemented the above recommendation for the IWMW 2010 blog. The final post on the blog was entitled Closing the 2010 blog. In this post we documented how the blog was used (numbers of posts; numbers of contributors; etc.) and the technologies used (the them used and details of WordPress plugins which had been installed). A year later we discovered how useful it was to have provided documentation on the plugins used in the blog.

The blog environment we used to host the IWMW 2010 blog a year ago had to be upgraded. We used this as an opportunity to provide a more robust environment for additional blogs to support IWMW events, including the IWMW 2011 event.

However after the upgrade we discovered that the WordPress plugins had reverted to the defaults, with additional content which had been embedded in the blog, including the video interviews which had been published on the blog, missing from the posts. I now recall that this isn’t the first time this has happened – following a WordPress upgrade on the JISC Inform platform the plugins and the theme used on the JISC PoWR were lost and the environment had to be recreated from memory.

However the final post published last year provided the following record of the plugins which had been installed:

Details of plugins used: Akismet, Buddy Press, BP Disable Activation, Google Analyticator, Lifestream, Lux Vimeo

We subsequently re-installed the Lux Vimeo plugin – but found that the videos failed to re-appear. It seems that loss of the plugin also resulted in losing the embed code, which included the address of the videos.

Fortunately each of the posts also included a direct link to the resource on Vimeo (as illustrated in the screen shot which shows a blog post for which the video has been embedded and one for which it is still missing).  We were therefore able to re-establish the embedded video – although we decided to do this using the Embed Object plugin since this seems to provide richer functionality (and we updated the final post so that we have documented these changes).

The need to include links to remote content in addition to embedding such content was described in a post which advised Don’t Just Embed Objects; Add Links To Source Too! In this case the advice was provided in order to enhance access to content on m0bile devices, in cases in which Flash-based embedding technologies was not supported.   We have now discovered another reasons for providing such links – embedding addressing into plugins way result in the address being lost if the plugin becomes unavailable.

Best Practices for Live Blogs

The advice we had developed for those who make use of blogs stated that when archiving a blog:

Monitoring of technologies used: Information on the technologies used to provide the blog, including blog plugins, configuration options, themes, etc. can be useful if a blog environment has to be recreated.

It seems that such advice should be followed for cases when blogs which will continue to be provided are hosted on a blog platform which may be upgraded.  And since all blog platforms are liable to be upgraded the advice provided for blog preservation purposes would appear to be applicable more generally.  We are therefore applying this advice for the IWMW 2011 blog and the About page for this blog has also been updated accordingly.

Posted in preservation | Leave a Comment »

Archiving Blogs and Machine Readable Licence Conditions

Posted by Brian Kelly on 21 Apr 2011

Clarifying Licence Conditions When Archiving Blogs

UKOLN’s Cultural Heritage blog has recently been frozen following the cessation of funding from the MLA (a government body which is due to be shut down shortly).

As part of the closure process for our blog we have provided a Status of the Blog page which summarises the reasons for the closure, provides a  history of the blog, outlines various statistics about the blog and provides some reflections of the effectiveness of the blog.

Another important aspect of the closure of a blog should be the clarification of the rights of the blog posts. This could be important if the blog contents were to be reused by others – which could, for example, include archiving by other agencies.

As shown a human readable summary was included in the sidebar of the blog which states that the content of the blog are provided under a Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales License.

The sidebar also defined the scope of this licence which covered the textual content of blog posts and comments which were submitted to the blog.  It was pointed out that other embedded objects, such as images, video clips, slideshows, etc, may have other licence conditions.

However automated tools will not be able to understand the licence conditions.  What is needed is a definition of the licence in a format suitable for automated reading. This has been implemented using a simple use of RDFa which is included in the sidebar description.  The HTML fragment used is shown below:

<img alt=”Creative Commons License” src=”http://i.creativecommons.org/l/by-nc-sa/2.0/uk/88×31.png&#8221; /> This blog is licensed under a <a href=”http://creativecommons.org/licenses/by-nc-sa/2.0/uk/&#8221; rel=”license”>Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales License</a>.

How might software process such information? One example is the OpenAttribute plugin which is available for the FireFox, Chrome and Opera browsers. This is described as a “suite of tools that makes it ridiculously simple for anyone to copy and paste the correct attribution for any CC licensed work“. Use of the OpenAttribute plugin on the Cultural Heritage blog is illustrated below.

Assigning Multiple Licences To Embedded Objects in Blogs

The image above shows the licence for the blog in its entirety.  However the blog is a complex container of a variety of objects (blog posts from multiple authors;  comments from readers and embedded images and other objects from multiple sources)  and each of these embedded may have its own set of licence conditions.

How might one specify the licence conditions of such embedded objects?  In the case of the Cultural Heritage blog there was a statement that any comments added to the blog would be published under a Creative Commons licence so although anybody making a comment did not have to formally accept this licence condition, it practice we can demonstrate that we took reasonable measures to ensure that the licence conditions were made clear.

In order to specify the licence conditions for embedded images we initially looked at the Image Licenser WordPress plugin.   However this provides a mechanism for assigning licence conditions as images are embedded within a post, which are then made available as RDFa.  Since in our case we were looking at retrospectively assigning licence conditions to existing images (in total 151 items) it was not realistic to use this tool.

The Creative Commons Media Tagger provides the ability to “tag media in the media library as having a Creative Commons (CC) license“. But what licence should be assigned to images on the blog?  These include screen images and photographs which may have been include by guest bloggers but which have not been explicitly assigned a Creative Commons licence.  The question of  Who owns the copyright to a screen grab of a website? was asked recently on ecademy.com with a lack of consensus and a patent and trade mark attorney providing the less than helpful suggestion that “It is better to include a link to the original work if it is on the Web rather than to copy it.” The uncertainties regarding ownership of screen shots are echoed in a Wikipedia article which states:

Some companies believe the use of screenshots is an infringement of copyright on their program, as it is a derivative work of the widgets and other art created for the software. Regardless of copyright, screenshots may still be legally used under the principle of fair use in the U.S. or fair dealing and similar laws in other countries.

In light of such confusions there is a question as to what licence, if any, should be assigned to images in the blog. As described in the Creative Commons Media Tagger FAQ it is possible to run the plugin in batch mode to “tag media that was already in your media library prior to installing and activating CC-Tagger“. It occurred to me that it would be best to assign a non-CC licence by default to all images and then to manually assign an appropriate CC licence to images such as those taken from Flickr Commons in a post entitled “Around the World in 80 Gigabytes“. However using the batch made of the tool appeared not to change the content – and it is unclear to me whether there is a way of providing a machine-readable statement in RDFa stating that a resource is not available with a Creative Commons licence.

Using the Image Licenser tool on an individual image resulted in the following HTML fragment which illustrates how a machine readable statement of the licence conditions can be applied to an individual object:

<img class=”size-medium wp-image-2206″ title=”Flickr Commons” src=”http://blogs.ukoln.ac.uk/cultural-heritage/files/2011/02/flickr-commons-300×205.jpg&#8221; alt=”image of flickr commons home page” width=”300″ height=”205″ />

Discussion

Whilst finalising this post I asked on TwitterIs it possible to use RDFa to provide a machine-readable statement that an image *doesn’t* have a CC licence? …” and followed this by describing the context: “.. i.e. have a blog post with CC licence for content but want to clarify lience for embedded objects. #creativecommons“.  Subsequent comments from @patlockley and @jottevanger helped to identify areas for further work which I hadn’t considered – I have kept an archive of the discussion to ensure that I don’t forget the points which were made. A summary of my thoughts is given below:

Purpose: Why should one be interested in ways in which the licence conditions of objects embedded in blog posts? My interest relates to arching policies and processes for blogs.  For example if an archiving service chooses to archive only blogs for which an explicit licence is available there will be a need to ensure that such licences are provided in a machine-readable format in automate to allow for automated harvesting.  There will also be a need to understand the scope of such licences. In addition to my interests, those involved in the provision of or reuse of OER resources will have similar interests for reusing blog posts if these are treated as OER resources.  Finally, as  @jottevanger pointed out this discussion is also relevant more widely, with Jeremy’s interests focussing on complex Web resources containing digitised museum objects.

Granularity: What level of granularity should be applied – or perhaps this might be better phrased as what level of granularity is it feasible to apply machine readable licence conditions for complex objects? Should this be at the collection level (the blog), the item level (the blog post) or for each component of the object (each individual embedded image)?

Risks: Should one take a risk averse approach, avoiding use of a Creative Commons licence at the collection level since it may be difficult to ensure that each individual item has an appropriate Creative Commons licence)? Or should one state that by default items in the collection are normally available under a Creative Commons licence, but there may be exceptions?

Viewing tools: What tools are available for processing machine understandable licence conditions? What are the requirements for such tools?

Creation tools : What tools are available for assigning machine understandable licence conditions? What level of granularity should they provide? What default values can be applied?

I know that in the OER community there are interests in these issues.  I would be interested to hear how such issues are being addressed and details of tools which may already exist – especially tools which can be used with blogs.

Posted in openness, preservation | Leave a Comment »

A Few Days Left to Download a Structured Archive of Tweets

Posted by Brian Kelly on 17 Mar 2011

On 21 February 2011 John O’Brien, developer of the Twapper Keeper twitter archiving service announced the “Removal of Export and Download / API Capabilities“. In a subsequent video interview John explained the reasons for the removal of this service, which arose following Twitter announcement that it was enforcing its policy that third party services are not allowed to syndicate or redistribute tweets. Following Twitter’s ‘cease and desist’ email the removal of Twapper Keeper’s export capabilities and APIs will take place on 20 March – a few day’s time.

It is clear that the popularity of the Twapper Keeper service (which has a total of 2,410,061,623 tweets across 21,475 archives) has demonstrated a clear need for Twitter archiving – and it seems that Twitter wishes to be able to commercially exploit such popularity. I would guess that other services, such as Martin Hawksey’s iTitle Twitter captioning service is another example of an innovative approach which Twitter will be seeking to exploit commercially.

Last year’s JISC-funded developments to the Twapper Keeper service included making the software available under a Creative Commons licence. If you visit the Your.TwapperKeeper.com site you will be able to download the software which can be run on your own server. Clearly you would not be able to simply replicate a public Twapper Keeper service, but if Twitter’s terms and conditions are aimed at stopping public redistribution of tweets it would appear possible to install the software on an institutional Intranet – although I should admit that IANAL.

It should the pointed out that the Twapper Keeper service will continue to archive tweets which can be accessed via the HTML interface – what is being lost is API access and the ability to download a structured archive of tweets in for example, MS Excel format with columns of the tweets, Twitter userid, date and time information, geo-location information, etc. Such structured information is, as Twitter is very aware of, valuable for developers who wish to carry out richer data analysis or provide additional value-added services on top of the conventional Web-based display of tweets.

It is still possible for a few days to download such structured archives from Twitter. I have recently looked at the details of my TwapperKeeper archives. I have decided to keep a local archive of tweets associated with a number of talks I have given. However I don’t intend to keep a structured archive which are primarily of interest to event organisers (such as the ALT-C, JISC and CETIS conferences). I have also decided to keep a record in the list below of the decisions I have made. Note that an example of a local archive can be seen for the seminar I gave last year at the University of Girona.

Archive Type Name Description Policy # of Tweets Create Date
#Hashtag #a11y Accessibility (a11y) Archive not kept as this subject based archive is not directly related to my key areas of work. 42427 04-25-10
#Hashtag #accbc CETIS/BSI Accessibility SIG meeting. Local archive not kept as I was a speaker at this recent event. 154 02-28-11
#Hashtag #altc2009 The ALTC 2009 conference Archive not kept as this event-based archive will primarily be relevant to the event organisers. 4737 08-28-09
#Hashtag #altmetrics New approaches for developing metrics for scholarly research Archive not kept as this subject-based archive will primarily be relevant to others with an interest in the subject area.. 158 01-15-11
#Hashtag #Ariadne The Ariadne hashtag – which may be used for UKOLN’s Ariadne ejournal. Archive not kept as this subject-based archive will primarily be about topics other than UKOLN’s Ariadne ejournal. 11897 09-21-10
Keyword Ariadne Archive of tweets contains the string ‘Ariadne’ Archive not kept as this subject-based archive will primarily be about topics other than UKOLN’s Ariadne ejournal. 25598 09-21-10
@Person ariadne_ukoln Tweets about the Ariadne web magazine. Local archive kept. 882 05-28-10
@Person briankelly Tweets about Brian Kelly Personal archive kept. 6471 03-19-10
#Hashtag #CETIS The CETIS service, based at the University of Bolton. Archive not kept as this organisational archive will primarily be of relevance to the host institution. 2836 09-24-10
#Hashtag #CILIP CILIP, the Chartered Institute of Library and Information Professionals. Archive not kept as this organisational archive will primarily be of relevance to the host institution. 4494 09-24-10
#Hashtag #CILIP1 Campaign on future of CILIP organisation based on CILIP’s 1-minute messages. Archive not kept as this campaign-based archive will primarily be of relevance to the host institution. 357 06-13-10
#Hashtag #CSR Comprehensive Spending Review Archive not kept as this subject archive will primarily be of relevance to others. 79799 10-15-10
#Hashtag #falt09 ALTC Fringe Archive not kept as this event-based archive will primarily be of relevance to others. 219 08-28-09
#Hashtag #heweb10 Tag for the HigherEdWeb 2010 conference Archive not kept as this event-based archive will primarily be of relevance to others. 8723 09-28-10
#Hashtag #ipres10 Tweets for the iPres10 conference, Vienna, 19-24 Sept 2010. Archive not kept as this event-based archive will primarily be of relevance to others. 2 08-27-10
#Hashtag #ipres2010 Archive for the IPres 2010 conference to be held in Vienna on 19-25 Sept 2010. Archive not kept as this event-based archive will primarily be of relevance to others. 1397 08-27-10
@Person iwmwlive IMWM live blogging account Local archive kept. 1373 04-30-10
#Hashtag #jisc10 JISC 2010 conference Archive not kept as this event-based archive will primarily be of relevance to others. 2059 04-02-10
#Hashtag #jiscpowr Archive of tweets related to the JISC PoWR project provided by UKOLN and ULCC Archive not kept due to low numbers of tweets. 6 07-09-10
#Hashtag #jiscpowrguide Archive of tweets about the Guide to Web Preservation published by the JISC-funded PoWR project and launched on 12 July 2010. Archive not kept due to low numbers of tweets. 2 07-09-10
#Hashtag #ldow2010 Linked Data on the Web 2010 conference Archive not kept as this event-based archive will primarily be of relevance to others. 524 04-25-10
#Hashtag #loveHE Times Higher Education campaign to support Higher Education in UK. Archive not kept as this campaign-based archive will primarily be of relevance to others. 12066 06-12-10
#Hashtag #mdforum UKOLN’s Metadata Forum Local archive planned. 119 12-10-10
#Hashtag #morris Tweets about Morris dancing Archive not kept as this social archive will primarily be of relevance to others. 17813 10-16-10
#Hashtag #oxsmc09 socialmediaconference Archive not kept as this event-based archive will primarily be of relevance to others. 1063 09-18-09
#Hashtag #PhD Tweets for researchers using the #PhD tag Archive not kept as this subject-based archive will primarily be of relevance to others. 28527 09-24-10
#Hashtag #s113 Workshop session at ALTC 2009. Local archive kept (will be edited to remove irrelevant tweets posted after event had taken place). 227 09-03-09
#Hashtag #scl2010 Scholarly Communication Landscape (SCL): Opportunities and challenges symposium, held at Manchester Conference Centre on 30 November 2010. Archive not kept as this event-based archive will primarily be of relevance to others. 39 12-02-10
#Hashtag #ucassm Social Media Marketing Conference organsied by UCAS. Archive not kept as this event-based archive will primarily be of relevance to others. 223 10-18-10
#Hashtag #udgamp10 What Can We Learn From Amplifed Events seminar, given by Brian Kelly, UKOLN at the University of Girona.
Local archive available
Local archive kept. 395 09-01-10
#Hashtag #ukmw09 UKMuseumsandtheWeb Archive not kept as this event-based archive will primarily be of relevance to others. 750 12-05-09
Keyword ukoln Tweets about UKOLN Local archive kept. 1948 03-19-10
#Hashtag #ukolneim UKOLN’s Evidence, Impact, Metric work Archive not kept due to low numbers of tweets. 45 11-05-10
#Hashtag #w3ctrack W3C Track at WWW 2010 confernce Archive not kept as this event-based archive will primarily be of relevance to others. 179 04-30-10
#Hashtag #ww2010 Misspelling of WWW2010 hashtag Archive not kept as this event-based archive will primarily be of relevance to others. 833 04-29-10

It should be noted that this list is based on Twapper Keeper archives which I created. There will be a number of other archives which will be of interest to myself and colleagues at UKOLN which may also be archived locally.

Also note that a number of event-based Twitter archives (such as the #s113 archive of a workshop session at the ALT-C 2009 conference) will contain irrelevant tweets due to the hashtag being used for other purposes. Such irrelevant tweets may be deleted from the archive

Posted in preservation, Twitter | 2 Comments »

Time to Move to GMail?

Posted by Brian Kelly on 2 Mar 2011

The University of Bath email service is still down. The problems were first announced 0n Twitter at 06.02 on 24 February:

The University email is currently running at risk of failure we are working towards a fix – sorry for any disruption caused.

Later that day we heard:

University email will be unavailable for the rest of the day -for alternative use University Instant Messenger Jabber: http://bit.ly/fAshWi

The problems continued the following day and so BUCS (the Bath University Computing Service) announced an interim email service: I can now send and receive email but can’t access any email messages which I received prior to 25 February.  I must adit that this provides a strange feeling of bliss (my email folder is almost empty!), but I  know that the actions which I’m now running behind on will come back to haunt me when the full email service is restored.

Of course communications have continued, particularly on Twitter. I’m pleased, incidentally, that BUCS have been using Twitter as a communications channel to keep their users informed of developments.  It has also occurred to me how I am still able to continue working using Twitter to support my professional activities: how, I wonder, are others at the University of Bath who don’t use Twitter coping?

During this outage, whilst away in London, I suggested that use of Google’s GMail service might be appropriate.  In response I received the ironical reply:

Gmail never breaks. Oh. Wait. http://www.pocket-lint.com/news/38815/gmail-reset-deletes-correspondence-history :)

It seems that on the day Bath University email users were suffering as a consequence of hardware problems on its email servers Gmail was also having problems. As the PocketLint article rather dramatically announced:

Oh dear – looks like Google has dropped the bomb on hundreds of thousands of Gmail accounts, wiping out years of email and chat history.

You can’t trust GMail to provide a reliable email service seemed to be the sub-text of other Twitter followers who responded to my initial tweet.  But is that really the case? I have described the continuing problems with the BUCS email service (which are summaried in a BUCS FAQ). But what is the current status of GMail?

Whilst Computer Weekly has highlighted the problems of use of Web-based email services the CBC News has pointed out thatGmail messages [are] being restored after bug“.  The article described how  emails “are being restored to Gmail accounts temporarily emptied out two days ago”. This problem was either small-scale – “About 0.02 per cent of Gmail users had their accounts completely emptied“) or significant – “media outlets estimate there are roughly 190 million Gmail users, so about 38,000 were affected”. The problem, caused by a bug which has now been fixed, did not affect me whereas the BUCS email outage clearly has.  Which, I wonder, is the more significant problem?

I have to admit that I have been affected by outages in externally-hosted communications services previously. In September 2009  I wrote a post entitled “Skype, Two Years After Its Nightmare Weekend” which described how “Skype’s popular internet telephone service went down on August 16 [2007] and was unavailable for between two and three days“. This was also due to a software bug (related to MS Windows automated updates) which has been fixed – and I have continued to be a happy Skype user and agree with last year’s Guardian article which described “Why Skype has conquered the world”.

So yes there will be problems with externally-hosted systems, just as there will be problems with in-house systems (and ironically the day before the BUCS email system went down and two days before GMail suffered its problems my desktop PC died and I had to spend half a day setting up a new PC!). It may therefore be desirable to develop plans for coping with such problems – and note that a number of resources which provide advice on backing up GMail have been provided recently, including a Techspot article on “How to Backup your Gmail Account” and a Techland article on “How to backup GMail“.

But in addition to such technical problems there are also policy challenges which need to be considered. At the University of Bath email accounts are deleted when staff and students leave the institution (and for a colleague who retired recently the email account was deleted a day or so before she left). One’s GMail account, on the other hand, won’t be affected by changes in one’s place of study or employment.  In light of likely redundancies due to Government cutbacks isn’t it sensible to consider migration from an institutional email service?  And shouldn’t those who are working or studying for a short period avoid making use of an institutional email account which will have a limited life span?

Posted in General, preservation | 22 Comments »

Link Checking For Old Web Sites

Posted by Brian Kelly on 4 Jan 2011

Web sites rot. Over time they’ll start to break. Not only will increasing numbers of links to external resources start to break but you may also find that the functionality provided within the Web site may start to break. This may be a problem if Web sites are still being used but are no longer maintained. But what should be done?

From 1999-2000 UKOLN was a member of the EU-funded EXPLOIT project and provided the Exploit Interactive Web magazine. This was followed, from, 2000-2003 by the Cultivate Interactive Web magazine. Since the funding ceased a link check of the Web sites has been carried out annually with the findings published and summaries of any problems documented. Only internal links are checked and the surveys helped us to identify and fix a number of problems which occurred when the Web site was migrated from a Windows NT service to an Apache server running on a Unix box. We have also observed a small number of broken links to third party Web site usage services, as illustrated below.

Running the annual link check and documenting the findings takes about 10 minutes. The Exploit Interactive and Cultivate Interactive Web sites are technically quite simple, with little integration with third party services. However as Web sites increasingly make use of content and services provided by third parties there are dangers that such dependencies will cause problems. So perhaps auditing of such services, including project Web sites which are no longer being funded, will become increasingly important. The Exploit Interactive

Alternatively you could argue that after a period of time such Web sites should be deleted. We recommended to the EU that project Web sites should be expected to continue to be hosted for at least three years after the funding had expired. We also suggested that this should be a minimum and that organisations should try to continue to host such Web sites for ten years after the funding has finished. Since the final issue of the Exploit Interactive ejournal was published in October 2000 we have achieved that goal. Should we now delete the Web site? Doing so might save ten minutes a year in checking that the Web site is still functioning, but would mean that articles on a number of EU-funded projects would be lost, including the following which were published in the final issue:

  • ELVIL 2000: Ingrid Cartwell and Magnus Enzell introduce the prototype for the ELVIL 2000 Project, an Academic Portal for European Law and Politics.
  • EQUINOX: Following on from an earlier article in Exploit Interactive, Monica Brinkley provides an update on the EQUINOX project, a Library Performance Measurement and Quality Management System.
  • ILSES: Meinhard Moschner and Repke de Vries describe the development of a specialised networked digital library which integrates publication retrieval and survey data extraction.
  • LIBECON 2000: David Fuegi, John Sumsion and Phillip Ramsdale discuss the LIBECON2000 Project and its Millennium Report.
  • TECUP: Paul Greenwood and Martina Lange-Rein on TECUP, a meta project which analyses practical mechanisms for rights acquisition for the distribution, archiving and use of electronic products.
  • VERITY: Alexandra Papazoglou gives a final report on Project Verity: Virtual and Electronic Resources for Information skills Training for Young people.

I can’t help but feel that the Web site should continue to be hosted. But what should the general policy be for project Web sites? What are others doing for project Web sites whose funding may have ceased ten years ago or five years ago or even more recently?

Note: Coincidentally after published this post I received an email containing details of the uptime for the Exploit Interactive and Cultivate Interactive Web sites. I receive an automated email if the Web sites are not available and also receive weekly reports on the server availability, as illustrated below. Another approach to consider for legacy Web sites?

Posted in preservation | 13 Comments »

“5 Days Left to Choose a New Ning Plan”

Posted by Brian Kelly on 23 Aug 2010

I received an email on 16 August announced that I had “5 Days Left to Choose a New Ning Plan“.  The email related to the announcement Ning made a few months ago that the company was withdrawing its provision of free social networks.

We had made use of Ning to provide the IWMW 2008 social network.   The email informed me that “the network has grown up a bit since you started the ball rolling. You have grown to 90 members who have collectively helped you add unique photos, some interesting videos, and 24 spirited discussions“.

What action, if any, was needed in response to this email? The simple answer would be to suggest that nothing needed to be done as the social network was established simply to support an event which took place 2 years ago – so there’s no point in paying the $19.95 annual subscription for the social network to continue to be hosted. But what if the social network (or indeed any other Cloud Service) hosted useful content which I would not like to lose?  So I took the opportunity to evaluate copying the Web site prior to its demise – and I hope that documenting this process with be of interest to others.

The WinHTTrack software was used on Monday 16 August 2010 to create a copy of the IWMW 2008 social network. The mirror is currently hosted on the main IWMW 2008 Web site – although we are making no commitment to hosting the content on a long term basis.

The purpose of the provision of the Ning social network for the event was to provide a communications and collaboration environment for IWMW 2008 delegates and also to gain a better understanding of whether such a service was need.  We discovered that the usage was low, with only 90 registered members out of about 180+ registered delegates and, despite the “spirited discussions” rhetoric in the email from Ning, there was very little use made of the discussion fora on the service.

We kept a record of information provided by the WinHTTrack mirroring software.  Despite the low usage I was surprised to discover that the mirror took 1 hour 42 minutes to run. The mirror is 175 Mb and contains 9,065 files and 282 folders.

Once the mirror had been created the navigational bars were updated to link to the local resource rather than the Ning social network, and a record of the process was documented. In addition a news item was created on the IWMW 2008 event news feed.

Our intention will be to delete this mirror shortly, as we do not feel it provides any useful content. We will, however, be keeping a record that the Ning social network was used and provide a summary of its usage,  so that, for example, we will have a record of the technologies used to support the various IWMW events.

We’ve also decided to publish this summary so that if anyone has any interest in the event’s social network, the tool used to mirror the content or the policy we intend to implement will have the opportunity to give their comments.

This is a summary of how we responded to the announcement of the closure. I wonder what will happen to the 33 Ning social networks I found using a search for ‘JISC’?  One, I noticed, is a “personal portfolio to record and reflect on my work experience” contains spam for free drugs! There are others, however, which have been used to support the work of the JISC Regional Support Centres (this one, for example), JISC-funded projects (such as this one) and  events (such as this example).

The use of such services to support events, in particular, raises some interesting issues. I have previously suggested that “The lesson I’ve learnt – there’s a need to change the settings for social networks set up to support events after the event is over. I still prefer to make it easy to subscribe to such services, however, in order to avoid any delays caused by the need to accept new subscriptions manually“. But as well as tightening up on access after an event is over in order to avoid spam are futher measures needed?  Should the content be replicated elsewhere? Should the social networking site be closed? Or should we be happy with the default option of simply doing nothing – after all, although the announcement stated that the free service would be withdrawn on 20 August, it is still available today.

Latest News: I have just received an email stating that “we’ve decided to extend the deadline until August 30, 2010.“.

Posted in preservation, Social Networking | Tagged: | 3 Comments »

Decommissioning / Mothballing Mailing Lists

Posted by Brian Kelly on 1 Feb 2010

The Context

In response to my recent post about usage of JISCMail lists Nicole Harris pointed out some evidence of its popularity. It is clear that although in some sectors there may have been a migration to a diversity of communication and collaboration tools, other sectors are still well-served by email lists.  This is particularly true of museums and public libraries, as I know from experience, being a member of the well-used MCG and lis-pub-libs JISCMail lists.

The Evidence

But what should be done for the lists which are no longer being used to any significant extent?  And following Nicole’s links to statistics on the use of JISCMail I was very interested to see the statistics on the numbers of messages on lists.

As can be seen from the accompanying image (taken from the JISC’s Monitoring Unit Web site), the majority of lists appear to have had zero messages posted in the given time period and the numbers of such lists has been growing. The number of very active lists, with over 100 messages, is in comparison, tiny. Of course these lists must be very active as the overall amount of traffic on the lists is still growing.

Although these figures are very surprising they do reflect my findings when I looked at the various lists that I was still subscribed to. For example here are two lists which I had forgotten about:

ADVSERV-CANDM (Advisory Services and Comms and Marketing mailing list)
A list is for discussion and dissemination between Advisory Services and Communication and Marketing) .
Only a handful of posts between July 2004 and November 2005.

DNER-TECH
List to discuss technical issues relating to the establishment of the Distributed National Electronic Resource. These issues should particularly relate to inter-operability matters.
Posts between August 1999 and October 2004.

In addition to these lists which I am still subscribed to I discovered there are a number of list which I own which I had forgotten about.  Here are another two examples:

HELPS (Historic Environment List For Projects and Societies) – 180 Subscribers
This list is designed to promote liaison between those recording all aspects of the historic environment, whether as part of a national project, a specialist interest group or locally based society. The list is intended for members to share experiences for the benefit of others, exchange information and provide mutual support.
Discussions in 2004 and only occasional publicity posting since, with last in June 2007 and July 2008.

INTEROP-CULTURE – 70 subscribers
The mailing list of the international group involved in shaping Interoperable Digital Cultural Content Creation Strategies.
One post in April 2006 but prior to that used from July 2001 to November 2004.

What is to be Done?

Does the existence of many moribund lists matter? This is a question which is very pertinent to UKOLN activities on behalf of the cultural heritage sector in providing advice on digital preservation issues.  The need to make plans for the  decommissioning services was highlighted by Chris Sexton, UCISA chair, at a recent UCISA meeting in which, as she described in her blogWe are all going to be faced with spending less, doing more with less, and deciding what we can stop doing“.

Deciding which lists no longer have a useful purpose can be helpful to a number of groups. Users who find the mailing lists archives a potentially valuable resource may find that the search interface becomes useable if the numbers of lists is decreased (there is no global search of the mailing lists and as Google is blocked from the archives searching selected mailing lists is a very time-consuming process). Deleting such lists may also help new users who are seeking relevant lists to join – at present statistically they are likely to join a moribund list is they make their selection based on the list descriptions. The JISCMail team may well find the systems management easier if unwanted content is deleted, thus potentially freeing technical expertise which can be used to enhance other aspects of the service.

Policies and Processes For Decommissioning and Mothballing Lists

How should a list owner go about deleting unused lists? And aren’t there dangers that deleting the contents of lists which may have been used to influence the research process or provide possibly valuable historical insights on the content area covered by the list would be regarded as a mistake by future generations?

If would be a mistake, however, to regard digital preservation to simply mean that digital resources should be kept forever. An important role for those involved in preservation activities is the selection of resources which are felt to be worthy of preservation and the deletion of the rest – and if such deletion activities is ignored there may be significant costs in its ongoing maintenance.

I’m not aware of guidance for list owners on how they should go about developing policies for mailing lists and associated procedures for implementing such policies. They only  relevant information I could find on the JISCMail Web site was a page on renaming or deleting JISMail lists. This page allows a list owner to give the name of the list to be deleted and request a ZIP file containing the archives, files and list header.

No advice is provided, however, to assist list owners who may be considering deleting lists. It would clearly be inappropriate for a list owner to delete a still-popular list. But at what stage might it be felt that a list should be considered for deletion?  Do posters of messages to the list have any say in the matter (they own the copyright of their messages)? And who should take responsibility for consideration of the long-term importance of messages posted to the list?

In a bottom-up approach to attempting to answer such questions I will describe my thoughts on the DNER-TECH and INTEROP-CULTURE lists.

A summary of these lists is given below.

List: DNER-TECH
Date created: August 1999
List owner: Brian Kelly, UKOLN (although I was initially unaware of this as it used a non-standard variant of my email address)
Status: Open access to archives
Summary of purpose of list, ownership, etc: To discuss technical issues related to the DNER ( Distributed National Electronic Resource).
No. of subscribers: 50 (including 5 variants of my email address!)
Period of popularity: Small number of posts (2-3/month? from 1999-2002.
Period of few and ‘non-essential’ posts (non-essential may include announcements, posts sent to multiple lists, etc.): Last discussion took place in July 2003.
Stakeholder communities and individuals: Software developers from JISC eLib and subsequent DNER (later renamed IE) programme; Chris Rusbridge? (eLib programme director); Rachel Bruce: (JISC); UKOLN.
Likelihood of messages being cited in research papers: Unlikely.
Other issues: –
Risks: Closure of this list would have no adverse effect. Deletion of the contents of the list would be unlikely to have an adverse effect, especially in light of the (now-dated) technical content of the list.

ListINTEROP-CULTURE
Date created: July 2001
List owners: Brian Kelly and Rosemary Russell, UKOLN
Status: Login required to view archives
Summary of purpose of list, ownership, etc: Set up by staff in UKOLN
No. of subscribers: 70
Period of popularity: Last posts in November 2004 and April 2006.
Period of few and ‘non-essential’ posts (non-essential may include announcements, posts sent to multiple lists, etc.): List appears to have been announcements only.
Stakeholder communities and individuals: Appears to have been set up for policy makers in cultural heritage organisations.
Likelihood of messages being cited in research papers or contain ‘significant’ content: Very low.
Other issues: Significant number of overseas subscribers.
Risks:  Closure of this list would have an adverse effect. Deletion of the contents of the list would be unlikely to have an adverse effect. However in light of the international aspect of the list it would be prudent to ensure stakeholders have the opportunity to give their views.

Next Steps

Carry out this research proved interesting in observing how these mailing lists failed to live up to their initial expectations.  but what to do next?  Some may feel that as the costs of the disk storage are trivial there is no need to do anything. However my view is that managed curation of such digital resources is needed.  So I feel that I should send an email to these two lists announcing my intention to delete these lists based on my review of the contents and my assessment of the risks of deleting the content. And since I no longer have an interest in the archives if anyone wishes to maintain the content they will be welcome to take on ownership of the lists.

But before taking this step I thought I would seek others views on these proposals. What do you think should be done?


[Note this post has been updated with a updated chart of JISCMail usage statistics. You can
view the original statistics published in the post which covered the period 2003-2007.]

Posted in preservation | 8 Comments »

Are You Able?

Posted by Brian Kelly on 17 Feb 2009

There were two invited keynote speakers who travelled from Europe to speak at the OzeWAI 2009 conference. As well as my talk (which I described recently ) Dr. Eva M. Méndez (an Associate Professor in the Library and Information Science Department at the Universidad Carlos III de Madrid and not the American actor!) gave a talk entitled “I say accessibility when I want to say availability: misunderstandings of the accessibility in the other part of the world (EU and Spain)“.

Eva’s research focuses on metadata and web standards, digital information systems and services, accessibility and Semantic Web. She has also served as an independent expert in the evaluation and review of European projects since 2006, both for the eContentPlus program and the ICT (Information and Communication Technologies) program and her talk was informed by her knowledge of the inner working of such development programmes funded by the EU.

Her talk explored the ways in which well-meaning policies may be agreed with the EU, although such policies may be misinterpreted or misunderstand and fail to be implemented, even by the EU itself.

I don’t have access to Eva’s slides, so I will give my own interpretation of Eva’s talk.

We might expect the EU to support the development of a networked environment across EU countries across a range of areas. These areas might include:

Available: Have resources been digitised? Are they available via the Web?

Reusable: Are the resources available for use by others?  Or they it trapped within a Web environment which makes reuse by others difficult?

Findable: Can the resources be easily found? Have SEO techniques been applied to allow the resource to be indexed by search engines such Google?

Exploitable: Are the resources available for others to reuse through, for example, use of Creative Commons licences?

Usable: Are the resources available in a usable environment?

Accessible: Are the resources accessible to people with disabilities?

Preservable: Can the resources be preserved for use by future generations?

Since the acronym ARFEUAP isn’t particularly memorable (and ARE-U-API would be too contrived) we might describe this as the Able approach to digitisation. But there is 0ne additional concept which I feel also needs to be included:

Feasible: Are the policies which are proposed (or perhaps mandated) feasible (or achievable)? We might ask are they actually possible (can we make all resources universally accessible to all?)  and can they be achieved with available budgets and with the standards and technologies which are currently available?

There is, of course, a question which tends to be forgotten question: is the proposed service of interest to people and will it be used?

The worrying aspect of Eva’s talk was that the EU don’t appear to be asking such questions – or even used the same vocabulary.  We need to have the bigger picture in order to address tensions between these different areas and the question (and power struggles) of how we prioritise achieving best practices – for example, should we be digitizing resources, even if we can’t make them accessible; should we regard access by people with disabilities as being of  importance than ensuring the resources can be preserved?  And let’s not fudge the issue by suggested that each is equally important and all can be achieved by use of open standards. That simply isn’t the case – and if you doubt this, ask managers of institutional repositories. They will probably say that they are addressing the available, reusable, findable, preservable and, perhaps, exploitable issues, but I suspect that the repository managers would probably admit that many of the PDFs in the repositories will not be accessible.

Posted in Accessibility, preservation, standards | Tagged: | 3 Comments »

Disappearing Resources On Institutional Web Sites

Posted by Brian Kelly on 16 Dec 2008

I recently received the publisher’s proofs of an accessibility paper which will be published in the new year. The reviewers spotted a number of broken links in the references. Some of them were links to previous papers I had published, and the errors were introduced by the publisher (which I confirmed by checking the details of the paper which I submitted). But for a couple of other references the pages did seem to have disappeared. I contact Stuart Smith, one of the co-authors, and asked him if he knew anything about the references he had supplied which seemed to have disappeared.

Stuart told me that a new e-learning team in his institution has rebuilt the e-learning Web site, resulting, it seems, in the loss of existing resources. Stuart wrote a blog post about this incident entitled “Mummy I lost my MP3!“. Stuart felt that “My MP3 problem shows to me that the argument that the ‘cloud’ is too unstable doesn’t hold water … because institutional systems are open to the same criticisms“. Stuart concluded that “My solution to my MP3 problem will probably lie in the ‘cloud’ I’ll find a suitable archiving host that I like and also keep a backup offline (like I should have done in the first place) and if that host disappears at least I will know about it“.

I’m sure Stuart isn’t alone. How many resources do you think will have disappeared following the establishment of new Web teams or the release of new software?  Maybe institutional repositories will have a role to play, as they try to address the persistent identifier problem by at least decoupling the address of the resource form the technology used to access the resource.  But repositories won’t be used to manage all resources on an institutional Web site, will they?

Since our institutions don’t seem to have yet cracked the problem of management of resources across changes in policies, staff and technologies, is Stuart right, I wonder,  in regarding ‘the cloud’ (e.g. services such as the Internet Archive, perhaps) as the place (or one of the places) to deposit resources for safe-keeping?  Or perhaps the question is whether such services may be more reliable than the institutional Web site. After all, if your own institution misplaces your resources, you can;’t sue them, can you?

Posted in preservation, Web2.0 | 3 Comments »

The Final JISC PoWR Workshop

Posted by Brian Kelly on 29 Aug 2008

The final workshop organised by the JISC-funded Preservation of Web Resources (PoWR) will take place at the University of Manchester on Friday 12th September 2008.

Now you may think that preservation is a pretty dull topic, compared with the exciting developments that are taking place in a Web 2.0 environment. And if that’s what you think, then you’re not alone. As Alison Wildish, head of Web Services at the University of Bath described on the Web Services team blog:

We were asked by our colleagues at UKOLN (who organised the event) to deliver a brief talk detailing our approach to preserving web resources at the University. Our initial reaction was that we had little to say. Lizzie’s remit lies with the paper records and I am responsible for managing our website – ensuring it meets the needs of our users. Neither of us felt web preservation was something we had expertise in nor the time (and for me the inclination) to fully explore this.

And you can even listen to Alison and Lizzie Richmond (University of Bath records manager, archivist and FOI coordinator) expand on this by viewing the Slidecast of the talk they gave at the first JISC PoWR workshop:

If you listen to the end of the Slidecast you’ll hear Alison and Lizzie describing how they discovered in the course of the discussions reasons why Web preservation is a topic which needs to be treated seriously.

But how should one go about Web preservation? What should you preserve? What should one discard? What are the implications of use of Web 2.0 on preservation policies? Whose responsibility is this? What are the costs associated with preservation? And what are the costs and associated risks of not developing and implementing a preservation policy for your Web resources? And how does one ensure that an institutional preservation policy is sustainable and embedded withn the institution?

These are some of the topics which have been raised on the JISC PoWR blog and will be discussed at the workshop. But hurry up and book you place, as the deadline for bookings is Friday 5th September. And note that the workshop is free to attend for members of the higher and further education community.

And finally I should point out that the case study given by Alison Wildish and Lizzie Richard has been saved from being trapped in the non-interoperable world of the past, accessible only to Doctor Who (and even then only on a good day) by recording the talk and synching the recording with the slides and hosting this on Slideshare. You see, preservation can be enhanced through use of Web 2.0 services. Digital preservation can be cool – even though, arguably, it may kill the odd polar bear :-)

Posted in preservation, Web2.0 | Leave a Comment »

Fahrenheit 451

Posted by Brian Kelly on 15 Aug 2008

I recently attended the JISC’s Innovation Forum. One of the most interesting of the plenary talks was given by HEFCE’s John Selby. In his talk John praised the work of the JISC and the JISC Services, but went on to warn of troubled financial times ahead for the educational sector. The glory days of the past 10 years are over, he predicted.

This was probably not unexpected. What did surprise me, however, was the figures John quoted which put the carbon cost to the environment on par with the cost of flying – both at 2%.

This generated much debate at the forum, and, later on at the conference meal and in the bar. Although people questioned the accuracy of these figures, and wanted to know how these figures were obtained, there was an awareness that the carbon cost of IT is an issue which the IT secure needs to address. I should add that I subsequently came across details of a forthcoming Government Goes Green conference in which Malcolm Wicks, Energy Minister, BERR was quoted as saying that

ICT is now responsible for around 2% of global CO2 emissions. The public sector, with annual IT spending of £14bn, has an important role to play in reducing this two percent. An increased focus on sustainable procurement and efficient use of IT products are two key areas that it needs to work on and I am very pleased to see a conference dedicated on this.

At the JISC Innovation Forum dinner I found myself sitting next to colleagues from the Digital Curation Centre (DCC). I suggested, partly in jest, that although there was a clear need for continued development of networked services which are popular with the users, we had to ask ourselves where the costs of preserving digital resources could be justified. If, as we learnt from Alison Wildish’s recent presentation at the first JISC PoWR workshop, those involved in Web development activities tend to focus on the pressing needs of their user communities and find it difficult to justify diverting scarce resources to preserving resources which are no longer of significant interest to the institution, why don’t we stop pushing the notion of digital preservation. And not only will this allow the development community to focus their efforts on responding to pressing user needs – but removing archived files from hard disk drives could result in significant savings in energy.

This approach would then both help the users and help save the planet :-)

As I’ve said this was intended as a joke, over our conference meal. But we realised that their may be benefits for the digital preservation community in making such suggestions. After all, preservation is widely considered as worthy but dull. If digital preservation was regarded as something radical, might it have a greater appeal to developers? Could those involved in digital preservation work – harvesting old Web sites and even implementing OAIS models – find themselves repositioned as members of an underground radical movement, secretly preserving digital artefacts for a society which regards such activities as unacceptable. Fahrenheit 451 for the 21st century, perhaps.

Save a Polar Bear campaign posterThe following day when I suggested this, I was told that there have been discussions about strategies for digital preservation which acknowledge that there are environmental factors which need to be addressed. It seems that there have been proposals that such preservation activities should be based in places such as Greenland and Alaska where the low temperatures may reduce the need for consuming energy to keep the disk drives running at acceptable temperatures.

Now scientists may point out that running large scale server farms in locations near glaciers and the ice cap may increase the rate at which they melt. But the ideas which were bounced around at the event did make me wonder whether centralisation of networked services (e.g. running applications hosted by Google or Yahoo or running our applications on Amazon’s S3 and EC2 servers) would be more beneficial to the environment than all of our institutions running our own local servers.

And perhaps such discussion might be useful in a teaching context. Does data curation, for example, conflict with environmental protection? If so, should we forget it? Or could this approach result in deletion of the very data that could save the planet

What do you think?

And if you’d like to take part in a viral marketing campaign which seeks to make digital preservation interesting by suggesting that it might be responsible for global warming, feel free to make use of the post which has been produced. And note that a Creative Commons zero licence (currently in beta) has been assigned to this resource, so you don’t need to cite the original source. Let’s be part of an underground movement :-)

Posted in Finances, preservation | 19 Comments »

Places Still Available on “Preservation of Web Resources” Workshop

Posted by Brian Kelly on 17 Jun 2008

I’ve previously mentioned the JISC Preservation of Web Resources (JISC-PoWR) project which is being provided by UKOLN and ULCC. The project has established a blog and will be running its first workshop, entitled Preservation of Web Resources: Making a Start, on Friday 27th June 2008 at Senate House, London.

The workshop is aimed staff in the higher and further education sector with responsibilities for the preservation of institutional Web resources. The workshop will introduce the concept of Web preservation, and discuss the technological, institutional and legal challenges the preservation of Web resources presents. One aspect of Web site preservation might be keeping a history of changes to your institution’s home page. Do you have a digital record of the changes? And do you have a record of why significant changes were made and when? I have been working with colleagues in the University of Bath on ways in which we might address this particular issue. The following video clip, which is available on YouTube, illustrates some of the issues (although if the display is too small you might prefer to view the original resource):

There are still a number of places available on the workshop – which is free to attend for those in the higher and further education sector. But please sign up promptly if you are interested. The timetable is given below:

10:00 – 10:30 Registration and coffee

10:30 – 12:45 Morning Sessions:

  • Presentation: Preservation of Web Resources Part I
  • Breakout session: What are the Barriers to Web Resource Preservation?
  • Presentation: Challenges for Web Resource Preservation
  • Presentation: Legal issues

12:45 – 13:45 Lunch
13:45 – 16:00 Afternoon Sessions:

  • Presentation: Bath University Case Study
  • Breakout session: Preservation Scenarios
  • Presentation: Preservation of Web Resources Part II

16:00 End

Posted in preservation | Leave a Comment »

The SearchMe Visual Service

Posted by Brian Kelly on 13 Jun 2008

A recent Tweet from Tony Hirst alerted me to the Searchme Visual Search service. An example of use of this service searching for “UKWebFocus is illustrated below.

The Searchmevisual.com Service

As the name suggests this service provides a visually-oriented approach to searching and, rather than attempting to describe this service I suggest you try it.

I suspect that an initial response from some information professionals would be to highlight the limitations of such an interface, pointing out the difficulties of more advanced searching. However I feel that this would be to overlook the potential of this type of interface to provide browsing functionality. And this, indeed, was the use case made by Tony Hirst:

@briankelly would like a wayback machine browser for home pages over time. http://beta.searchme.com would look neat? Any libraries for it?

I met Tony at the recent CRIG DRY (Don’t Repeat Yourself) Metadata Barcamp held at the University of Bath. Over lunch I mentioned UKOLN’s JISC-PoWR (Preservation of Web Resources) project and described my interest in ways of exploiting content held in the Internet Archive’s WayBack Machine. I suggested that a generic screen-scraping interface to the service would be useful – and when I returned to the Barcamp later that afternoon Tony demonstrated the first version of the software :-) And the following day Tony had started to explore ways of providing a richer user interface to such data. A browse interface such as that used by Search Me Visual could potentially provide a very engaging way of visualising the changes to an organisation’s home page, I would think. And wouldn’t it be great if this could be demonstrated at the JISC-PoWR’s opening workshop on 25 June 2008. Has anyone come across any tools which could do this?

Posted in preservation, Web2.0 | Tagged: , | 4 Comments »

Preservation of Web Resources: Making a Start

Posted by Brian Kelly on 4 Jun 2008

My colleague Marieke Guy together with the JISC-PoWR project partners at ULCC have announced details of a workshop on “Preservation of Web Resources: Making a Start” – this one-day workshop will take place on Friday 27th June 2008 at the Senate House Library, University of London.

The JISC-PoWR project runs until the end of September 2008 and will run three workshops which will aim to identify best practices for preserving Web sites. The key deliverable of the project will be a handbook which will document the challenges to be addressed in Web site preservation in a number of areas which will include key institutional Web services (e.g. the prospectus), project Web sites (which have clear termination dates) and, a particular challenge for the project, the preservation issues associated with use of Web 2.0 services.

The first workshop will be free to attend (although there will be a penalty for non-shows), with the second workshop being held as part of the IWMW 2008 event at the University of Aberdeen on 23rd July.

Please sign up now if you would like to attend. And I’d you can’t make it but have an interest in the preservation of Web resource, why not subscribe to the JISC-PoWR blog – and, rather than being a passive reader, join in the discussions.  Topics we’d be interested in hearing about include (a) how institutions are currently addressing the preservation of key institutional Web-based services (such as the prospectus); (b) the approaches you may be taken to short-term project Web sites (whether JISC-funded or institutionally-funded and (c) your views on the preservation of data and services provided by externally-hosted Web 2.0 services.

Posted in Events, preservation | Leave a Comment »

Preserving The Past Can Help The Future

Posted by Brian Kelly on 21 May 2008

Many of the posts featured in this blog describe innovative tools and applications which aim to provide a more effective work or study environment for users. However there can be a danger that an emphasis on new and innovative services can mean a failure to manage legacy services which can result in a loss of our experiences, history and culture.

This can be particularly true in the Web environment. I first became aware of the scale of the problem when I monitored the Web sites which had been set up for projects funded by the EU’s Telematics For Libraries programme. As I described in an article on WebWatching Telematics For Libraries Project Web Sites published in the Exploit Interactive e-journal in October 2000 of the 65 projects which had Web sites, a total of 23 of the Web sites has disappeared when I carried out the survey. And a recent check shows that at least 39 of the Web sites have gone. Our digital history, the associated learning and the investment (from EU taxpayers) is being lost!

Or is it? Is this assertion just being alarmist? Might not the information have been migrated to a more manageable environment? And perhaps some of the projects are now available, possibly under new names, as sustainable services?

There’s a clear need for these issues to be addressed and for advice to be provided – both to organisation as responsible for managing their own Web services and to funding bodies which commission development work which will involve the development of Web sites.

JISC have recognised the need to provide such advice. They issued a recent call for an ITT on “The Preservation of Web Resources Workshops and Handbook” and I’m pleased to report that a joint bid by UKOLN and ULCC was successful. The project, which had its launch meeting on 1 May 2008, will run three workshops which will aim to gain a better understanding of the challenges to be faced in Web site preservation, identify examples of best practices and provide a set of recommendations to policy makers, content providers and developers. This will be documented in a handbook which should be available after September 2008.

Although the project is only funded for 5 months it will seek to provide advice not only on conventional institutional Web sites, but also on use of third party Web 2.0 services – the potential benefits of such services are well-understood, but there needs to be a better understanding of the risks associated with their use and how institutions should assess such risks and use such assessments to inform policy.

JISC PoWR BlogThe project team members themselves are using a variety of Web 2.0 tools to support their work. As well as communications technologies (beyond email) to support the work of the distributed team members a blog is also being used to disseminate information about the project and to solicit feedback and encourage discussion and debate. The JISC-PoWR (Preservation of Web Resources) blog (illustrated) is hosted on the JISC Involve blog service.

The team would like to welcome those with an interest in Web site preservation to join the blog and contribute to the discussions.

Posted in preservation | Tagged: | 1 Comment »

Disappearing Public Sector Web Sites

Posted by Brian Kelly on 31 Mar 2008

I recently used the Intute service to see what records it held about UKOLN’s activities. I found a record about the ‘Crossroads West Midlands service which UKOLN provided technical advice on the design of the collection description database:

This is the website of ‘Crossroads West Midlands’, a Resource funded project that is working to develop online access to the collections of libraries, museums and archives in the West Midlands (including universities and local authorities as well as private institutions). The Crossroads website is currently a prototype, testing a database built upon the RSLP collection level description database, covering the collections relating to the potteries industry of North Staffordshire.

The record provides additional information about the service which reminded me about the meetings I attended several years ago about this project. I was interested to see what the Crossroads West Midlands service now looks like, so I followed the link to the http://www.crossroads-wm.org.uk/ address – and, rather than a service providing access to a database of cultural heritage resources in the West Midlands, I found a page full of links to services such as golf, gambling, estate agents, motor insurance, etc.

Crossroads West Midlands Web SiteClearly at some point the domain name for the original service had lapsed and was purchased by a company which used it to host advertisments and links to companies which would be willing to advertise in this way (or possibly companies wishing to enhance their search engine ranking may have procured the services of a Search Engine Optimisation service and might not be aware of the approaches taken.)

I was interested in the history of the Web site. Using the Internet Archive I discovered that the Web site was first archived on 26 September 2002. At this point the information in the archive contained details about the project. The service itself was first launched around February 2003. And the service disappeared to be replaced by an advertsiment site at some point between December 2005 and April 2006.

What happened? Did project funding run out? Did key staff leave? Or was there a blunder, with nobody receiving the email requesting renewal of the domain name?

Whatever the reason, this West Midlands Crossroads service has disappeared for sight. Is this inevitable? Well back in 1999 I was the project manager for the Exploit Interactive e-journal– an EU-funded project which ran until 2000. Once the funding had finished we had to decide what would happen with the domain name. We agreed to continue paying for the domain for at least 3 years after the project funding had ceased and would try to keep the domain for a period of 10 years. This policy was informed by a survey I carried out of project Web site funded by the EU-funded Telematics for Libraries programme. As I described in an article published in Exploit Interactive in October 2000 23 Web site had disappeared of the 103 projects funded.

We are seeing a disappearance of cultural resource and EU-funded projects from the digital environment. And this may well get worse, if the UK Government’s policy of centralising its Web sites, which will result in 551 Web sites being closed down, is not managed properly. Will we, for example, find that the Drugdrive Web site at http://www.drugdrive.com/ suddenly becomes a site used for selling drugs?

What is to be done? The good news is that the Government does seem to be handling its redirects properly – the Drugdrive Web site, for example, is redirected to http://www.drugdrive.com/

Well done, the UK Government. But what about the rest of us? Are we managing the closure of Web sites? And are we assessing the risks of failing to do this? After all, if a government Web site on protection of children from dangers on the Internet became available and was bought by a pornography site, we could well see a government minister being forced to resign

Posted in preservation | 3 Comments »