UK Web Focus

Reflections on the Web and Web 2.0

Archive for the ‘preservation’ Category

Thoughts on “The Future of the Past of the Web” Event (#fpw11)

Posted by Brian Kelly (UK Web Focus) on 10 October 2011

imageOn Friday 7th October 2011 I attended a one-day event on “The future of the past of the web“. The event, which was organised was organised by the British Library, the Digital Preservation Coalition (DPC) and the JISC, was the third joint Web archiving workshop, the previous two workshops having been held in 2006 and 2009 .

I have had an interest for some time having given a talk way in 2002 on “Archiving The UK Domain and UK Web Sites: What Are The Issues?” at a DPC seminar on “Web-archiving: managing and archiving online documents and records“. It seems that the Web archiving world changed significantly since I gave my talk and, indeed, since the first two workshops.  As a number of people commented, many of  those involved in Web archiving initiatives are no longer primarily focussed on archiving conventional Web ‘pages’ – rather the sector is facing the challenges in archiving a much more dynamic environment, with the Social Web now providing significant content which social historians of the future will wish to analyse in order to make sense of today’s online (and offline) environment.

The changes in emphasis can also be seen from the developments of end user services which can help to make the importance of Web archiving move obvious to the wider community.  In the opening plenary talk Herbert van der Sompel described Memento, an initiative which is looking to “add time to the Web” by developments which build on existing web protocols including HTTP and content negotiation.

imageA Memento plugin for Firefox is available which enables end users to gain an understanding of benefits which such developments can provide. I was also pleased to hear that a Memento Browser is available for Android mobile devices. For those who may not be able to install such applications, use of Memento’s capabilities can also be seen by using the Internet Archive’s Wayback Machine. As can be seen from the accompanying image you can view the BBC News Web site for October 2008, and perhaps reminisce about the early days of the financial crisis.

Further examples of rich interactive interfaces to Web archives have been developed to enhance the  UK Web Archive service and, as described by Maureen Pennock and Lewis Crawford, this includes N-Gram visualisations of searches across the archive, tag clouds generated from the General Election 2005 Collection and a 3D wall visualisation across archived collections.

Services provided by the British Library have, of course, always been valued by researchers.  But in a talk on “Web Archiving: the State of the Art and the Future” Eric Meyer, Research Fellow Director at the Oxford Internet Institute, asked us to consider how effective we have been in making social science researchers aware of the potential of Web archives in supporting their research.  There is, I feel, a need for further advocacy for ensuring that researchers are aware of the ways in which not only archived digital resources, but also data associated with such archives, can sup[port research interests.

The increasing importance of Web archiving has led to archiving tools and services being developed within the commercial sector in addition to activities led by national libraries and archives, higher education and EU-funded consortia. Mark  Williamson was invited to give a presentation at the last minute and described various archiving activities of his company, Hazno. It was interesting to hear how a well-known multi-national company such as Coca Cola, which, as might be expected, has well-established archiving processes for archiving of physical objects but was slow in recognising the importance of digital archiving, including initially the development of its public Web site and then its public presence on social web sites including the Coca Cola Facebook page. Mark also described how APIs are being developed for the Hazno Web archiving system and how the APIs would be valuable in analysing the data associated with large collections of Web archives. As Mark put it: “The individual pages in a web archive are pretty boring – it’s the Big Data that’s exciting“. It will be interesting to see whether the Hazno software could provide a solution for Universities which may be interested in archiving their digital presence, especially uses of social web services for which the content cannot be managed through use of a content management system used to manage the institutional Web presence.

As well as finding the talks at the workshop of interest it was also interesting to observe the gaps. In the final session Neil Grindley, JISC Programme Manager for digital preservation asked the panel for their thoughts on standards for web archiving – and found that no one on the panel. However in response to my tweet that:

Interesting that nobody wanted to respond to the question about standards for web archiving at #fpw11

Helen Hockx commented that:

@briankelly I agree. Both ISO and BSI have initiated and are going to initiate work on standards related to web archiving.

If the next Web archiving event is held in another two years time, it will be very interesting to see what the focus of development work will be.  Ten years ago the drive for Web archiving came from national and international bodies.  However as suggested in a tweet posted by Les Carr a few hours ago who provided a link to a blog post on EPrints repositories to collect data from Twitter perhaps we shall see institutions appreciating the value of digital content created by members of the institution, including content hosted outside of the institution. Or perhaps, as suggested by the EU-funded Arcomem project, it may be large EU-funded projects which help to preserve todays’ cultural memories which are help on online service, including social web services.  And although motivated individuals may wish to make use of tools such as Memolane, a “Social Web application that captures all of your memories from different Social Networks like Flickr, Facebook, Twitter, Youtube ” highlighted on the Arcomem Website as a “Personal Timemachine for the Social Web“, in reality I don’t think we can leave it to individuals to take responsibility for preserving their own public content. Of course, this begs the question of ‘walled gardens’ which apparently mean that content cannot be accessed by third parties and issues such as privacy and copyright.  I wonder if the next Web archiving workshop will have got bogged down by the difficulties which such issues raise, or if ways of circumventing such difficulties may have been found?

Posted in preservation | 4 Comments »

“Battling legal, logistical and technical obstacles to archiving the Web”

Posted by Brian Kelly (UK Web Focus) on 12 September 2011

Recent Features on Web Archiving

The recent guest blog post entitled Web archives: more useful than just a ‘historical snapshot’ was quite timely, having been published a few days after a related article in the Time Higher Education (Memory Failure Detected) which described how:

A coalition of the willing is battling legal, logistical and technical obstacles to archive the riches of the mercurial World Wide Web for the benefit of future scholars

The article went on to illustrate a use case from the preservation of Web resources:

It is 2031 and a researcher wants to study what London’s bloggers were saying about the riots taking place in their city in 2011. Many of the relevant websites have long since disappeared, so she turns to the archives to find out what has been preserved. But she comes up against a brick wall: much of the material was never stored or has been only partially archived. It will be impossible to get the full picture.

But, as I describe below, we don’t need to wait until 2031 to have a reason to analyse Web content which may have been thought to be ephemeral.

Analysis of Twitter Usage at Recent ALT-C Conferences

The article in the Times Higher Education referred to an archiving initiative led by the Library of Congress which is archiving Twitter posts which will allow, at some time in the future, researchers to analyse public tweets. The article could also have mentioned the TwapperKeeper  archiving service which benefitted from JISC-funding to enhance its archiving capabilities to address requirements of the UK HE’s sector. The TwapperKeeper service was used to keep an archive of tweets posted about last week’s ALT-C 2011 conference.  The JISC-funded developments to the service included the provision of enhanced API access which led to development of the Summarizr analysis service  by Andy Powell at Eduserv.

In order to make valid comparisons across annual events I have previously suggested that the Twitter traffic for a week is analysed, so that discussions in advance of an event and shortly afterwards can be analysed. The Summarizr statistics for tweets at the ALT-C conferences for the past three years are given in the following table.

Note: Following the publication of this post Martin Hawksey pointed out in a comment on the post that the Twapper Keeperr archive was not available at the start of the ALT-C 2011 conference, until he created the archive on the opening morning of the conference.  An updated column has been published, but note that this does not include tweets form the opening morning of the conference.

ALT-C 2009 ALT-C 2010 ALT-C 2011 ALT-C 2011 (updated)
Date of event 8-10 Sept 2009 7-9 Sept 2010 6-8 Sept 2011 6-8 Sept 2011
Dates for analysis 6-12 Sept 2009 5-11 Sept 2010 4-10 Sept 2011
(partial archive)
6-11 Sept 2011
Nos. of tweets 4,442 6,138 6,296 6,342
Nos. of users 726 658 802 809
Nos. of URLs tweeted 701 664 1,083 1,102
Top five twitterers jamesclay (168)
sputuk (113)
haydnblackey (112)
emmadw (110)
JackieCarter (97)
dajbconf (330)
timbuckteeth (279)
AJCann (174)
jamesclay (153)
jak82 (111)
digitalfprint (327)
timbuckteeth (212)
sarahhorrigan (187)
FieryRed1 (165)
kevupnorth (140)
digitalfprint (327)
timbuckteeth (217)
sarahhorrigan (187)
FieryRed1 (165)
amcunningham (141)
Top five tweeted hashtags altc2009 (4,333)
jisccdd (108)
dubaimetro (84)
wheniwaslittle (72)
dupedb (64)
altc2010 (6,089)
digilit (173)
awesome (25)
altc2011 (24)
fail (23)
altc2011 (6194)
ds106radio (54)
altc2012 (42)
oer (39)
opencountry (35)
altc2011 (6,240)
ds106radio (54)
altc2012 (42)
oer (39)
opencountry (35)
Nos. of geo-located tweets 0 (0%) 35 (0%) 83 (1%) 83 (1%)

Archiving of the tweets allows us to provide such analyses in order to see the importance of Twitter at such events and identify the people who are particularly active Twitter users at the events. The figures also suggest that the amount of Twitter traffic seems to have stabilised over the past two years and the geo-located tweets, although growing in numbers, is not yet being used to any significant extent.

The Coalition of the Willing – Should Include You

The article published in the Times Higher Education highlighted a number of examples of  initiatives designed for archiving the broad ranges of resources available on the Web, including work being undertaken at the British Library, the Library of Congress and the Internet Archive as well as a number of national libraries in Europe.

The emphasis of national and international organisations may lead to the impression that archiving of Web resources is being addressed by others and so there is no need for individual universities to need to consider web preservation issues. This is, I feel,  a mistaken view.  Indeed not only should those who have a responsibility for the management of institutional digital resources need to address preservation issues, so too do those who manage project resources as well as, as we have seen above, those who may wish to preserve content associated with events.

JISC has recognised the importance of Web archiving and will be hosting an event on “The Future of the Past of the Web” which will be held at the British Library Conference Centre on 7 October 2011. This free event is the third joint Web archiving workshop which has been organised by the JISC in conjunction with the British Library and the DCC. The event is aimed at:

  • Curators, librarians, archivists interested in the preservation of web resources
  • Organisations that are engaged in web archiving and digital preservation
  • Researchers who depend on access to stable web resources for their research
  • Web developers and content creators who value their content
  • Information managers with responsibility for legal compliance

If this event is of interest to you note that bookings should be made before 12:00 on Friday 30th September 2011.

Posted in Events, preservation | Tagged: | 4 Comments »

Guest Post: Web archives: more useful than just a ‘historical snapshot’

Posted by Brian Kelly (UK Web Focus) on 7 September 2011

In this guest blog post Maureen Pennock, the Web Archive Engagement & Liaison Manager at the British Library, explores some possible approaches to exploiting the scholarly value of web archives.

Web archives: more useful than just a ‘historical snapshot’

The importance of the internet for research is well-known. As a constantly growing and evolving information source, the web contains vast amounts of information not available or published elsewhere. It is also a unique record of life and society in this technological age. Rarely these days do scholars carry out their research without going online, and the research value of the web is undeniable.

Web archives seek to capture this value and uniqueness by harvesting websites so that they may be re-used in the future even when they are no longer available on the live web. Over the past decade, numerous web archives have been established and grown, including the UK Web Archive. At almost 10 terabytes, over 9,300 web sites and 38,000 instances of archived sites, the UK Web Archive is a unique selective web archive that reflects the collection policies of the participating institutions.

Use of the web archive is steady. However, as recent reports have identified, there remains a gap between the potential community of researchers who could exploit the content, and those who actually do so. To address this, we are collaborating with researchers to explore different ways in which they may use the web archive and exploit the data contained within. We have developed and released a number of visualisation tools as an early first step:

  • the 3D Visualisation Wall, (shown below) which provides a high-level, more dynamic presentation of search results and special collections;
  • the N-Gram search, which encourages users to consider the web archives as data as well as websites, enabling visualisation and comparisons of term frequency;
  • the General Election 2005 Tag Cloud, which visualises the most frequently used (single and pairs of) words in the websites related to key political parties during the 2005 election campaign.

Analysis shows that our single most popular site is the One & Other site, otherwise known as the Fourth Plinth, the website of a 2009 public arts project by artist Anthony Gormley. The site is no longer available on the live web. This type of usage, where users browse websites in order to access content that was available at a given point of time but is no longer accessible, is a widely accepted, original user scenario. It is based largely on original user experiences and early interactions with the live web. But there are other ways in which a web archive may be used, aside from visiting sites as they were captured at a given date and time. For example:

  1. Resource citation. Researchers typically use the live web for research and cite live web resources with the date last visited. Why? Because content changes over time and they want to indicate when the content was available on the website. But if the content changes – and web pages are frequently updated or refreshed without archiving old versions – then there is no proof that the content cited actually existed. The web archive provides a more reliable and persistent citation than the live web.
  2. Data exploitation. Web archives enable automatic identification of social trends over time (automated temporal trend research). The tools available will impact on the type of research that can be undertaken. This is a chicken & egg scenario: we rely to an extent on users to tell us what tools they want, but users need some direction on what might be possible with the data available. We need to work together to further develop the archive and support the emerging research needs of our users.
  3. Intelligent querying, of the Q&A sort. Given the amount of data available in the web archive, it’s not inconceivable that future users will expect a more intelligent query mechanism than simple search and result presentation. More complex questions, for example, ‘tell me about the competing interests of oil companies in the late twentieth century’ are the stuff of sci-fi but rely upon an extensive historical database – such as a web archive.

Of course the characteristics of a web archive inevitably impact on how viable these different scenarios may be. For example, a selective web archive with limited scope but rich resource description will support research differently to a broad domain or international archive, with minimal accompanying metadata. The age of the web archive may be another factor. These factors must be recognised when developing tools and functionality.

Increasing usage and responding to researcher needs is an important element of our growth strategy for the UK Web Archive over the next five years. If you use the web archive for research and/or have ideas about tools or functionality to support specific types of research, we’d really like to hear from you. You can get in touch with us either by email, on Twitter, or by leaving a comment below.


Contact Details

Maureen Pennnock
Web Archive Engagement & Liaison Manager
The British Library (Yorkshire)

Email: maureen.pennock@bl.uk
Twitter: @mopennock

Posted in Guest-post, preservation | 2 Comments »

Case Study: Opening Access to a Closed and Unused Mailing List

Posted by Brian Kelly (UK Web Focus) on 31 August 2011

The Value of List Archives

A recent post on Policies on Unused JISCMail Lists highlighted the potential value of JISCMail lists which are no longer active but which host content which may provide historical insights into digital library developments. As is suggested by the recent JISC ITT for an Analysis of the Value and Benefits of Text Mining and Text Analytics in UK FE and HE  data mining tools have developed in sophistication since the JISCMail service was launched in 2000. It  may well now be timely to perform data mining work on our email archives – particularly as recent email messages have been sent to owners of unused lists inviting them to delete archives without mentioning the implications of such actions.  The current importance of JISCMail as an archive rather than as a communications tool is also suggested by the JISCMail statistics which show that the majority of lists (5,840) have no recent posts (and this number is steadily increasing), 1,583 have between 1 and 10 posts, 707  have between 11 and 100 and only 114  have over 100 posts.

However, as was discussed in the comments on the recent post, it is unclear whether closed lists which are no longer in use can be made open. Does the 30 year rule which, according to Wikipedia, states that “Public records ….other than those to which members of the public have had access before their transfer …., shall not be available for public inspection until they have been in existence for [thirty] years or such other period….as the Lord Chancellor may,…. for the time being prescribe as respects any particular class of public records” apply to JISCMail lists? But as Chris Rusbridge has pointed out Section 7 of the JISCMail Acceptable Use Policy states that:

Messages sent to a JISCMail list will normally be archived, and these archives can then be retrieved by any member of that same list. These archives may also (at the discretion of the listowner) be made publicly available on the web, and thus be available to anyone. … 

Archives or collections of the messages sent to a JISCMail list may not be made publicly available at another site unless the listowner has granted explicit permission, and the list members have been informed.

It would therefore appear that as listowner I can make the LIS-ELIB-MANAGERS archive available (and also make the archive available elsewhere) provided I inform the list members.  However although the FAQ suggests that the decision for opening access to a closed list resides with the list owner, the list owner will need to make a decision as to whether it is appropriate for a list to be made open. Clearly there may be lists which contains confidential, sensitive, embarrassing or even potential illegal content which should not be made available.  In addition, as described in a JISCMail page on Copyright:

 When you send a message to a JISCMail list, you retain your copyright in that message. You also retain your moral right to be identified as the author of the work, and your moral right against derogatory treatment.

The extent to which your message is made available across the internet will depend on the level of access that has been decided by the listowner.

What processes should be taken to decide whether or not to open up a closed list archive?  This post describes the processes which are being taken for the LIS-ELIB-MANAGERS archive.

Processes For Informing List Members

Auditing The List

The LIS-ELIB-MANAGERS list currently has 26 members. A message was sent to the list in order to see how many of the email addresses were still valid. There were 11 bounced messages but only four people replied to a request to respond to a message sent to the list. It does not seem to be possible to find out how many people in total have subscribed to a list. For data protection reasons when users leave JISCMail, their name and email address are removed from the JISCMail database. However the ownership of email messages relates to list members who have posted to a list and not to those who have only lurked on a list.   It therefore would seem feasible to explore information about the numbers of people who have posted to the list and the number of messages they have posted.

Unfortunately there doesn’t seem to be an easy way of getting reports on the numbers of people who have posted to a JISCMail list or the number of messages they have posted. I therefore used the advanced search function to search for the numbers of messages posted for each year.

 Year 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 TOTAL
Nos.
of posts
147 113 209 46 29 1 0 0 0 0 0 0 0 0 0 1 546

It would be useful if information on the numbers of messages posted by list members could be obtained. However this does not seem to be provided.  I therefore looked at the list archives in order to find the email address of people who had posted to the list and searched for this email address in order to see the total number of messages posted from the email address.  I did this for 30 users, including those whose names were familiar to me and whom I felt were likely to have posted significant numbers of messages to the list.  The details are given below.

In addition I skimmed through some of the messages in order to gain a feel for the issues being discussed and to see if there appeared to be any sensitive topics been discussed or flame wars breaking out.  As can be seen from the list of subjects which are illustrated, there doesn’t appear to be any sensitive issues being routinely discussed. However subsequently I came across one post which contained personal information about a member of the community which I feel should be deleted if the list archives are to be made open.

Name Nos. of posts
1 Chris Rusbridge 119 [118+1]
2 Elizabeth Graham 50
3 John Kirriemuir 29 [16 (UKOLN), 6 (ILRT) + 7 (OMNI)]
4 Kelly Russell 24 [13 + 8 + 3]
5 Rosemary Russell 14
6 Janine Packard 13
7 Catherine Edwards 13
8 John Paschoud 12
9 Verity Brack 11
10 Brian Kelly 11 [10+1]
11 Astrid Wissenburg 8
12 Tony Gill 8
13 Stephen Smith 8
14 Jill Foster 7
15 Philip Hunter 6
16 Dee Wood 6
17 Lorcan Dempsey 5 [4+1]
18 Hugh Brailsford 5
19 Kay Flatten 5
20 Bruce Royan 5
21 Roddy Macleod 4
22 Ioanna Dandolos 4
23 Stephen Pinfield 4
24 John Kelleher 4
25 Liora Rolfe Stubbs 3
26 Tom Wilson 3
27 Nicky Ferguson 3
28 Isabel Stark 3
28 Hazel Gott 2
30 Anne Ramsden 2
TOTAL
391

In the above table it should be noted that one person (John Kirriemuir) posted from three different organisational addresses. In addition four others posted from different variants of the same email address (e.g. foo@ukoln.ac.uk and foo@ukoln.bath.ac.uk).

This table does not include everyone who has posted and also does not necessarily include information on those who have posted significant numbers of messages, since there are 155 messages not attributed to a sender.   However we seem to have listed the most active participants, including those who worked for the eLib programme team and those who worked at UKOLN who hosted the eLib programme Web site and were actively involved in the design of the eLib programme.    Having skimmed through the list archives, especially for the most active period in 1996-1998, it seems that many of the remaining posts will have been from a long tail of people posting informational messages about their projects, events, publications, etc.

Policy and Processes for Changing Access to the List Archives

Following this audit I have been in touch with Chris Rusbridge and Lorcan Dempsey in order to solicit feedback on the following proposed policy and implementation processes:

Information on the audit of the lis-elib-managers JISCMail list will be published and promoted to those who were active in the eLib community in order to solicit their views on opening access to the lis-elib-managers-archives.

Current and previous list members will be informed that the list owner and others involved in managing the list when it was being actively used feel that the list had been made closed in order that the list helped to address a particular audience and wanted to minimise distractions.

Posts which are discovered which contain personal information which we feel may be inappropriate to be published openly will be deleted.

Individuals who have posted to the list who may have concerns regarding issues related to confidentiality, legality and related issues for their posts can request further information about their posts.

If there are no specific concerns raised after a period of a month the list archives will be made open. This policy on openness will allow the archives to be published elsewhere, such as on the Markmail.org service. If concerns are raised these will be discussed by Brian Kelly, Chris Rusbridge and Lorcan Dempsey.

Rachel Bruce and Neil Grindley from the JISC who will have interests in preservation policies will be informed of the proposed change in status of the list and the processed which have been used prior to this change.

In brief the process for opening up access to the mailing list archive which may be applicable for other lists consists of:

  • Auditing the archive in order to identify the numbers of people who have posted messages and the numbers of messages that have been posted.
  • Identifying the reasons why the list was set up as a closed list.
  • Gaining an understanding of possible risks in opening up access to list archives.
  • Formulating a policy decision with key stakeholders.
  • Communicating the policy and gathering feedback.
  • Analysing the feedback and reviewing any changes to the proposed policy.
  • Implementing the policy.

Conclusions

This post began by the value which text mining tools can potentially provide by exploring the contents of email archives.  It is important to note that such text mining need not be carried out by the organisation hosting the archives; indeed there may be advantages in allowing an email distribution service to focus on the challenges in delivering large volumes of email for the higher education sector and allowing other organisations with expertise in data mining to provide this service.

The proposed changes to the policy will also allow the content to be reused elsewhere, such as the Markmail.org service.  As can be seen, this contains a large amount of content about JISC (1,235 messages from the 8,371 lists it currently indexes).  However this does not include lists which are hosted by JISCMail, due to the JISCMail policy which prohibits archives from being hosted elsewhere, without the permission of the list owner.  I hope that this post has outlined one way in which a closed list can be made open and such openness exploited by enabling a service which can demonstrably add value to be allowed to make use of the valuable archives provided by the JISCMail service.

Is this an appropriate approach?  I’d welcome your feedback.


Posted in preservation | 4 Comments »

Policies on Unused JISCMail Lists

Posted by Brian Kelly (UK Web Focus) on 17 August 2011

Last week I received an email from JISCMail which invited me to state whether an unused mailing list should be retained or deleted:

Lists: LIS-ELIB-MANAGERS

Your JISCMail list(s) have not been used for over 3 years. Please email to helpline@jiscmail.ac.uk to confirm whether the list should now be deleted or retained. If you choose deletion, let us know if you would like a zipped copy of the archives for your records.

Back in January 2010 I wrote a post on Decommissioning / Mothballing Mailing Lists in which I discussed policies and processes for decommissioning and mothballing lists:

How should a list owner go about deleting unused lists? And aren’t there dangers that deleting the contents of lists which may have been used to influence the research process or provide possibly valuable historical insights on the content area covered by the list would be regarded as a mistake by future generations?

Following the subsequent discussions I decided on the policy for unused lists which I owned: I disabled postings to the lists and updated the list description accordingly. For example the DNER-TECH list now states:

List to discuss technical issues relating to the establishment of the Distributed National Electronic Resource. These issues should particularly relate to inter-operability matters. Other topics may be introduced later. THIS LIST IS NOW CLOSED.

I have decided not to delete the unused lists as the lists I own tend to have been used to discuss various aspects of early developments of digital library initiatives in the sector and I feel that the issues which were discussed could provide information which may have some value from an historical perspective.  For example ten years ago on the DNER-TECH list there were discussions of ”issues related to deploying the Bath Profile, the emerging proposals for ‘Z39.50 Next Generation’ (ZNG), and presentations by a number of UK-based projects with significant experience of deploying Z39.50 applications in a number of domains“. This message can therefore provide evidence of the interest in Z39.50 at that time.

You could, of course, manage the content by requesting a zipped copy of the archive (although note that the Web page on deleting a list, somewhat confusing called Deleting a Group, does not provide any further information, including details of the contents held in a zipped archive – will, for example, this include details of the members of the list?). But this would mean that the original location of the resource being deleted and will make it more difficult for other interested parties to find this information. To be honest I can’t see the point of requesting a zipped copy for most open lists, especially since the existing JISCMail archive provides a rich archive which may be of value and provides an interface (using JISCMail commands) which potentially could support data mining of these resources.  However for closed lists, such as the LIS-ELIB-MANAGERS list which I own, since it would probably be inappropriate to retrospectively provide open access to such archives (will there be a 30 year limit, I wonder, before the general public can see what Chris Rusbridge, Lorcan Dempsey and eLib project managers were discussing on this list?!)

On further reflection it does seem to me that JISC-funded projects should probably have a policy on the management of legacy lists related to the project work.  There is, for example, a requirement for Web sites to be maintained for at least three years after the funding has ceased.   What should the policy be on mailing lists? And what should practices should be implemented once a list archive is felt to be no longer of interest?  I would welcome comments from other list owners on how they are managing any unused lists they own.

Posted in preservation | 9 Comments »

Blog Preservation and Plugins

Posted by Brian Kelly (UK Web Focus) on 18 July 2011

Best Practices for Blog Preservation

A paper entitled “Moving From Personal to Organisational Use of the Social Web” described best practices for exploiting services such as blogs which were hosted in the Cloud. The paper further developed guidelines initially outlined in  a paper on “Approaches To Archiving Professional Blogs Hosted In The Cloud” including advice on managing the closure of a blog:

Monitoring of technologies used: Information on the technologies used to provide the blog including blog plugins, configuration options, themes, etc. can be useful if a blog environment has to be recreated.

It should be noted that advice on managing blog hosted in the Cloud might also need to be applied to blogs hosted within the institution.  As an example we implemented the above recommendation for the IWMW 2010 blog. The final post on the blog was entitled Closing the 2010 blog. In this post we documented how the blog was used (numbers of posts; numbers of contributors; etc.) and the technologies used (the them used and details of WordPress plugins which had been installed). A year later we discovered how useful it was to have provided documentation on the plugins used in the blog.

The blog environment we used to host the IWMW 2010 blog a year ago had to be upgraded. We used this as an opportunity to provide a more robust environment for additional blogs to support IWMW events, including the IWMW 2011 event.

However after the upgrade we discovered that the WordPress plugins had reverted to the defaults, with additional content which had been embedded in the blog, including the video interviews which had been published on the blog, missing from the posts. I now recall that this isn’t the first time this has happened – following a WordPress upgrade on the JISC Inform platform the plugins and the theme used on the JISC PoWR were lost and the environment had to be recreated from memory.

However the final post published last year provided the following record of the plugins which had been installed:

Details of plugins used: Akismet, Buddy Press, BP Disable Activation, Google Analyticator, Lifestream, Lux Vimeo

We subsequently re-installed the Lux Vimeo plugin – but found that the videos failed to re-appear. It seems that loss of the plugin also resulted in losing the embed code, which included the address of the videos.

Fortunately each of the posts also included a direct link to the resource on Vimeo (as illustrated in the screen shot which shows a blog post for which the video has been embedded and one for which it is still missing).  We were therefore able to re-establish the embedded video – although we decided to do this using the Embed Object plugin since this seems to provide richer functionality (and we updated the final post so that we have documented these changes).

The need to include links to remote content in addition to embedding such content was described in a post which advised Don’t Just Embed Objects; Add Links To Source Too! In this case the advice was provided in order to enhance access to content on m0bile devices, in cases in which Flash-based embedding technologies was not supported.   We have now discovered another reasons for providing such links – embedding addressing into plugins way result in the address being lost if the plugin becomes unavailable.

Best Practices for Live Blogs

The advice we had developed for those who make use of blogs stated that when archiving a blog:

Monitoring of technologies used: Information on the technologies used to provide the blog, including blog plugins, configuration options, themes, etc. can be useful if a blog environment has to be recreated.

It seems that such advice should be followed for cases when blogs which will continue to be provided are hosted on a blog platform which may be upgraded.  And since all blog platforms are liable to be upgraded the advice provided for blog preservation purposes would appear to be applicable more generally.  We are therefore applying this advice for the IWMW 2011 blog and the About page for this blog has also been updated accordingly.

Posted in preservation | Leave a Comment »

Archiving Blogs and Machine Readable Licence Conditions

Posted by Brian Kelly (UK Web Focus) on 21 April 2011

Clarifying Licence Conditions When Archiving Blogs

UKOLN’s Cultural Heritage blog has recently been frozen following the cessation of funding from the MLA (a government body which is due to be shut down shortly).

As part of the closure process for our blog we have provided a Status of the Blog page which summarises the reasons for the closure, provides a  history of the blog, outlines various statistics about the blog and provides some reflections of the effectiveness of the blog.

Another important aspect of the closure of a blog should be the clarification of the rights of the blog posts. This could be important if the blog contents were to be reused by others – which could, for example, include archiving by other agencies.

As shown a human readable summary was included in the sidebar of the blog which states that the content of the blog are provided under a Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales License.

The sidebar also defined the scope of this licence which covered the textual content of blog posts and comments which were submitted to the blog.  It was pointed out that other embedded objects, such as images, video clips, slideshows, etc, may have other licence conditions.

However automated tools will not be able to understand the licence conditions.  What is needed is a definition of the licence in a format suitable for automated reading. This has been implemented using a simple use of RDFa which is included in the sidebar description.  The HTML fragment used is shown below:

<img alt=”Creative Commons License” src=”http://i.creativecommons.org/l/by-nc-sa/2.0/uk/88×31.png” /> This blog is licensed under a <a href=”http://creativecommons.org/licenses/by-nc-sa/2.0/uk/” rel=”license”>Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales License</a>.

How might software process such information? One example is the OpenAttribute plugin which is available for the FireFox, Chrome and Opera browsers. This is described as a “suite of tools that makes it ridiculously simple for anyone to copy and paste the correct attribution for any CC licensed work“. Use of the OpenAttribute plugin on the Cultural Heritage blog is illustrated below.

Assigning Multiple Licences To Embedded Objects in Blogs

The image above shows the licence for the blog in its entirety.  However the blog is a complex container of a variety of objects (blog posts from multiple authors;  comments from readers and embedded images and other objects from multiple sources)  and each of these embedded may have its own set of licence conditions.

How might one specify the licence conditions of such embedded objects?  In the case of the Cultural Heritage blog there was a statement that any comments added to the blog would be published under a Creative Commons licence so although anybody making a comment did not have to formally accept this licence condition, it practice we can demonstrate that we took reasonable measures to ensure that the licence conditions were made clear.

In order to specify the licence conditions for embedded images we initially looked at the Image Licenser WordPress plugin.   However this provides a mechanism for assigning licence conditions as images are embedded within a post, which are then made available as RDFa.  Since in our case we were looking at retrospectively assigning licence conditions to existing images (in total 151 items) it was not realistic to use this tool.

The Creative Commons Media Tagger provides the ability to “tag media in the media library as having a Creative Commons (CC) license“. But what licence should be assigned to images on the blog?  These include screen images and photographs which may have been include by guest bloggers but which have not been explicitly assigned a Creative Commons licence.  The question of  Who owns the copyright to a screen grab of a website? was asked recently on ecademy.com with a lack of consensus and a patent and trade mark attorney providing the less than helpful suggestion that “It is better to include a link to the original work if it is on the Web rather than to copy it.“ The uncertainties regarding ownership of screen shots are echoed in a Wikipedia article which states:

Some companies believe the use of screenshots is an infringement of copyright on their program, as it is a derivative work of the widgets and other art created for the software. Regardless of copyright, screenshots may still be legally used under the principle of fair use in the U.S. or fair dealing and similar laws in other countries.

In light of such confusions there is a question as to what licence, if any, should be assigned to images in the blog. As described in the Creative Commons Media Tagger FAQ it is possible to run the plugin in batch mode to “tag media that was already in your media library prior to installing and activating CC-Tagger“. It occurred to me that it would be best to assign a non-CC licence by default to all images and then to manually assign an appropriate CC licence to images such as those taken from Flickr Commons in a post entitled “Around the World in 80 Gigabytes“. However using the batch made of the tool appeared not to change the content – and it is unclear to me whether there is a way of providing a machine-readable statement in RDFa stating that a resource is not available with a Creative Commons licence.

Using the Image Licenser tool on an individual image resulted in the following HTML fragment which illustrates how a machine readable statement of the licence conditions can be applied to an individual object:

<img class=”size-medium wp-image-2206″ title=”Flickr Commons” src=”http://blogs.ukoln.ac.uk/cultural-heritage/files/2011/02/flickr-commons-300×205.jpg” alt=”image of flickr commons home page” width=”300″ height=”205″ />

Discussion

Whilst finalising this post I asked on TwitterIs it possible to use RDFa to provide a machine-readable statement that an image *doesn’t* have a CC licence? …” and followed this by describing the context: “.. i.e. have a blog post with CC licence for content but want to clarify lience for embedded objects. #creativecommons“.  Subsequent comments from @patlockley and @jottevanger helped to identify areas for further work which I hadn’t considered – I have kept an archive of the discussion to ensure that I don’t forget the points which were made. A summary of my thoughts is given below:

Purpose: Why should one be interested in ways in which the licence conditions of objects embedded in blog posts? My interest relates to arching policies and processes for blogs.  For example if an archiving service chooses to archive only blogs for which an explicit licence is available there will be a need to ensure that such licences are provided in a machine-readable format in automate to allow for automated harvesting.  There will also be a need to understand the scope of such licences. In addition to my interests, those involved in the provision of or reuse of OER resources will have similar interests for reusing blog posts if these are treated as OER resources.  Finally, as  @jottevanger pointed out this discussion is also relevant more widely, with Jeremy’s interests focussing on complex Web resources containing digitised museum objects.

Granularity: What level of granularity should be applied – or perhaps this might be better phrased as what level of granularity is it feasible to apply machine readable licence conditions for complex objects? Should this be at the collection level (the blog), the item level (the blog post) or for each component of the object (each individual embedded image)?

Risks: Should one take a risk averse approach, avoiding use of a Creative Commons licence at the collection level since it may be difficult to ensure that each individual item has an appropriate Creative Commons licence)? Or should one state that by default items in the collection are normally available under a Creative Commons licence, but there may be exceptions?

Viewing tools: What tools are available for processing machine understandable licence conditions? What are the requirements for such tools?

Creation tools : What tools are available for assigning machine understandable licence conditions? What level of granularity should they provide? What default values can be applied?

I know that in the OER community there are interests in these issues.  I would be interested to hear how such issues are being addressed and details of tools which may already exist – especially tools which can be used with blogs.

Posted in openness, preservation | Leave a Comment »

A Few Days Left to Download a Structured Archive of Tweets

Posted by Brian Kelly (UK Web Focus) on 17 March 2011

On 21 February 2011 John O’Brien, developer of the Twapper Keeper twitter archiving service announced the “Removal of Export and Download / API Capabilities“. In a subsequent video interview John explained the reasons for the removal of this service, which arose following Twitter announcement that it was enforcing its policy that third party services are not allowed to syndicate or redistribute tweets. Following Twitter’s ‘cease and desist’ email the removal of Twapper Keeper’s export capabilities and APIs will take place on 20 March – a few day’s time.

It is clear that the popularity of the Twapper Keeper service (which has a total of 2,410,061,623 tweets across 21,475 archives) has demonstrated a clear need for Twitter archiving – and it seems that Twitter wishes to be able to commercially exploit such popularity. I would guess that other services, such as Martin Hawksey’s iTitle Twitter captioning service is another example of an innovative approach which Twitter will be seeking to exploit commercially.

Last year’s JISC-funded developments to the Twapper Keeper service included making the software available under a Creative Commons licence. If you visit the Your.TwapperKeeper.com site you will be able to download the software which can be run on your own server. Clearly you would not be able to simply replicate a public Twapper Keeper service, but if Twitter’s terms and conditions are aimed at stopping public redistribution of tweets it would appear possible to install the software on an institutional Intranet – although I should admit that IANAL.

It should the pointed out that the Twapper Keeper service will continue to archive tweets which can be accessed via the HTML interface – what is being lost is API access and the ability to download a structured archive of tweets in for example, MS Excel format with columns of the tweets, Twitter userid, date and time information, geo-location information, etc. Such structured information is, as Twitter is very aware of, valuable for developers who wish to carry out richer data analysis or provide additional value-added services on top of the conventional Web-based display of tweets.

It is still possible for a few days to download such structured archives from Twitter. I have recently looked at the details of my TwapperKeeper archives. I have decided to keep a local archive of tweets associated with a number of talks I have given. However I don’t intend to keep a structured archive which are primarily of interest to event organisers (such as the ALT-C, JISC and CETIS conferences). I have also decided to keep a record in the list below of the decisions I have made. Note that an example of a local archive can be seen for the seminar I gave last year at the University of Girona.

Archive Type Name Description Policy # of Tweets Create Date
#Hashtag #a11y Accessibility (a11y) Archive not kept as this subject based archive is not directly related to my key areas of work. 42427 04-25-10
#Hashtag #accbc CETIS/BSI Accessibility SIG meeting. Local archive not kept as I was a speaker at this recent event. 154 02-28-11
#Hashtag #altc2009 The ALTC 2009 conference Archive not kept as this event-based archive will primarily be relevant to the event organisers. 4737 08-28-09
#Hashtag #altmetrics New approaches for developing metrics for scholarly research Archive not kept as this subject-based archive will primarily be relevant to others with an interest in the subject area.. 158 01-15-11
#Hashtag #Ariadne The Ariadne hashtag – which may be used for UKOLN’s Ariadne ejournal. Archive not kept as this subject-based archive will primarily be about topics other than UKOLN’s Ariadne ejournal. 11897 09-21-10
Keyword Ariadne Archive of tweets contains the string ‘Ariadne’ Archive not kept as this subject-based archive will primarily be about topics other than UKOLN’s Ariadne ejournal. 25598 09-21-10
@Person ariadne_ukoln Tweets about the Ariadne web magazine. Local archive kept. 882 05-28-10
@Person briankelly Tweets about Brian Kelly Personal archive kept. 6471 03-19-10
#Hashtag #CETIS The CETIS service, based at the University of Bolton. Archive not kept as this organisational archive will primarily be of relevance to the host institution. 2836 09-24-10
#Hashtag #CILIP CILIP, the Chartered Institute of Library and Information Professionals. Archive not kept as this organisational archive will primarily be of relevance to the host institution. 4494 09-24-10
#Hashtag #CILIP1 Campaign on future of CILIP organisation based on CILIP’s 1-minute messages. Archive not kept as this campaign-based archive will primarily be of relevance to the host institution. 357 06-13-10
#Hashtag #CSR Comprehensive Spending Review Archive not kept as this subject archive will primarily be of relevance to others. 79799 10-15-10
#Hashtag #falt09 ALTC Fringe Archive not kept as this event-based archive will primarily be of relevance to others. 219 08-28-09
#Hashtag #heweb10 Tag for the HigherEdWeb 2010 conference Archive not kept as this event-based archive will primarily be of relevance to others. 8723 09-28-10
#Hashtag #ipres10 Tweets for the iPres10 conference, Vienna, 19-24 Sept 2010. Archive not kept as this event-based archive will primarily be of relevance to others. 2 08-27-10
#Hashtag #ipres2010 Archive for the IPres 2010 conference to be held in Vienna on 19-25 Sept 2010. Archive not kept as this event-based archive will primarily be of relevance to others. 1397 08-27-10
@Person iwmwlive IMWM live blogging account Local archive kept. 1373 04-30-10
#Hashtag #jisc10 JISC 2010 conference Archive not kept as this event-based archive will primarily be of relevance to others. 2059 04-02-10
#Hashtag #jiscpowr Archive of tweets related to the JISC PoWR project provided by UKOLN and ULCC Archive not kept due to low numbers of tweets. 6 07-09-10
#Hashtag #jiscpowrguide Archive of tweets about the Guide to Web Preservation published by the JISC-funded PoWR project and launched on 12 July 2010. Archive not kept due to low numbers of tweets. 2 07-09-10
#Hashtag #ldow2010 Linked Data on the Web 2010 conference Archive not kept as this event-based archive will primarily be of relevance to others. 524 04-25-10
#Hashtag #loveHE Times Higher Education campaign to support Higher Education in UK. Archive not kept as this campaign-based archive will primarily be of relevance to others. 12066 06-12-10
#Hashtag #mdforum UKOLN’s Metadata Forum Local archive planned. 119 12-10-10
#Hashtag #morris Tweets about Morris dancing Archive not kept as this social archive will primarily be of relevance to others. 17813 10-16-10
#Hashtag #oxsmc09 socialmediaconference Archive not kept as this event-based archive will primarily be of relevance to others. 1063 09-18-09
#Hashtag #PhD Tweets for researchers using the #PhD tag Archive not kept as this subject-based archive will primarily be of relevance to others. 28527 09-24-10
#Hashtag #s113 Workshop session at ALTC 2009. Local archive kept (will be edited to remove irrelevant tweets posted after event had taken place). 227 09-03-09
#Hashtag #scl2010 Scholarly Communication Landscape (SCL): Opportunities and challenges symposium, held at Manchester Conference Centre on 30 November 2010. Archive not kept as this event-based archive will primarily be of relevance to others. 39 12-02-10
#Hashtag #ucassm Social Media Marketing Conference organsied by UCAS. Archive not kept as this event-based archive will primarily be of relevance to others. 223 10-18-10
#Hashtag #udgamp10 What Can We Learn From Amplifed Events seminar, given by Brian Kelly, UKOLN at the University of Girona.
Local archive available
Local archive kept. 395 09-01-10
#Hashtag #ukmw09 UKMuseumsandtheWeb Archive not kept as this event-based archive will primarily be of relevance to others. 750 12-05-09
Keyword ukoln Tweets about UKOLN Local archive kept. 1948 03-19-10
#Hashtag #ukolneim UKOLN’s Evidence, Impact, Metric work Archive not kept due to low numbers of tweets. 45 11-05-10
#Hashtag #w3ctrack W3C Track at WWW 2010 confernce Archive not kept as this event-based archive will primarily be of relevance to others. 179 04-30-10
#Hashtag #ww2010 Misspelling of WWW2010 hashtag Archive not kept as this event-based archive will primarily be of relevance to others. 833 04-29-10

It should be noted that this list is based on Twapper Keeper archives which I created. There will be a number of other archives which will be of interest to myself and colleagues at UKOLN which may also be archived locally.

Also note that a number of event-based Twitter archives (such as the #s113 archive of a workshop session at the ALT-C 2009 conference) will contain irrelevant tweets due to the hashtag being used for other purposes. Such irrelevant tweets may be deleted from the archive

Posted in preservation, Twitter | 1 Comment »

Time to Move to GMail?

Posted by Brian Kelly (UK Web Focus) on 2 March 2011

The University of Bath email service is still down. The problems were first announced 0n Twitter at 06.02 on 24 February:

The University email is currently running at risk of failure we are working towards a fix – sorry for any disruption caused.

Later that day we heard:

University email will be unavailable for the rest of the day -for alternative use University Instant Messenger Jabber: http://bit.ly/fAshWi

The problems continued the following day and so BUCS (the Bath University Computing Service) announced an interim email service: I can now send and receive email but can’t access any email messages which I received prior to 25 February.  I must adit that this provides a strange feeling of bliss (my email folder is almost empty!), but I  know that the actions which I’m now running behind on will come back to haunt me when the full email service is restored.

Of course communications have continued, particularly on Twitter. I’m pleased, incidentally, that BUCS have been using Twitter as a communications channel to keep their users informed of developments.  It has also occurred to me how I am still able to continue working using Twitter to support my professional activities: how, I wonder, are others at the University of Bath who don’t use Twitter coping?

During this outage, whilst away in London, I suggested that use of Google’s GMail service might be appropriate.  In response I received the ironical reply:

Gmail never breaks. Oh. Wait. http://www.pocket-lint.com/news/38815/gmail-reset-deletes-correspondence-history :)

It seems that on the day Bath University email users were suffering as a consequence of hardware problems on its email servers Gmail was also having problems. As the PocketLint article rather dramatically announced:

Oh dear – looks like Google has dropped the bomb on hundreds of thousands of Gmail accounts, wiping out years of email and chat history.

You can’t trust GMail to provide a reliable email service seemed to be the sub-text of other Twitter followers who responded to my initial tweet.  But is that really the case? I have described the continuing problems with the BUCS email service (which are summaried in a BUCS FAQ). But what is the current status of GMail?

Whilst Computer Weekly has highlighted the problems of use of Web-based email services the CBC News has pointed out thatGmail messages [are] being restored after bug“.  The article described how  emails “are being restored to Gmail accounts temporarily emptied out two days ago”. This problem was either small-scale – “About 0.02 per cent of Gmail users had their accounts completely emptied“) or significant – “media outlets estimate there are roughly 190 million Gmail users, so about 38,000 were affected”. The problem, caused by a bug which has now been fixed, did not affect me whereas the BUCS email outage clearly has.  Which, I wonder, is the more significant problem?

I have to admit that I have been affected by outages in externally-hosted communications services previously. In September 2009  I wrote a post entitled “Skype, Two Years After Its Nightmare Weekend” which described how “Skype’s popular internet telephone service went down on August 16 [2007] and was unavailable for between two and three days“. This was also due to a software bug (related to MS Windows automated updates) which has been fixed – and I have continued to be a happy Skype user and agree with last year’s Guardian article which described “Why Skype has conquered the world”.

So yes there will be problems with externally-hosted systems, just as there will be problems with in-house systems (and ironically the day before the BUCS email system went down and two days before GMail suffered its problems my desktop PC died and I had to spend half a day setting up a new PC!). It may therefore be desirable to develop plans for coping with such problems – and note that a number of resources which provide advice on backing up GMail have been provided recently, including a Techspot article on “How to Backup your Gmail Account” and a Techland article on “How to backup GMail“.

But in addition to such technical problems there are also policy challenges which need to be considered. At the University of Bath email accounts are deleted when staff and students leave the institution (and for a colleague who retired recently the email account was deleted a day or so before she left). One’s GMail account, on the other hand, won’t be affected by changes in one’s place of study or employment.  In light of likely redundancies due to Government cutbacks isn’t it sensible to consider migration from an institutional email service?  And shouldn’t those who are working or studying for a short period avoid making use of an institutional email account which will have a limited life span?

Posted in General, preservation | 21 Comments »

Link Checking For Old Web Sites

Posted by Brian Kelly (UK Web Focus) on 4 January 2011

Web sites rot.  Over time they’ll start to break.  Not only will increasing numbers of links to external resources start to break but you may also find that the functionality provided within the Web site may start to break.  This may be a problem if Web sites are still being used but are no longer maintained. But what should be done?

From 1999-2000 UKOLN was a member of the EU-funded EXPLOIT project and provided the Exploit Interactive Web magazine. This was followed, from, 2000-2003 by the Cultivate Interactive Web magazine.  Since the funding ceased a link check of the Web sites has been carried out annually with the findings published and summaries of any problems documented. Only internal links are checked and the surveys helped us to identify and  fix a number of problems which occurred when the Web site was migrated from a Windows NT service to an Apache server running on a Unix box.  We have also observed a small number of broken links to third party Web site usage services, as illustrated below.

Running the annual link check and documenting the findings takes about 10 minutes. The Exploit Interactive and Cultivate Interactive Web sites are technically quite simple, with little integration with third party services. However as Web sites increasingly make use of content and services provided by third parties there are dangers that such dependencies will cause problems. So perhaps auditing of such services, including project Web sites which are no longer being funded, will become increasingly important. The Exploit Interactive

Alternatively you could argue that after a period of time such Web sites should be deleted.  We recommended to the EU that project Web sites should be expected to continue to be hosted for at least three years after the funding had expired. We also suggested that this should be a minimum and that organisations should try to continue to host such Web sites for ten years after the funding has finished.  Since the final issue of the Exploit Interactive ejournal was published in October 2000 we have achieved that goal. Should we now delete the Web site? Doing so might save ten minutes a year in checking that the Web site is still functioning, but would mean that articles on a number of EU-funded projects would be lost, including the following which were published in the final issue:

  • ELVIL 2000: Ingrid Cartwell and Magnus Enzell introduce the prototype for the ELVIL 2000 Project, an Academic Portal for European Law and Politics.
  • EQUINOX: Following on from an earlier article in Exploit Interactive, Monica Brinkley provides an update on the EQUINOX project, a Library Performance Measurement and Quality Management System.
  • ILSES: Meinhard Moschner and Repke de Vries describe the development of a specialised networked digital library which integrates publication retrieval and survey data extraction.
  • LIBECON 2000: David Fuegi, John Sumsion and Phillip Ramsdale discuss the LIBECON2000 Project and its Millennium Report.
  • TECUP: Paul Greenwood and Martina Lange-Rein on TECUP, a meta project which analyses practical mechanisms for rights acquisition for the distribution, archiving and use of electronic products.
  • VERITY: Alexandra Papazoglou gives a final report on Project Verity: Virtual and Electronic Resources for Information skills Training for Young people.

I can’t help but feel that the Web site should continue to be hosted. But what should the general policy be for project Web sites? What are others doing for project Web sites whose funding may have ceased ten years ago or five years ago or even more recently?

Note: Coincidentally after published this post I received an email containing details of the uptime for the Exploit Interactive and Cultivate Interactive Web sites. I receive an automated email if the Web sites are not available and also receive weekly reports on the server availability, as illustrated below.  Another approach to consider for legacy Web sites?

Posted in preservation | 12 Comments »

“5 Days Left to Choose a New Ning Plan”

Posted by Brian Kelly (UK Web Focus) on 23 August 2010

I received an email on 16 August announced that I had “5 Days Left to Choose a New Ning Plan“.  The email related to the announcement Ning made a few months ago that the company was withdrawing its provision of free social networks.

We had made use of Ning to provide the IWMW 2008 social network.   The email informed me that “the network has grown up a bit since you started the ball rolling. You have grown to 90 members who have collectively helped you add unique photos, some interesting videos, and 24 spirited discussions“.

What action, if any, was needed in response to this email? The simple answer would be to suggest that nothing needed to be done as the social network was established simply to support an event which took place 2 years ago – so there’s no point in paying the $19.95 annual subscription for the social network to continue to be hosted. But what if the social network (or indeed any other Cloud Service) hosted useful content which I would not like to lose?  So I took the opportunity to evaluate copying the Web site prior to its demise – and I hope that documenting this process with be of interest to others.

The WinHTTrack software was used on Monday 16 August 2010 to create a copy of the IWMW 2008 social network. The mirror is currently hosted on the main IWMW 2008 Web site – although we are making no commitment to hosting the content on a long term basis.

The purpose of the provision of the Ning social network for the event was to provide a communications and collaboration environment for IWMW 2008 delegates and also to gain a better understanding of whether such a service was need.  We discovered that the usage was low, with only 90 registered members out of about 180+ registered delegates and, despite the “spirited discussions” rhetoric in the email from Ning, there was very little use made of the discussion fora on the service.

We kept a record of information provided by the WinHTTrack mirroring software.  Despite the low usage I was surprised to discover that the mirror took 1 hour 42 minutes to run. The mirror is 175 Mb and contains 9,065 files and 282 folders.

Once the mirror had been created the navigational bars were updated to link to the local resource rather than the Ning social network, and a record of the process was documented. In addition a news item was created on the IWMW 2008 event news feed.

Our intention will be to delete this mirror shortly, as we do not feel it provides any useful content. We will, however, be keeping a record that the Ning social network was used and provide a summary of its usage,  so that, for example, we will have a record of the technologies used to support the various IWMW events.

We’ve also decided to publish this summary so that if anyone has any interest in the event’s social network, the tool used to mirror the content or the policy we intend to implement will have the opportunity to give their comments.

This is a summary of how we responded to the announcement of the closure. I wonder what will happen to the 33 Ning social networks I found using a search for ‘JISC’?  One, I noticed, is a “personal portfolio to record and reflect on my work experience” contains spam for free drugs! There are others, however, which have been used to support the work of the JISC Regional Support Centres (this one, for example), JISC-funded projects (such as this one) and  events (such as this example).

The use of such services to support events, in particular, raises some interesting issues. I have previously suggested that “The lesson I’ve learnt – there’s a need to change the settings for social networks set up to support events after the event is over. I still prefer to make it easy to subscribe to such services, however, in order to avoid any delays caused by the need to accept new subscriptions manually“. But as well as tightening up on access after an event is over in order to avoid spam are futher measures needed?  Should the content be replicated elsewhere? Should the social networking site be closed? Or should we be happy with the default option of simply doing nothing – after all, although the announcement stated that the free service would be withdrawn on 20 August, it is still available today.

Latest News: I have just received an email stating that “we’ve decided to extend the deadline until August 30, 2010.“.

Posted in preservation, Social Networking | Tagged: | 3 Comments »

Decommissioning / Mothballing Mailing Lists

Posted by Brian Kelly (UK Web Focus) on 1 February 2010

The Context

In response to my recent post about usage of JISCMail lists Nicole Harris pointed out some evidence of its popularity. It is clear that although in some sectors there may have been a migration to a diversity of communication and collaboration tools, other sectors are still well-served by email lists.  This is particularly true of museums and public libraries, as I know from experience, being a member of the well-used MCG and lis-pub-libs JISCMail lists.

The Evidence

But what should be done for the lists which are no longer being used to any significant extent?  And following Nicole’s links to statistics on the use of JISCMail I was very interested to see the statistics on the numbers of messages on lists.

As can be seen from the accompanying image (taken from the JISC’s Monitoring Unit Web site), the majority of lists appear to have had zero messages posted in the given time period and the numbers of such lists has been growing. The number of very active lists, with over 100 messages, is in comparison, tiny. Of course these lists must be very active as the overall amount of traffic on the lists is still growing.

Although these figures are very surprising they do reflect my findings when I looked at the various lists that I was still subscribed to. For example here are two lists which I had forgotten about:

ADVSERV-CANDM (Advisory Services and Comms and Marketing mailing list)
A list is for discussion and dissemination between Advisory Services and Communication and Marketing) .
Only a handful of posts between July 2004 and November 2005.

DNER-TECH
List to discuss technical issues relating to the establishment of the Distributed National Electronic Resource. These issues should particularly relate to inter-operability matters.
Posts between August 1999 and October 2004.

In addition to these lists which I am still subscribed to I discovered there are a number of list which I own which I had forgotten about.  Here are another two examples:

HELPS (Historic Environment List For Projects and Societies) – 180 Subscribers
This list is designed to promote liaison between those recording all aspects of the historic environment, whether as part of a national project, a specialist interest group or locally based society. The list is intended for members to share experiences for the benefit of others, exchange information and provide mutual support.
Discussions in 2004 and only occasional publicity posting since, with last in June 2007 and July 2008.

INTEROP-CULTURE – 70 subscribers
The mailing list of the international group involved in shaping Interoperable Digital Cultural Content Creation Strategies.
One post in April 2006 but prior to that used from July 2001 to November 2004.

What is to be Done?

Does the existence of many moribund lists matter? This is a question which is very pertinent to UKOLN activities on behalf of the cultural heritage sector in providing advice on digital preservation issues.  The need to make plans for the  decommissioning services was highlighted by Chris Sexton, UCISA chair, at a recent UCISA meeting in which, as she described in her blogWe are all going to be faced with spending less, doing more with less, and deciding what we can stop doing“.

Deciding which lists no longer have a useful purpose can be helpful to a number of groups. Users who find the mailing lists archives a potentially valuable resource may find that the search interface becomes useable if the numbers of lists is decreased (there is no global search of the mailing lists and as Google is blocked from the archives searching selected mailing lists is a very time-consuming process). Deleting such lists may also help new users who are seeking relevant lists to join – at present statistically they are likely to join a moribund list is they make their selection based on the list descriptions. The JISCMail team may well find the systems management easier if unwanted content is deleted, thus potentially freeing technical expertise which can be used to enhance other aspects of the service.

Policies and Processes For Decommissioning and Mothballing Lists

How should a list owner go about deleting unused lists? And aren’t there dangers that deleting the contents of lists which may have been used to influence the research process or provide possibly valuable historical insights on the content area covered by the list would be regarded as a mistake by future generations?

If would be a mistake, however, to regard digital preservation to simply mean that digital resources should be kept forever. An important role for those involved in preservation activities is the selection of resources which are felt to be worthy of preservation and the deletion of the rest – and if such deletion activities is ignored there may be significant costs in its ongoing maintenance.

I’m not aware of guidance for list owners on how they should go about developing policies for mailing lists and associated procedures for implementing such policies. They only  relevant information I could find on the JISCMail Web site was a page on renaming or deleting JISMail lists. This page allows a list owner to give the name of the list to be deleted and request a ZIP file containing the archives, files and list header.

No advice is provided, however, to assist list owners who may be considering deleting lists. It would clearly be inappropriate for a list owner to delete a still-popular list. But at what stage might it be felt that a list should be considered for deletion?  Do posters of messages to the list have any say in the matter (they own the copyright of their messages)? And who should take responsibility for consideration of the long-term importance of messages posted to the list?

In a bottom-up approach to attempting to answer such questions I will describe my thoughts on the DNER-TECH and INTEROP-CULTURE lists.

A summary of these lists is given below.

List: DNER-TECH
Date created: August 1999
List owner: Brian Kelly, UKOLN (although I was initially unaware of this as it used a non-standard variant of my email address)
Status: Open access to archives
Summary of purpose of list, ownership, etc: To discuss technical issues related to the DNER ( Distributed National Electronic Resource).
No. of subscribers: 50 (including 5 variants of my email address!)
Period of popularity: Small number of posts (2-3/month? from 1999-2002.
Period of few and ‘non-essential’ posts (non-essential may include announcements, posts sent to multiple lists, etc.): Last discussion took place in July 2003.
Stakeholder communities and individuals: Software developers from JISC eLib and subsequent DNER (later renamed IE) programme; Chris Rusbridge? (eLib programme director); Rachel Bruce: (JISC); UKOLN.
Likelihood of messages being cited in research papers: Unlikely.
Other issues: -
Risks: Closure of this list would have no adverse effect. Deletion of the contents of the list would be unlikely to have an adverse effect, especially in light of the (now-dated) technical content of the list.

ListINTEROP-CULTURE
Date created: July 2001
List owners: Brian Kelly and Rosemary Russell, UKOLN
Status: Login required to view archives
Summary of purpose of list, ownership, etc: Set up by staff in UKOLN
No. of subscribers: 70
Period of popularity: Last posts in November 2004 and April 2006.
Period of few and ‘non-essential’ posts (non-essential may include announcements, posts sent to multiple lists, etc.): List appears to have been announcements only.
Stakeholder communities and individuals: Appears to have been set up for policy makers in cultural heritage organisations.
Likelihood of messages being cited in research papers or contain ‘significant’ content: Very low.
Other issues: Significant number of overseas subscribers.
Risks:  Closure of this list would have an adverse effect. Deletion of the contents of the list would be unlikely to have an adverse effect. However in light of the international aspect of the list it would be prudent to ensure stakeholders have the opportunity to give their views.

Next Steps

Carry out this research proved interesting in observing how these mailing lists failed to live up to their initial expectations.  but what to do next?  Some may feel that as the costs of the disk storage are trivial there is no need to do anything. However my view is that managed curation of such digital resources is needed.  So I feel that I should send an email to these two lists announcing my intention to delete these lists based on my review of the contents and my assessment of the risks of deleting the content. And since I no longer have an interest in the archives if anyone wishes to maintain the content they will be welcome to take on ownership of the lists.

But before taking this step I thought I would seek others views on these proposals. What do you think should be done?


[Note this post has been updated with a updated chart of JISCMail usage statistics. You can
view the original statistics published in the post which covered the period 2003-2007.]

Posted in preservation | 8 Comments »

Are You Able?

Posted by Brian Kelly (UK Web Focus) on 17 February 2009

There were two invited keynote speakers who travelled from Europe to speak at the OzeWAI 2009 conference. As well as my talk (which I described recently ) Dr. Eva M. Méndez (an Associate Professor in the Library and Information Science Department at the Universidad Carlos III de Madrid and not the American actor!) gave a talk entitled “I say accessibility when I want to say availability: misunderstandings of the accessibility in the other part of the world (EU and Spain)“.

Eva’s research focuses on metadata and web standards, digital information systems and services, accessibility and Semantic Web. She has also served as an independent expert in the evaluation and review of European projects since 2006, both for the eContentPlus program and the ICT (Information and Communication Technologies) program and her talk was informed by her knowledge of the inner working of such development programmes funded by the EU.

Her talk explored the ways in which well-meaning policies may be agreed with the EU, although such policies may be misinterpreted or misunderstand and fail to be implemented, even by the EU itself.

I don’t have access to Eva’s slides, so I will give my own interpretation of Eva’s talk.

We might expect the EU to support the development of a networked environment across EU countries across a range of areas. These areas might include:

Available: Have resources been digitised? Are they available via the Web?

Reusable: Are the resources available for use by others?  Or they it trapped within a Web environment which makes reuse by others difficult?

Findable: Can the resources be easily found? Have SEO techniques been applied to allow the resource to be indexed by search engines such Google?

Exploitable: Are the resources available for others to reuse through, for example, use of Creative Commons licences?

Usable: Are the resources available in a usable environment?

Accessible: Are the resources accessible to people with disabilities?

Preservable: Can the resources be preserved for use by future generations?

Since the acronym ARFEUAP isn’t particularly memorable (and ARE-U-API would be too contrived) we might describe this as the Able approach to digitisation. But there is 0ne additional concept which I feel also needs to be included:

Feasible: Are the policies which are proposed (or perhaps mandated) feasible (or achievable)? We might ask are they actually possible (can we make all resources universally accessible to all?)  and can they be achieved with available budgets and with the standards and technologies which are currently available?

There is, of course, a question which tends to be forgotten question: is the proposed service of interest to people and will it be used?

The worrying aspect of Eva’s talk was that the EU don’t appear to be asking such questions – or even used the same vocabulary.  We need to have the bigger picture in order to address tensions between these different areas and the question (and power struggles) of how we prioritise achieving best practices – for example, should we be digitizing resources, even if we can’t make them accessible; should we regard access by people with disabilities as being of  importance than ensuring the resources can be preserved?  And let’s not fudge the issue by suggested that each is equally important and all can be achieved by use of open standards. That simply isn’t the case – and if you doubt this, ask managers of institutional repositories. They will probably say that they are addressing the available, reusable, findable, preservable and, perhaps, exploitable issues, but I suspect that the repository managers would probably admit that many of the PDFs in the repositories will not be accessible.

Posted in Accessibility, preservation, standards | Tagged: | 3 Comments »

Disappearing Resources On Institutional Web Sites

Posted by Brian Kelly (UK Web Focus) on 16 December 2008

I recently received the publisher’s proofs of an accessibility paper which will be published in the new year. The reviewers spotted a number of broken links in the references. Some of them were links to previous papers I had published, and the errors were introduced by the publisher (which I confirmed by checking the details of the paper which I submitted). But for a couple of other references the pages did seem to have disappeared. I contact Stuart Smith, one of the co-authors, and asked him if he knew anything about the references he had supplied which seemed to have disappeared.

Stuart told me that a new e-learning team in his institution has rebuilt the e-learning Web site, resulting, it seems, in the loss of existing resources. Stuart wrote a blog post about this incident entitled “Mummy I lost my MP3!“. Stuart felt that “My MP3 problem shows to me that the argument that the ‘cloud’ is too unstable doesn’t hold water … because institutional systems are open to the same criticisms“. Stuart concluded that “My solution to my MP3 problem will probably lie in the ‘cloud’ I’ll find a suitable archiving host that I like and also keep a backup offline (like I should have done in the first place) and if that host disappears at least I will know about it“.

I’m sure Stuart isn’t alone. How many resources do you think will have disappeared following the establishment of new Web teams or the release of new software?  Maybe institutional repositories will have a role to play, as they try to address the persistent identifier problem by at least decoupling the address of the resource form the technology used to access the resource.  But repositories won’t be used to manage all resources on an institutional Web site, will they?

Since our institutions don’t seem to have yet cracked the problem of management of resources across changes in policies, staff and technologies, is Stuart right, I wonder,  in regarding ‘the cloud’ (e.g. services such as the Internet Archive, perhaps) as the place (or one of the places) to deposit resources for safe-keeping?  Or perhaps the question is whether such services may be more reliable than the institutional Web site. After all, if your own institution misplaces your resources, you can;’t sue them, can you?

Posted in preservation, Web2.0 | 2 Comments »

The Final JISC PoWR Workshop

Posted by Brian Kelly (UK Web Focus) on 29 August 2008

The final workshop organised by the JISC-funded Preservation of Web Resources (PoWR) will take place at the University of Manchester on Friday 12th September 2008.

Now you may think that preservation is a pretty dull topic, compared with the exciting developments that are taking place in a Web 2.0 environment. And if that’s what you think, then you’re not alone. As Alison Wildish, head of Web Services at the University of Bath described on the Web Services team blog:

We were asked by our colleagues at UKOLN (who organised the event) to deliver a brief talk detailing our approach to preserving web resources at the University. Our initial reaction was that we had little to say. Lizzie’s remit lies with the paper records and I am responsible for managing our website – ensuring it meets the needs of our users. Neither of us felt web preservation was something we had expertise in nor the time (and for me the inclination) to fully explore this.

And you can even listen to Alison and Lizzie Richmond (University of Bath records manager, archivist and FOI coordinator) expand on this by viewing the Slidecast of the talk they gave at the first JISC PoWR workshop:

If you listen to the end of the Slidecast you’ll hear Alison and Lizzie describing how they discovered in the course of the discussions reasons why Web preservation is a topic which needs to be treated seriously.

But how should one go about Web preservation? What should you preserve? What should one discard? What are the implications of use of Web 2.0 on preservation policies? Whose responsibility is this? What are the costs associated with preservation? And what are the costs and associated risks of not developing and implementing a preservation policy for your Web resources? And how does one ensure that an institutional preservation policy is sustainable and embedded withn the institution?

These are some of the topics which have been raised on the JISC PoWR blog and will be discussed at the workshop. But hurry up and book you place, as the deadline for bookings is Friday 5th September. And note that the workshop is free to attend for members of the higher and further education community.

And finally I should point out that the case study given by Alison Wildish and Lizzie Richard has been saved from being trapped in the non-interoperable world of the past, accessible only to Doctor Who (and even then only on a good day) by recording the talk and synching the recording with the slides and hosting this on Slideshare. You see, preservation can be enhanced through use of Web 2.0 services. Digital preservation can be cool – even though, arguably, it may kill the odd polar bear :-)

Posted in preservation, Web2.0 | Leave a Comment »

Fahrenheit 451

Posted by Brian Kelly (UK Web Focus) on 15 August 2008

I recently attended the JISC’s Innovation Forum. One of the most interesting of the plenary talks was given by HEFCE’s John Selby. In his talk John praised the work of the JISC and the JISC Services, but went on to warn of troubled financial times ahead for the educational sector. The glory days of the past 10 years are over, he predicted.

This was probably not unexpected. What did surprise me, however, was the figures John quoted which put the carbon cost to the environment on par with the cost of flying – both at 2%.

This generated much debate at the forum, and, later on at the conference meal and in the bar. Although people questioned the accuracy of these figures, and wanted to know how these figures were obtained, there was an awareness that the carbon cost of IT is an issue which the IT secure needs to address. I should add that I subsequently came across details of a forthcoming Government Goes Green conference in which Malcolm Wicks, Energy Minister, BERR was quoted as saying that

ICT is now responsible for around 2% of global CO2 emissions. The public sector, with annual IT spending of £14bn, has an important role to play in reducing this two percent. An increased focus on sustainable procurement and efficient use of IT products are two key areas that it needs to work on and I am very pleased to see a conference dedicated on this.

At the JISC Innovation Forum dinner I found myself sitting next to colleagues from the Digital Curation Centre (DCC). I suggested, partly in jest, that although there was a clear need for continued development of networked services which are popular with the users, we had to ask ourselves where the costs of preserving digital resources could be justified. If, as we learnt from Alison Wildish’s recent presentation at the first JISC PoWR workshop, those involved in Web development activities tend to focus on the pressing needs of their user communities and find it difficult to justify diverting scarce resources to preserving resources which are no longer of significant interest to the institution, why don’t we stop pushing the notion of digital preservation. And not only will this allow the development community to focus their efforts on responding to pressing user needs – but removing archived files from hard disk drives could result in significant savings in energy.

This approach would then both help the users and help save the planet :-)

As I’ve said this was intended as a joke, over our conference meal. But we realised that their may be benefits for the digital preservation community in making such suggestions. After all, preservation is widely considered as worthy but dull. If digital preservation was regarded as something radical, might it have a greater appeal to developers? Could those involved in digital preservation work – harvesting old Web sites and even implementing OAIS models – find themselves repositioned as members of an underground radical movement, secretly preserving digital artefacts for a society which regards such activities as unacceptable. Fahrenheit 451 for the 21st century, perhaps.

Save a Polar Bear campaign posterThe following day when I suggested this, I was told that there have been discussions about strategies for digital preservation which acknowledge that there are environmental factors which need to be addressed. It seems that there have been proposals that such preservation activities should be based in places such as Greenland and Alaska where the low temperatures may reduce the need for consuming energy to keep the disk drives running at acceptable temperatures.

Now scientists may point out that running large scale server farms in locations near glaciers and the ice cap may increase the rate at which they melt. But the ideas which were bounced around at the event did make me wonder whether centralisation of networked services (e.g. running applications hosted by Google or Yahoo or running our applications on Amazon’s S3 and EC2 servers) would be more beneficial to the environment than all of our institutions running our own local servers.

And perhaps such discussion might be useful in a teaching context. Does data curation, for example, conflict with environmental protection? If so, should we forget it? Or could this approach result in deletion of the very data that could save the planet

What do you think?

And if you’d like to take part in a viral marketing campaign which seeks to make digital preservation interesting by suggesting that it might be responsible for global warming, feel free to make use of the post which has been produced. And note that a Creative Commons zero licence (currently in beta) has been assigned to this resource, so you don’t need to cite the original source. Let’s be part of an underground movement :-)

Posted in Finances, preservation | 18 Comments »

Places Still Available on “Preservation of Web Resources” Workshop

Posted by Brian Kelly (UK Web Focus) on 17 June 2008

I’ve previously mentioned the JISC Preservation of Web Resources (JISC-PoWR) project which is being provided by UKOLN and ULCC. The project has established a blog and will be running its first workshop, entitled Preservation of Web Resources: Making a Start, on Friday 27th June 2008 at Senate House, London.

The workshop is aimed staff in the higher and further education sector with responsibilities for the preservation of institutional Web resources. The workshop will introduce the concept of Web preservation, and discuss the technological, institutional and legal challenges the preservation of Web resources presents. One aspect of Web site preservation might be keeping a history of changes to your institution’s home page. Do you have a digital record of the changes? And do you have a record of why significant changes were made and when? I have been working with colleagues in the University of Bath on ways in which we might address this particular issue. The following video clip, which is available on YouTube, illustrates some of the issues (although if the display is too small you might prefer to view the original resource):

There are still a number of places available on the workshop – which is free to attend for those in the higher and further education sector. But please sign up promptly if you are interested. The timetable is given below:

10:00 – 10:30 Registration and coffee

10:30 – 12:45 Morning Sessions:

  • Presentation: Preservation of Web Resources Part I
  • Breakout session: What are the Barriers to Web Resource Preservation?
  • Presentation: Challenges for Web Resource Preservation
  • Presentation: Legal issues

12:45 – 13:45 Lunch
13:45 – 16:00 Afternoon Sessions:

  • Presentation: Bath University Case Study
  • Breakout session: Preservation Scenarios
  • Presentation: Preservation of Web Resources Part II

16:00 End

Posted in preservation | Leave a Comment »

The SearchMe Visual Service

Posted by Brian Kelly (UK Web Focus) on 13 June 2008

A recent Tweet from Tony Hirst alerted me to the Searchme Visual Search service. An example of use of this service searching for “UKWebFocus is illustrated below.

The Searchmevisual.com Service

As the name suggests this service provides a visually-oriented approach to searching and, rather than attempting to describe this service I suggest you try it.

I suspect that an initial response from some information professionals would be to highlight the limitations of such an interface, pointing out the difficulties of more advanced searching. However I feel that this would be to overlook the potential of this type of interface to provide browsing functionality. And this, indeed, was the use case made by Tony Hirst:

@briankelly would like a wayback machine browser for home pages over time. http://beta.searchme.com would look neat? Any libraries for it?

I met Tony at the recent CRIG DRY (Don’t Repeat Yourself) Metadata Barcamp held at the University of Bath. Over lunch I mentioned UKOLN’s JISC-PoWR (Preservation of Web Resources) project and described my interest in ways of exploiting content held in the Internet Archive’s WayBack Machine. I suggested that a generic screen-scraping interface to the service would be useful – and when I returned to the Barcamp later that afternoon Tony demonstrated the first version of the software :-) And the following day Tony had started to explore ways of providing a richer user interface to such data. A browse interface such as that used by Search Me Visual could potentially provide a very engaging way of visualising the changes to an organisation’s home page, I would think. And wouldn’t it be great if this could be demonstrated at the JISC-PoWR’s opening workshop on 25 June 2008. Has anyone come across any tools which could do this?

Posted in preservation, Web2.0 | Tagged: , | 4 Comments »

Preservation of Web Resources: Making a Start

Posted by Brian Kelly (UK Web Focus) on 4 June 2008

My colleague Marieke Guy together with the JISC-PoWR project partners at ULCC have announced details of a workshop on “Preservation of Web Resources: Making a Start” – this one-day workshop will take place on Friday 27th June 2008 at the Senate House Library, University of London.

The JISC-PoWR project runs until the end of September 2008 and will run three workshops which will aim to identify best practices for preserving Web sites. The key deliverable of the project will be a handbook which will document the challenges to be addressed in Web site preservation in a number of areas which will include key institutional Web services (e.g. the prospectus), project Web sites (which have clear termination dates) and, a particular challenge for the project, the preservation issues associated with use of Web 2.0 services.

The first workshop will be free to attend (although there will be a penalty for non-shows), with the second workshop being held as part of the IWMW 2008 event at the University of Aberdeen on 23rd July.

Please sign up now if you would like to attend. And I’d you can’t make it but have an interest in the preservation of Web resource, why not subscribe to the JISC-PoWR blog – and, rather than being a passive reader, join in the discussions.  Topics we’d be interested in hearing about include (a) how institutions are currently addressing the preservation of key institutional Web-based services (such as the prospectus); (b) the approaches you may be taken to short-term project Web sites (whether JISC-funded or institutionally-funded and (c) your views on the preservation of data and services provided by externally-hosted Web 2.0 services.

Posted in Events, preservation | Leave a Comment »

Preserving The Past Can Help The Future

Posted by Brian Kelly (UK Web Focus) on 21 May 2008

Many of the posts featured in this blog describe innovative tools and applications which aim to provide a more effective work or study environment for users. However there can be a danger that an emphasis on new and innovative services can mean a failure to manage legacy services which can result in a loss of our experiences, history and culture.

This can be particularly true in the Web environment. I first became aware of the scale of the problem when I monitored the Web sites which had been set up for projects funded by the EU’s Telematics For Libraries programme. As I described in an article on WebWatching Telematics For Libraries Project Web Sites published in the Exploit Interactive e-journal in October 2000 of the 65 projects which had Web sites, a total of 23 of the Web sites has disappeared when I carried out the survey. And a recent check shows that at least 39 of the Web sites have gone. Our digital history, the associated learning and the investment (from EU taxpayers) is being lost!

Or is it? Is this assertion just being alarmist? Might not the information have been migrated to a more manageable environment? And perhaps some of the projects are now available, possibly under new names, as sustainable services?

There’s a clear need for these issues to be addressed and for advice to be provided – both to organisation as responsible for managing their own Web services and to funding bodies which commission development work which will involve the development of Web sites.

JISC have recognised the need to provide such advice. They issued a recent call for an ITT on “The Preservation of Web Resources Workshops and Handbook” and I’m pleased to report that a joint bid by UKOLN and ULCC was successful. The project, which had its launch meeting on 1 May 2008, will run three workshops which will aim to gain a better understanding of the challenges to be faced in Web site preservation, identify examples of best practices and provide a set of recommendations to policy makers, content providers and developers. This will be documented in a handbook which should be available after September 2008.

Although the project is only funded for 5 months it will seek to provide advice not only on conventional institutional Web sites, but also on use of third party Web 2.0 services – the potential benefits of such services are well-understood, but there needs to be a better understanding of the risks associated with their use and how institutions should assess such risks and use such assessments to inform policy.

JISC PoWR BlogThe project team members themselves are using a variety of Web 2.0 tools to support their work. As well as communications technologies (beyond email) to support the work of the distributed team members a blog is also being used to disseminate information about the project and to solicit feedback and encourage discussion and debate. The JISC-PoWR (Preservation of Web Resources) blog (illustrated) is hosted on the JISC Involve blog service.

The team would like to welcome those with an interest in Web site preservation to join the blog and contribute to the discussions.

Posted in preservation | Tagged: | 1 Comment »

Disappearing Public Sector Web Sites

Posted by Brian Kelly (UK Web Focus) on 31 March 2008

I recently used the Intute service to see what records it held about UKOLN’s activities. I found a record about the ‘Crossroads West Midlands service which UKOLN provided technical advice on the design of the collection description database:

This is the website of ‘Crossroads West Midlands’, a Resource funded project that is working to develop online access to the collections of libraries, museums and archives in the West Midlands (including universities and local authorities as well as private institutions). The Crossroads website is currently a prototype, testing a database built upon the RSLP collection level description database, covering the collections relating to the potteries industry of North Staffordshire.

The record provides additional information about the service which reminded me about the meetings I attended several years ago about this project. I was interested to see what the Crossroads West Midlands service now looks like, so I followed the link to the http://www.crossroads-wm.org.uk/ address – and, rather than a service providing access to a database of cultural heritage resources in the West Midlands, I found a page full of links to services such as golf, gambling, estate agents, motor insurance, etc.

Crossroads West Midlands Web SiteClearly at some point the domain name for the original service had lapsed and was purchased by a company which used it to host advertisments and links to companies which would be willing to advertise in this way (or possibly companies wishing to enhance their search engine ranking may have procured the services of a Search Engine Optimisation service and might not be aware of the approaches taken.)

I was interested in the history of the Web site. Using the Internet Archive I discovered that the Web site was first archived on 26 September 2002. At this point the information in the archive contained details about the project. The service itself was first launched around February 2003. And the service disappeared to be replaced by an advertsiment site at some point between December 2005 and April 2006.

What happened? Did project funding run out? Did key staff leave? Or was there a blunder, with nobody receiving the email requesting renewal of the domain name?

Whatever the reason, this West Midlands Crossroads service has disappeared for sight. Is this inevitable? Well back in 1999 I was the project manager for the Exploit Interactive e-journal- an EU-funded project which ran until 2000. Once the funding had finished we had to decide what would happen with the domain name. We agreed to continue paying for the domain for at least 3 years after the project funding had ceased and would try to keep the domain for a period of 10 years. This policy was informed by a survey I carried out of project Web site funded by the EU-funded Telematics for Libraries programme. As I described in an article published in Exploit Interactive in October 2000 23 Web site had disappeared of the 103 projects funded.

We are seeing a disappearance of cultural resource and EU-funded projects from the digital environment. And this may well get worse, if the UK Government’s policy of centralising its Web sites, which will result in 551 Web sites being closed down, is not managed properly. Will we, for example, find that the Drugdrive Web site at http://www.drugdrive.com/ suddenly becomes a site used for selling drugs?

What is to be done? The good news is that the Government does seem to be handling its redirects properly – the Drugdrive Web site, for example, is redirected to http://www.drugdrive.com/

Well done, the UK Government. But what about the rest of us? Are we managing the closure of Web sites? And are we assessing the risks of failing to do this? After all, if a government Web site on protection of children from dangers on the Internet became available and was bought by a pornography site, we could well see a government minister being forced to resign

Posted in preservation | 3 Comments »