UK Web Focus (Brian Kelly)

Innovation and best practices for the Web

“Battling legal, logistical and technical obstacles to archiving the Web”

Posted by Brian Kelly on 12 Sep 2011

Recent Features on Web Archiving

The recent guest blog post entitled Web archives: more useful than just a ‘historical snapshot’ was quite timely, having been published a few days after a related article in the Time Higher Education (Memory Failure Detected) which described how:

A coalition of the willing is battling legal, logistical and technical obstacles to archive the riches of the mercurial World Wide Web for the benefit of future scholars

The article went on to illustrate a use case from the preservation of Web resources:

It is 2031 and a researcher wants to study what London’s bloggers were saying about the riots taking place in their city in 2011. Many of the relevant websites have long since disappeared, so she turns to the archives to find out what has been preserved. But she comes up against a brick wall: much of the material was never stored or has been only partially archived. It will be impossible to get the full picture.

But, as I describe below, we don’t need to wait until 2031 to have a reason to analyse Web content which may have been thought to be ephemeral.

Analysis of Twitter Usage at Recent ALT-C Conferences

The article in the Times Higher Education referred to an archiving initiative led by the Library of Congress which is archiving Twitter posts which will allow, at some time in the future, researchers to analyse public tweets. The article could also have mentioned the TwapperKeeper  archiving service which benefitted from JISC-funding to enhance its archiving capabilities to address requirements of the UK HE’s sector. The TwapperKeeper service was used to keep an archive of tweets posted about last week’s ALT-C 2011 conference.  The JISC-funded developments to the service included the provision of enhanced API access which led to development of the Summarizr analysis service  by Andy Powell at Eduserv.

In order to make valid comparisons across annual events I have previously suggested that the Twitter traffic for a week is analysed, so that discussions in advance of an event and shortly afterwards can be analysed. The Summarizr statistics for tweets at the ALT-C conferences for the past three years are given in the following table.

Note: Following the publication of this post Martin Hawksey pointed out in a comment on the post that the Twapper Keeperr archive was not available at the start of the ALT-C 2011 conference, until he created the archive on the opening morning of the conference.  An updated column has been published, but note that this does not include tweets form the opening morning of the conference.

ALT-C 2009 ALT-C 2010 ALT-C 2011 ALT-C 2011 (updated)
Date of event 8-10 Sept 2009 7-9 Sept 2010 6-8 Sept 2011 6-8 Sept 2011
Dates for analysis 6-12 Sept 2009 5-11 Sept 2010 4-10 Sept 2011
(partial archive)
6-11 Sept 2011
Nos. of tweets 4,442 6,138 6,296 6,342
Nos. of users 726 658 802 809
Nos. of URLs tweeted 701 664 1,083 1,102
Top five twitterers jamesclay (168)
sputuk (113)
haydnblackey (112)
emmadw (110)
JackieCarter (97)
dajbconf (330)
timbuckteeth (279)
AJCann (174)
jamesclay (153)
jak82 (111)
digitalfprint (327)
timbuckteeth (212)
sarahhorrigan (187)
FieryRed1 (165)
kevupnorth (140)
digitalfprint (327)
timbuckteeth (217)
sarahhorrigan (187)
FieryRed1 (165)
amcunningham (141)
Top five tweeted hashtags altc2009 (4,333)
jisccdd (108)
dubaimetro (84)
wheniwaslittle (72)
dupedb (64)
altc2010 (6,089)
digilit (173)
awesome (25)
altc2011 (24)
fail (23)
altc2011 (6194)
ds106radio (54)
altc2012 (42)
oer (39)
opencountry (35)
altc2011 (6,240)
ds106radio (54)
altc2012 (42)
oer (39)
opencountry (35)
Nos. of geo-located tweets 0 (0%) 35 (0%) 83 (1%) 83 (1%)

Archiving of the tweets allows us to provide such analyses in order to see the importance of Twitter at such events and identify the people who are particularly active Twitter users at the events. The figures also suggest that the amount of Twitter traffic seems to have stabilised over the past two years and the geo-located tweets, although growing in numbers, is not yet being used to any significant extent.

The Coalition of the Willing – Should Include You

The article published in the Times Higher Education highlighted a number of examples of  initiatives designed for archiving the broad ranges of resources available on the Web, including work being undertaken at the British Library, the Library of Congress and the Internet Archive as well as a number of national libraries in Europe.

The emphasis of national and international organisations may lead to the impression that archiving of Web resources is being addressed by others and so there is no need for individual universities to need to consider web preservation issues. This is, I feel,  a mistaken view.  Indeed not only should those who have a responsibility for the management of institutional digital resources need to address preservation issues, so too do those who manage project resources as well as, as we have seen above, those who may wish to preserve content associated with events.

JISC has recognised the importance of Web archiving and will be hosting an event on “The Future of the Past of the Web” which will be held at the British Library Conference Centre on 7 October 2011. This free event is the third joint Web archiving workshop which has been organised by the JISC in conjunction with the British Library and the DCC. The event is aimed at:

  • Curators, librarians, archivists interested in the preservation of web resources
  • Organisations that are engaged in web archiving and digital preservation
  • Researchers who depend on access to stable web resources for their research
  • Web developers and content creators who value their content
  • Information managers with responsibility for legal compliance

If this event is of interest to you note that bookings should be made before 12:00 on Friday 30th September 2011.

4 Responses to ““Battling legal, logistical and technical obstacles to archiving the Web””

  1. mhawksey said

    Hi Brian – Just a note on alt-c 2011 stats. I wasn’t at ALT this year and was dipping into the stream. As I’ve done some work using twitter data (e.g. adding Twitter generated subtitles to keynotes) on the Tuesday morning (6th) I setup a Google Spreadsheet to capture #altc2011 tweets. To make sure I was not missing any data I wanted to cross check with TwapperKeeper only to find no one had created the archive. So in sort the ALT-C 2011 stats are missing 4th-6th (my spreadsheet goes back to the 31st August ;).

    Martin

    • Many thanks for letting me know – and more importantly for creating the archive. I’ve updated my post so it is clear that the initial data published for the ALT-C 2011 Twitter usage is incomplete and have added an additional column which gives data for a week, although not including the first morning of the conference,

      Note that I have also created a Twapper Keeper archive for #ALTC2011 tweets – so nobody forgets next year. I hope they don’t change the hashtag, mind!

  2. […] thoughts.  Brian Kelly (UKOLN) and Martin Hawksey (MASHe)   helped to archive the tweets using Twapper Keeper. The associated Summarizr makes a neat job of aggregating and displaying those archived tweets. As […]

  3. […] can improve a bad lecture! 7… – Seb Schmoller – FriendFeed”; one on ““Battling legal, logistical and technical obstacles to archiving the Web” « UK Web Focus” which summarised one of my blog posts on Twitter archiving and one on “Martin […]

Leave a comment