“Battling legal, logistical and technical obstacles to archiving the Web”
Posted by Brian Kelly on 12 September 2011
Recent Features on Web Archiving
The recent guest blog post entitled Web archives: more useful than just a ‘historical snapshot’ was quite timely, having been published a few days after a related article in the Time Higher Education (Memory Failure Detected) which described how:
A coalition of the willing is battling legal, logistical and technical obstacles to archive the riches of the mercurial World Wide Web for the benefit of future scholars
The article went on to illustrate a use case from the preservation of Web resources:
It is 2031 and a researcher wants to study what London’s bloggers were saying about the riots taking place in their city in 2011. Many of the relevant websites have long since disappeared, so she turns to the archives to find out what has been preserved. But she comes up against a brick wall: much of the material was never stored or has been only partially archived. It will be impossible to get the full picture.
But, as I describe below, we don’t need to wait until 2031 to have a reason to analyse Web content which may have been thought to be ephemeral.
Analysis of Twitter Usage at Recent ALT-C Conferences
The article in the Times Higher Education referred to an archiving initiative led by the Library of Congress which is archiving Twitter posts which will allow, at some time in the future, researchers to analyse public tweets. The article could also have mentioned the TwapperKeeper archiving service which benefitted from JISC-funding to enhance its archiving capabilities to address requirements of the UK HE’s sector. The TwapperKeeper service was used to keep an archive of tweets posted about last week’s ALT-C 2011 conference. The JISC-funded developments to the service included the provision of enhanced API access which led to development of the Summarizr analysis service by Andy Powell at Eduserv.
In order to make valid comparisons across annual events I have previously suggested that the Twitter traffic for a week is analysed, so that discussions in advance of an event and shortly afterwards can be analysed. The Summarizr statistics for tweets at the ALT-C conferences for the past three years are given in the following table.
Note: Following the publication of this post Martin Hawksey pointed out in a comment on the post that the Twapper Keeperr archive was not available at the start of the ALT-C 2011 conference, until he created the archive on the opening morning of the conference. An updated column has been published, but note that this does not include tweets form the opening morning of the conference.
|ALT-C 2009||ALT-C 2010||ALT-C 2011||ALT-C 2011 (updated)|
|Date of event||8-10 Sept 2009||7-9 Sept 2010||6-8 Sept 2011||6-8 Sept 2011|
|Dates for analysis||6-12 Sept 2009||5-11 Sept 2010||4-10 Sept 2011
|6-11 Sept 2011|
|Nos. of tweets||4,442||6,138||6,296||6,342|
|Nos. of users||726||658||802||809|
|Nos. of URLs tweeted||701||664||1,083||1,102|
|Top five twitterers||jamesclay (168)
|Top five tweeted hashtags||altc2009 (4,333)
|Nos. of geo-located tweets||0 (0%)||35 (0%)||83 (1%)||83 (1%)|
Archiving of the tweets allows us to provide such analyses in order to see the importance of Twitter at such events and identify the people who are particularly active Twitter users at the events. The figures also suggest that the amount of Twitter traffic seems to have stabilised over the past two years and the geo-located tweets, although growing in numbers, is not yet being used to any significant extent.
The Coalition of the Willing – Should Include You
The article published in the Times Higher Education highlighted a number of examples of initiatives designed for archiving the broad ranges of resources available on the Web, including work being undertaken at the British Library, the Library of Congress and the Internet Archive as well as a number of national libraries in Europe.
The emphasis of national and international organisations may lead to the impression that archiving of Web resources is being addressed by others and so there is no need for individual universities to need to consider web preservation issues. This is, I feel, a mistaken view. Indeed not only should those who have a responsibility for the management of institutional digital resources need to address preservation issues, so too do those who manage project resources as well as, as we have seen above, those who may wish to preserve content associated with events.
JISC has recognised the importance of Web archiving and will be hosting an event on “The Future of the Past of the Web” which will be held at the British Library Conference Centre on 7 October 2011. This free event is the third joint Web archiving workshop which has been organised by the JISC in conjunction with the British Library and the DCC. The event is aimed at:
- Curators, librarians, archivists interested in the preservation of web resources
- Organisations that are engaged in web archiving and digital preservation
- Researchers who depend on access to stable web resources for their research
- Web developers and content creators who value their content
- Information managers with responsibility for legal compliance
If this event is of interest to you note that bookings should be made before 12:00 on Friday 30th September 2011.