The Components of Twitter to be Archived
Posted by Brian Kelly on 30 April 2010
In a recent post on “Developments to Twapper Keeper” I described JISC-funded developments to the Twapper Keeper Twitter archiving service. I mentioned how the Twapper Keeper blog was being used initial to gather user requirements for developments n User Enhancements to Twapper Keeper and API Developments to Twapper Keeper. I’m pleased that we have received a number of suggestions – one of which, a request to allow tweets to be deleted from the archive and users to opt-out of Twapper Keeper archiving, has been identified as an important feature, particularly for UK users in light of the uncertainties regarding Twitter and copyright in light of the recent passing of the Digital Economy Act.
It has recently occurred to me, though, that we haven’t properly defined what it is that will be archived to allow subsequent reuse (e.g. by tools such as Martin Hawksey’s Twitter capturing service) or analysis (e.g. the sentiment analysis which failed to identify the irony of the tweets posted with the #NickCleggsFault tag).
We will be able to archive the contents of a tweet contained within the 40 character limit which will include the textual content, hypertext links to Web resources and Twitter pictures and videos, the Twitter ID of the recipient of public messages (or the subject of a message) as defined by the @ command and the hashtag(s) used in a tweet. Are there any other structural elements of a tweet, I wonder?
As well as the content of a tweet which is created by the author, there will be a number of metadata attributes which will also be available. This will include the Twitter ID of the poster, the data and time and name of the Twitter client used and, optionally, geo-location information (which I suspect will grow in importance). Again I wonder if there are additional metadata fields I may have missed.
In addition to this Twitter information there is also information related to the Twitter user’s community – the numbers of people they follow and who follow them. The ability to gather this (volatile) information could be useful for observing trends, identifying causes of viral Twitter posts, applying heuristics for spotting Twitter spammers (as Tony Hirst has described), etc.
The systematic archiving of information related to a Twitterer’s community is probably out-of-scope for the current Twapper Keeper development work. But will, I wonder, such information be harvested as part of the Library of Congress’s Twitter archiving work?