UK Web Focus

Innovation and best practices for the Web

Archive for February, 2010

WWW 2010: Connect On Facebook, Twitter, LinkedIn and Flickr

Posted by Brian Kelly on 26 February 2010

The World Wide Web (WWW) conference series was launched in 1994 and I have vivid memories of attending the conference, hosted in CERN, the birthplace of the Web, which was described as ‘the Woodstock of the 1990s‘.

The conference is an important event for the Web research community. This year’s event, WWW 2010, is the nineteenth in the series and is being held in Raleigh, North Carolina, USA on 26-30 April. If you want to find out more you can visit the conference Web site, which provides information on the conference program, location details and online bookings.

You can also find information about the conference on the WWW 2010 page on Facebook, or on the WWW 2010 page on LinkedIn. A WWW 2010 Twitter account (@www2010) has also been set up which has been used so far to provide information on various deadlines.  There is also a WWW 2010 Flickr group which currently has a small number of photographs of the conference venue.

A few years ago I suspect that some in the hardcore Web researcher and development communities would have been rather dismissive of use of social networking services such as Facebook and LinkedIn.  I suspect the rationale to make use such of services is now a business decision, based on the need to ensure that sufficient numbers attend the conference. The benefits of use of such popular services (there are currently over 1,100 members of the Facebook group) need to be balanced with the resources which may be needed to manage the resource (e.g. respond to wall messages). Which makes we wonder, who makes the decision on use of such services to support this type of event, how large does an event need to be for it to benefit for exposure in Facebook or LinkedIn and how do you judge whether such a decision will provide a satisfactory ROI? Perhaps the answer to this question should be gained by observing the approaches taken by others – in this case Facebook (with 1,100 members is ahead on LinkedIn (with 586 members).

Posted in Events, Facebook, Social Networking | Tagged: | 5 Comments »

Approaches To Debugging The DBpedia Query

Posted by Brian Kelly on 24 February 2010

My recent invitation to Linked Data developers to illustrate the potential benefits of Linked Data by providing an answer to a simple query using DBpedia as a data source generated a lot of subsequent discussion. A tweet by Frank van Harmelen (the Dutch Computer Scientist and Professor in Knowledge Representation & Reasoning, I assume) summarised his thoughts of the two posts and related behind-the scenes-activities: “insightful discussion on #linkeddate strengths, weaknesses, scope and limitations“.

But as described in the post, the answer to the question “Which town or city in the UK has the largest proportion of students?” was clearly wrong.  And if you view the output from the most recent version of the query, you’ll see that the answers are still clearly incorrect.

We might regard this ‘quick fail’ as being of more value that the ‘quick win’ which I had expected initially, as this provides an opportunity to reflect onthe processes needed to debug a Linked Data query.

As a reminder here is the query:

#quick attempt at analyzing students as % of population in the United Kingdom by Town
#this query shows DBpedia extraction related quality issues which ultimately are a function of the
#wikipedia infoboxes.

prefix dbpedia-owl: 
prefix dbpedia-owl-uni: 
prefix dbpedia-owl-inst: 

select distinct  ?town ?pgrad ?ugrad  ?population (((?pgrad + ?ugrad) / 1000.0 / ?population ) ) as ?per where {
?s dbpedia-owl-inst:country dbpedia:United_Kingdom;
   dbpedia-owl-uni:postgrad ?pgrad;
   dbpedia-owl-uni:undergrad ?ugrad;
   dbpedia-owl-inst:city ?town.
optional {?town dbpedia-owl:populationTotal ?population. filter (?population >0) }
 }
group by ?town having (((?pgrad + ?ugrad) / 1000.0 / ?population ) ) > 0
order by desc 5

As can be seen, the query is short and, for a database developer with SQL expertise, the program logic should be apparent. But the point about Linked Data is the emphasis on the data and the way in which the data is described (using RDF). So I suspect there will be a need to debug the data. We will probably need answers to questions such as “Is the data correct in the original source (Wikipedia)?“; “Is the data correct in DBpedia?“; “Is the data marked-up in a consistent fashion?“; “Does the query process the data correctly?” and “Does the data reflect the assumptions in the query?“.

Finding an answer to these questions might be best done by looking at the data for the results which were clearly in error and comparing the data with results which appear to be more realistic.

We see that Cambridge has a population of 12 and Oxford a population of 38. These are clearly wrong. My initial suspicion was that several zeros were missing (perhaps the data was described in Wikipedia as population (in tens of thousands).   But looking at the 0ther end of the table, the towns and cities with the largest populations include Chatham (Kent) with a population of 70,540, Stirling (41,243) and Guildford (66,773) – the latter population count agrees with the data held in Wikipedia.

In addition to the strange population figures,there are also questions about the towns and cities which are described as hosting a UK University. As far as I know neither Folkestone nor Hastings has a University. London, however, has many universities but is missing from the list.

My supposition is that the population data is marked up in a variety of ways  – looking at the Wikipedia entry for Cambridge, for example, I see that the info table on the right of the page (which contains the information used in DBpedia) has three population counts: the district and city population (122,800), urban population (130,000), and county population (752,900). But by querying the DBpedia query results I find three values for population: 12, 73 and 752,900.

The confusions regarding towns and cities which may or may not host UK universities might reflect real world complexities – if a town hosts a campus but the  main campus is located elsewhere, should the town  be included? There’s not a clear-cut answer, especially when, as in this case, the data, from Wikipedia, is managed in a very devolved fashion.

I’ve suggested some possible reasons for the incorrect results to the SPARQL query and I am sure there may be additional reasons (and I welcome such suggestions).  How one might go about fixing the  bugs is another question. Should the data be made more consistent?  If so, how might one do this when the data is owned by a distributed query?  Or isn’t the point of Linked Data being that the data should be self-describing – in which case perhaps a much more complex SPAQL query is needed in order to process the complexities hidden behind my apparently simple question.

Posted in Linked Data | 10 Comments »

Response To My Linked Data Challenge

Posted by Brian Kelly on 19 February 2010

The Linked Data Challenge

A week ago I issued a challenge to Linked Data developers – using a Linked Data service, such as DBpedia, tell me which town or city in the UK has the largest proportion of students. I’m pleased to say that a developer, Alejandra Garcia Rojas, has responded to my challenge and provided an answer. But this post isn’t about the answer but rather the development processes, the limitations of the approach and the issues which the challenge has raised. The post concludes with a revised challenge for Linked Data (and other) developers.

The Motivation For The Challenge

Before revealing the answer I should explain why I posed this challenge. I can recall Tim Berners-Lee introducing the Semantic Web at a WWW conference  many years ago – looking at my trip reports it was the WWW 7 conference held in Brisbane in 1998. My report links to the slides Tim Berners-Lee used in his presentation in which he encouraged the Web development community to engage with the Semantic Web. I was particularly interested in his slide in which he outlined some of the user problems which the Semantic Web would address:

  • Can Joe access the party photos?
  • Who are all the people who can?
  • Is there a green car for sale for around $15000 in Queensland?
  • Did someone driving a blue car send us an invoice for over $10000?
  • What was the average temperature in 1997 in Brisbane?
  • Please fill in my tax form!

I was interested if 12 years on, such types of questions can be answered using what is now referred to as Linked Data. And as we have a large resource, DBpedia, which provides Linked Data for use by developers I felt it would be useful to use an existing resource (which is based on the structured content held in Wikipedia) to experiment with. I was particularly interested in the following three questions:

  • How easy would it be for an experienced Linked Data developer to write code which would provide an answer? Would it be 10 lines of code which could be written in 10 minutes, a million lines of code which would take a large team years to write or somewhere in between?
  • Is the Linked Data content held in DBpedia of sufficient consistency and quality to allow an answer to be provided within the need for intensive data cleansing?
  • What additional issues might the experiences gained in this challenge raise?

A SPARQL Query To Answer My Challenge

In addition to issuing my challenge on this blog and using Twitter to engage with a wider community I also raised the challenge in the Linked Data Web group in LinkedIn (note you need to be a member of the group to view the discussion). It was in this group that the discussion started, with Kingsley Idehen (CEO at OpenLink Software) clarifying some of the issues I’d raised. Alejandra Garcia Rojas, a Semantic Web Specialist was the developer who immediately responded to my query and, within a few hours, provided an initial summary and a few days later gave me the final version of her SPARQL query (as described in Wikipedia SPARQL is a query language for Linked Data) which was used to provide an answer to my question. Alejandra explained that it should be possible to use the following single SPARQL query to provide an answer from the data held in DBpedia:

prefix dbpedia-owl:
prefix dbpedia-owl-uni:
prefix dbpedia-owl-inst:

select ?town count(?uni) ?pgrad ?ugrad max(?population) (( (?pgrad+?ugrad)/ max(?population))*100) as ?percentage where {
?uni dbpedia-owl-inst:country dbpedia:United_Kingdom ;
dbpedia-owl-uni:postgrad ?pgrad ;
dbpedia-owl-uni:undergrad ?ugrad ;
dbpedia-owl-inst:city ?town.
optional {?town dbpedia-owl:populationTotal ?population . FILTER (?population > 0 ) }
}
group by ?town ?pgrad ?ugrad having( (((?pgrad+?ugrad)/ max(?population) )*100) > 0)
order by desc 6

The Answer To My Challenge

What’s the answer, I hear you asking?  The answer, to my slightly modified query in which I’ve asked for the number of universities and the total population for the six towns with the highest proportion of students, is given in the table below:

City Nos. of
Universities
Student
Nos.
Population Student
Proportion
Cambridge 2 38,696 12 3224%
Leeds 5 135,625 89 1523%
Preston 1 34,863 30 1162%
Oxford 3 40,248 38 1059%
Leicester 3 54,262 62 875%

As can be seen, Cambridge, which has two universities, has the highest proportion of student, with a total student population of 38,696 students and an overall population of 12 people. Clearly something is wrong :-) And as Alejandra has provided a live link to her SPARQL query so you can examine the full responses for yourself. In addition another SPARQL query provides a list of cities and their universities.

The Quality Of The Data Is The Problem

I was very pleased to discover that it was possible to write a brief and simple SPARQL query (anyone with knowledge of SQL will be able to understand the code). The problem lies with the data. And this exercise has been useful in gaining a better understanding of the flaws in the data and of the need to understand why such problems have occurred.

Following discussions with Alejandra we identified the following problems with the underlying data:

  • The population of the towns and cities defined in a variety of ways. We discovered  many different variables describing the population: totalPopulation, urbanPopulation, populationEstimate, etc. – and on occasions there was more than one value in the variables. Moreover, these variables are not always in all cities’ descriptions, thus making it impossible to select the most appropriate value.
  • A full list of all UK universities has not been analysed because the query processes the universities that have the student numbers defined. If the university does not provide the number of students, then it is discarded.
  • Colleges are sometimes defined as universities.

What Have We Learnt?

What have we learnt from this exercise? We have learnt that although the necessary information to answer my query may be held in DBpedia, it is not available in a format which is suitable for automated processing.

I have also learnt that a SPARQL query need not be intimidating and it would appear that writing SPARQL queries need not necessarily be time-consuming, if you have the necessary expertise.

The bad news, though, is that although DBpedia appears to be fairly central to the current publicity surrounding Linked Data, it does not appear to be capable of providing end user services on the basis of this initial experiment.

I do not know, though, what the underlying problems with the data are. It could be due to the complexity of the data modelling, the inherent limitations of the distributed data collection approach used by Wikipedia, limitations of the workflow process in taking data from Wikipedia for use in DBpedia – or simply that the apparent simple query “which place in the UK has the higher per capita student population” does have many implicit assumptions which can’t be handled by the DBpedia’s representation of the data stored in Wikipedia.

If the answer to such apparently simple queries will require much more complex data modelling, there will be a need to address the business models which will be needed to justify additional expenditure needed to handle the complexity. And although there might be valid business reasons for doing this in areas such as biomedial data, it may be questionable whether this is the case for answering essential trivial questions such as the one I posed. In which case the similarly trivial question which Tim Berners-Lee used back in 1998 – “Is there a green car for sale for around $15000 in Queensland?” – was perhaps responsible for misleading people into thinking the Semantic Web was for ordinary end users.  I am now starting to wonder whether a better strategy for those involved in Linked Data activities would be to purposely distance it from typical  end users and target, instead, identified niche areas.

A more general concern which this exercise has alerted me to is the dangers of assuming that the answer to a Linked Data query will necessarily be correct. In this case it was clear that the results were wrong. But what if the results had only been slightly wrong? And what if you weren’t in a position to make a judgement on the validity of the answers?

On the LinkedIn discussion Chris Rusbridge summarised his particular concerns: “My question is really about the application of tools without careful thought on their implications, which seems to me a risk for Linked Data in particular“. Chris went on to ask “what are the risks of running queries against datasets where there are data of unknown provenance and qualification?

My simple query has resulted in me asking many questions which hadn’t occurred to me previously. I welcome comments from others with an interest in Linked Data.

An Updated Challenge For Linked Data (and Other) Developers

It would be a mistake to regard the failure to obtain an answer to my challenge as an indication of limitations of the Linked Data concept – the phrase ‘garbage is, garbage out‘ is as valid in a Linked Data world as it was when it was coined in the days of IBM mainframe computers.’

An updated challenge for Linked Data developers would be to answer the query “What are the top five places in the UK with the highest proportion of students?” The answer should list the town or city, student percentage, together with the numbers of universities, students and the overall population.

And rather than using DBpedia the official source of such data would be a better starting point. The UK government has published entry points to perform SPARQL queries for a variety of statistical queries – so Linked Data developers may wish to use the interface for government statistics and educational statistics.

Of course it might be possible to provide an answer to my query using approaches other than linked data. In a post entitled “Mulling Over =datagovukLookup() in Google Spreadsheets” Tony Hirst asked “do I win if I can work out a solution using stuff from the Guardian Datastore and Google Spreadsheets, or are you only accepting Proper Linked Data solutions“. Although I’m afraid there’s no prize, I would be interested in seeing if an answer to my query can be provided using other approaches.


Twitter conversation from Topsy: [View]

Posted in Linked Data | 33 Comments »

Moderated Comments? Closed Comments? No Thanks!

Posted by Brian Kelly on 15 February 2010

Copyright? There’s A Need For A Debate

On Friday I read a blog post on about alleged copyright infringement on Blogger. The post on the JISC Digital Media blog described how “In a draconian move, Google has recently removed several music blogs from its Blogger and Blogspot services“. The story, which was also featured in a Guardian article on “Google shuts down music blogs without warning, concerned the deletion of entire blogs which were alleged to contain copyrighted content.

The post concluded “it also starkly demonstrates the importance of gaining permission to use copyrighted material, lest you spoil your ship for a ha’pworth of tar. As always, if you’re not sure, don’t use it!“.

I disagree – I feel that copyright in today’s digital environment is a very complex topic, and simply suggesting that copyright resources should never be used is avoiding the realities of how digital resources are being used. In addition to the question of how copyrighted resources are being used, there is also the question of the extent we should continue to support a legal framework around copyright whose relevance is being questioned by increasing numbers  – Professor Peter Murray-Rust, for example, at a keynote talk given at the ILI 2009 conference argued that “Copyright as we know it must be destroyed for the sake of academic publishing and in order to facilitate the sharing of knowledge (as distinct from the business of making money from restricting the sharing of knowledge)“. As described in a report on the talk published on the FromMelbin blog Peter claimed that “Copyright is currently preventing the sharing of knowledge that could help to save the planet and that we as librarians should be agitating, displaying our “raw anger” and protesting for legislative change“.

Comment Moderation is A Barrier To Debate

I responded to the original blog post on Friday night, mentioning a paper on “Empowering Users and Institutions: A Risks and Opportunities Framework for Exploiting the Social Web” by myself and Professor Charles Oppenheim which describes a risk management approach to copyright. However the blog contains a message that “Unfortunately due to high levels of spam all of our comments are moderated and only authentic comments posted” so as I write this my comment is not yet visible (it was approved on Monday morning).

It is true that blogs are subjected to automated spam attacks – back in June 2008, for example, I described how on this blog the Akismet spam filter had filtered  A Quarter of a Million and Counting. But for me this demonstrates the effectiveness of the Akismet spam filter.  Since this blog was launched in November 2006 comments can be made on any of the posts, with no moderation being in place – the only requirement is that the comment author must provide an name and e-mail.  This policy, I feel, is important in avoiding delays in the publication of comments.

For me this ease of commenting is an important feature of blogs, especially for blogs in which feedback, comments and discussions are encouraged. The benefits of immediate publication of comments therefore outweigh the risks that spam might get through the spam filter and save me the effort of having to manually approve comments.

I also feel that comments can be useful even for posts which were published a long time ago – so I do not switch off comments after a set period of time.  This can also be useful in allowing notifications from other blogs (via pingbacks and trackbacks) to be displayed, so that viewers can easily follow links to posts which link to articles on this blog.

Comment moderation and closed comments? Not for me.  What about you?

Posted in Blog | 12 Comments »

A Challenge To Linked Data Developers

Posted by Brian Kelly on 12 February 2010

Back in November, following the interest in Linked Data which had been discussed at a CETIS 2009 Conference I wondered whether it was Time To Experiment With DBpedia?

The following month I attended the Online Information 2009 conference. As I described in a post on the Highlights of Online Information 2009: Semantic Web and Social Web it was clear to me that “ #semanticweb was the highlight & relevant for early mainstream“.  A blog post which provided the LIS Research Coalition “review” of Online 2009 was in agreement: “sessions on the semantic web gave the impression that those in library and information science related roles are now beginning to consider the exploitation of data to data links“.

However a concern I raised with Ian Davis,  CTO of Talis UK following his keynote talk on “The Reality of Linked Data” was the danger of overhyping expectations; something I feel is very relevant in light of the perceived failure of the Semantic Web to live up to the potential of evangelists in the early years of the last decade.  Has, for example, the “new form of Web content that is meaningful to computers will unleash a revolution of new possibilities” described in the Semantic Web article published in Scientific America (and also available from Ryerson.ca) in May 2001 arrived? I think not.

There is a danger, I fear, that the renewed enthusiasm felt by increasing numbers of developers will not be shared by managers and policy makers – leading to interesting pilots and prototypes which do not necessarily become deployed in a mainstream service environment.

A suggestion I made to a number of Linked Data experts at the Online Information 2009 conference was to demonstrate the value of Linked Data not by providing examples in niche subject areas (e.g. chemistry) but by taking an example which everyone can understand.

In my post Time To Experiment With DBpedia? I used the DBpedia Faceted Browser to search for information about UK Universities – in the example I searched for UK Universities which were founded in 1966. But this wasn’t demonstrating how Linked Data can be used to join information which have different underlying structures.

My challenge to Linked Data developers is to make use of the data stored in DBpedia (which is harvested from Wikipedia) to answer the query “Which town or city in the UK has the highest proportion of students?“.  This would involve processing the set of UK Universities, finding all Universities from the same town or city, recording the total number of students  and then, from the town/city entries in DBpedia, finding the total population in order to identify the town or city with the largest proportion of students.

I’m not too concerned about some of the edge cases (i.e. the differences between the City of London and Greater London or the Universities with campuses in several locations).  Rather I want to know:

  • Can Linked Data solve this problem (from a theoretical perspective)?
  • Is DBpedia able to solve this problem (from a theoretical perspective)?
  • How difficult is it to solve the problem (is it a trivial 1 line SPARQL query or would it require several months of work?)

 Any takers?  And note the answer must be provided using DBpedia – asking your friends on Twitter is cheating!

Posted in Linked Data | 11 Comments »

OMG! Is That Me On The Screen?

Posted by Brian Kelly on 10 February 2010

Yesterday a tweet from @josiefraser alerted me to the fact that “There’s a giant @briankelly on the screen!“. Josie went on to inform me (and her other followers) that my image was being used by “Kirsty McGill on remote audiences #transliteracy“. A few minutes later Josie tweeted@briankelly now with added @briankelly http://u.nu/9dy25 #transliteracy“. There it was, amongst a set of Josie’s photographs taken at yesterday’s Transliteracy Conference held in Leicester, a photograph of me taken at last year’s IWMW 2009 conference together with a photograph of the photograph being displayed during a talk by Kirsty McGill at the conference. Very meta!

After viewing the photo I wonderedwhat was the learning point for use of that image?” and went on to speculate that perhaps at a transliteracy conference such an image might be used to raise issues such as privacy and permissions. I asked “how many rights-holders need to sign waiver for public use of this photo http://bit.ly/aHkNUV :-)” – with the smiley face in the tweet indicating that I didn’t have a problem with such reuse of the photograph.

The photograph was used in a talk given by Kirsty McGill – and Kirsty herself took the photograph at UKOLN’s IWMW 2009 event last summer in her role as the official event blogger. The photograph was used in a blog post which summarised the various activities which took place at the workshop dinner – which included a caricaturist who, on hearing that one of my interests was rapper sword dancing, added a sword in his drawing of me.

Kirsty used the photograph in her talk on “Remote Audiences” in which she “provide[d] a brief introduction to creating a complete online experience of a conference for a remote audience by creating tools and providing content so they can actively engage and interact with the live event“. It was good to see how the amplification of IWMW 2009 was used in Kirsty’s talk.  As Kirsty’s abstract went on to describe “integrating [use of various technologies and resources] with a live event raises a number of challenges related to transliteracy: the remote audience may wish to access the event content from a variety of different platforms; representing the event appropriately within the literacies of each platform may require some adaptation of the content; and members of the remote audience may have different levels of ability to navigate and use the resources to full effect“.

In addition to the various technologies (e.g. Slideshare, the Twitter back channel, the video stream, etc.) there are also various softer issues to be considered.  For example there were several discussions on this blog (and elsewhere) last year related to archiving and citing tweets published at events.  In addition there is the issue regarding taking photographs (or videos or audio recordings) at such events and subsequent publication of such photographs.

A typical response  to the potential concerns regarding privacy which may be raised would be to require that permission is obtained before reusing such photographs. However my view is that this is likely to be too time-consuming to do.  Going back to my original question as to the various rights-holders associated with the photograph shown above, we might identify myself (the main person in the photo), David Harrison (also easily identified in the photograph – but who, unlike me, was probably not aware that the photograph was being taken), the photographer (Kirsty, herself, I believe), UKOLN, who commissioned Kirsty to take photos on our behalf, the people in the background and the caricaturist who drew thew picture. In addition the photograph included above is a photograph of the photograph taken at IWMW 2009 – the (former) photograph was taken by Josie Fraser, and it includes Kirsty McGill. There could also be additional rights associated with the venue the two photos were taken in.

In this particular example the main stakeholders (myself, Kirsty, Josie and David) know each and are unlikely to be unduly concerned about reuse of such photographs and the two events (IWMW 2009 and the Transliteracy Conference) are both supportive of use of such technologies to enhance the events and support community-building.  But what may be appropriate for these events is not necessarily the case more widely.

For me I feel there is a need to take a risk management approach, which will assess the likelihood of concerns being raised and seek to take measures to minimise such risks (for example we provided a ‘quiet area’ in the main auditorium at the IWMW 2009 event for those who did not wish to be photographed or distracted by participants using their laptops during the talks).

Such a risk management approach was described in a paper entitled “A Risks and Opportunities Framework for Exploiting the Social Web” which I presented before Christmas at the Cultural Heritage Online 2009 Conference. In the paper, which was co-authored by Charles Oppenheim, we described a risk assessment formula for legal infringements. As an aid to identifying the risk of copyright infringement  the following formula was proposed:

R = A x B x C xD

where R is the financial risk; A is the chances that what has been done is infringement; B is the chances that the copyright owner becomes aware of such infringement; C is the chances that having become aware, the owner sues and D is the financial cost (damages, legal fees, opportunity costs in defending the action, plus loss of reputation) for such a legal action. Each one of these other than D ranges from 0 (no risk at all) to 1 (100% certain). D is potentially a high number. It is not easy to calculate the cost of loss of reputation.

This example was provided  for gaining an understanding of the financial risks of copyright infringement. But the risks aren’t just financial. In the example provided in this post we might modify the formula so that:

R is the general risk; A is the chances that what has been done is infringement; B is the chances that the rights holder becomes aware of such infringement; C is the chances that having become aware, the owner takes some action and D is the social cost (e.g. loss of reputation).

Although such an approach is subject to misuse by, say, the paparazzi, it may be a useful mechanism for those who wish to reuse images whilst avoiding upsetting others.

Posted in General | 11 Comments »

Investigation into Challenges, Application and Benefits of Social Media in UK HEIs

Posted by Brian Kelly on 9 February 2010

A New Report

A new report on “An Investigation into the  Challenges, Application and Benefits of Social Media in Higher Education Institutes” has just been published.  This 28 page document was published by Jadu, a provider of Content Management Systems for public sector organisation.

The Process

Since I am aware there may be concerns in the sector related to a commercial company publishing such a report I should first declare my links with this report and with Jadu. Jadu were one of the commercial sponsors of UKOLN’s IWMW 2009 event. After last year’s event I was approached by Christine Fiddis, Jadu’s Marketing Director who wanted to carry out a survey of the higher education community to gain an understanding of how the sector was responding to the challenges and benefits of Social Media.   Christine is very aware of certain sensitivities within the sector to surveys from commercial companies.  My suggestion was to make the report freely available under a Creative Commons licence and to keep to a minimum any sales pitch in the report.  I also suggested providing an incentive to those involved in managing institutional Web services – and an iPod prize was provided to the lucky winner. I also provided some suggestions on the questions on the survey and gave some minor comments on the final report, as well as drawing the prize from an electronic hat (I used a random number generator to select a number which Christine used to identify the winner).

I’m pleased to say that the report is licensed under a Creation Commons Attribution licence – and there are only two brief paragraphs about the company.

The Content

Sixty responses were received from 44 HEIs across the UK (36 in England, 3 in Wales and 6 in Scotland). Responses were received from people working in Web management, marketing, media and communications, learning and development, business, libraries and IT management and services.

The top three challenges to date in implementing social media which were identified were (1) developing the business case for its usage; (2) overcoming cultural issues and (3) dealing with current software compatibility issues.

There are few restrictions on access to social Web services in the community, with unrestricted access to Facebook, Twitter, Blogging, MySpace, YouTube and Flickr reported by 90% of the institutions.

The two most frequently used external social networking tools are Twitter (68.3%), YouTube (60.7%) followed by social networking tools such as Facebook and MySpace (57.49%). 47.3% of respondents intend to adopt Twitter over the next two years; 41.8% intend to YouTube and 41.1% social networking tools such as Facebook and MySpace. A much smaller percentage of respondents intend to adopt ‘customised’ social networking tools such as Ning.com and Yammer.

67.9% of respondents believed that the main factor in influencing the development of HEI’s social media strategy is the user base, with the web team’s role to respond to user demand. This compared to 44.6% of respondents who believed that the Web team was driving strategy development.

There is currently little if any integration of content management systems with social media technologies within those HEIs responding. [This will be the marketing opportunity for the commercial sector, I feel].

The four major issues identified in the report are:

HEI Management need a stronger business case:  Building a business case for social media adoption is the number one challenge for HEIs both now and in the future. Whilst users are convinced of the benefits, HEI management need a stronger business case.

The strategic approach to managing social media is evolving: The future strategy for responding to social media development is unclear in many institutions. No firm conclusion appears to have been reached on ownership and management this new development. It is also not clear whether social media technologies should be treated as a separate strategy, or embedded in core operations?

Can unrestricted use continue? The low level of restrictions currently applied to social media usage has implications for a wide range of areas including privacy, intellectual property to data protection. As usage grows it is likely that these issues will increase in profile and impact.

Increased awareness is needed to address cultural issues: Awareness of the potential benefits of social media and its usage was identified as a ‘generational issue’ by a number of respondents. The development of short programmes around the ‘benefits of social media technologies’ for HEIs could address this situation and the cultural issues identified.

What Next?

This summary may be unsurprising, but it provides value evidence on the perceptions of the challenges and concerns of those practitioners involved in providing and managing institutional Web-based services. This reports complements the “Higher Education in a Web 2.0 World” report produced by a team led by Sir David Melville which, as I described last year, provided a senior management perspective which acknowledged the importance  of the Social Web.

I would encourage all those involved in the management of institutional Web services to read this report.  Myself and my colleague Marieke Guy feel that the issues raised in the report could well form an important part of the programme for UKOLN’s IWMW 2010 event, which will be held in the University of Sheffield on 12-14 July 2010.  As I mentioned recently the call for submissions is currently open – so if anyone is willing to give a talk, perhaps on the four key challenges listed above or, perhaps more appropriately, facilitate a 90 minute workshop session on these issues, we’d love to hear from you.

And I’d like to leave the final thoughts to Kathryn Chedgzoy, Development & Alumni Relations Officer at Warwick Business School, University of Warwick. In a 5 minute video clip available on Vimeo Kathryn discusses how her institution is adopting a social media strategy and the challenges she feels need to be faced.

Posted in Web2.0 | 4 Comments »

H.264 Format Free To End Users Until (At Least) 2016

Posted by Brian Kelly on 4 February 2010

Shortly after I published my post on “iPad, Flash, HTML 5 and Standards” it seems that the an announcement was made regarding the licence conditions for Web use use of the H.264 video format. Philip Roy alerted me to a press release (PDF format) which announced that the licence deal for H.264 has just been extended until 2016. The press release states that:

MPEG LA announced today that its AVC Patent Portfolio License will continue not to charge royalties for Internet Video that is free to end users (known as Internet Broadcast AVC Video) during the next License term from January 1, 2011 to December 31, 2015. Products and services other than Internet Broadcast AVC Video continue to be royalty-bearing, and royalties to apply during the next term will be announced before the end of 2010.

So although  Christopher Blizzard was correct when he pointed out that the licence conditions mean “that something that’s free today might not be free tomorrow” it is also true that something that’s free today may continue to be free tomorrow.

A post by Christopher Blizzard entitled “HTML5 video and H.264 – what history tells us and why we’re standing with the web” encourages readers to learn from the lessons of GIF. I can remember what happened just after Christmas in 1999 – the owners of the GIF image format (which was being widely used on the Web) announced that they intended to charge for its use – and these charges would apply to developers who made tools which created or edited GIF images and also Web site owners who made use of GIF images on their Web site (who would have to pay at least $5,000 for use of GIF images!).

As a direct result of this threat to open use of the Web the W3C coordinated development of the PNG (Portable Network Graphic) file format, which provide a royalty-free alternative to GIF which was also had richer functionality.

Christopher Blizzard argues that this example illustrates why we must avoid use of formats which have such licensing conditions associated with them.  But there is another view.  Although the PNG format has its merits sadly support for the format is flawed. After it was released viewing Web pages containing PNG images in the most widely used browser caused problems for the end user. And I was told by a colleague recently that even today Web pages containing PNG images which are viewed in Internet Explorer version 6 still cause problems.

I also understand Unisys did not enforce the licence conditions on users of the GIF format and, as described in a Wikipedia article, “Unisys was completely unable to generate any good publicity and continued to be vilified by individuals and organizations“.

The patent for the compression algorithm used in GIF has now expired, so there are no barriers to use of GIF.  Might the lessons be that it is dangerous to adopt an open standard before tools which support it correctly are widely deployed (rather than just freely available) and that user pressure may result in owners of patented formats being unwilling to enforce payment for use of their formats?

Posted in standards | 7 Comments »

iPad, Flash, HTML 5 and Standards

Posted by Brian Kelly on 3 February 2010

Lack of Flash Support by the iPad – Bad News or Good?

A post I wrote in November 2008 entitled “Why Did SMIL and SVG Fail?” has been referenced by the Stevie 5 is Alive blog. The post on the lack of Flash support for the iPad device says “Apple: Thank You for Leaving Flash Out“.

As the author, a ‘geek and entrepreneur’, correctly points out SMIL, the open XML-based multimedia standards developed by the W3C “was virtually assassinated from the landscape“. He goes on to point our that:

Quicktime X no longer opens and runs SMIL files (Quicktime Player 7 does, and it’s still in the spec). Quicktime on the iPhone won’t handle SMIL. WYSIWYG SMIL editors now are nowhere to be found. Evolution of the SMIL specification slowed to a crawl. The once potentially vibrant ecosystem around open standards has withered to nearly nothing – with obscure projects like Ambulant remaining as last-chance efforts to keep an open format available to the world for interactive media.

In its place we have seen Flash dominating the market place. The problem is that Flash “is a vendor proprietary format, with a closed ecosystem. Adobe makes the flash player. Adobe makes the flash development tools. Sure some other companies provide streamlined development tools based on Adobe’s APIs (like SWiSH Max) but Adobe controls what they can and can’t do with those APIs.

Perhaps, then, the lack of Flash support in the iPad is to be welcomed, particularly in light of the recent announcement about YouTube’s HTML5 Video Player, which does not require Flash support, but instead supports native video streaming.

A desire to move away from Flash was expressed at a meeting I attended last week when I heard that Flash seems to be blocked by firewalls in certain public sector organisations. “HTML 5 will avoid the need for Flash” was a response made to this comment, although the lack of support for HTML 5 in current versions of Internet Explorer is likely to be a barrier to its deployment.

Complexities of Video and HTML 5

But rather than a lack of support for standards being a problem, once again, for just Microsoft, use of the open source FireFox browser to view HTML 5 pages used by services such as YouTube will not necessarily work. Although HTML5 defines a standard way to embed video in a Web page, using the element. FireFox currently supports the Ogg Theora, Ogg Vorbis and WAV formats – but not the widely used H.264 format (codec).

The H.264 family of standards were developed to “create a standard capable of providing good video quality at substantially lower bit rates than previous standards“. But despite its popularity as described in Wikipediavendors of products which make use of H.264/AVC are expected to pay patent licensing royalties for the patented technology that their products use“.  The costs of use of the format are difficult to determine: Christopher Blizzard, in a post looking at the history of patented technologies, points out that although “H.264 is currently liberally licensed [it] also has a license that changes from year to year, depending on market conditions. This means that something that’s free today might not be free tomorrow.” As for what those costs may be an article on “H.264 Royalties: what you need to know” states that:

… a one-time payment of $2,500 “per AVC transmission encoder” or an annual fee starting at “$2,500 per calendar year per Broadcast Markets of at least 100,000 but no more than 499,999 television households, $5,000 per calendar year per Broadcast Market which includes at least 500,000 but no more than 999,999 television households, and $10,000 per calendar year per Broadcast Market which includes at 1,000,000 or more television households.

A Dive Into HTML 5 post on Video on the Web also points out that “The fees are potentially somewhat steeper for internet broadcasts” and “starting in 2011, it’s going to cost a whole lot more“.

As well as the issues of the licensing costs (likely to be difficult to be paid for by an open source company such as FireFox which doesn’t have an income stream related to its core product), there is also a need to consider the principles involved: the success of the Web has been based on open standards for which use has not required payment of royalty feeds.

Is There an Open Alternative to H.264?

Are there open alternatives to H.264 which aren’t encumbered with licensing restrictions? The answer is yes:  Ogg is an open standard container format for video which is unencumbered by any known patents. Firefox 3.5, Chrome 4, and Opera 10 provide native support for the format without the need for any platform-specific plugins through use of the  Ogg container format, Ogg video (“Theora”) and Ogg audio (“Vorbis”) .

Surely the answer to the licensing complexities of H.264 is simple – make use of Ogg instead?  Robert Accettura has given his interpretations of the reasons why Apple and Google appear to be willing to support H.264:

Apple’s Argument: Hardware decoding for H.264 is available on various devices (including the iPhone). Hardware decoding means the devices CPU does not have to carry out this function, resulting in better performance and battery life. As there does not appear to be a hardware Theora decoder available use of the H.264 standard can be deployed using existing technologies.

Google’s Argument: In a message sent to the WhatWG mailing list last year Chris DiBona argued thatIf [you] were to switch to theora and maintain even a semblance of the current youtube quality it would take up most available bandwidth across the internet“. Although others have queried this argument (and an Ars technica post on “Decoding the HTML 5 video codec debate” explored this issue in more detail) the bandwidth costs of accessing streaming video will be a factor in choosing an appropriate format, particularly for companies such as Google who are significant providers of streaming video due to their video streaming services such as YouTube.

What is To be Done?

In a recent post on “Reflections on CETIS’s “Future of Interoperability Standards” Meeting” I described how there was a view that policy makers tended to have a naive view of open standards, perhaps feeling that an open standard would be guaranteed to provide simple, elegant solutions whilst bringing down costs by avoiding reliance of vendors of proprietary formats.  In response Erik Duval pointed out that “I certainly strongly agree that policy makers sometimes have a somewhat naive view of the standards process – but then so did we when we started this?“.

Erik is certainly correct that developers and others working in IT will have a tendency to gloss over real world deployment issues – on reflection I was guilty of this in my article on “HTML Is Dead!” which argued that the future for HTML was based on XHTML. So here’s my brief summary related to the complexities of video and HTML 5.

The video element in the draft HTML 5 standard will allow Web pages to have embedded videos which do not require use of plugin technologies (such as the proprietary Flash format which is widely used twoay).  The format of such videos is not defined in the HTML 5 standard – it is being left to the market place, the browser vendors,  to provide such support. Google (with their Chrome browser) and Apple (with their Safari browser) currently support the H.264 video format, but since this format uses patented technologies use of this requires the browser vendors to pay a licence fee.   FireFox feel that the open Ogg / Theodora format  should be used, but Google and Apple argue that this format has limitations.

Since Google and Apple are both significant providers of video and multimedia content (with YouTube in the case of the former and iTunes for the latter)  the decisions they make regarding formats for the content they provide is likely to influence the user community’s preferences, since users will have no interest in the complexities of codeces, patents, etc.

There may be ways of circumventing these difficulties, by eventual agreements by the major software vendors or by the provision of alternative environment (e.g. the Google Chrome Frame plugin for Internet Explorer or, as described in Ryan Paul’s blog post as “The undesirable middle-ground” of “expos[ing] each platform’s underlying media playback engine through the HTML 5 video element” which appears to be technically possible but would “heighten the risk of fragmentation“).

What are your plans for streaming video?!

Posted in standards | 12 Comments »

Decommissioning / Mothballing Mailing Lists

Posted by Brian Kelly on 1 February 2010

The Context

In response to my recent post about usage of JISCMail lists Nicole Harris pointed out some evidence of its popularity. It is clear that although in some sectors there may have been a migration to a diversity of communication and collaboration tools, other sectors are still well-served by email lists.  This is particularly true of museums and public libraries, as I know from experience, being a member of the well-used MCG and lis-pub-libs JISCMail lists.

The Evidence

But what should be done for the lists which are no longer being used to any significant extent?  And following Nicole’s links to statistics on the use of JISCMail I was very interested to see the statistics on the numbers of messages on lists.

As can be seen from the accompanying image (taken from the JISC’s Monitoring Unit Web site), the majority of lists appear to have had zero messages posted in the given time period and the numbers of such lists has been growing. The number of very active lists, with over 100 messages, is in comparison, tiny. Of course these lists must be very active as the overall amount of traffic on the lists is still growing.

Although these figures are very surprising they do reflect my findings when I looked at the various lists that I was still subscribed to. For example here are two lists which I had forgotten about:

ADVSERV-CANDM (Advisory Services and Comms and Marketing mailing list)
A list is for discussion and dissemination between Advisory Services and Communication and Marketing) .
Only a handful of posts between July 2004 and November 2005.

DNER-TECH
List to discuss technical issues relating to the establishment of the Distributed National Electronic Resource. These issues should particularly relate to inter-operability matters.
Posts between August 1999 and October 2004.

In addition to these lists which I am still subscribed to I discovered there are a number of list which I own which I had forgotten about.  Here are another two examples:

HELPS (Historic Environment List For Projects and Societies) – 180 Subscribers
This list is designed to promote liaison between those recording all aspects of the historic environment, whether as part of a national project, a specialist interest group or locally based society. The list is intended for members to share experiences for the benefit of others, exchange information and provide mutual support.
Discussions in 2004 and only occasional publicity posting since, with last in June 2007 and July 2008.

INTEROP-CULTURE – 70 subscribers
The mailing list of the international group involved in shaping Interoperable Digital Cultural Content Creation Strategies.
One post in April 2006 but prior to that used from July 2001 to November 2004.

What is to be Done?

Does the existence of many moribund lists matter? This is a question which is very pertinent to UKOLN activities on behalf of the cultural heritage sector in providing advice on digital preservation issues.  The need to make plans for the  decommissioning services was highlighted by Chris Sexton, UCISA chair, at a recent UCISA meeting in which, as she described in her blogWe are all going to be faced with spending less, doing more with less, and deciding what we can stop doing“.

Deciding which lists no longer have a useful purpose can be helpful to a number of groups. Users who find the mailing lists archives a potentially valuable resource may find that the search interface becomes useable if the numbers of lists is decreased (there is no global search of the mailing lists and as Google is blocked from the archives searching selected mailing lists is a very time-consuming process). Deleting such lists may also help new users who are seeking relevant lists to join – at present statistically they are likely to join a moribund list is they make their selection based on the list descriptions. The JISCMail team may well find the systems management easier if unwanted content is deleted, thus potentially freeing technical expertise which can be used to enhance other aspects of the service.

Policies and Processes For Decommissioning and Mothballing Lists

How should a list owner go about deleting unused lists? And aren’t there dangers that deleting the contents of lists which may have been used to influence the research process or provide possibly valuable historical insights on the content area covered by the list would be regarded as a mistake by future generations?

If would be a mistake, however, to regard digital preservation to simply mean that digital resources should be kept forever. An important role for those involved in preservation activities is the selection of resources which are felt to be worthy of preservation and the deletion of the rest – and if such deletion activities is ignored there may be significant costs in its ongoing maintenance.

I’m not aware of guidance for list owners on how they should go about developing policies for mailing lists and associated procedures for implementing such policies. They only  relevant information I could find on the JISCMail Web site was a page on renaming or deleting JISMail lists. This page allows a list owner to give the name of the list to be deleted and request a ZIP file containing the archives, files and list header.

No advice is provided, however, to assist list owners who may be considering deleting lists. It would clearly be inappropriate for a list owner to delete a still-popular list. But at what stage might it be felt that a list should be considered for deletion?  Do posters of messages to the list have any say in the matter (they own the copyright of their messages)? And who should take responsibility for consideration of the long-term importance of messages posted to the list?

In a bottom-up approach to attempting to answer such questions I will describe my thoughts on the DNER-TECH and INTEROP-CULTURE lists.

A summary of these lists is given below.

List: DNER-TECH
Date created: August 1999
List owner: Brian Kelly, UKOLN (although I was initially unaware of this as it used a non-standard variant of my email address)
Status: Open access to archives
Summary of purpose of list, ownership, etc: To discuss technical issues related to the DNER ( Distributed National Electronic Resource).
No. of subscribers: 50 (including 5 variants of my email address!)
Period of popularity: Small number of posts (2-3/month? from 1999-2002.
Period of few and ‘non-essential’ posts (non-essential may include announcements, posts sent to multiple lists, etc.): Last discussion took place in July 2003.
Stakeholder communities and individuals: Software developers from JISC eLib and subsequent DNER (later renamed IE) programme; Chris Rusbridge? (eLib programme director); Rachel Bruce: (JISC); UKOLN.
Likelihood of messages being cited in research papers: Unlikely.
Other issues: -
Risks: Closure of this list would have no adverse effect. Deletion of the contents of the list would be unlikely to have an adverse effect, especially in light of the (now-dated) technical content of the list.

ListINTEROP-CULTURE
Date created: July 2001
List owners: Brian Kelly and Rosemary Russell, UKOLN
Status: Login required to view archives
Summary of purpose of list, ownership, etc: Set up by staff in UKOLN
No. of subscribers: 70
Period of popularity: Last posts in November 2004 and April 2006.
Period of few and ‘non-essential’ posts (non-essential may include announcements, posts sent to multiple lists, etc.): List appears to have been announcements only.
Stakeholder communities and individuals: Appears to have been set up for policy makers in cultural heritage organisations.
Likelihood of messages being cited in research papers or contain ‘significant’ content: Very low.
Other issues: Significant number of overseas subscribers.
Risks:  Closure of this list would have an adverse effect. Deletion of the contents of the list would be unlikely to have an adverse effect. However in light of the international aspect of the list it would be prudent to ensure stakeholders have the opportunity to give their views.

Next Steps

Carry out this research proved interesting in observing how these mailing lists failed to live up to their initial expectations.  but what to do next?  Some may feel that as the costs of the disk storage are trivial there is no need to do anything. However my view is that managed curation of such digital resources is needed.  So I feel that I should send an email to these two lists announcing my intention to delete these lists based on my review of the contents and my assessment of the risks of deleting the content. And since I no longer have an interest in the archives if anyone wishes to maintain the content they will be welcome to take on ownership of the lists.

But before taking this step I thought I would seek others views on these proposals. What do you think should be done?


[Note this post has been updated with a updated chart of JISCMail usage statistics. You can
view the original statistics published in the post which covered the period 2003-2007.]

Posted in preservation | 8 Comments »