UK Web Focus

Innovation and best practices for the Web

“We Have the Highest Proportion of Students!”

Posted by Brian Kelly on 7 April 2010

Back in September 2001 I gave a talk at the JANET User Support Workshop, which was held at Loughborough University. I remember a Pro Vice Chancellor giving the welcome talk during which he mentioned that “Loughborough has the highest proportion of students of any place in the UK” (or words to that effect). I remember him saying that as I worked at Loughborough University from 1984-90 and I was interested in seeing how the increases in the numbers of students was changing the town centre – there were a number of superpubs which weren’t there when I lived in the town.

Last November I spent a few days at Aberystwyth University. While I was there, on my way to a CAMRA pub, I noticed large numbers of students (dressed as doctors and nurses) on a pub crawl around the town. This made me wonder if a small place like Aberystwyth might have overtaken Loughborough as the town or city in the UK with the largest proportion of students.

That was the background to my recent “Challenge To Linked Data Developers” in which I asked “Which town or city in the UK has the largest proportion of students?“. In order to simplify the challenge and avoid the need for SPARQL developers to have to track down official relevant data sources I asked that the challenge be addressed using data held in DBpedia, the RDF datastore of structured information provided in Wikipedia. An additional aim was to gain an understanding of the quality of the data (and the data structures) held in DBpedia, which is frequently mentioned as having a central role to play in the Linked Data world.

A week after issuing my challenge I published the “Response To My Linked Data Challenge“. However the answers obtained from querying DBpedia were clearly incorrect – Cambridge, for example, doesn’t have a population of 12!

On the DCC blog Chris Rusbridge has revisited my challenge in a post entitled “Linked Data and Reality“. Chris suggested that “If we care about our queries, we should care about our sources; we should use curated resources that we can trust. Resources from, say… the UK government?“. That may be true, but I wasn’t primarily after the correct answer when I formulated my challenge – I was more interested in whether DBpedia could provide a reasonable answer, how long it might take to write a SPARQL query and how complex such a query might be. This motivation was acknowledged by bitwacker in his comment that “I think Brian’s challenge should be seen as only a benchmark, a sampling of the effectiveness of linked data practices today.” That’s right – and I’m pleased to have noticed recently that the DBpedia community have recently issued an “Invitation to contribute to DBpedia by improving the infobox mappings“. In addition Kingsley Idehen alerted me to Yago, Opencyc, Umbel, and Sumo ontologies, all of which have binding to DBpedia. (I should also add that Kingsley has written a blog post on “DBpedia receives shot #1 of CLASSiness vaccine” which illustrates how new ontologies can be integrated with DBpedia).

Perhaps DBpedia could have a role to play in answering the type of query I posed – after all, if you want to compare the proportions of students in towns and cities across several countries, mightn’t DBpedia be an easier place to seeks an initial answer, rather than having to find and query statistics from each of the individual countries (especially as the UK Government seems to be taking a leading role in expressing a commitment to Linked Data).

In addition to suggesting that the query should use official Government sources of data (which Chris Wallace has used to provide an answer to my query) Chris also raised the issue about the need to seek clarity in the queries we pose. Using the Guardian Platform Chris Wallace found that the place with the highest proportion of students is Milton Keynes. Chris Rusbridge suggested this in an initial discussion on a LinkedIn Linked Data discussion. And yes, the home of the Open University, is likely to have a large number of registered students. But I don’t think the place will be full of students at the start of the academic year since the Open University is a distance learning institution. The (implied) context of my query was the place for which a significant proportion of students would be likely to affect the local environment, with large numbers of students in town during freshers pub crawls and, perhaps, little happening during vacations. So we should rule out the Open University. But what about other universities with a large number of students on distance learning courses? According to a tweet from lordllamaAbout 41% of 23,000 students at Leicester University are on distance learning courses“.

There is also the question of how we should treat institutions such as the University of Brighton in Hastings which “offers University of Brighton degrees“.  As Margaret Wallis pointed out in response to my initial blog post this institution has  “grown in six years from 40 students to 600+“. But should those students be included in the totals for the Univeristy of Brighton or for Hastings? The general question is how we should treat institutions which have multiple campuses, split across different towns or, as may be case in this example, institutions which award degrees on befalf of other institutions.

You may also notice that my question about places with a large proportion of students is now talking about universities and university students. But what about students at FE colleges? And school children?

Chris Rusbridge highlighted such complexities: “The point is, these things are hard. Understanding your data structures and their semantics, understanding the actual data and their provenance, understanding your questions, expressing them really clearly: these are hard things.” Chris concluded “I’m beginning to worry that Linked Data may be slightly dangerous except for very well-designed systems and very smart people…” Chris probably had his tongue in his cheek with his ‘smart people‘ remark but he may be right with his warning that Linked Data might be dangerous. If a simply query such as “Which town or city in the UK has the largest proportion of students?” is open to a number of different interpretations, what are the implications for more complex queries.

In my “Response To My Linked Data Challenge” I described how Tim Berners-Lee introduced the Semantic Web by described how it aimed to provide an answer to a query such as “Is there a green car for sale for around $15000 in Queensland?“. Tim described how, unlike the search engines of the day, a Semantic Web query would be able to find a result which was described as “Affordable maroon saloon for sale in Brisbane”. But this query is seeking to find additional results which would not be found by a traditional keyword search. The “Which town or city in the UK has the largest proportion of students?“, however, is seeking to find a single answer. Might there be types of queries for which Linked Data might work and others for which if may be difficult or expensive to model the data? Or to rephrase the question what, specifically, is Linked Data for?

About these ads

3 Responses to ““We Have the Highest Proportion of Students!””

  1. Alison McNab said

    [Ducking out of the challenge and settling for hearsay] When I was at Lboro, it was reckoned to be between Loughborough and St Andrews. I remember that JANET User Support Workshop – Rob K and I also gave a paper at it.

  2. [...] “We Have the Highest Proportion of Students!” « UK Web Focus [...]

  3. Ben Toth said

    Good question – “what, specifically, is linked data for?”, and it’s interesting to note that Richard MacManus is asking the same question – http://bit.ly/csa1hU. The problem for me is that the couple of examples I can think of (area health profiles and parametrised home location) are already implemented reasonably well without linked data or the semantic web. And the travel example in RWW doesn’t seem worth a huge amount of effort. It might be a lack of imagination on my part so I’ll follow the discussions with great interest.
    Come to think of it, there are a few examples from health care, for example, patterns of resource usage for particular groups or particular conditions. These analyses require a lot of manual effort currently. The main challenges in producing these are data availability, quality, and in some cases the need to link patient records. This last has some relationship to linked data, but raises some tough questions about confidentiality. It’s worth noting that experiments with computer based record linkage started in the NHS in the 1960s (Lester Gill http://bit.ly/bLIibP ). They were were abandoned before the National Programmable for IT was established, but it is not clear that NPfIT will produce a replacement. It’s maybe also worth noting that for health planners the real issue is not just the ability to link data but to model the impact of change (for example the potential impact of a new telemedicine service for Heart Failure on hospital admissions).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: