UK Web Focus

Innovation and best practices for the Web

Response To My Linked Data Challenge

Posted by Brian Kelly on 19 February 2010

The Linked Data Challenge

A week ago I issued a challenge to Linked Data developers – using a Linked Data service, such as DBpedia, tell me which town or city in the UK has the largest proportion of students. I’m pleased to say that a developer, Alejandra Garcia Rojas, has responded to my challenge and provided an answer. But this post isn’t about the answer but rather the development processes, the limitations of the approach and the issues which the challenge has raised. The post concludes with a revised challenge for Linked Data (and other) developers.

The Motivation For The Challenge

Before revealing the answer I should explain why I posed this challenge. I can recall Tim Berners-Lee introducing the Semantic Web at a WWW conference  many years ago – looking at my trip reports it was the WWW 7 conference held in Brisbane in 1998. My report links to the slides Tim Berners-Lee used in his presentation in which he encouraged the Web development community to engage with the Semantic Web. I was particularly interested in his slide in which he outlined some of the user problems which the Semantic Web would address:

  • Can Joe access the party photos?
  • Who are all the people who can?
  • Is there a green car for sale for around $15000 in Queensland?
  • Did someone driving a blue car send us an invoice for over $10000?
  • What was the average temperature in 1997 in Brisbane?
  • Please fill in my tax form!

I was interested if 12 years on, such types of questions can be answered using what is now referred to as Linked Data. And as we have a large resource, DBpedia, which provides Linked Data for use by developers I felt it would be useful to use an existing resource (which is based on the structured content held in Wikipedia) to experiment with. I was particularly interested in the following three questions:

  • How easy would it be for an experienced Linked Data developer to write code which would provide an answer? Would it be 10 lines of code which could be written in 10 minutes, a million lines of code which would take a large team years to write or somewhere in between?
  • Is the Linked Data content held in DBpedia of sufficient consistency and quality to allow an answer to be provided within the need for intensive data cleansing?
  • What additional issues might the experiences gained in this challenge raise?

A SPARQL Query To Answer My Challenge

In addition to issuing my challenge on this blog and using Twitter to engage with a wider community I also raised the challenge in the Linked Data Web group in LinkedIn (note you need to be a member of the group to view the discussion). It was in this group that the discussion started, with Kingsley Idehen (CEO at OpenLink Software) clarifying some of the issues I’d raised. Alejandra Garcia Rojas, a Semantic Web Specialist was the developer who immediately responded to my query and, within a few hours, provided an initial summary and a few days later gave me the final version of her SPARQL query (as described in Wikipedia SPARQL is a query language for Linked Data) which was used to provide an answer to my question. Alejandra explained that it should be possible to use the following single SPARQL query to provide an answer from the data held in DBpedia:

prefix dbpedia-owl:
prefix dbpedia-owl-uni:
prefix dbpedia-owl-inst:

select ?town count(?uni) ?pgrad ?ugrad max(?population) (( (?pgrad+?ugrad)/ max(?population))*100) as ?percentage where {
?uni dbpedia-owl-inst:country dbpedia:United_Kingdom ;
dbpedia-owl-uni:postgrad ?pgrad ;
dbpedia-owl-uni:undergrad ?ugrad ;
dbpedia-owl-inst:city ?town.
optional {?town dbpedia-owl:populationTotal ?population . FILTER (?population > 0 ) }
}
group by ?town ?pgrad ?ugrad having( (((?pgrad+?ugrad)/ max(?population) )*100) > 0)
order by desc 6

The Answer To My Challenge

What’s the answer, I hear you asking?  The answer, to my slightly modified query in which I’ve asked for the number of universities and the total population for the six towns with the highest proportion of students, is given in the table below:

City Nos. of
Universities
Student
Nos.
Population Student
Proportion
Cambridge 2 38,696 12 3224%
Leeds 5 135,625 89 1523%
Preston 1 34,863 30 1162%
Oxford 3 40,248 38 1059%
Leicester 3 54,262 62 875%

As can be seen, Cambridge, which has two universities, has the highest proportion of student, with a total student population of 38,696 students and an overall population of 12 people. Clearly something is wrong :-) And as Alejandra has provided a live link to her SPARQL query so you can examine the full responses for yourself. In addition another SPARQL query provides a list of cities and their universities.

The Quality Of The Data Is The Problem

I was very pleased to discover that it was possible to write a brief and simple SPARQL query (anyone with knowledge of SQL will be able to understand the code). The problem lies with the data. And this exercise has been useful in gaining a better understanding of the flaws in the data and of the need to understand why such problems have occurred.

Following discussions with Alejandra we identified the following problems with the underlying data:

  • The population of the towns and cities defined in a variety of ways. We discovered  many different variables describing the population: totalPopulation, urbanPopulation, populationEstimate, etc. – and on occasions there was more than one value in the variables. Moreover, these variables are not always in all cities’ descriptions, thus making it impossible to select the most appropriate value.
  • A full list of all UK universities has not been analysed because the query processes the universities that have the student numbers defined. If the university does not provide the number of students, then it is discarded.
  • Colleges are sometimes defined as universities.

What Have We Learnt?

What have we learnt from this exercise? We have learnt that although the necessary information to answer my query may be held in DBpedia, it is not available in a format which is suitable for automated processing.

I have also learnt that a SPARQL query need not be intimidating and it would appear that writing SPARQL queries need not necessarily be time-consuming, if you have the necessary expertise.

The bad news, though, is that although DBpedia appears to be fairly central to the current publicity surrounding Linked Data, it does not appear to be capable of providing end user services on the basis of this initial experiment.

I do not know, though, what the underlying problems with the data are. It could be due to the complexity of the data modelling, the inherent limitations of the distributed data collection approach used by Wikipedia, limitations of the workflow process in taking data from Wikipedia for use in DBpedia – or simply that the apparent simple query “which place in the UK has the higher per capita student population” does have many implicit assumptions which can’t be handled by the DBpedia’s representation of the data stored in Wikipedia.

If the answer to such apparently simple queries will require much more complex data modelling, there will be a need to address the business models which will be needed to justify additional expenditure needed to handle the complexity. And although there might be valid business reasons for doing this in areas such as biomedial data, it may be questionable whether this is the case for answering essential trivial questions such as the one I posed. In which case the similarly trivial question which Tim Berners-Lee used back in 1998 – “Is there a green car for sale for around $15000 in Queensland?” – was perhaps responsible for misleading people into thinking the Semantic Web was for ordinary end users.  I am now starting to wonder whether a better strategy for those involved in Linked Data activities would be to purposely distance it from typical  end users and target, instead, identified niche areas.

A more general concern which this exercise has alerted me to is the dangers of assuming that the answer to a Linked Data query will necessarily be correct. In this case it was clear that the results were wrong. But what if the results had only been slightly wrong? And what if you weren’t in a position to make a judgement on the validity of the answers?

On the LinkedIn discussion Chris Rusbridge summarised his particular concerns: “My question is really about the application of tools without careful thought on their implications, which seems to me a risk for Linked Data in particular“. Chris went on to ask “what are the risks of running queries against datasets where there are data of unknown provenance and qualification?

My simple query has resulted in me asking many questions which hadn’t occurred to me previously. I welcome comments from others with an interest in Linked Data.

An Updated Challenge For Linked Data (and Other) Developers

It would be a mistake to regard the failure to obtain an answer to my challenge as an indication of limitations of the Linked Data concept – the phrase ‘garbage is, garbage out‘ is as valid in a Linked Data world as it was when it was coined in the days of IBM mainframe computers.’

An updated challenge for Linked Data developers would be to answer the query “What are the top five places in the UK with the highest proportion of students?” The answer should list the town or city, student percentage, together with the numbers of universities, students and the overall population.

And rather than using DBpedia the official source of such data would be a better starting point. The UK government has published entry points to perform SPARQL queries for a variety of statistical queries – so Linked Data developers may wish to use the interface for government statistics and educational statistics.

Of course it might be possible to provide an answer to my query using approaches other than linked data. In a post entitled “Mulling Over =datagovukLookup() in Google Spreadsheets” Tony Hirst asked “do I win if I can work out a solution using stuff from the Guardian Datastore and Google Spreadsheets, or are you only accepting Proper Linked Data solutions“. Although I’m afraid there’s no prize, I would be interested in seeing if an answer to my query can be provided using other approaches.


Twitter conversation from Topsy: [View]

About these ads

34 Responses to “Response To My Linked Data Challenge”

  1. “select ?town count(?uni) ?pgrad ?ugrad max(?population) (( (?pgrad+?ugrad)/ max(?population))*100) as ?percentage where …” … is not W3C SPARQL. It might be related to some proposed SPARQL extensions, or to the implementation behind DBpedia.

  2. Alejandra Garcia-Rojas said

    You are right, those extensions (count, max, )of Sparql are provided in the OpenLink virtuoso which host DBpedia endpoint. They are definitely necessary to be able to respond the question in a single query. Thanks to point that out.

  3. Roger Hyam said

    Ah yes but the semantic web doesn’t solve the problem of deciding what data to include in an analysis – that would be crazy. The human should be choosing which graphs to include in an analysis and if the human makes bad choices he gets bad results. What the SW does is make it simple to combined/analyse the data from the chosen data sources and, possibly, make it easier to find those datasources – if they exist – which unfortunately the don’t! Oh and the tools to analyse the data (e.g. Protege) are still professor grade and not orientated to non-specialists. Other than that it is all just fab.

    • Thanks for the response. A concern I have is that when I first heard about DBpedia in any details (at the Developers’ day session at WWW 2007 conference in Banff) it was positioned (by researchers, not marketing people) as providing a solution to generic queries. I can’t find slides from the presentations form that day but if you see the slides (PDF format) taken from a presentation by researchers at Freie Universität Berlin and Universität Leipzig given at the 16th IWWWC you’ll see that Linked Data researchers were suggesting the DBPedia can provdie a answer to queries such as:

      Find 5 of each:

      o Tennis players born in Moscow
      o Bart’s chalkboard gags in Season 12′
      o Soccer players wearing Nr. 11, playing for a club having a stadium with more than 40K seats and born in a country with more than 10M inhabitants

      We don’t, yet, know whether DBpedia can provide correct answers to a wide range of queries such as this or whether the wildly incorrect results for my query are more typical of DBpedia’s limitations.

  4. gemstest said

    Interesting and thought provoking, and quite a challenge for anyone who thinks it possible to establish a settled ontology for anything. BTW – there are, depending on the definition, 3 universities in Cambridge, and maybe 4 if you include people who live in Cambridge who study with the OU. Or more. Turns out that you have to define the terms in the query to fit the artificial ontologies, not the other way round, as in the definition of a population. For example, there is arguably, at anyone moment, an enumerable number of living individuals in the UK. But that number is never captured in the various statistics that describe the UK population. Census numbers maybe, but even in the census the definition of population has a specific meaning, which BTW changed between 1991 and 2001. http://www.statistics.gov.uk/downloads/census2001/definitions_chapters_1_5.pdf

  5. Social comments and analytics for this post…

    This post was mentioned on Twitter by ukwebfocus: Response To My Linked Data Challenge: The Linked Data Challenge A week ago I issued a challenge to Linked Data dev… http://bit.ly/9OHrK4

  6. Brian,

    More than anything else, this effort brings to surface an inherent contextual fluidity that pervades all forms of data analysis. What should become clearer to everyone is that
    “context lenses” have to be applied to Linked Data just as they apply to everyday data analysis reports from anywhere.

    DBpedia and Linked Data in general simply showcase realities that exist behind firewalls and closed world database applications.

    Remember, today’s financial crisis and economic downturn is a function of: opaque data and monopolistic data analysis.

    To conclude, there are no absolute truths, we have perspectives (derived from claims). Thus, your exercise has started the process of unveiling the real power of Linked Data; basically, perceived inaccuracies are triggers for perpetual improvements, media permitting :-)

    Kingsley

  7. Brian,

    Continuing from my earlier comment, there is a very important aspect of Linked Data that still remains quite mercurial. Basically, its about this sequence:

    1. Someone makes a claim or publishes a perspective (in a document that provides some context for the underlying data sources)
    2. Someone else thinks differently about the data (guided by their particular context lenses: function of many interleaving factors that are ultimately personal).

    Now without Linked Data, the items above lead to collisions (departmental politics even world wars). With Linked Data, you simply GET the data, inject your perspective into the burgeoning discourse, and publish.

    If you look at the sequence of events re. the SPARQL query. AGR delivered an initial SPARQL (more than likely unaware of the SPARQL-BI extensions of Virtuoso), I responded with a SPARQL variant that then exposed those particular extensions, then AGR goes back an refines.

    This LINK tries to pack what I am saying into a single LINK (i.e. one that others can modify and pass on via LINKs if they choose):

    1. http://bit.ly/buAL7K — SPARQL Results LINK
    2. http://bit.ly/bQDPjX — SPARQL Query LINK (ignore the warnings and go to the Advanced Tab which puts you in the Query Builder).

    Personally, LINKs are becoming canonical units of Open Discourse, and this (to me) is the intrinsic power of Linked Data :-)

  8. Sorry, lots of negative comment about this from me on Twitter this morning. To summarise… I think you said on Twitter that this was part of a ‘critical’ approach to Linked Data. I’m concerned that it is not, for two reasons:

    Firstly, it is (in some senses at least) the same as going back to 1993 and asking, “can the Web provide an answer question X?” (to which the answer would have nearly always been “no” or “not very well” or “yes it can but the answer is wrong”). I’m not sure how helpful that kind of question would have been then and I’m not sure how helpful it is now. It’s the kind of question tabloid papers used to regularly ask about Wikipedia, usually with superficial analysis of the answers (hence my ‘tabloid headline’ comments on Twitter earlier).

    Secondly, it is generally the case that bespoke solutions are quicker/better than generic solutions for any individual challenge. So if you only present a series of individual challenges, to which the the Tony Hirst’s of this world can say, “I can give you a better individual answer by creating a bespoke Google spreadsheet”… then the answer will always be a Google spreadsheet.

    Linked Data isn’t about solving individual questions using bespoke tools, it’s about providing data in a more widely re-usable form – so that more generic tools are likely to be able to give us the answers.

    Now, of course, generic stuff costs more, is intellectually more challenging (i.e. seems more academic – to note a concern raised by Mike Ellis on Twitter) and so on. So, there are definitely some very hard questions to answer about cost/benefit and so on. On that basis I’m absolutely not saying Linked Data is the answer, or that it is a better approach than the Google spreadsheet one… but I’m also not convinced that if we are serious about adopting a “critical” approach to Linked Data this question/challenge is a helpful one.

    • gemstest said

      I must be missing something, but I would have thought the 1993 question was and is helpful and interesting. And just on a point of accuracy the level of interest in Wikipedia wasn’t Tabloid, unless you mean wan’t peer-reviewed. http://en.wikipedia.org/wiki/Reliability_of_Wikipedia.

      On the second point, how is linked-data going to prove its value unless it is useful in some specific real-world way?

      I want to see linked-data move forward if only to improve the quality of public sector datasets. But it will only do so if it meets some pragmatic challenges. If it doesn’t it will be like the world of hypertext pre HTML – the one which turned down a TBL paper (Weaving the Web p55)

    • Hi Andy – Ta for the comment. I think you misinterpreted a Twitter discussion with Mike Ellis which argued for healthy scepticism towards Linked Data. Twitter does necessitate truncated discussions, so it’s generally wrong to belittle such discussions, I feel, especially if, as in this case, there are more detailed observations published elsewhere.

      I posed the challenge as I was interested in whether DBPedia could answer the (seemingly) simple question I posed. I asked this as DBpedia has been positioned (by the research community – see e.g. the slides PDF format I mentioned above. My aim was to gather evidence as to whether such use cases can actually be implemented. I had expected that it would be possible to get some reasonable answers, with a debate then taking place about the validty (e.g. places such as the Open University, other HEIs with large numbers of remote learners, etc.) I was very surprised to discover that many of the answers to the query were clearly wrong.

      Hence my willingness to engage in an open discussion about the implications of this. I’m now interested in the reasons for the problems with the data used in the examples, how such problems might be addressed and whether they should (i.e. maybe we should forget about DBpedia and used Linked Data from official data sources).

      I would also argue that there is a need to address the business cases for developing Linked Data solutions – and if it is felt currently to be too expensive or difficult to provide consistent linked data, then maybe providing open data (e.g. data stored in Google Spreadsheets) might be a useful route to go down. Note that I asked Wendy Hall and Nigel Shadbolt a question on these lines when they gave the opening plenary talk at the Online Information 2009 conference – the responses were along the lines that providing open structured data might be a valuable step in moving towards the Semantic Web as can be seen by these tweets from @LBrad:
      #online09 Nigel Shadbolt talking about Linked data “the pragmatic semantic web” & uk govt initiatives
      and Wendy Hall (@DameWendyDBE) herself:
      online09 @Nigel_Shadbolt now talking about the pragmatic semantic web

      Brian

    • Paul Walk said

      Really unclear what these generic tools Andy Powell describes might be – or how they might be less ‘bespoke’ than a spreadsheet (AKA table of data)?

      • Gemstest,
        I agree with your points… note that I didn’t say the “level of discussion around Wikipedia was tabloid”. I said that some of the discussion around Wikipedia was tabloid, e.g. at the level of “the wikipedia entry for X contains obvious errors, therefore the whole of Wikipedia must be useless, and therefore our children should not be allowed to look at any of it”.

        Brian posed his challenge and subsequently noted that there could have been a “quick win” for Linked Data (from one of his tweets). What he got was a “quick fail” for either DBPedia or Linked Data depending on how you look at it (or choose to interpret it). My point is only, “so what?”… neither case really tells you anything that we didn’t know before??

        Paul,
        (not sure why you refer to me in third person? :-) ). Re: generic tools… anything that moves the developer a step (or more) away from writing software that has to ‘know’ the precise detail of a particular dataset is what I mean by a generic tool. With a Google spreadsheet I have to know what each column means (typically by reading the ‘human-readable’ label) and I have to code specifically to each column. One of the intentions of Linked Data is to move a little away from that – particularly thru the use of inferencing.

        Now… that said, I completely agree with you that it is absolutely unclear whether, in reality, that is a likelihood. I am not arguing from a pro-LD position here. Far from it. I’m actually arguing from a rather skeptical LD position. I think the costs of creating LD are likely to be very high, I think the data modelling skills required are rather rare, I’m not convinced there is an appetite to model stuff sufficiently consistently across data-sets to move away from the bespoke situation anyway, and I also suspect that the costs of application development are also rather steep.

        But to properly understand the relative merits of LD we’ve got to separate out the issues a little more clearly. If we want to challenge the data quality of DBPedia, let’s do so. If we want to discuss the costs/benefits of a LD approach, let’s do so. If we want to discuss how consistently people model stuff, let’s do so. But please don’t let’s mix the two things up in the same discussion based on a single proposition, or at least not in the sense that the headline refers to one thing and the discussion refers primarily to another. I agree that part of the problem here is that these things are highly intertwined.

        Finally (phew) Brian,
        yes, I take your point about DBPedia being cited too centrally in the LD cause (and agree with it) but I think my final para above still applies. I’ve re-read your conclusions above and I still don’t really know what we’ve learned other than people can use LD to expose poor/incorrect/misleading data – which is a bit like saying that people can use the Web to expose poor/incorrect/misleading data? Both statements are uncontroversially true but are hardly earth-shattering (unless you were really that naive about the Semantic Web, which I find hard to believe).

  9. Tony Hirst said

    @andypowe11 said: “Linked Data isn’t about solving individual questions using bespoke tools, it’s about providing data in a more widely re-usable form – so that more generic tools are likely to be able to give us the answers.”

    Well it is and it isn’t – it’s about providing data that can be be related to particular things via unique and persistent identifiers; knowing the location of a datastore, an identifier, and one or more relations allows you to get hold of particular attributes or properties of the identified thing.

    Part of the power of reuse comes from the ability to write powerful queries. Whenever I do a training session on using Google, I’m no longer surprised that it’s only the minority, if any, who know how to use site:, filetype:, etc etc. God help us all if we’re going to expect people to write SPARQL queries…;-) (SPARQL interface is still a usability straw man, I know…)

    So as devil’s advocate….
    … with spreadsheets, I’d argue the 2D layout encourages you to decompose a problem into component parts, and use simple formulae applied in turn across several rows or columns to construct quite complex manipulations. Certain of those formulae might be used to pull in data from live data stores; “linked data” is achieved by using the results of one formula as arguments in the next.

    If you look at the structure of a lot of SPARQL queries, they return one or more rows of results – separate arguments in separate columns returned from a query based around. Often, one column will define a key element with the other columns representing properties or attributes of that element. The rows that are returned may also be selected based on the value of one or more of the “column” values (e.g. the name and population of all universities where the student satisfaction is less than 70% and the population is greater than 10000).

    But who constructs the query? And where? What makes you think that any particular query is NOT “a bespoke tool”?

    In my current dabblings with spreadsheets’n’linked data, (e.g. http://ouseful.wordpress.com/2010/02/17/using-data-from-linked-data-datastores-the-easy-way/ ) I’ve taken the approach of viewing SPARQL queries as particular tools, and then wrapping them in an abstraction layer that allows the queries to be called in parameterised form from within a spreadsheet using an everyday looking formula. By providing building blocks in this way, I would argue that we are providing a way of “providing data in a more widely re-usable form” in terms that a wide percentage of Office (sic;-) workers can understand?

    • Tony Hirst said

      PS I do know why Linked Data is better, of course – in Linked Data, the audit trail that describes where any particular piece of data came from is discoverable via the unique IDs; when you’re pulling data from spreadsheets, there’s potentially (!;-) a bit more uncertainty involved in proving that this piece of data about this Thing in this spreadsheet is referring to the same Thing that appears in another spreadsheet

    • Tony,
      as I said in response to Paul above, I’m not arguing from a pro-LD position here. I’m arguing from a “pro-clarity” position.

      And yes, I take your point about SPARQL queries and bespoke tools.

  10. ..I’m also rather acutely aware that the level of discussion around this post rather waters down my argument against it.

  11. Also note (to refer back to a Twitter comment by Mike Ellis) that a Google search for:

    Is there a green car for sale for around $15000 in Queensland

    (one of the Semantic Web’s original use-cases) very rapidly finds the following:

    2000 Ford Falcon $11,990 Gympie
    Green XLS SUPER CAB AU II Utility

    Very sporty Ford Falcon Supercab. 2000 model, automatic, 6 cylinders. Features include air conditioning, cd player, power windows, tow bar and tonneau cover. Excellent price for a vey tidy vehicle! Come in today and check it out!

    from http://www.sunshinecoastcars.com.au/

    :-)

  12. I think the example does point up the issues of trust and provenance when dealing with linked data.

    The example reminds me a bit of this story from the THE, where they rang Libraries and asked what the boiling point of Ethanol was (http://www.timeshighereducation.co.uk/story.asp?storycode=401533). They found that the majority of wrong answers to this question used the figure given in Wikipedia at the time.

    Of course, the thing is that the boiling point of any liquid is going to be a function of pressure, and so there isn’t a ‘correct’ answer to the question as such – the question is too vague for that.

    I suspect we could spend quite a while deconstructing the initial question – what is a student? what is a place? do you mean resident students? what time of year?

    This I think points to another issue – which is if ‘the semantic web’ is going to answer questions like “Is there a green car for sale for around $15000 in Queensland?” this is going to take more than just linked data and SPARQL, but the ability to make sense of the question given the context, and the appreciation that it may be appropriate to prompt for more information if the context is ambiguous:

    Does green mean the colour, or environmentally friendly
    Is that US or AUS dollars?

    I do take issue with the statement “The bad news, though, is that although DBpedia appears to be fairly central to the current publicity surrounding Linked Data, it does not appear to be capable of providing end user services on the basis of this initial experiment.”

    I think the best we can really conclude at this point is that the population data in Dbpedia is not being correctly translated from Wikipedia – since a quick check shows that the population figures you quote here are not the same as the ones in Wikipedia. It may point to overall weaknesses in the approach taken in creating Dbpedia – this needs some more work I think. The student figures given above (aside from the Leeds one) look more reasonable to me, although that’s just gut instinct.

    Perhaps a slightly bigger issue (for me) for dbpedia, is that even if the data had been accurately extracted from Wikipedia, it still doesn’t have any provenance attached to it. If we go back to the population of Cambridge it’s not clear (to me) where the figures quoted in Wikipedia come from. Now, looking at the Wikipedia figures I can say ‘looks OK to me’, but this isn’t what happens when you do a linked data query – these figures could be buried deep, and in the final result I may not see these figures, or be able to tell the impact they have had on the final query – so I’d argue it is more important here that we can give credence to all the sources used throughout the query without checking them all individually.

    The point about GIGO is clearly well made. I’d also take this slightly further – that making it possible to combine different bits of data does not mean it makes sense to do so. There is a risk that you could take two good, trusted sources, and combine them together to come up with nonsensical answers.

    Feel this is a little bit all over the place as a comment – hope it makes some kind of sense

  13. [...] Response To My Linked Data Challenge [...]

  14. Brian

    I started to comment on your question but got a bit carried away and ended up writing a blog entry.

    I think if we really want to answer this question with Linked data, dbpedia is rather beside the point. Excellent though it is, and I use it a lot, it is an automated parse of a hand-written source of second or third hand data (at least for the domain of the question). If Linked Data means anything, it means linking disparate data sets published authoritatively as close to the data source as possible. Whilst some parts of the data chase are available as RDF (the EduBase2 data for example), the bulk is not yet there. The data.gov.uk initiative is certainly driving change, but we are still stuck I think on traditional data integration problems as simple as the need to adopt common identifiers for entities such as Educational Institutions (with or without enclosure in a URI) and definitions of terms such as “student” and “town”.

    Chris

    • I think this is key. “Web 1.0″ developed because anyone could publish their own content and could link to other content. The SW requires the same freedom, together with the freedom to publish links between content (such as sameas.org). Without that, and the discovery resources to go with it, the SW is another walled-garden hyperlink system, destined for niche markets at best and oblivion at worst.

      How do we ease the development of the tools and services to allow simple publishing of LD, so that any fool can do it and any other fool can discover and use the result?

  15. [...] a lot of inconsistent data problems. This is salient to a recent discussion on twitter around Brian Kelly’s Linked Data challenge. One conclusion was that it was difficult, because the data was ‘bad’. IMHO, this is [...]

  16. All,

    Here is a modified query [1], still not there yet due to source data issues, but at least we have percentages that make sense etc..

    Again, when I posted the initial example, the real goal was to demonstrate how sharing queries address the following issues

    1. Subjectivity inherent in all reports that creates perpetual context fluidity (our context-halo vs the context-halo of the information producer)
    2. Loose Coupling of Data and Information (these items are distinct but often conflated) en route to producing alternative views or debugging (if you say) existing ones (i.e. how did we arrive at this presentation of facts).

    Linked Data is about the ability to perform open-drill-downs from a piece of information exposed via a URL courtesy of the Generic Data Object URIs that constitute the Entity-Attribute-Value graph in the container document.

    Links:

    1. http://bit.ly/ae3h0w — modified query with actual percentages

  17. [...] it?  Not very, if we follow the example set by Brian Kelly of the University of Bath.  He set a challenge to some students to use Linked Data to find out which UK city has the highest proportion of students.  Within a few [...]

  18. My attempt last night (after the Wales/France match of course) to answer the question with a mixture of data sources and XSLT/XQuery/Sparql is described in my blog.

    The final table puts Milton Keynes at the top, Oxford second.

    The main problems encountered were the lack of identifiers in the HESA data to match the RDFed eduBase data, differences in institution names between the two datasets, universities in Scotland, Wales and Northern Ireland not in the edubase data and other scope differences, problems in choosing a suitable definition of ‘town’ to match with ONS areas as well as the definitional problem of student location which leads to the Milton Keynes answer. Some SPARQL but mostly XML manipulated with XQuery and XSLT.

    I plan to try the same integration with triples generated from the source data and integration in a triple store as a demo for my students.

    Then I can set the task of answering the same question in the scope of, say, Europe or the world as an exercise :-)

    Chris

  19. Hi,

    I appreciate that an answer was obtained, and I’ve not read into the details of which datasets or definitions were used, but what is for sure is that there could be man different answers to this question depending on what was used.

    One of the big differences might be term time and out of term time calculations. Can it be done for out of term calculations so easily? This makes me wonder about student that commute to university. But also reminds me that the 2001 and 1991 Population Census Data were collected at different times. I believe that one was in term time and another out of term time. OK, not much mention of census data in this page, but it’s still interesting.

    It is possible to look at each term in the question and wonder about how to define it. Something is used for the answer given. The point I am making is that this should be clear and I’m sure that under different definitions the rankings would be different in the top 10.

    Best wishes,

    Andy

    • OK, I’m not sure, but that is my guess ;-)

      One further consideration is that of temporal aggregation. I suggested difference in measurement at different times of year, but how about aggregated measurements over a period… Whether that would give a better answer really all depends… My guess is that both the proportions and the rankings change over time…

      Geographical analysis is knocking at the door of linked data…

      Toodle pip!

  20. [...] recent invitation to Linked Data developers to illustrate the potential benefits of Linked Data by providing an answer to a simple query using DBpedia as a data source generated a lot of subsequent discussion. A tweet by Frank van Harmelen (the Dutch Computer [...]

  21. [...] URL?"Higher Education in a Web 2.0 World" Report PublishedThe 'Cities Visited' Facebook ApplicationResponse To My Linked Data Challenge(TwitterFall) You're My Wonder Wall « Twitter and the Digital [...]

  22. [...] the utility of the Linked Data approach (e.g. A Challenge To Linked Data Developers (followed up in Response To My Linked Data Challenge) and Linked Data: my challenge, with some other possibilities here: 10 Ideas For Web of Data [...]

  23. [...] the question: “Which town or city in the UK has the highest proportion of students?“. One answer puts Cambridge first (you’ll notice the quite obvious mistakes in the data), while another [...]

  24. […] graphical representation of BPedia’s linking […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: