UK Web Focus

Innovation and best practices for the Web

Approaches To Debugging The DBpedia Query

Posted by Brian Kelly on 24 February 2010

My recent invitation to Linked Data developers to illustrate the potential benefits of Linked Data by providing an answer to a simple query using DBpedia as a data source generated a lot of subsequent discussion. A tweet by Frank van Harmelen (the Dutch Computer Scientist and Professor in Knowledge Representation & Reasoning, I assume) summarised his thoughts of the two posts and related behind-the scenes-activities: “insightful discussion on #linkeddate strengths, weaknesses, scope and limitations“.

But as described in the post, the answer to the question “Which town or city in the UK has the largest proportion of students?” was clearly wrong.  And if you view the output from the most recent version of the query, you’ll see that the answers are still clearly incorrect.

We might regard this ‘quick fail’ as being of more value that the ‘quick win’ which I had expected initially, as this provides an opportunity to reflect onthe processes needed to debug a Linked Data query.

As a reminder here is the query:

#quick attempt at analyzing students as % of population in the United Kingdom by Town
#this query shows DBpedia extraction related quality issues which ultimately are a function of the
#wikipedia infoboxes.

prefix dbpedia-owl: 
prefix dbpedia-owl-uni: 
prefix dbpedia-owl-inst: 

select distinct  ?town ?pgrad ?ugrad  ?population (((?pgrad + ?ugrad) / 1000.0 / ?population ) ) as ?per where {
?s dbpedia-owl-inst:country dbpedia:United_Kingdom;
   dbpedia-owl-uni:postgrad ?pgrad;
   dbpedia-owl-uni:undergrad ?ugrad;
   dbpedia-owl-inst:city ?town.
optional {?town dbpedia-owl:populationTotal ?population. filter (?population >0) }
 }
group by ?town having (((?pgrad + ?ugrad) / 1000.0 / ?population ) ) > 0
order by desc 5

As can be seen, the query is short and, for a database developer with SQL expertise, the program logic should be apparent. But the point about Linked Data is the emphasis on the data and the way in which the data is described (using RDF). So I suspect there will be a need to debug the data. We will probably need answers to questions such as “Is the data correct in the original source (Wikipedia)?“; “Is the data correct in DBpedia?“; “Is the data marked-up in a consistent fashion?“; “Does the query process the data correctly?” and “Does the data reflect the assumptions in the query?“.

Finding an answer to these questions might be best done by looking at the data for the results which were clearly in error and comparing the data with results which appear to be more realistic.

We see that Cambridge has a population of 12 and Oxford a population of 38. These are clearly wrong. My initial suspicion was that several zeros were missing (perhaps the data was described in Wikipedia as population (in tens of thousands).   But looking at the 0ther end of the table, the towns and cities with the largest populations include Chatham (Kent) with a population of 70,540, Stirling (41,243) and Guildford (66,773) – the latter population count agrees with the data held in Wikipedia.

In addition to the strange population figures,there are also questions about the towns and cities which are described as hosting a UK University. As far as I know neither Folkestone nor Hastings has a University. London, however, has many universities but is missing from the list.

My supposition is that the population data is marked up in a variety of ways  – looking at the Wikipedia entry for Cambridge, for example, I see that the info table on the right of the page (which contains the information used in DBpedia) has three population counts: the district and city population (122,800), urban population (130,000), and county population (752,900). But by querying the DBpedia query results I find three values for population: 12, 73 and 752,900.

The confusions regarding towns and cities which may or may not host UK universities might reflect real world complexities – if a town hosts a campus but the  main campus is located elsewhere, should the town  be included? There’s not a clear-cut answer, especially when, as in this case, the data, from Wikipedia, is managed in a very devolved fashion.

I’ve suggested some possible reasons for the incorrect results to the SPARQL query and I am sure there may be additional reasons (and I welcome such suggestions).  How one might go about fixing the  bugs is another question. Should the data be made more consistent?  If so, how might one do this when the data is owned by a distributed query?  Or isn’t the point of Linked Data being that the data should be self-describing – in which case perhaps a much more complex SPAQL query is needed in order to process the complexities hidden behind my apparently simple question.

About these ads

10 Responses to “Approaches To Debugging The DBpedia Query”

  1. Brian,

    One approach is to do the natural i.e., Drill Down from the record of concern, via its Generic HTTP URI, to its underlying details [1].

    The ultimate power of Linked Data is that it decouples “Data” from the initial presentation context (“Information”) thereby enabling you to build alternative presentation based on your specific context. This decoupling via Generic HTTP URIs that de-reference to RDF model based Entity-Attribute-Value graphs remain somewhat mercurial :-)

    Links:

    1. http://twitpic.com/152cw2

  2. Chris Rusbridge said

    @kingsley But surely this example shows precisely that decoupling “data” from context conceals errors that would be much clearer in context. I’m a supporter of Linked Data, and I suspect the nature of the data source in this particular query is part of the problem. But such problems could turn up unheralded in almost any query on a dataset of unknown provenance and quality. GIGO as they used to say, and presumably unless you know a great deal about the input dataset you’d be wise to assume the output is indeed garbage.,

  3. I share the concerns emerging in this discussion – I wrote a note about it a few weeks ago, following some work I was doing on the ASHE data. The thrust is that simple methods of converting spreadsheets to XML / RDF were in danger of losing the small print such as footnotes (which explain a change in meaning of a code or a boundary change etc) and confidence levels (which explain which values can be relied on). Actually the ASHE data has supplementary confidence level tables which could be converted but are still likely to be ignored.

    In the data.gov.uk initiative, the SCOVO vocabulary is being used to express multi-dimensional data but in this standard there is no explicit way to express confidence levels nor to express the caveats about data interpretation even in human-readable terms, let alone coding these semantically. Ultimately, government departments must be persuaded to make the underlying databases available in a semantic format but there is a tricky trade-off between accuracy and simplicity of use here. The XML statistics vocabulary SDMX can express more of the subtlety of the data but complex professional standards are not the simple tabular data or simple triples which users want.

    However we clearly need to be able to access this raw data as Linked data because the dbPedia exercises clearly shows the added dangers of working with second or third hand data which has been retyped into wikipedia before being scrapped and tagged into dbpedia. Anyone for Chinese whispers?

    Chris

    • Thanks for the comment (and your related blog post).

      There might be a need to ask whether the solution to providing official statistics in reusable, Linked Data format, is likely to be more easily achieved by the processing of the Excel spreadsheets or the community involvement around Wikipedia/DBpedia. We might feel that the former should provide the best approach but, as you’ve pointed out, it’s not a easy as it seems.

      • Alejandra Garcia-Rojas said

        I’d like to comment that my attempt providing an answer to Brian challenge intended to prove the feasibility for a developer to create a query that could answer his question… unfortunately, it was a wrong answer. And my argument for this wrong answer was that DBpedia should not be the source of this kind of information, but UK.gov, just as Chris Wallace said.

        As developer, and fan of Linked Data, anticipating the future, I imagine that:
        WHEN gov.uk finishes publishing their data through the SPARQL endpoints for education and statistics,
        AND SPARQL 1.1 becomes a W3C recommendation , THEN I would like to test a query like this one:

        PREFIX uk-owl: <http://services.data.gov.uk/ontology/>
        PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
        PREFIX opencyc: <http://sw.opencyc.org/2008/06/10/concept/en/>

        SELECT ?city (?students/?population*100) as ?percentage {
          { SERVICE <http://services.data.gov.uk/education/sparql> {
            SELECT ?city sum(?s) as ?students WHERE {
              ?uni uk-owl:students ?s ;
                rdf:type opencyc:University ;
                uk-owl:city ?city .
            }group by ?city
           }
          }
          { SERVICE <http://services.data.gov.uk/statistics/sparql> {
            SELECT ?city ?population WHERE {
              ?city rdf:type opencyc:City ;
                uk-owl:totalPopulation ?population .
            }
           }
          }
        } group by ?city order by desc ?percentage

        The is going to two different data sources: education and statistics SERVICEs, and asks for the information concerning each one (number of students and cites population), and finally performs the arithmetics to get the percentage.
        We may not need to put that we want cities of UK because may be obvious that where we perform the queries have only local information. We could imagine that we can put another condition to the first part of the query saying that the students should reside in the same city.
        May be my example is too simplistic or even wrong because it is my speculation. I do not know if this is too far from reality, but this is how I imagine a query giving the right answer. Is my Chinese whisper is too daring? :)

  4. jane said

    Hastings has University Centre Hastings, affiliate of Brighton University. http://www.uch.ac.uk/

  5. AGRM et. al,

    A better solution (result) arises when you get data from the right place. The UK Govt. Linked Data Spaces should be the specialist Data Spaces from which you JOIN against DBpedia :-)

    Here are example links showing a JOIN across DBpedia and the New York Times:

    1. http://bit.ly/bJokvr — query results
    2. http://bit.ly/9Dp062 — query definition .

    Kingsley

  6. Margaret Wallis said

    Just to confirm that we are extremely proud of the University of Brighton in Hastings. The link to our website is: http://www.uch.ac.uk/ UCH is a new university centre which seeks to encourage students from non-traditional backgrounds to study at Higher Education level. We have grown in six years from 40 students to 600+ with further growth planned over the next three years offering a range of subjects including computing, social sciences, education, business and English Literature. We are a centre of excellence in media production with courses in broadcast media, radio and TV production and Broadcast Journalism.

  7. [...] of Brighton in Hastings which “offers University of Brighton degrees“.  As Margaret Wallis pointed out in response to my initial blog post this institution has  ”grown in six years from 40 [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: