UK Web Focus (Brian Kelly)

Innovation and best practices for the Web

Google Scholar Citations and Metadata Quality

Posted by Brian Kelly on 28 Nov 2011

Back in 2005 Debra Hiom, Amanda Closier and myself wrote a paper entitled “Gateway Standardization: A Quality Assurance Framework For Metadata” which was published in the Library Trends journal. The paper (which is available in MS Word and PDF format from the University of Bath repository) described the systematic approaches to ‘spring-cleaning’ metadata which the SOSIG subject gateway which, at the time, was a subject gateway in the Resource Discovery Network.  The approaches which were taken at SOSIG reflected a quality assurance framework which was being developed by the JISC-funded QA Focus project which was described in a paper on “Developing a quality culture for digital library programmes“.

The quality assurance approaches or metadata we described in the papers was focussed primarily on the service providers. However, six years later, the importance of the quality of metadata for resource discovery is no longer just of relevance to service providers. In a Web 2.0 environment in which content providers can make their teaching and learning and research outputs available on a wide range of services without the mediation of information professionals there is a need to ensure that a wider range of content providers are aware of risks that poor quality metadata can lead to valuable content being difficult to find.

I became aware of such risks while Surveying Russell Group University Use of Google Scholar Citations which I described in a recent blog post.  As mentioned in the post I became aware of the dangers of over-counting the numbers of researchers who have claimed a profile by aggregating researchers from the University of Birmingham with those from the University of Birmingham at Alabama or those from Newcastle University with Newcastle University,  New South Wales.

 Of further investigation I discovered entries from researchers who had misspelt the name of their university by using “univeristy” – a common typo which I myself have made. Currently it seems there are only 33 such misspellings.
In our paper we described how:

We have recommended to the JISC that those JISC-funded projects making significant use of metadata should address these issues as part of the project’s reporting procedures.

Whilst the issues referred to are still valid for projects which have significant metadata requirements, we now have the question of approaches which researchers can use when they are uploading information about their papers which may be harvested by a range of services, who aren’t in a position to implement metadata quality checking tools in services which may be used by full-time information management staff.

So what can individual researchers do to ensure that their papers don’t become difficult to find in tools such as Google Scholar Citations?

I have experimented with tools such as Collabgraph, a finalist in the Mendeley/PLoS API Binary Battle. This helped me to spot that a number of my papers listed in my Mendeley library had listed two sets of co-authors in a single string.  This brought home to me the potential benefits of visualisations for spotting errors in textual data.

In addition to use of such tools a recommendation I am making to colleagues is to create a profile and check you pages while the service is still new and there are only small numbers of users.  This means, for  example, that I can search for authors called “Kelly” and discover that there are currently only 26 entries and that there are no duplicate entries for me.

I can also search for my department, UKOLN, and check that the entries are correct.In this case we are fortunate in having a unique name for our department.  However in many other cases there may be legitimate variants: for example I currently find seven entries for Computer Science, Southampton and 43 entries for ECS, Southampton with the discrepancy due, in part, to many researchers having a foo@ecs.southampton.ac.uk email address.

As I started to reflect on ways in which errors could be introduced into such services and ways in which end user might search for resources I realised that although early adopters can gain benefits in adopting profiles in such services (by gaining additional exposure to one’s research and being able to more easily spot errors when there is only are small numbers of  profiles available) at some point the bottom-up approach will suffer from limitations. What we really need will be the centralised provision of quality assured metadata about research publications.  But services such as Google Citations Scholar won’t disappear in the short term (although, as with a range of other Google services, they could disappear in the future if they turn out not to be aligned with Google’s business interests).  My conclusions: be an early adopter in order to provide another mechanism for making ones research papers more visible but be prepared to accept the risk that the benefits may not last forever.

One Response to “Google Scholar Citations and Metadata Quality”

  1. Brian, you are making a big leap in the last paragraph, from mthe existence of errors in crowd-sourced data to the possible death of crowd-sourced services and the necessary success of authoritative ones. Surely the answer really is to build architectures that allow errors in crowd-sourced data to be identified and corrected by the crowd, in this case particularly by the “owners” of the data. Your collabgraph seems like one good way to do just that…

Leave a comment