Will the Real Scott Wilson Please Stand Up, Please Stand Up

The Microsoft Academic Search Service

At the recent Science Online London (SOLO) 2011 conference I attended a session on the Microsoft Academic Search service.  There seem to have been a lot of developments to this service since I first signed up for it shortly it had been announced.  I couldn’t get a decent feel for the service on my Android phone at the session since it uses Silverlight which isn’t supported on my phone.  However the tweets for the session were curated using Storify and these resources have been embedded in a post on the Nature blog which includes Twitter summaries of the third breakout sessions.From these useful notes I find that:

  • Microsoft Academic Search has details of over 27 million publications (see tweet).
  • There is an expectation that there will be up to 200 million publications by next year, but the biggest flaw is the content (see tweet).
  • Users can edit the content in the database i.e. using  crowd sourcing to cleanup the data  (see tweet).
  • Co-author and citation graphs shows relations which connect people (see tweet).
  • There is an open API for Microsoft academic search (see tweet).
  • Currently there’s no real system for claiming authorship on Microsoft Academic Research (see tweet).

I was impressed by the functionality and user interface. But in addition I was also interested in the issues raised in the tweets listed above regarding claiming authorship of papers and using crowd-sourcing to enhance the quality of the content.  I will discuss these issues in this post.

Personal Experiences

Shortly after returning to my office I reviewed the information it had about my research papers. A summary for my papers is illustrated below.

Although I am aware of the papers I have published I hadn’t really looked at the statistical analysis of the papers.  But in addition to details of the numbers of citations and the G-Index and H-index scores, of particular interest was the information about the 37 co-authors of papers I have published during my time at UKOLN.

As illustrated the Microsoft Academic Search service allows you to view links between the co-authors. You can also produce a similar citation graph showing researchers which have cited your work.

In addition you can also view the degrees of separation between two researchers - and I discovered that I have co-authored a paper with Sebastian Rahtz who has co-authored a paper with Dame Wendy Hall who has co-authored a paper with Sir Tim-Berners-Lee.

Interesting stuff – and if you are primarily a researcher the information on links and relationships may be particularly significant.  But can we trust the information which is depicted in the diagram?

Can We Trust The Information?

When checking details of my co-authors I noticed a number of errors. In the bottom right hand corner of the screen shot I have placed four of my co-authors:

Scott Wilson: Is based at CETIS, University of Bolton but on the Microsoft Academic Search service is listed as working at the University of British Columbia and apparently has published 74 papers.

Stephen Dean Brown: Is based at De Montford University but on the Microsoft Academic Search service is listed as working at the University of Toronto and apparently has published 88 papers.

Richard Davies: Is based at University College London  but on the Microsoft Academic Search service is listed as working at the University of Sheffield and apparently has published 35 papers.

Lawrie Phippsa: Is based at JISC but on the Microsoft Academic Search service is listed with an incorrectly spelt surname (although he does have a correct identifier which lists him as being based at JISC and having published 10 papers).

Back in April 2011 I wrote a post in which I described What I Like and Don’t Like About about the IamResearcher service.  I can recall how, having signed up for the service, I had to assert the papers which I had written, merge papers which had been assigned to different variants of my name and delete those which were incorrectly assigned to me. However since access to the service is restricted to signed in users I wasn’t too concerned about the service.  The information held on the Microsoft Academic Search service, in contrast, is openly available and widgets are available which enable the information to be embedded elsewhere – and I have used this feature to include information about my papers on the UKOLN Web site.  But how should we address the problems caused by incorrect information which I have illustrated?

Maintainance Issues

A Researcher’s List of Publications

I have edited a number of the errors I found in my details – but there is one paper on One world, one web … but great diversity which is also listed again as One World, One Web … But Great Diversity. Despite having tried to merge these two papers a week ago, the item is still listed twice (and is locked form further editing).  It would appear that there is a bottle-neck in approving changes.  So although researchers should have a vested interest in ensuring that information is accurate and complete (after all the content of open services such as the Microsoft Academic Search should be harvested by Google and will thus enhance the visibility of the source content) this may not be easy to do, even if the authors are aware of the service and feel sufficiently motivated to correct any errors.

And since there are pressures from funding bodies to maximise awareness and impact of one’s research papers it would seem to be self-evident that researchers will be motivated to manage their content. But is this really true?  And even if researchers can be made aware of the potential benefits, will they feel the effort is worthwhile?

Additional Content

Services such as Microsoft Academic Search may provide automatically find the title and  authors for a paper. However managing this information might include:

  • Providing links to details for the correct author.
  • Providing full citation details for the papers.
  • Providing links to PDF versions of papers, if available.
  • Providing conference details for papers published at conferences.

It may be felt to be the responsibility of the lead author to support the dissemination of a paper in this way (as well as having responsibility for the content and ensuring the paper is submitted in time). But in addition to maintaining details of the papers and co-authors there is also the need to consider other information which may not be as easy to determine.  For example recently while looking to summarise details of UKOLN’s peer-reviewed papers I noticed that authors’ institutional details had been split across UKOLN/Bath and the University of Bath. There appears to be a need to  aggregate this information in order to provide an organisational view of our research outputs.

Such a departmental view may help to provide an insight into changing areas of research interests. The accompanying image, for examples, shows  the subject of UKOLN’s research publications over time. From this we can see a long-standing involvement in the areas of information retrieval and human-computer interfaces. However this picture is skewed by the not having all authors included under the same department (and the information not being updated despite changes made over a week ago).

Getting It Right

The cynic would blame Microsoft for the problems which I have identified, but I think this would be unfair.  I feel that the service does provide a very appealing interface which has advantages over Google Scholar, for example.

But what improvements are needed in order to enhance the quality of such services?

It seems to me that there are three main sources of information, each of which will have corresponding issues which will have to be addressed:

  • Information about the author:  This is information which we might expect the author to maintain  (name, contact details, host institution, previous employment, etc.)  However there will be a need for the author to be sufficiently motivated to claim their identity and maintain the information.  There will also be a question of trust.
  • Information about the author’s papers:  This is information which could be harvested from content provided by publishers, institutional repositories, etc. However, as has been illustrated, there will be a need to validate information which is harvested.
  • Information about the author’s institution: The host institution will have an interest in ensuring that the research outputs from its staff and research students are included.

It should be noted that there may be tensions between an individual’s and an institution’s view on such data. For example the outlier in the diagram shown above (a paper on “Becoming an Information Provider on the World Wide Web”  published in 1994) should be included in my list of publications (it was the first peer-reviewed paper I wrote). However at the time I was working at the University of Leeds so it should not be included as a UKOLN/University of bath publication.

We could regard the process of ‘getting it right’ to be primarily focussed on data modelling. But since the Microsoft Academic Search service involved automated harvesting of large volumes of data from a range of sources with an expectation that data cleansing will be carried out by ‘crowd-sourcing’ including the authors themselves there will be a need to consider the motivations for people to register for a system, check the information and be willing to update it.

For me important drivers for doing this include:

  • Updating data which is openly available as I would have a vested interest in ensuring that information about my professional activities is correct and up-to-date. (I have no interest in updating information held in the service as this is closed).
  • Having a richly functional, easy-to-use and visually appealing system which differentiates itself from other providers.
  • Allows me to update the information easily and quickly.  Note that having found that information which I have updated on the Microsoft Academic Search service has not been approved after a period of a week is a barrier for making any more updates to this system.

And although I may be willing to update the information about myself and my institution I am reluctant to correct errors about my co-authors. Although for example,  I know about the paper which Scott Wilson and I have co-authored and know that he is based at CETIS, I don’t know if he was based at Bolton or Bangor University when we wrote the paper. I also don’t know which papers written by Scott Wilson were written by the Scott I know and which one’s were written by the Scott Wilson who is based at the University of British Columbia.  Will the real Scott Wilson please stand up!

