Methods – GeoScimo

Bibliographic databasis and spatial analysis

Our research is based primarily on data from the Web of Science Core Collection (WoS). This database includes publications (articles, reviews, and letters) dating back to the 1900s in the “major international scientific and technological journals” – those most cited by researchers themselves.

Since the early 2000s, the content of over 10,000 scientific journals per year has been indexed in this database. An important part of our results relate to the content of the Science Citation Index Expanded (SCI Exp), which focuses on the natural sciences and technology. Among the three major indexes compiled in the WoS, SCI Exp has the most extensive and reliable coverage for a dynamic analysis of publication activity worldwide.

The geocoding process

The quality of automatic geocoding tools (Google, Yahoo!, Bing, etc.) is actually widely divergent when used on worldwide and spread over decades datasets such as the one we have used.

In thzeWoS, the authors’ addresses are decomposed in several fields, of which we selected three: city, province, and country. Our target scale for analysis was the city level.

Error control and correction was quite a long procedure, helped by the development of a user-friendly online visualization tool shared among all project participants. This interactive cartographic application helped to evaluate the quality of the ongoing geocoding process and to begin geographically interpreting the data.

Several test-zones were constructed to ascertain the quality of the geocoding process. This was particularly helpful to visually detect data geocoded elsewhere or that had gone missing. A data-quality index was constructed by country, indicating the zones where expert verification was needed. With the help of the online tool, our team reached colleagues and specialists of the regions or countries needing verification.

The quality of the geocoding improved step by step. The geocoding was finally manually refined for several cases whose interest lies in their complexity. One example is the splitting between two universities (for example Leuven / Louvain-la-Neuve in Belgium), or large cities homonyms (the patron-saint cities of France or Taoyuan in China and Taiwan).

After more than a year of work, with the help of geospatial analysts and cartographers working in fields such as sociology and the geography of sciences, we obtained a fine-tuned/high-resolution spatial database of scientific production over the last several decades.

The building of scientific agglomerations

This granularity is itself a source of problems when we attempt a comparative approach at the global level. The characteristics of postal addresses, the geographical variability of postal reference systems, and the great diversity of administrative geographical segmentation, prevent any direct comparison between distinct “scientific localities”.

Our team addressed this problem by building spatially comparable geographical entities at the global level. Once all the articles, reviews, letters are extracted and geocoded, urban perimeters are delineated and used as elementary analysis units to measure scientific activity.

The method we used to build those entities is a two-step method :

First, the aggregation perimeters defined around the 500 top publishing localities were obtained using a semi-automatic procedure based on population density (highly fine-tuned raster data). Population density is one of the few global indicators with homogeneous resolution quality in every part of the planet.

Second, smaller publishing localities, which were not included within one of the dense urban areas, were grouped together if they were geographically close enough, with 40 kilometers as the criterion.

As a result, the Parisian agglomeration includes for instance suburban cities such as Gif-sur-Yvette, Villejuif, l’Université de Versailles-Saint-Quentin-en-Yveline. Once the scientific agglomerations were delineated taking into account the localization of scientific activity, a final step was required. It is necessary to select a counting method in order to study co-authorship data at the agglomeration level.

The counting method

Publications counts by urban areas

Counting methods are an issue because most scientific publications have multiple authors. However, the authors of an article may belong to different cities and to different countries. To compute the number of publications by city, we opted for a fractional count using the “Whole Normalized Counting” technique (Gauffriau et al., 2008).

The technique is “Whole”, because it takes into account, not the number of addresses, but the number of different urban areas that contributed to the publication (the basic unit being the metropolitan area); and “Normalized” because a fraction is attributed to each town that contributed to the publication (each urban area receives a fraction as a credit for the publication that is equal to one divided by the number of cities involved).

Fractioning allows us to simultaneously compute a range of data while maintaining their relationship with the actual number of publications worldwide (since the sum of fractions is the total number of articles published in the world). We consider this technique the most rigorous and respectful of the reality of science since it allows us to reconstruct as accurately as possible the geographical form of scientific activity (spatial groupings by urban area, region, country …).

Links between urban areas

To quantify the co-publications – the number of links between two spatial units – we chose to apply a whole normalized count by scientific areas. Therefore, if a publication is co-signed by n cities, each pair of cities is assigned a value l equal to the:

1/n(n-1)/2 = 2/n(n-1)

Thus, the value of the sum of collaborative relationships equals the total number of co-written articles between two cities. Specifying the method used to count publications and collaborative links is important because results differ depending on the method chosen, especially for cities (or links between cities) located in the middle of the hierarchy.

Time normalization

Before analyzing the data, a last methodological operation is required: time normalization. In order to nullify minute annual fluctuations in scientific activity, a normalized or moving mean is computed over a span of three years. In order to compute the moving average of co-authored publications for the year 2007, the proper formula is:

X₂₀₀₇ = (x₂₀₀₆+x₂₀₀₇+x₂₀₀₈)/3

Disciplines

Ten disciplinary fields are distinguished based on the categories designed by the French Observatoire des Sciences et Techniques :

Basic Biology
Medical Sciences
Applied Biology
Chemistry
Physical Sciences
Earth Sciences, Astronomy and Astrophysics
Engineering
Mathematics
Arts and Humanities
Social Sciences