Science is all in the
eye of beholder: keyword maps in Google Scholar Citations
Jose
Luis Ortega
VICYT-CSIC, Madrid (Spain)
Isidro
Aguillo
Cybermetric
Lab, CCHS-CSIC, Madrid (Spain)
Accepted
in: Journal of
the American Society for Information Science and Technology
Abstract
This
paper introduces a keyword map of the labels used by the scientists
registered in the Google Scholar Citations (GSC) database from December
2011. 15,000 random queries were formulated to GSC to obtain a list of
26,682 registered users. From this list, a network graph of 6,660
labels was built and classified according to the Scopus Subject Area
classes. Results display a detailed label map of the most used
(>15
times) tags. The structural analysis shows that the core of the network
is occupied by computer science-related disciplines that accounts for
the most used and shared labels. This core is surrounded by clusters of
disciplines related or close to the computing such as the Information
Sciences, Mathematics or Bioinformatics. Classical areas such as
Chemistry and Physics are marginalized in the graph. It is suggested
that GSC would be in the future an accurate source to map the Science
because it is based on the labels that scientists themselves use to
describe their own research activity.
Introduction
On
July 20th 2011, Google launched the Google Scholar Citations (GSC) as a
response to the Microsoft Academic Search (MAS) (Google Scholar blog,
2011a). The search engine created by Microsoft Research Asia in 2009
included the possibility of suggesting changes in a pre-defined
personal profile for each researcher, with their bibliometrics
indicators provided as well as their list of publications and
citations. However, MAS only covered computing related fields until
March 2011. Google’s product, supported in Google Scholar database,
allows direct creating and editing in a personal profile, correctly
assigning your own publications and deleting the duplicates and other
mistakes. This way, it solved one of the most important limitations
since the November 2004 GS launch: the duplication and poor authority
control of its bibliographic records.
The
major novelty of these scientific search engines is that they are
focused on the author, instead of the journal as in the traditional
citation databases (Scopus and WoS), giving way to the Science 2.0
functionalities and Web 2.0 tools such as the social online networking
(Waldrop, 2008). But the main advantage of GSC, with respect to the
other databases, is that the profiles can be directly edited by the
users, defining their research activity, institutional affiliation,
professional position and contact information, using a natural, not
controlled, language. This editing freedom presents the problems of
ambiguity (i. e. web or technology), synonymy (i. e. computing and
computer sciences) and spelling mistakes (i. e. filosophy for
philosophy). However, the main advantage is that these labels express
the actual preferences of the scientists in their own language and they
are not linked to a predefined classification scheme. In this sense, to
map the GSC labels network is to visualize Science not according to the
categories of one database, but following the point of view that
researchers hold of their scientific activity by themselves. A direct
consequence of this liberty is the spontaneous emergence of
self-organised inter and multidisciplinary connections. These
considerations justify the study of these relationships through a
keyword map which shows the distribution of research interests and
disciplinary links among the tags.
Related Research
Several
attempts have been made to map large scientific databases in order to
visually display the Science and describe the main interdisciplinary
relationships between research areas. Maybe, Garfield’s historiography
(Garfield et al., 1964) was the first proposal to map the scientific
trends, although these were built at the level of articles and along
the time. It was Small et al. (1985) the first to map the Science
Citation Index (SCI) through paper co-citation networks. They observed
that the Biomedicine constituted the core of the scientific literature
in 1974, surrounded by the Chemistry, Physics and Mathematics clusters.
More recently, Boyack, Klavans and Börner (2005) also mapped at the
journal level the SCI and the Social Science Citation Index (SSCI).
Using several co-citation measures and the VxOrd layout algorithm, they
presented a spatial view of journal clusters where related disciplines
could be observed. Moya-Anegón et al. (2007) visualized the scientific
articles included in the Web of Science (WoS) according to the ISI
categories. They managed to build a more schematic view of scientific
disciplines using Pathfinder networks, with branches linking all the
scientific categories in a hierarchical view. This representation also
assumed the Biomedicine as the core of the Science. In a comparative
approach, Klavans and Boyack (2007) did not find significant
differences between mapping ISI and Scopus databases. They obtained a
circular network graph in which the disciplines are linked among all of
them in a continuum and where all the categories are represented at the
same level. Similar results were obtained by Rosvall and Bergstrom
(2008) through a random-walk model of journal citations. Leydesdorff
and Rafols (2009) also used the ISI categories to group journal
citations between disciplines. Unlike those other proposals, they used
factor analysis to detect the disciplinary structure of science. The
ISI categories citation network showed two faced clusters: Biomedicine
and Biosciences in front of Physics and Applied Sciences. However,
these representations are only based on the relationships of scientific
papers, journal or subject categories through citations or occurrences.
Moreover, the majority of articles obtained their data from a unique
data source: WoS/ISI Thomson. Surprisingly, the number of global maps
of science obtained from other scientific data is very low, including
the following few examples: Using patents to visualize technological
maps (Engelsman & Van Raan, 1994; Boyack and Klavans, 2008);
web
log files to graph electronic journal clickstreams (Bollen &
Van de
Sompel, 2006; Bollen et al., 2009); web links to group academic
departments (Ortega and Aguillo, 2007); or semantic relationships
between Wikipedia entries (Holloway et al., 2007).
Several
papers have used Google Scholar as data source because of its larger
coverage of not only the main scientific peer-review journals and
preprints from open access repositories, but also from obscure or not
so formally published scientific documents such as divulgation papers,
conferences presentations or academic materials. The majority of the
studies analized its citation coverage in relation to the main
scientific citation databases, finding differences across disciplines
and years (Bakkalbasi et al., 2006) and ascertaining the impact of some
areas or publication going beyond the classical citation databases
(Meho and Yang, 2007). Other studies, from a web bibliometrics view,
have analysed its citations in relation to the web citations (Kousha
& Thelwall, 2007) and have discussed its suitability to the
scientific assessment (Jacsó, 2008; Aguillo, 2012).
Objectives
In
This paper intends to build a keyword map of the labels used by the
scientists registered in Google Scholar Citations in December 2011. The
main objective of this work is to show and discuss the resulting map,
describing its structural characteristics, to identify the main groups,
and compare it to the previous maps. Several research questions were
formulated:
- What
characteristics does it take the early population registered in GSC and
which biases could it introduce in the resulting graph?
- Is
the GSC label an accurate data source to build Maps of Science?
- How is Science
represented in this map and in relation with the previous studies?
Methods
Data extraction
In Several
sequential processes were developed to extract the keywords information
from GSC. In the first stage, 15,000 random queries were formulated to
GSC to obtain a list of registered users. The terms of these queries
were built by the combination of the 25 letters of the Latin alphabet
in groups of three (i.e. aaa, aab, aac, and so on) with a depth of 50
pages. If each response page shows 10 users, this procedure had allowed
the extraction of 500 users by query and then 7.5 million of users.
However, much of the queries did not return any result or they were
rather low. Thus, the total list of retrieved profiles was 26,682 in
December 2011.
To
test the reliability of the sample and to estimate the total population
of GSC we have applied the Lincoln-Petersen formula (Seber, 2002). This
equation is widely used in Wildlife management and it is based on the
mark and recapture method. This counting method assumes that a high
proportion of duplicated items would be an indicator of the
completeness of the sample. As more samples are tested more consistency
gains the population estimation. As we have only one sample, our
estimations have to be considered with caution.
Where
N
is the total population to estimate, M is the size of
the sample (100,508), C
is the number of unique profiles (26,682) and R
is the number of duplicated profiles (73,826). By this reckoning, the
total population of GSC in December 2011 would roughly be 36,325 users.
Although it may seem to be a small population, it must be reminded that
this service started in July 2011 and it was opened to the public in
November 2011 (Google Scholar blog, 2011b), so it is logical to think
that this service did nothing but start in December 2011. According to
this estimation, our sample contains the 73% of the population, and
then we can consider that the sample is highly representative.
Moreover, we can consider that the retrieved authors are the most
productive and cited because the query results are ranked by citations
and so their probability to be retrieved is higher. In view of the
absence of an API or another way to extract a representative sample, we
understand that the formulation of a large amount of random queries is
a good method to obtain a consistent sample.
Web
Data Extraction 8.2 was used to implement the random queries and to
obtain the list of authors. It is a crawler or bot that extracts URLs
according to a pattern from a list of addresses. However, this software
shows limitations to extract and tabulate the textual information
embedded in the web page such as the multiple labels used by each
researcher to define his/her activity. Then, a SQL routine was
implemented to extract data from each profile such as labels, e-mail
domains, affiliations, etc.
Labels
map
To
build the labels map, the preparation of a 2-mode network of
researchers with labels in common (23,179) and shared labels (29,839)
using Pajek 2.05 network software was required first. This network
shows the users linked to the labels which describe their research
activity. Then, from Pajek, this network was transformed to a one-mode
network, in which the keywords are now only related among them. In this
new undirected network, one keyword is connected with another if a user
selected both together in his/her profiles. This criterion allows
relating terms, not according to a classification method of GSC, but
following the personal and free criterion of each scientist. In this
way, the map does not show Science according to GSC, but Science
according to the researches with a profile in GSC. Because these
keywords are free and written in a natural language, and they are not
selected from any controlled list, many of them are used only by one
author and hence they are not connected with any other keyword. This
reduces the labels map to 6,660 terms.
To
improve the visualization of the map, the node size represents the
frequency in which these labels are used by the scientists. This allows
identifying what research topics are more common in GSC. Each label was
also classified in a category to improve the visualization and to
detect defined clusters, a step needed to make it possible to group the
research activity in this platform. Because GSC does not provide
classification scheme, we apply the Subject Area categories of Scopus
(2008). The aim of this labels classification is just to colour the
terms according to a scheme and to observe how they are grouped
together. This classification was elected because it presents only 30
categories grouped in four areas, with low overlap and simplicity. The
classification process was manually done by the Cybermetrics Lab team.
The arc weight shows the number of times that both terms are used
together, although this map only displays arcs larger than 2.
To
visualize the graph, a network analysis software package was used,
Gephi 0.8 beta version with GNU licence and developed by Gephi
Consortium (consortium.gephi.org).
Finally, the layout algorithm used to energize the network was a
force-directed algorithm (Force Atlas) which is implemented in this
software (Bastian, Heymann and Jacomy, 2009). This layout algorithm was
used because it is specialized in large scale-free networks, converging
more efficiently and increasing the speed in the network configuration.
Network
indicators
Several
social network analysis indicators were calculated to extract the
structural properties of the network. The calculation of the network
indicators were done with Gephi, excepting the k-cores algorithm that
was calculated with Pajek 1.28.
- Degree
centrality (k):
It measures the number of lines incident to a node (Freeman, 1979). A
variation is the weighted degree centrality, which calculates the
weight of each line, indicating the strength of each relationship. In
this study the centrality degree allows to describe the labels most
frequently used by the GSC scientists and what concepts are central in
an academic database, which allows thematically characterizing a
researcher community.
- Freeman’s
Betweenness centrality (CB):
It is defined as the capacity of one node to help to connect those
nodes that are not directly connected between them (Freeman, 1980).
This measurement enables us to detect “gateway” terms that connect
remote groups because they are used in different knowledge domains.
These labels may be considered as interdisciplinary indicators.
- K-Cores:
is a sub-network in which each node has k
degree in that sub-network. The k-cores allow us to detect groups with
a strong link density. In scale-free networks, the core with the
highest degree is the central nucleus of the network, detecting the set
of labels where the network rests on (Seidman, 1983).
- Modularity:
it is clustering measurement that allows detecting modules or clusters
in a network. A module is a group of nodes which shows more edges among
them than out to that set. Modularity finds close groups iteratively,
comparing the fraction of links inside a cluster with respect to the
expected value in a random network (Newman, 2006).
Results
Descriptive analysis
A
previous descriptive analysis of the population was made to detail the
main demographical characteristics of the profiles. A fast way to
identify the work place of each author is through the web domain of
their e-mail addresses, although there are some restrictions on the use
of this data. Four percent of the profiles do not include an e-mail
address and some domains do not contain a country top level domain
(i.e., .com, .edu). We have observed some cases in which the e-mail
domain corresponds to a professional association (IEEE, ACM, etc.), to
a commercial web mail service (hotmail, gmail, etc.) or to a personal
domain (johnsmith.net). We have also detected that the e-mail domain
does not match with the institutional address. In these cases, a manual
classification of the profiles was done from the institutional address
in each profile. In general, these cases are less than 1% of the
profiles. Due to this, we think that the use of e-mail domains is a
good approach to institutional classification of personal profiles
although it has to be used with caution and with a deep manual
inspection. Moreover, this practice is recommended when the
affiliations are signed by the authors in natural not controlled
language and the addresses present many variations.
From
the 26,682 profiles, the country and research institution was
identified for 25,485 (95%) of them. According to the country where
they are working (Table 1), 39.1% came from the United States, 9.5%
from the United Kingdom and 4.8% from Canada. It is interesting to
highlight the number of the Spanish users (4%), the first
non-Anglo-Saxon country, and the presence of emergent countries such as
Brazil (2.3%) and India (2.1%).
Country
|
Profiles
|
%
|
The United
States
|
9,974
|
39.1
|
The United
Kingdom
|
2,424
|
9.5
|
Canada
|
1,219
|
4.8
|
Spain
|
1,013
|
4.0
|
Australia
|
998
|
3.9
|
Germany
|
830
|
3.3
|
The Netherlands
|
751
|
2.9
|
Brazil
|
584
|
2.3
|
India
|
541
|
2.1
|
Total
|
25,485
|
100
|
Table 1. Country
distribution of the
profiles registered in GSC in December 2011
Country
|
Institution
|
Profiles
|
%
|
The United
States
|
Massachusetts
Institute of Technology
|
250
|
0.99
|
The United
States
|
Harvard
University
|
238
|
0.94
|
The United
States
|
Carnegie Mellon
University
|
205
|
0.81
|
The United
States
|
University of
Michigan
|
181
|
0.71
|
The United
Kingdom
|
University
College London
|
169
|
0.67
|
The United
States
|
University of
Minnesota
|
157
|
0.62
|
The United
States
|
Stanford
University
|
155
|
0.61
|
The United
States
|
University
California, San Diego
|
153
|
0.60
|
The United
States
|
University of
California, Berkeley
|
152
|
0.60
|
Canada
|
University of
British Columbia
|
142
|
0.56
|
Total
|
|
25,324
|
100
|
Table 2. Institutional
distribution
of the profiles registered in GSC in December 2011
Table
2 shows the research institutions with the highest number of profiles
in GSC. Obviously the great majority came from the university (83%),
followed by the governmental research centers (5.8%). Among the first
ten institutions, only two are not settled in the United States, the
University College London (UK) and the University of British Columbia
(Canada). The results are consistent with the university rankings as
the Massachusetts Institute of Technology and Harvard University take
up the first positions. However, small institutions such as Carnegie
Mellon University and the University of Minnesota emerge over
prestigious universities such as Stanford University and Berkeley. This
evidences that in these first moments the presence of members depends
more from the way in that this new service was disseminated through the
social networks and the willingness to participate in this platform,
than other factors such as the number of scholars in each institution
and their research productivity.
Unfortunately,
to descend to the department or faculty level is rather hard because
many of the affiliations do not include that additional information and
when they do, they present multiple denominations in free language that
makes almost impossible to group them into an institutional chart. We
have selected the most frequent labels and Scopus subject classes in
Table 3 as a way to present the thematic distribution of the profiles.
Labels
|
Count
|
%
|
Scopus
subject classes
|
labels
|
%
|
machine_learning
|
1,243
|
18.7
|
Computer
Sciences
|
359
|
17.2
|
artificial_intelligence
|
998
|
15.0
|
Mathematics
|
138
|
6.6
|
computer_vision
|
989
|
14.8
|
Medicine
|
136
|
6.5
|
bioinformatics
|
757
|
11.4
|
Agricultural
and Biological Sciences
|
133
|
6.4
|
software_engineering
|
464
|
7.0
|
Biochemistry,
Genetics and Molecular Biology
|
131
|
6.3
|
data_mining
|
440
|
6.6
|
Environmental
Science
|
127
|
6.1
|
human_computer_interaction
|
420
|
6.3
|
Earth and
Planetary Sciences
|
87
|
4.2
|
computational_biology
|
393
|
5.9
|
Psychology
|
83
|
4.0
|
robotics
|
362
|
5.4
|
Neuroscience
|
83
|
4.0
|
computer_science
|
347
|
5.2
|
Multidisciplinary
|
83
|
4.0
|
Total
|
6,660
|
100
|
Total
|
2,089
|
100
|
Table 3. The most
frequent labels
and Scopus subject classes in GSC in December 2011
Tabel
3 shows that the ten most frequent labels are Computer Science related
terms such as machine learning (18.7%), artificial intelligence (15%)
and computer vision (14.8%) which evidences the strong presence of
users interested in information science technologies. This is confirmed
by the Scopus subject classes, which group the labels with a frequency
higher than 15 times. It shows that 17.2% of the labels belong to the
Computer Science class, 6.6% to the Mathematics class and 6.5% to the
Medicine one.
Structural analysis
Figure
1. Labels map. Simplified view of the terms with a frequency equal or
higher than 15 (N=772; arcs=4,574)
Figure
1 shows a reduced map of the labels with a frequency equal to or higher
than 15 times. This network shows scale-free properties because the
degree centrality is distributed following a power law (γ=-6.835); and
its clustering coefficient is rather higher (C=.433) than the expected
for a random network (C=.007) with an average path length relatively
short (l=3.07). These results confirm that this labels network holds
small-world properties (Watts & Strogatz, 1998) as well. This
type
of networks is characterized by compact groups that are connected
between them through transversal links that cut across the network
reducing the distance. Thus, in the GSC labels network it is possible
to identify several groups from the Scopus subject area categories.
Clearly, a large group of computing-related labels in red can
be
observed; a second compact set on Biology and Biomedicine in green; a
less dense group of Social Sciences and Economics terms in blue and
purple; and other much smaller clusters such as Physics in pink or
Information Sciences in grey.
It
is also interesting to realize that the central position is occupied by
computed-related labels. K-core algorithm detected a dense core of 28
(3.6%) labels with a degree centrality between them of 13. The 69% of
the nodes included in that core were classified as Computer Science.
So, the gravitational edge of this network rests on the central ‘red
group’ of computer science labels. This is confirmed as well by the
list of the ten labels with the highest degree centrality (Table 1).
Excepting the labels ecology and evolution, every term is related with
computer sciences. These are the ten labels most used and shared by the
users of GSC, thus we can claim that a large proportion of the GSC
current users are specialists in computer science and related
disciplines.
The
small-world properties of this network involve the existence of
transversal links that connect these groups reducing the distance
between the nodes. Betweenness centrality makes possible to identify
the nodes that mediate the most between disconnected parts of the
network (Table 4). These may be considered as conceptual “gateways”
and/or as interdisciplinary labels that put distant concepts from
remote disciplines in relation. For example, computer science education
connects Computer Science with learning, speech processing with
Linguistics and motor control with Engineering. It is interesting that
these terms connect the central group of Computer Science with the rest
of peripheral disciplines, which corroborates the weight of computing
in this network. However, several labels relate other disciplines such
as economics of education, health disparities and vertebrate
palaeontology.
Label
|
Degree
|
Label
|
Betweenness
Centrality
|
artificial_intelligence
|
266
|
computer_science_education
|
0.87
|
machine_learning
|
250
|
speech_processing
|
0.74
|
bioinformatics
|
193
|
process_mining
|
0.67
|
computer_vision
|
144
|
data_structures
|
0.62
|
software_engineering
|
135
|
image_retrieval
|
0.53
|
data_mining
|
126
|
economics_of_education
|
0.52
|
human_computer_interaction
|
118
|
motor_control
|
0.5
|
ecology
|
116
|
health_disparities
|
0.45
|
computational_biology
|
111
|
vertebrate_paleontology
|
0.44
|
evolution
|
109
|
clustering
|
0.43
|
Table 4. Ten labels
with the highest
Degree centrality and Betweenness centrality
Figure
2. Labels map. Simplified
view with the main modules (N=772; arcs=4,574)
One
of our objectives is to check the fit of a classification scheme
(Scopus subject area categories) with the structural distribution of
the labels. Modularity of the network was calculated (Q=.53) and it
shows that there are clearly differentiated modules or clusters. Table
5 presents the modules with more than 1.5% of the nodes of the network.
The nine clusters represent 98% of the total network. In general, the
modules fit with the prior classification; we can appreciate Biology
and Biomedicine cluster in green, the Humanities and Social Science
group in blue and the Information Science set in orange. However,
modularity also observed a few differences. The Computer Science labels
are split in two groups, in red, the Applied computing such as
artificial intelligence, machine learning and computer vision; and in
yellow the software group with labels such as software engineering,
computer science or distributed systems. On the contrary, there are
disciplines merged into the same group, such as Neurosciences with
Psychology or Physics and Astronomy with Chemistry.
Cluster
|
Class
|
Colour
|
N
|
%
|
1
|
Biology and
Biomedicine
|
Green
|
142
|
18.39
|
2
|
Humanities,
Social Sciences and Economics
|
Blue
|
121
|
15.67
|
3
|
Computer
Sciences, Software
|
Yellow
|
120
|
15.54
|
4
|
Computer
Sciences, Applied computing
|
Red
|
103
|
13.34
|
5
|
Physics and
Astronomy, and Chemistry
|
Purple
|
101
|
13.08
|
6
|
Information
Science and Education
|
Orange
|
77
|
9.97
|
7
|
Neuroscience
and Psychology
|
Light blue
|
63
|
8.16
|
8
|
Mathematics
|
Brown
|
17
|
2.20
|
9
|
Earth and
Planetary Sciences
|
Cyan
|
14
|
1.81
|
TOTAL
|
|
|
772
|
100
|
Table 5. Principal
clusters
obtaining from the modularity of the labels network
Discussion
These results provide an early view of the new GSC, when it hardly
opened to public in December 2011. The most interesting result is the
strong presence of the Computer Sciences in this database. This
discipline is the ground that supports the network, with the most used
and shared labels and with the largest number of terms. Moreover, some
of the clusters correspond to disciplines related or close to Computing
such as the Information Sciences, Mathematics or Bioinformatics. In
fact, and according to the modularity, the Applied Computing is located
in the centre of the map and it constitutes the densest core. The
interdisciplinary aspect of this field favours the establishment of
ties with the rest of disciplinary groups in the network. Thus, the
Applied Computing connects with the Biology and Biomedicine through
bioinformatics and computational biology labels; with Social Sciences
through social networks and social media; and with Neurosciences and
Psychology through computational neuroscience. On the contrary, the
marginal presence of classical and important disciplines such as
Physics or Chemistry is also remarkable. This network graph contrasts
with the previous maps of science in which the Biology and Biomedicine
form the nucleus of the science (Small et al., 1985; Moya-Anegón et
al., 2007; Rosvall & Bergstrom, 2008) or in which the presence
of
computing-related disciplines is less highlighted (Klavans &
Boyack, 2007; Leydesdorff & Rafols, 2009).
These
differences could be related to the data sources employed which can
considerably affect the resulting map. The previous studies are based
on bibliographic data from citation indexes, where the presence of the
Biology and Biomedicine disciplines are predominant, besides classical
disciplines such as Physics and Chemistry. These maps represent the
formal Science that is communicated through traditional formats such as
articles and proceeding papers. However, the obtained map in our study
is not expressed through papers but by the labels that the scientists
freely use to define their activity. This difference between sources is
also observed in the patent maps (Engelsman & Van Raan, 1994;
Boyack and Klavans, 2008) where the Chemistry and Electronics dominate
the network; and in the journals clickstream graphs (Bollen et al.,
2009) in which the Humanities and Social Sciences gain prominence.
Nevertheless,
the observed differences in our map may be also due to the free
incorporation of scientist profiles which introduces a bias in favor of
scientists immersed in the Web and new information technologies. This
explains the strong presence of computing sciences and information
sciences disciplines. Thus, GSC may be seen as the database colonized
by scientists interested in new information technologies, science 2.0
and scientific networking. Another important bias of this map is the
moment in which the data were harvested. In December 2011, the GSC had
just started, and the number of profiles, approx. 36,000, was rather
low. In this first stage the database could be settled by scientists
near to the computing and information technologies. We think that, as
time goes by and the popularity of this service increases, the number
of researchers from other distant and remote disciplines would grow and
a more evenly distributed map with a stronger presence of the classical
disciplines would emerge. In the future, new remapping of GSC will be
recommended to check the evolution of this network. If this is
confirmed and GSC becomes in a resource in which every researcher is
sharing a profile, the view of Science from this will be more accurate.
Since the relationship between fields is directly established by the
users, which will produce a more immediate view of new specialities
emergence and rising of new interdisciplinary relationships.
The
structural visualization of this network of labels or concepts also
makes it possible to observe the conceptual environment of a label and
to better understand the meaning of each one. For example, ambiguous
terms such as technology were classified as Engineering but the graph
locates it surrounded by Social Sciences labels. Other labels such as
social psychology and medical imaging, classified as Psychology and
Medicine respectively, are reassigned to Social Sciences and Applied
Computing environments. Hence, this structural presentation of concepts
favours the correct classification of ambiguous or difficult terms
because they are established by the own researchers, which we consider
a very reliable criterion.
Conclusions
The
analized population shows a strong presence of profiles from
English-speaking countries, mainly from the United States and the
United Kingdom. They are primarily scholars from universities (83%),
being the American universities which contribute with most of the
researchers. These results suggest that GSC first spread through the US
academic environment, affecting more to certain universities than
others, and then to the rest of the world. We think that the diffusion
of this new platform was influenced by the way in which this was
publicized through social networks and 2.0 web platforms. This favored
the incorporation of researchers close to these information
technologies and those specific institutions (i.e. Carnegie Mellon
University) and countries (i.e. Spain) emerged noticeably.
The
resulting map has shown that the building of maps of science from GSC
data is possible and this data source shows a more fresh and immediate
view of the research activity because it is directly based on the
labels that scientists use to describe their research activity.
Although this early data may show a distorted view of researches active
on the on-line social networks and many of them are related to the
information technologies, we think that GSC has great potential and
suggest a new way to analize and measure the research activity from a
wider point of view, including teaching and divulgation activities.
We
also conclude that the visual presentation of labels is a reliable way
to present keyword maps that help to a better understanding of the real
meaning of each concept and describe the conceptual framework in which
the different research disciplines are managed. This helps the revision
and updating of classification schemes to the new research dynamics
observed in these GSC labels maps.
Acknowledgements
We
would like to thank to the anonymous referees their important
suggestions and to Jennifer Carranza her helpful recommendations on the
English version.
References
Aguillo,
I.F. (2012). Is Google Scholar useful for bibliometrics? A webometric
analysis. Scientometrics, 91(2): 343-351
Bakkalbasi,
N., Bauer, K., Glover, J. & Wang, L. (2006).Three options for
citation tracking: Google Scholar, Scopus and Web of Science.
Biomedical Digital Libraries, 3(7) http://www.bio-diglib.com/content/3/1/7
Bastian,
M., Heymann, S. and Jacomy, M. (2009), “Gephi: An Open Source Software
for Exploring and Manipulating Networks”, in Proceedings of the Third
International ICWSM Conference, San José, USA, 2009, AAAI Press
Bollen, J. & Van de Sompel, H. (2006). Mapping the structure of
science through usage. Scientometrics, 69(2): 227-258
Bollen,
J., Van de Sompel, H., Hagberg, A., Bettencourt, L., Chute, R.,
Rodriguez, M. A. & Balakireva, L. (2009). Clickstream Data
Yields
High-Resolution Maps of Science. PLoS ONE, 4(3): e4803
Boyack,
K. W. & Klavans, R. (2008). Measuring science-technology
interaction using rare inventor-author names. Journal of Informetrics,
2 (3): 173-182.
Boyack, K. W., Klavans, R., & Börner, K. (2005). Mapping the
backbone of science. Scientometrics, 64(3), 351-374
Engelsman, E. C. & van Raan, A. F. J. (1994). A patent-based
cartography of technology. Research Policy, 23(1): 1-26
Freeman, L. C. (1979). Centrality in networks: I. conceptual
clarification. Social Networks, 1: 215-239.
Freeman, L. C. (1980). The gatekeeper, pair-dependency, and structural
centrality. Quality and Quantity, 14: 585-592
Garfield
E., Sher I. H. & Torpie R. J. (1964). The Use of Citation Data
in
Writing the History of Science. Philadelphia: Institute for Scientific
Information.
Google Scholar blog (2011a). Google Scholar Citations http://googlescholar.blogspot.com/2011/07/google-scholar-citations.html
Google Scholar blog (2011b). Google Scholar Citations open to all http://googlescholar.blogspot.com/2011/11/google-scholar-citations-open-to-all.html
Holloway,
T., Božičević, M. & Börner, K. (2007). Analyzing and
Visualizing
the Semantic Converage of Wikipedia and Its Authors. Complexity, 12(3):
30-40
Klavans, R. & Boyack, K. W. (2007). Is there a
Convergent Structure of Science? A Comparison of Maps using the ISI and
Scopus Databases. 11th International Conference of the International
Society for Scientometrics and Informetrics meeting in Madrid, Spain,
June 2007
Kousha, K., & Thelwall, M. (2007). Google Scholar
citations and Google Web-URL citations: A multi-discipline exploratory
analysis. Journal of the American Society for Information Science and
Technology. 58(7), 1055-1065
Jacsó, P. (2008). Google Scholar revisited. Online Information Review,
32(1):102 - 114
Leydesdorff,
L. and Rafols, I. (2009), A global map of science based on the ISI
subject categories. Journal of the American Society for Information
Science and Technology, 60(2): 348–362
Meho, L. I. and Yang, K.
(2007), Impact of data sources on citation counts and rankings of LIS
faculty: Web of science versus scopus and google scholar. Journal of
the American Society for Information Science and Technology, 58(13):
2105–2125.
Moya-Anegón, F., Vargas-Quesada, B.,
Chinchilla-Rodríguez, Z., Corera-Alvarez, E. & Herrero-Solana,
V.
(2007). Visualizing the Marrow of Science. Journal of the American
Society for Information Science and Technology, 58(14): 2167-2179.
Newman,
M. E. J. (2006). Modularity and community structure in networks.
Proceeding of the National Academy of Sciences, 103(23): 8577–8582
Ortega,
J. L. & Aguillo, I. (2007). Interdisciplinary relationships in
the
Spanish academic web space: A Webometric study through networks
visualization. Cybermetrics, 11(1): paper 4 http://www.cybermetrics.info/articles/v11i1p4.html
Rosvall,
M. & Bergstrom, C. T. (2008). Maps of random walks on complex
networks reveal community structure. Proceeding of the National Academy
of Sciences, 105(4): 1118-1123
Scopus (2008). Subject Area Categories. http://help.scopus.com/robo/projects/schelp/h_subject_categories.htm
Seber, G.A.F. (2002). The Estimation of Animal Abundance and Related
Parameters. Caldwel, New Jersey: Blackburn Press.
Seidman, S. B. (1983). Network structure and minimum degree. Social
Networks, 5(3): 269–287
Small,
H., Sweeney, E., & Greenlee, E. (1985). Clustering the science
citation index using co-citations. II. Mapping science. Scientometrics,
8(5-6): 321-340
Waldrop, M. M. (2008). Science 2.0. Scientific American 298, 68 – 73
Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of
'small-world' networks. Nature, 393, 440-442.