Science is all in the eye of beholder: keyword maps in Google Scholar Citations

Jose Luis Ortega

VICYT-CSIC, Madrid (Spain)

Isidro Aguillo

Cybermetric Lab, CCHS-CSIC, Madrid (Spain)

Accepted in: Journal of the American Society for Information Science and Technology

Abstract

This paper introduces a keyword map of the labels used by the scientists registered in the Google Scholar Citations (GSC) database from December 2011. 15,000 random queries were formulated to GSC to obtain a list of 26,682 registered users. From this list, a network graph of 6,660 labels was built and classified according to the Scopus Subject Area classes. Results display a detailed label map of the most used (>15 times) tags. The structural analysis shows that the core of the network is occupied by computer science-related disciplines that accounts for the most used and shared labels. This core is surrounded by clusters of disciplines related or close to the computing such as the Information Sciences, Mathematics or Bioinformatics. Classical areas such as Chemistry and Physics are marginalized in the graph. It is suggested that GSC would be in the future an accurate source to map the Science because it is based on the labels that scientists themselves use to describe their own research activity.

Introduction

On July 20th 2011, Google launched the Google Scholar Citations (GSC) as a response to the Microsoft Academic Search (MAS) (Google Scholar blog, 2011a). The search engine created by Microsoft Research Asia in 2009 included the possibility of suggesting changes in a pre-defined personal profile for each researcher, with their bibliometrics indicators provided as well as their list of publications and citations. However, MAS only covered computing related fields until March 2011. Google’s product, supported in Google Scholar database, allows direct creating and editing in a personal profile, correctly assigning your own publications and deleting the duplicates and other mistakes. This way, it solved one of the most important limitations since the November 2004 GS launch: the duplication and poor authority control of its bibliographic records.
The major novelty of these scientific search engines is that they are focused on the author, instead of the journal as in the traditional citation databases (Scopus and WoS), giving way to the Science 2.0 functionalities and Web 2.0 tools such as the social online networking (Waldrop, 2008). But the main advantage of GSC, with respect to the other databases, is that the profiles can be directly edited by the users, defining their research activity, institutional affiliation, professional position and contact information, using a natural, not controlled, language. This editing freedom presents the problems of ambiguity (i. e. web or technology), synonymy (i. e. computing and computer sciences) and spelling mistakes (i. e. filosophy for philosophy). However, the main advantage is that these labels express the actual preferences of the scientists in their own language and they are not linked to a predefined classification scheme. In this sense, to map the GSC labels network is to visualize Science not according to the categories of one database, but following the point of view that researchers hold of their scientific activity by themselves. A direct consequence of this liberty is the spontaneous emergence of self-organised inter and multidisciplinary connections. These considerations justify the study of these relationships through a keyword map which shows the distribution of research interests and disciplinary links among the tags.

Related Research

Several attempts have been made to map large scientific databases in order to visually display the Science and describe the main interdisciplinary relationships between research areas. Maybe, Garfield’s historiography (Garfield et al., 1964) was the first proposal to map the scientific trends, although these were built at the level of articles and along the time. It was Small et al. (1985) the first to map the Science Citation Index (SCI) through paper co-citation networks. They observed that the Biomedicine constituted the core of the scientific literature in 1974, surrounded by the Chemistry, Physics and Mathematics clusters. More recently, Boyack, Klavans and Börner (2005) also mapped at the journal level the SCI and the Social Science Citation Index (SSCI). Using several co-citation measures and the VxOrd layout algorithm, they presented a spatial view of journal clusters where related disciplines could be observed. Moya-Anegón et al. (2007) visualized the scientific articles included in the Web of Science (WoS) according to the ISI categories. They managed to build a more schematic view of scientific disciplines using Pathfinder networks, with branches linking all the scientific categories in a hierarchical view. This representation also assumed the Biomedicine as the core of the Science. In a comparative approach, Klavans and Boyack (2007) did not find significant differences between mapping ISI and Scopus databases. They obtained a circular network graph in which the disciplines are linked among all of them in a continuum and where all the categories are represented at the same level. Similar results were obtained by Rosvall and Bergstrom (2008) through a random-walk model of journal citations. Leydesdorff and Rafols (2009) also used the ISI categories to group journal citations between disciplines. Unlike those other proposals, they used factor analysis to detect the disciplinary structure of science. The ISI categories citation network showed two faced clusters: Biomedicine and Biosciences in front of Physics and Applied Sciences. However, these representations are only based on the relationships of scientific papers, journal or subject categories through citations or occurrences. Moreover, the majority of articles obtained their data from a unique data source: WoS/ISI Thomson. Surprisingly, the number of global maps of science obtained from other scientific data is very low, including the following few examples: Using patents to visualize technological maps (Engelsman & Van Raan, 1994; Boyack and Klavans, 2008); web log files to graph electronic journal clickstreams (Bollen & Van de Sompel, 2006; Bollen et al., 2009); web links to group academic departments (Ortega and Aguillo, 2007); or semantic relationships between Wikipedia entries (Holloway et al., 2007).
Several papers have used Google Scholar as data source because of its larger coverage of not only the main scientific peer-review journals and preprints from open access repositories, but also from obscure or not so formally published scientific documents such as divulgation papers, conferences presentations or academic materials. The majority of the studies analized its citation coverage in relation to the main scientific citation databases, finding differences across disciplines and years (Bakkalbasi et al., 2006) and ascertaining the impact of some areas or publication going beyond the classical citation databases (Meho and Yang, 2007). Other studies, from a web bibliometrics view, have analysed its citations in relation to the web citations (Kousha & Thelwall, 2007) and have discussed its suitability to the scientific assessment (Jacsó, 2008; Aguillo, 2012).

Objectives

In This paper intends to build a keyword map of the labels used by the scientists registered in Google Scholar Citations in December 2011. The main objective of this work is to show and discuss the resulting map, describing its structural characteristics, to identify the main groups, and compare it to the previous maps. Several research questions were formulated:

What characteristics does it take the early population registered in GSC and which biases could it introduce in the resulting graph?
Is the GSC label an accurate data source to build Maps of Science?
How is Science represented in this map and in relation with the previous studies?

Methods

Data extraction

In Several sequential processes were developed to extract the keywords information from GSC. In the first stage, 15,000 random queries were formulated to GSC to obtain a list of registered users. The terms of these queries were built by the combination of the 25 letters of the Latin alphabet in groups of three (i.e. aaa, aab, aac, and so on) with a depth of 50 pages. If each response page shows 10 users, this procedure had allowed the extraction of 500 users by query and then 7.5 million of users. However, much of the queries did not return any result or they were rather low. Thus, the total list of retrieved profiles was 26,682 in December 2011.
To test the reliability of the sample and to estimate the total population of GSC we have applied the Lincoln-Petersen formula (Seber, 2002). This equation is widely used in Wildlife management and it is based on the mark and recapture method. This counting method assumes that a high proportion of duplicated items would be an indicator of the completeness of the sample. As more samples are tested more consistency gains the population estimation. As we have only one sample, our estimations have to be considered with caution.

Where N is the total population to estimate, M is the size of the sample (100,508), C is the number of unique profiles (26,682) and R is the number of duplicated profiles (73,826). By this reckoning, the total population of GSC in December 2011 would roughly be 36,325 users. Although it may seem to be a small population, it must be reminded that this service started in July 2011 and it was opened to the public in November 2011 (Google Scholar blog, 2011b), so it is logical to think that this service did nothing but start in December 2011. According to this estimation, our sample contains the 73% of the population, and then we can consider that the sample is highly representative. Moreover, we can consider that the retrieved authors are the most productive and cited because the query results are ranked by citations and so their probability to be retrieved is higher. In view of the absence of an API or another way to extract a representative sample, we understand that the formulation of a large amount of random queries is a good method to obtain a consistent sample.
Web Data Extraction 8.2 was used to implement the random queries and to obtain the list of authors. It is a crawler or bot that extracts URLs according to a pattern from a list of addresses. However, this software shows limitations to extract and tabulate the textual information embedded in the web page such as the multiple labels used by each researcher to define his/her activity. Then, a SQL routine was implemented to extract data from each profile such as labels, e-mail domains, affiliations, etc.

Labels map

     To build the labels map, the preparation of a 2-mode network of researchers with labels in common (23,179) and shared labels (29,839) using Pajek 2.05 network software was required first. This network shows the users linked to the labels which describe their research activity. Then, from Pajek, this network was transformed to a one-mode network, in which the keywords are now only related among them. In this new undirected network, one keyword is connected with another if a user selected both together in his/her profiles. This criterion allows relating terms, not according to a classification method of GSC, but following the personal and free criterion of each scientist. In this way, the map does not show Science according to GSC, but Science according to the researches with a profile in GSC. Because these keywords are free and written in a natural language, and they are not selected from any controlled list, many of them are used only by one author and hence they are not connected with any other keyword. This reduces the labels map to 6,660 terms.
     To improve the visualization of the map, the node size represents the frequency in which these labels are used by the scientists. This allows identifying what research topics are more common in GSC. Each label was also classified in a category to improve the visualization and to detect defined clusters, a step needed to make it possible to group the research activity in this platform. Because GSC does not provide classification scheme, we apply the Subject Area categories of Scopus (2008). The aim of this labels classification is just to colour the terms according to a scheme and to observe how they are grouped together. This classification was elected because it presents only 30 categories grouped in four areas, with low overlap and simplicity. The classification process was manually done by the Cybermetrics Lab team. The arc weight shows the number of times that both terms are used together, although this map only displays arcs larger than 2.
     To visualize the graph, a network analysis software package was used, Gephi 0.8 beta version with GNU licence and developed by Gephi Consortium (consortium.gephi.org). Finally, the layout algorithm used to energize the network was a force-directed algorithm (Force Atlas) which is implemented in this software (Bastian, Heymann and Jacomy, 2009). This layout algorithm was used because it is specialized in large scale-free networks, converging more efficiently and increasing the speed in the network configuration.

Network indicators

Several social network analysis indicators were calculated to extract the structural properties of the network. The calculation of the network indicators were done with Gephi, excepting the k-cores algorithm that was calculated with Pajek 1.28.

Degree centrality (k): It measures the number of lines incident to a node (Freeman, 1979). A variation is the weighted degree centrality, which calculates the weight of each line, indicating the strength of each relationship. In this study the centrality degree allows to describe the labels most frequently used by the GSC scientists and what concepts are central in an academic database, which allows thematically characterizing a researcher community.
Freeman’s Betweenness centrality (CB): It is defined as the capacity of one node to help to connect those nodes that are not directly connected between them (Freeman, 1980). This measurement enables us to detect “gateway” terms that connect remote groups because they are used in different knowledge domains. These labels may be considered as interdisciplinary indicators.
K-Cores: is a sub-network in which each node has k degree in that sub-network. The k-cores allow us to detect groups with a strong link density. In scale-free networks, the core with the highest degree is the central nucleus of the network, detecting the set of labels where the network rests on (Seidman, 1983).
Modularity: it is clustering measurement that allows detecting modules or clusters in a network. A module is a group of nodes which shows more edges among them than out to that set. Modularity finds close groups iteratively, comparing the fraction of links inside a cluster with respect to the expected value in a random network (Newman, 2006).

Results

Descriptive analysis

A previous descriptive analysis of the population was made to detail the main demographical characteristics of the profiles. A fast way to identify the work place of each author is through the web domain of their e-mail addresses, although there are some restrictions on the use of this data. Four percent of the profiles do not include an e-mail address and some domains do not contain a country top level domain (i.e., .com, .edu). We have observed some cases in which the e-mail domain corresponds to a professional association (IEEE, ACM, etc.), to a commercial web mail service (hotmail, gmail, etc.) or to a personal domain (johnsmith.net). We have also detected that the e-mail domain does not match with the institutional address. In these cases, a manual classification of the profiles was done from the institutional address in each profile. In general, these cases are less than 1% of the profiles. Due to this, we think that the use of e-mail domains is a good approach to institutional classification of personal profiles although it has to be used with caution and with a deep manual inspection. Moreover, this practice is recommended when the affiliations are signed by the authors in natural not controlled language and the addresses present many variations.
From the 26,682 profiles, the country and research institution was identified for 25,485 (95%) of them. According to the country where they are working (Table 1), 39.1% came from the United States, 9.5% from the United Kingdom and 4.8% from Canada. It is interesting to highlight the number of the Spanish users (4%), the first non-Anglo-Saxon country, and the presence of emergent countries such as Brazil (2.3%) and India (2.1%).

Country	Profiles	%
The United States	9,974	39.1
The United Kingdom	2,424	9.5
Canada	1,219	4.8
Spain	1,013	4.0
Australia	998	3.9
Germany	830	3.3
The Netherlands	751	2.9
Brazil	584	2.3
India	541	2.1
Total	25,485	100

Table 1. Country distribution of the profiles registered in GSC in December 2011

Country	Institution	Profiles	%
The United States	Massachusetts Institute of Technology	250	0.99
The United States	Harvard University	238	0.94
The United States	Carnegie Mellon University	205	0.81
The United States	University of Michigan	181	0.71
The United Kingdom	University College London	169	0.67
The United States	University of Minnesota	157	0.62
The United States	Stanford University	155	0.61
The United States	University California, San Diego	153	0.60
The United States	University of California, Berkeley	152	0.60
Canada	University of British Columbia	142	0.56
Total		25,324	100

Table 2. Institutional distribution of the profiles registered in GSC in December 2011

Table 2 shows the research institutions with the highest number of profiles in GSC. Obviously the great majority came from the university (83%), followed by the governmental research centers (5.8%). Among the first ten institutions, only two are not settled in the United States, the University College London (UK) and the University of British Columbia (Canada). The results are consistent with the university rankings as the Massachusetts Institute of Technology and Harvard University take up the first positions. However, small institutions such as Carnegie Mellon University and the University of Minnesota emerge over prestigious universities such as Stanford University and Berkeley. This evidences that in these first moments the presence of members depends more from the way in that this new service was disseminated through the social networks and the willingness to participate in this platform, than other factors such as the number of scholars in each institution and their research productivity.
Unfortunately, to descend to the department or faculty level is rather hard because many of the affiliations do not include that additional information and when they do, they present multiple denominations in free language that makes almost impossible to group them into an institutional chart. We have selected the most frequent labels and Scopus subject classes in Table 3 as a way to present the thematic distribution of the profiles.

Labels	Count	%	Scopus subject classes	labels	%
machine_learning	1,243	18.7	Computer Sciences	359	17.2
artificial_intelligence	998	15.0	Mathematics	138	6.6
computer_vision	989	14.8	Medicine	136	6.5
bioinformatics	757	11.4	Agricultural and Biological Sciences	133	6.4
software_engineering	464	7.0	Biochemistry, Genetics and Molecular Biology	131	6.3
data_mining	440	6.6	Environmental Science	127	6.1
human_computer_interaction	420	6.3	Earth and Planetary Sciences	87	4.2
computational_biology	393	5.9	Psychology	83	4.0
robotics	362	5.4	Neuroscience	83	4.0
computer_science	347	5.2	Multidisciplinary	83	4.0
Total	6,660	100	Total	2,089	100

Table 3. The most frequent labels and Scopus subject classes in GSC in December 2011

Tabel 3 shows that the ten most frequent labels are Computer Science related terms such as machine learning (18.7%), artificial intelligence (15%) and computer vision (14.8%) which evidences the strong presence of users interested in information science technologies. This is confirmed by the Scopus subject classes, which group the labels with a frequency higher than 15 times. It shows that 17.2% of the labels belong to the Computer Science class, 6.6% to the Mathematics class and 6.5% to the Medicine one.

Structural analysis

Figure 1. Labels map. Simplified view of the terms with a frequency equal or higher than 15 (N=772; arcs=4,574)

     Figure 1 shows a reduced map of the labels with a frequency equal to or higher than 15 times. This network shows scale-free properties because the degree centrality is distributed following a power law (γ=-6.835); and its clustering coefficient is rather higher (C=.433) than the expected for a random network (C=.007) with an average path length relatively short (l=3.07). These results confirm that this labels network holds small-world properties (Watts & Strogatz, 1998) as well. This type of networks is characterized by compact groups that are connected between them through transversal links that cut across the network reducing the distance. Thus, in the GSC labels network it is possible to identify several groups from the Scopus subject area categories. Clearly, a large group of computing-related labels in red can be observed; a second compact set on Biology and Biomedicine in green; a less dense group of Social Sciences and Economics terms in blue and purple; and other much smaller clusters such as Physics in pink or Information Sciences in grey.
     It is also interesting to realize that the central position is occupied by computed-related labels. K-core algorithm detected a dense core of 28 (3.6%) labels with a degree centrality between them of 13. The 69% of the nodes included in that core were classified as Computer Science. So, the gravitational edge of this network rests on the central ‘red group’ of computer science labels. This is confirmed as well by the list of the ten labels with the highest degree centrality (Table 1). Excepting the labels ecology and evolution, every term is related with computer sciences. These are the ten labels most used and shared by the users of GSC, thus we can claim that a large proportion of the GSC current users are specialists in computer science and related disciplines.
     The small-world properties of this network involve the existence of transversal links that connect these groups reducing the distance between the nodes. Betweenness centrality makes possible to identify the nodes that mediate the most between disconnected parts of the network (Table 4). These may be considered as conceptual “gateways” and/or as interdisciplinary labels that put distant concepts from remote disciplines in relation. For example, computer science education connects Computer Science with learning, speech processing with Linguistics and motor control with Engineering. It is interesting that these terms connect the central group of Computer Science with the rest of peripheral disciplines, which corroborates the weight of computing in this network. However, several labels relate other disciplines such as economics of education, health disparities and vertebrate palaeontology.

Label	Degree	Label	Betweenness Centrality
artificial_intelligence	266	computer_science_education	0.87
machine_learning	250	speech_processing	0.74
bioinformatics	193	process_mining	0.67
computer_vision	144	data_structures	0.62
software_engineering	135	image_retrieval	0.53
data_mining	126	economics_of_education	0.52
human_computer_interaction	118	motor_control	0.5
ecology	116	health_disparities	0.45
computational_biology	111	vertebrate_paleontology	0.44
evolution	109	clustering	0.43

Table 4. Ten labels with the highest Degree centrality and Betweenness centrality

Figure 2. Labels map. Simplified view with the main modules (N=772; arcs=4,574)

One of our objectives is to check the fit of a classification scheme (Scopus subject area categories) with the structural distribution of the labels. Modularity of the network was calculated (Q=.53) and it shows that there are clearly differentiated modules or clusters. Table 5 presents the modules with more than 1.5% of the nodes of the network. The nine clusters represent 98% of the total network. In general, the modules fit with the prior classification; we can appreciate Biology and Biomedicine cluster in green, the Humanities and Social Science group in blue and the Information Science set in orange. However, modularity also observed a few differences. The Computer Science labels are split in two groups, in red, the Applied computing such as artificial intelligence, machine learning and computer vision; and in yellow the software group with labels such as software engineering, computer science or distributed systems. On the contrary, there are disciplines merged into the same group, such as Neurosciences with Psychology or Physics and Astronomy with Chemistry.

Cluster	Class	Colour	N	%
1	Biology and Biomedicine	Green	142	18.39
2	Humanities, Social Sciences and Economics	Blue	121	15.67
3	Computer Sciences, Software	Yellow	120	15.54
4	Computer Sciences, Applied computing	Red	103	13.34
5	Physics and Astronomy, and Chemistry	Purple	101	13.08
6	Information Science and Education	Orange	77	9.97
7	Neuroscience and Psychology	Light blue	63	8.16
8	Mathematics	Brown	17	2.20
9	Earth and Planetary Sciences	Cyan	14	1.81
TOTAL			772	100

Table 5. Principal clusters obtaining from the modularity of the labels network

Discussion

     These results provide an early view of the new GSC, when it hardly opened to public in December 2011. The most interesting result is the strong presence of the Computer Sciences in this database. This discipline is the ground that supports the network, with the most used and shared labels and with the largest number of terms. Moreover, some of the clusters correspond to disciplines related or close to Computing such as the Information Sciences, Mathematics or Bioinformatics. In fact, and according to the modularity, the Applied Computing is located in the centre of the map and it constitutes the densest core. The interdisciplinary aspect of this field favours the establishment of ties with the rest of disciplinary groups in the network. Thus, the Applied Computing connects with the Biology and Biomedicine through bioinformatics and computational biology labels; with Social Sciences through social networks and social media; and with Neurosciences and Psychology through computational neuroscience. On the contrary, the marginal presence of classical and important disciplines such as Physics or Chemistry is also remarkable. This network graph contrasts with the previous maps of science in which the Biology and Biomedicine form the nucleus of the science (Small et al., 1985; Moya-Anegón et al., 2007; Rosvall & Bergstrom, 2008) or in which the presence of computing-related disciplines is less highlighted (Klavans & Boyack, 2007; Leydesdorff & Rafols, 2009).
     These differences could be related to the data sources employed which can considerably affect the resulting map. The previous studies are based on bibliographic data from citation indexes, where the presence of the Biology and Biomedicine disciplines are predominant, besides classical disciplines such as Physics and Chemistry. These maps represent the formal Science that is communicated through traditional formats such as articles and proceeding papers. However, the obtained map in our study is not expressed through papers but by the labels that the scientists freely use to define their activity. This difference between sources is also observed in the patent maps (Engelsman & Van Raan, 1994; Boyack and Klavans, 2008) where the Chemistry and Electronics dominate the network; and in the journals clickstream graphs (Bollen et al., 2009) in which the Humanities and Social Sciences gain prominence.
     Nevertheless, the observed differences in our map may be also due to the free incorporation of scientist profiles which introduces a bias in favor of scientists immersed in the Web and new information technologies. This explains the strong presence of computing sciences and information sciences disciplines. Thus, GSC may be seen as the database colonized by scientists interested in new information technologies, science 2.0 and scientific networking. Another important bias of this map is the moment in which the data were harvested. In December 2011, the GSC had just started, and the number of profiles, approx. 36,000, was rather low. In this first stage the database could be settled by scientists near to the computing and information technologies. We think that, as time goes by and the popularity of this service increases, the number of researchers from other distant and remote disciplines would grow and a more evenly distributed map with a stronger presence of the classical disciplines would emerge. In the future, new remapping of GSC will be recommended to check the evolution of this network. If this is confirmed and GSC becomes in a resource in which every researcher is sharing a profile, the view of Science from this will be more accurate. Since the relationship between fields is directly established by the users, which will produce a more immediate view of new specialities emergence and rising of new interdisciplinary relationships.
     The structural visualization of this network of labels or concepts also makes it possible to observe the conceptual environment of a label and to better understand the meaning of each one. For example, ambiguous terms such as technology were classified as Engineering but the graph locates it surrounded by Social Sciences labels. Other labels such as social psychology and medical imaging, classified as Psychology and Medicine respectively, are reassigned to Social Sciences and Applied Computing environments. Hence, this structural presentation of concepts favours the correct classification of ambiguous or difficult terms because they are established by the own researchers, which we consider a very reliable criterion.

Conclusions

     The analized population shows a strong presence of profiles from English-speaking countries, mainly from the United States and the United Kingdom. They are primarily scholars from universities (83%), being the American universities which contribute with most of the researchers. These results suggest that GSC first spread through the US academic environment, affecting more to certain universities than others, and then to the rest of the world. We think that the diffusion of this new platform was influenced by the way in which this was publicized through social networks and 2.0 web platforms. This favored the incorporation of researchers close to these information technologies and those specific institutions (i.e. Carnegie Mellon University) and countries (i.e. Spain) emerged noticeably.
     The resulting map has shown that the building of maps of science from GSC data is possible and this data source shows a more fresh and immediate view of the research activity because it is directly based on the labels that scientists use to describe their research activity. Although this early data may show a distorted view of researches active on the on-line social networks and many of them are related to the information technologies, we think that GSC has great potential and suggest a new way to analize and measure the research activity from a wider point of view, including teaching and divulgation activities.
     We also conclude that the visual presentation of labels is a reliable way to present keyword maps that help to a better understanding of the real meaning of each concept and describe the conceptual framework in which the different research disciplines are managed. This helps the revision and updating of classification schemes to the new research dynamics observed in these GSC labels maps.

Acknowledgements

We would like to thank to the anonymous referees their important suggestions and to Jennifer Carranza her helpful recommendations on the English version.

References

Aguillo, I.F. (2012). Is Google Scholar useful for bibliometrics? A webometric analysis. Scientometrics, 91(2): 343-351

Bakkalbasi, N., Bauer, K., Glover, J. & Wang, L. (2006).Three options for citation tracking: Google Scholar, Scopus and Web of Science. Biomedical Digital Libraries, 3(7) http://www.bio-diglib.com/content/3/1/7

Bastian, M., Heymann, S. and Jacomy, M. (2009), “Gephi: An Open Source Software for Exploring and Manipulating Networks”, in Proceedings of the Third International ICWSM Conference, San José, USA, 2009, AAAI Press

Bollen, J. & Van de Sompel, H. (2006). Mapping the structure of science through usage. Scientometrics, 69(2): 227-258

Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L., Chute, R., Rodriguez, M. A. & Balakireva, L. (2009). Clickstream Data Yields High-Resolution Maps of Science. PLoS ONE, 4(3): e4803

Boyack, K. W. & Klavans, R. (2008). Measuring science-technology interaction using rare inventor-author names. Journal of Informetrics, 2 (3): 173-182.

Boyack, K. W., Klavans, R., & Börner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3), 351-374

Engelsman, E. C. & van Raan, A. F. J. (1994). A patent-based cartography of technology. Research Policy, 23(1): 1-26

Freeman, L. C. (1979). Centrality in networks: I. conceptual clarification. Social Networks, 1: 215-239.

Freeman, L. C. (1980). The gatekeeper, pair-dependency, and structural centrality. Quality and Quantity, 14: 585-592

Garfield E., Sher I. H. & Torpie R. J. (1964). The Use of Citation Data in Writing the History of Science. Philadelphia: Institute for Scientific Information.

Google Scholar blog (2011a). Google Scholar Citations http://googlescholar.blogspot.com/2011/07/google-scholar-citations.html

Google Scholar blog (2011b). Google Scholar Citations open to all http://googlescholar.blogspot.com/2011/11/google-scholar-citations-open-to-all.html

Holloway, T., Božičević, M. & Börner, K. (2007). Analyzing and Visualizing the Semantic Converage of Wikipedia and Its Authors. Complexity, 12(3): 30-40

Klavans, R. & Boyack, K. W. (2007). Is there a Convergent Structure of Science? A Comparison of Maps using the ISI and Scopus Databases. 11th International Conference of the International Society for Scientometrics and Informetrics meeting in Madrid, Spain, June 2007

Kousha, K., & Thelwall, M. (2007). Google Scholar citations and Google Web-URL citations: A multi-discipline exploratory analysis. Journal of the American Society for Information Science and Technology. 58(7), 1055-1065

Jacsó, P. (2008). Google Scholar revisited. Online Information Review, 32(1):102 - 114

Leydesdorff, L. and Rafols, I. (2009), A global map of science based on the ISI subject categories. Journal of the American Society for Information Science and Technology, 60(2): 348–362

Meho, L. I. and Yang, K. (2007), Impact of data sources on citation counts and rankings of LIS faculty: Web of science versus scopus and google scholar. Journal of the American Society for Information Science and Technology, 58(13): 2105–2125.

Moya-Anegón, F., Vargas-Quesada, B., Chinchilla-Rodríguez, Z., Corera-Alvarez, E. & Herrero-Solana, V. (2007). Visualizing the Marrow of Science. Journal of the American Society for Information Science and Technology, 58(14): 2167-2179.

Newman, M. E. J. (2006). Modularity and community structure in networks. Proceeding of the National Academy of Sciences, 103(23): 8577–8582

Ortega, J. L. & Aguillo, I. (2007). Interdisciplinary relationships in the Spanish academic web space: A Webometric study through networks visualization. Cybermetrics, 11(1): paper 4 http://www.cybermetrics.info/articles/v11i1p4.html

Rosvall, M. & Bergstrom, C. T. (2008). Maps of random walks on complex networks reveal community structure. Proceeding of the National Academy of Sciences, 105(4): 1118-1123

Scopus (2008). Subject Area Categories. http://help.scopus.com/robo/projects/schelp/h_subject_categories.htm

Seber, G.A.F. (2002). The Estimation of Animal Abundance and Related Parameters. Caldwel, New Jersey: Blackburn Press.

Seidman, S. B. (1983). Network structure and minimum degree. Social Networks, 5(3): 269–287

Small, H., Sweeney, E., & Greenlee, E. (1985). Clustering the science citation index using co-citations. II. Mapping science. Scientometrics, 8(5-6): 321-340

Waldrop, M. M. (2008). Science 2.0. Scientific American 298, 68 – 73

Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of 'small-world' networks. Nature, 393, 440-442.