home

Studies in Variation, Contacts and Change in English

Volume 22 – Data Visualization in Corpus Linguistics: Critical Reflections and Future Directions

Article Contents

Network graphs to the rescue, or how to visualise distributions and networks in corpora and language

Jukka Tyrkkö
Linnaeus University

Please cite this article as:

Tyrkkö, Jukka. 2023. “Network graphs to the rescue, or how to visualise distributions and networks in corpora and language”. Data visualization in corpus linguistics: Critical reflections and future directions (Studies in Variation, Contacts and Change in English 22), ed. by Ole Schützler and Lukas Sönning. Helsinki: VARIENG. https://urn.fi/URN:NBN:fi:varieng:series-22-5

BibTeX format

@incollection{Tyrkkö2023,
  author = "Jukka Tyrkkö",
  title = "Network graphs to the rescue, or how to visualise distributions and networks in corpora and language",
  series = "Studies in Variation, Contacts and Change in English",
  year = 2023,
  booktitle = "Data visualization in corpus linguistics: Critical reflections and future directions",
  number = "22",
  editor = "Sönning, Lukas and Schützler, Ole",
  publisher = "VARIENG",
  address = "Helsinki",
  url = "https://urn.fi/URN:NBN:fi:varieng:series-22-5",
  issn = "1797-4453"
}

Abstract

Whether we are talking about the structural properties of corpora or the dispersion of linguistic phenomena within corpora or the language system, corpus-based analyses almost invariably deal with complex and relational data. However, due in part to the design of online and standalone corpus tools, corpora are often treated exclusively from the so-called bag-of-words perspective. As corpora have increased in size, it has become increasingly difficult to understand their structures and metadata, and associations between linguistic features are almost impossible to grasp from tabular data and test statistics alone. In recent years, data visualisation methods developed in the natural sciences have become a part of the digital humanist’s toolkit for gaining insights into complex data, understanding their structure, for identifying outliers and noteworthy categories, and for communicating findings in a way that readers and audiences will remember. In this paper, I will focus on network visualisations, which are highly suited for both exploring and presenting complex linked data. The main tool discussed is Cytoscape, an open-access network visualisation tool widely used in bioinformatics and supported by a large user-base. I will present a series of case studies of how network visualisations can assist in both exploratory analysis and descriptive visualisation of corpora and linguistic data. First, I will demonstrate their utility for exploring the structures of corpora and their metadata. Second, I will show how visualisation methods can clarify collocate relationships and how such visualisations can be designed to represent association strengths in a way that does not mislead the reader. And third, I use network graphing to explore the distribution of multilingual elements across millions of tweets, combining linguistic data and metadata to produce an overview that could not be represented otherwise.

 [1]

As a field of scientific inquiry that relies increasingly on large and complex datasets, quantitative evidence, and statistical analyses, corpus linguistics has been surprisingly resistant to exploring new data visualisation methods. With notable exceptions, the majority of linguistic studies continue to rely on well-established visualisations that are mostly available in spreadsheet applications, while at the same time elsewhere, data scientists, digital humanities scholars, and business analysts embrace new methods of displaying and exploring complex datasets.

In this paper, I will discuss the usefulness of one particular type of data visualisation: network graphs. The core concept of network is perhaps most familiar to linguists from sociolinguistics, where one of the core questions concerns the measuring of the impact of social networks, or so-called “webs of ties”, on language variation and change since at least the 1960s (see, e.g., Barnes 1972, Milroy 1987, Sairio 2009, Laitinen & Lundberg 2021). However, when we stop to consider the data we regularly use and the results we get from commonly used analytical methods, it soon becomes clear that quite a lot of corpus linguistic research is concerned with examining associations between linguistic features as well as between linguistic features and extralinguistic metadata. An area of particular interest in this article is the application area of exploratory data analysis (see, e.g., Tukey 1977, Kucher & Kerren 2015, Schneider et al. 2017), which refers to the broad range of practices in using statistical methods and visualisation techniques to summarise complex and (increasingly) very large datasets for insights into their structure, distributional properties and trends across time (see also Winters 2017).

The objective of this paper is to provide a basic introduction to the concepts of network analysis and a sufficient number of examples to allow the reader to explore the usefulness of network visualisations in their own work. Accordingly, the emphasis will be on the use of network graphs for descriptive and exploratory data visualisation, and all the examples given will be drawn from linguistic datasets. In each case, only rudimentary information will be provided on the details of the underlying linguistic analyses.

All the practical examples in this paper are written primarily with the tools-based corpus linguist in mind. The network analyses and visualisations are carried out in the freely available visualisation tool Cytoscape (see Shannon et al. 2003), and any data preparation and corpus queries that preceded the network analyses could be done in standalone corpus tools and spreadsheet applications. The decision to rely on Cytoscape rather than using coding-based environments like R, Python or Javascript stems from the desire to make the learning curve as gentle as possible. For those interested in exploring other tools and network analytical coding, a short overview of available tools is given in Section 3.3.

 

A network is a structure that represents entities and the links and interactions between them. The general concept of network can be applied to a wide variety of different fields from the hard sciences to the humanities. Any set of discrete actors or entities that can be envisioned to have internal links can be approached as a network: a family tree is a network of genealogical relations, all the devices connected to the internet make up a network, and information systems such as library indexes can be conceptualised as networks of keywords drawn from controlled vocabularies (see Golub at al. 2020). In linguistics, network models can be used for studying the words or phrases found within corpora; the distribution of features within an entire language system, such as word nets, semantic nets and hierarchical thesauri; interactions between metadata and language, such as sociolinguistic networks of influence; and the networks inherent to collocations and shared keywords can be visualised as conceptual maps (see, e.g., Kopaczyk & Tyrkkö 2017 and Taavitsainen et al. 2019). In short, network modelling allows us to build mathematical representations of the interrelations between entities, which can be visualised as graphs and which can be analysed using specialised mathematical concepts developed specifically for network analysis. Although we will discuss some of these mathematical models briefly, the main objective of this article is to introduce simple but powerful methods of visualising networks using specialised network graphing software.

Network visualisations, then, are data graphing techniques that allow us to represent visually associations between discrete items. Network graphs are widely used and often considered fundamentally important in various scientific fields, because they allow researchers to gain an understanding of sometimes very complex structures and to explore the data in ways that are intuitive, easy to grasp and likely to be remembered. To take an example, let us consider the student body of a large school. Each student has a number of other students that they like and others that they dislike, information which we could collect by a questionnaire or by following the students’ social media behaviour. While it would be possible to compile information about the students’ networks in a list or a spreadsheet, the relationships of even a relatively small number of students would soon become impossible to fully comprehend, and the addition of just a couple of additional variables, such as gender or ethnicity, would complicate matters exponentially. By contrast, network-analytical methods can be used to quickly identify clusters of individuals (cliques), the central and peripheral individuals in each cluster, those that link different clusters together, whether students of a particular gender or ethnicity tend to cluster together, which groups are the most diverse or homogenous, and so on. Furthermore, network visualisation methods can make this type of complex data instantly comprehensible by relying on the considerable extent to which the human brain is primed to process visual information. Thanks to their intuitive nature, network graphs are also an effective way of communicating complex structural information to non-expert audiences, though proper care needs to be taken not to oversimplify or, conversely, not to overcomplicate matters.

 

Generally speaking, all network models rely on two basic concepts: nodes (or vertices) that represent entities, actors or discrete units of some kind, and edges (or links) that represent the paths or relationships between the nodes. [2] Importantly, neither the nodes nor the edges need to be of only one type. For example, a network model representing social relations could include nodes that represent individual humans and nodes that represent some other type of entity, such as a company or a school. The edges between nodes can be undirectional (sometimes: undirected) or directional (directed), with the latter being further broken down into uni-directional and bi-directional. When an edge is undirectional, it means that the dynamic relationship between the two nodes is symmetrical: A is linked to B as B is linked to A. Depending on the context, this might mean that money, information or linguistic influences can travel from A to B and from B to A. By contrast, a directional edge has an orientation: A has an effect or interaction toward B but B does not have the same effect or interaction toward A. For example, if we wanted to represent kinship between two individuals and used membership of a family as the criterion for an interactional tie, a mother and son might be represented as having an undirectional edge relation: A is kin to B, and B is kin to A. If, on the other hand, we wanted the network to represent generational relations, we might decide that the edges are always drawn uni-directionally from parent to child: A is the parent of B, but B is not the parent of A.

The starting point of building a network model is that two lists are compiled. [3] A node table includes all the nodes in the network along with relevant descriptive information about the nodes, which is given in the form of variables. These variables can be categorical, such as gender or ethnicity, or continuous, such as age, height, price or frequency. These variables can be represented visually in a network graph by assigning them to specific visual features: for example, the nodes could be coloured according to gender and their size could reflect the person’s age. Not all of the descriptive variables need to be used in the final visualisation, and one of the essential benefits of network graphs is that it is usually possible to try different variables in different roles, which can provide informative and sometimes surprising views for exploring complex datasets. Similarly to the node list, an edge table is used to represent the ties or interactions between individual nodes. Like nodes, edges can include additional descriptive data. Quantitative variables can be used in network visualisations in the role of edge weight, which is a conceptual representation of the strength or intensity of the link between the two nodes, while qualitative variables are more typically represented by different colours or types of lines (solid, dashed, etc.). [4] In sociological and sociolinguistic network analyses, for example, the intensity of the relationship between two individuals is often calculated based some type of accumulative points system, which might take into account factors such as kinship, co-habiting, frequency of interaction, and so on, between two individuals (see, e.g., Sairio 2009). The intensity of the two individuals’ relationship may be used as an edge weight which, depending on the mathematical model used, can result in the distance between the two nodes being shorter or longer.

Figure 1. Sample network, node table and edge table.

Figure 1. Sample network, node table and edge table.

Once the node table and the edge table have been prepared (see Figure 1), a network analysis and visualisation tool such as Cytoscape or Gephi can be used to calculate network statistics and to produce network graphs. A network visualisation or network graph is a visual representation of the structure of a network as well as of the properties of the nodes and edges in it. It is important to note that the same network of nodes and edges can typically be organised in an almost infinite number of different shapes. There is no single correct way of graphing a given network, and for that reason network graphs are best considered heuristically derived visual aids for data representation and exploration, rather than definitive research results in their own right. The key concept in network graphing is readability, which refers to the extent to which a given graph is comprehensible to human readers. Overlapping nodes and node labels, crossing edges, and an overly complex variety of different markers and marker styles (shapes, sizes, colours, fonts, etc.) can all inhibit readability even if the information conveyed was factually correct. Similarly, the shape and arrangement of the layout (see below) can either improve or inhibit readability. Over the decades, scholars within the explicit field of data visualisation have developed a keen understanding of how effective visualisations are (see, e.g., Shneiderman et al. 2012).

Depending on the needs of the researcher and the nature of the data, the nodes and/or edges can be labelled or not, and the colours, thicknesses, shapes and sizes of nodes and edges can be assigned to represent variables in the underlying data. Directional edges are often denoted by arrows. Additionally, statistical parameters specific to network analysis can be used in the visualisation. Most network-analytical tools can calculate network parameters such as centrality and in-betweenness, which serve as quantitative representations of how central a given node is in the network, and various types of clustering coefficients for identifying closely interconnected groups of nodes.

A typical network visualisation tool includes dozens of different layout algorithms, or mathematical models designed for calculating the positions of the nodes, as well as various visual tools that allow the researcher to manually and semi-manually manipulate the graph. Although most visualisation tools provide perhaps a few dozen common layout algorithms, there are literally hundreds of data visualisation methods to choose from (see, e.g., di Battista et al. 1999, Kerren et al. 2017).

Examples of the visualisation tools include commands to avoid overlapping labels, to expand or shrink the space occupied by the visualisation (or a part of it) in order to improve readability, and so on. The tools can also perform other functions such as deleting duplicate edges and self-loops, though it is important to be careful not to perform such operations without first knowing how and why they occurred in the first place, and whether or not deleting them makes sense.

 

As mentioned in the previous section (2.1), network analysis involves the use of network parameters while layout algorithms are used for drawing graphs. Proceeding from the premise that any structure of nodes connected by edges can be understood as a network, these networks can be graphed and examined visually, but when we want to quantify the properties of entire networks, subnetworks, or individual nodes, we often need to calculate network parameters. [5] The simplest descriptive statistics about a network would be the number of nodes and the number or edges, and given two or more networks, or parts of the same network, these simple counts could be used to determine the sizes of the networks in question; in a diachronic research design, the size differences between networks at different points on a timeline could be compared. From there, we can proceed to calculating additional statistics such as the mean number of edges per node, the number of edges leading in and out of individual nodes (if the network is directional), the distance between two specific nodes or types of nodes, the mean distance between any two nodes, and so on. These metrics help define the properties of networks in terms that allow comparison, and they can also be used as additional information when the network is visualised.

So how might such statistics be used in practice? For example, it might be useful to be able to identify individuals that are the most well-connected when it comes to followers on social media, not only in terms of the number of direct personal followers but in terms of having followers who themselves have the most followers that are not shared by the persons they follow. Although it would be theoretically possible to run analyses like this manually, even a slightly larger network would make this almost impossible in practice. With network analysis tools, not only are the calculations straightforward, but their results can easily be integrated into the network visualisations which are, as discussed earlier, much easier for humans to comprehend than giant data tables. If, for example, we have a directed network and we let the size of each node reflect the number of edges leading into it — the edges, let us say, representing social media followers — it will become apparent at a glance which out of possibly thousands of individuals are the most well-followed.

The drawing of network graphs involves the use of layout algorithms, which are mathematical models that determine where to place the nodes and how to draw the edges between them (see, e.g., di Battista et al. 1999). Conceptually speaking, the simplest layouts are probably grid layouts, hierarchical layouts and circular layouts. These are usually minimally informative in terms of interactions between the nodes but allow us to see each node and, if the visual appearance of the nodes is used for representing metadata, may provide a useful impression of the proportional distribution of the data. Some variations of these layouts, such as the Attribute Circle layout and Group Attributes layout in Cytoscape, take into account specific attributes and thus improve the usefulness of the models.

More complex and thus interesting are layouts that take into account edge weights when arranging the nodes in the two-dimensional space. These algorithms, which in Cytoscape include the Edge-weighted Spring-Embedded layout, the Prefuse Force Directed layout and the Compound Spring Embedded layout, are all similar in principle in that they all make use of the so-called force-directed paradigm. In simple terms, the algorithms determine the distances between the nodes on the basis of forces, which are sometimes conceptualised as gravity, sometimes as springs or coils that are under tension, and so on. If the strength or intensity of interaction between the nodes can be quantified, a force-directed layout is usually the most informative. Thus, for example, the calculated strengths of association between collocate words can be used to determine the spatial distance between them. A variety of different algorithms are available for making these calculations, with the different algorithms putting emphasis on different aspects of the network structure. Depending on the complexity of the algorithm, there are also differences when it comes to the speed at which the layout can be calculated, especially when the number of nodes and edges increases. [6]

On the basis of the previous, more general discussion, let us consider the conceptual usefulness of networks specifically for linguistic research. In linguistics, our datasets, such as the texts in stratified corpora, can be understood and visualised as networks. The informants of sociolinguistic studies, such as the members of a group of bloggers or the correspondents in a seventeenth-century chain of letter-writers, can be approached as networks. And finally, the fact that linguistic phenomena have measurable tendencies of co-occurrence, sometimes called attraction and repulsion, means that their relationships can be perceived of as networks. For example, as will be demonstrated in Sections 3.2 and 3.3, collocations and keyword chains can be represented effectively as networks of associated words.

 

Cytoscape is a freely-available open source network analysis and visualisation tool continually developed under the auspices of the Cytoscape Consortium. [7] The tool is available for PCs, Macs and for computers running Linux. Despite the rich set of features, Cytoscape has an easy to use graphical user interface that puts network graphics within the reach of anyone regardless of previous experience. Using Cytoscape involves no programming. Originally developed for network analysis in the bio-medical fields, Cytoscape has numerous functions and uses that are not directly useful for linguists and which will not be discussed in this introductory article.

 

A network graph in Cytoscape is based on three tables: a node table, an edge table, and a network table (see Section 2.1 above). The first two are supplied by the user, the third one is created by Cytoscape when a new network is built. The node table includes each and every node that will be a part of the visualisation. Each node must carry two identifiers, a shared name and a name, in addition to which additional node attributes can be provided. The shared name is the “true name” of the node, that is, a nominal value that identifies that node in the network and that is used in the edge table; the name attribute is the nominal value used as the label of that node. While the shared name and name are by default the same, it is possible to use alternative node names for labelling purposes. In addition to the identifiers, various node attributes (or variables) can be assigned by the user. For example, in a linguistic visualisation each node might be a unique word and a variable “frequency” could be added to represent the frequency of that word in the corpus, another variable called “pos” could be used to record the appropriate part-of-speech, and so on. The edge table features all the interactions between nodes; each as a single row. Similarly to the node table, additional variables can be included in the edge table; for example, collocation strength could be added as a variable that can subsequently be used as an edge weight.

However, although the node and edge tables are two separate tables, the simplest way to build a network in Cytoscape is to import a network from a single file and to let Cytoscape split the data into the node table and the edge table. Cytoscape can read common file formats such as comma- and tab-separated files (csv and tsv), as well as spreadsheets such as MS Excel files. The input file should include labelled columns, with each row representing a pair of nodes. The columns can be labelled given whatever names the user wants, but it may be best for clarity to name one column ‘source’ and another column ‘target’. In addition to these two mandatory columns, additional columns can be used to provide additional information about the source node, the target node, or the edge.

To import the table into Cytoscape, go to File->Import->Network from file (Figure 2). Select the file, and Cytoscape will present you with an import screen where the role of each column needs to be specified; Cytoscape will guess a role for each variable, but this is by no means reliable.

You must assign one column as the source node and one column as the target node. These two columns will be used to create both the edge table and the node table. During the importation process, additional columns can be assigned as source node attributes, target node attributes, or edge variables (Figure 3).

Figure 2. Importing a network file into Cytoscape.

Figure 2. Importing a network file into Cytoscape.

Figure 3. Assign variable roles.

Figure 3. Assign variable roles.

Once all the columns have been assigned a role, Cytoscape will read the data and present a preliminary network visualisation based on the nodes and edges (Figure 4). At this point, the visualisation only shows the default structure of the network without making any use of the variables assigned to node or edge attributes.

It is possible that the user may want to add additional variables to the node table or the edge table, but unfortunately that data could not be added to the input file. Cytoscape gives the option of adding information to both the node and edge table by importing data from additional files. Here again, the file must have a tabular structure and there must be a key variable that maps the new data to the nodes or edges. For example, when adding data to the node table, the logical key variable is “shared name”.

The next step is to run a network analysis, which means that Cytoscape will run a variety of analytical algorithms and determine network statistics, such as In-degree, Out-degree, Betweenness, etc. These can be used in many of the layout algorithms as input variables. Go to Tools->Network Analyzer->Network Analysis->Analyze network. Cytoscape adds the newly calculated variables to the node and edge tables, as appropriate. Before the analysis is run, we must assign the network either as a directed network or undirected network. In a directed network there is a direction between the connected nodes, one being the source and the other being a target. If the network is considered undirected, there is simply a link between the two nodes. This information is used for calculating many of the network statistics. For example, if we want to calculate the distance from one node to another, the distance may be much longer in a directed network because the distance can only be calculated in the direction of the edges.

Figure 4. Initial network layout.

Figure 4. Initial network layout.

Once the network analysis has been carried out, a layout can be assigned for the network (Figure 5). Some fifteen layouts are included in a fresh install of Cytoscape, and additional layout libraries can be downloaded using the in-built App Manager. As discussed previously, the different layouts put emphasis on different aspects of the network: some attempt to present the network structure in a hierarchical fashion, others as a circular structure, yet others in a variety of ways that simulate gravitational forces acting on the nodes or as springs pulling the nodes together. These so-called force-directed algorithms make use of edge properties: for example, in a linguistic network we might use collocation statistics or standardised frequencies as edge weights.

Figure 5. Selecting a layout.

Figure 5. Selecting a layout.

Once the graph has been created, it is possible to zoom in and out of the visualisation using the mouse. By default, Cytoscape presents the image with reduced graphic details to save processing power. The details will become visible once we are zoomed in close enough, or when the visualisation is exported as an image. It is also possible to turn the full graphic details on, if your computer can handle it: View->Show Graphics Details.

Both nodes and edges can be customised using a simple user interface (Figure 6). This is the part of the process where visual data exploration really comes into its own. The size, shape, colour, transparency, and border of the node, as well as of the node label, can be independently manipulated, giving the user a variety of options for visualising variables associates with the node or the edge.

Figure 6. Setting style attributes for nodes and edges.

Figure 6. Setting style attributes for nodes and edges.

The properties of all of the attributes can be set at three different levels. Firstly, a default value will apply across the dataset, which means that, for example, the default colour of the nodes could be set to blue: if no other rules apply, the nodes are all painted blue. Secondly, a mapping value can be set for the nodes, which means that the value of the attribute is derived from the node table. There are three options for mapping. Continuous mapping takes a continuous variable (like word frequency) and maps it to the attribute in whatever way the user decides. For example, the size or colour of the node or the label could reflect frequency, year, or word count. Discrete mapping is typically used for nominal variables, such as gender or lemma. With this option, the user can decide on the attribute value for each unique value of the variable, e.g., a distinct colour for each gender. Passthrough mapping uses the variable value directly from the nodes tables, provided of course that the values in the data table are appropriate for that type of attribute. Thirdly, it is possible to manually bypass any of the attribute values on a node-by-node or edge-by-edge basis. Thus, for example, the user can highlight a particularly important node or edge by manually changing its appearance to something eye-catching, such as increasing label size or making it the only yellow node in the visualisation. The same options are available for each edge attribute.

One of the potential benefits of network visualisations is that by manipulating the visual properties of the nodes and edges we can identify associations and clusterings in our data which might otherwise go unnoticed. Since almost all aspects of the nodes, edges and layout can be customised, the main question to ask is how much data we can realistically take in. For example, if the colour and size of the node, the colour of the node border and its thickness, the shape of the node marker, and the colour and size of the node label, as well as the text of the label, were all used for presenting information about different variables, the reader would have to take in eight different visual cues about a single marker. This would almost certainly be too much.

Cytoscape projects are saved as Cytoscape session files (.cys), which include all details of the network and the graph in a single file. The networks and individual tables can also be exported in other formats. Cytoscape network graphs can be saved in a variety of graphics formats, including png- and pdf-files. It is important to note that the saved image includes only the portions of the network that are visible on screen at the moment the image is saved. The best quality images for most purposes are pdf-files, because the vector pdf-files are scalable and retain their quality regardless of the size at which they are printed or viewed.

Video 1 shows the basic workflow from importing a network file to running a network analysis on it and selecting a layout, and finally to setting the style attributes.

Video 1.

 

A few common questions and challenges associated with network visualisations should be discussed briefly. Perhaps the most often raised question concerns the reliability of network graphs. The fact that the same network of nodes and edges can be visualised in a variety of different layouts can be confusing and the natural question is how do we know which layout is the ‘correct’ one to use? While it is certainly possible to say that some layouts are better suited for some types of networks than others, it is important to keep in mind that the primary function of the visualisation is to allow us to observe a complex network structure more clearly and to discern relationships between entities within such structures, and thus, in short, the best layout is the one that provides useful information and does not mislead. Perhaps the main challenge with network graphs comes from the reader (and scholar) misinterpreting the spatial layout by, for example, believing that nodes in close proximity to one another are always associated. This is not necessarily the case, and it is for that reason always important to plot the edges and to understand how the layout algorithm works. For example, in most network layouts direction has no explicit meaning per se, which means that when a source node has multiple target nodes that have no other edges, the direction of the target node from the source node is meaningless. The layout algorithm may simply maximise the distance between the target nodes; the example Figures 7, 8 and 9 use Edge-weighted Spring Embedded layout. In Figure 7, although nodes D and F are closer to node B than nodes E and G, their association with B is not stronger. Node C, on the other hand, has edges with nodes A and B, and its position is meaningful in relation to them.

Figure 7. Simple network showing one shared node.

Figure 7. Simple network showing one shared node.

Should node B have two more unique target nodes of its own (H and I), the layout algorithm will find them a location that maximises the distances between all the target nodes (C, H and I) with only the location of C locked in (Figure 8).

Figure 8. Positions of unshared nodes maximise distances.

Figure 8. Positions of unshared nodes maximise distances.

If node H has two source nodes like node C, the two would be located in-between A and B, and the other single-edge nodes spaced out accordingly (Figure 9).

Figure 9. All shared target nodes are affected by their source nodes only.

Figure 9. All shared target nodes are affected by their source nodes only.

Moving on to another common issue, journal editors and copy editors often raise concerns over the readability of network graphs, especially when it comes to overlapping nodes and small labels. The unavoidable fact is that when the size of a network increases, either the canvas size or the density of nodes must increase. Depending on the number of nodes, a tightly packed visualisation may become difficult or even impossible to read. Click on View->Show Tool Panel, and a new tool panel will pop out which provides controls for stretching or contracting the whole visualisation or a part of it, as well as to rotate it as desired (Figure 10). The individual nodes, or groups of nodes, can also be manually dragged to new positions, though this is best left for situations where overlaps make the graph difficult to read. It is also possible to bypass the style settings for any of the individual nodes, which means that nodes or edges of particular interest can be made larger, given a prominent colour, or a larger label.

Figure 10. Node layout tool for scaling and rotating the graph.

Figure 10. Node layout tool for scaling and rotating the graph.

However, given the limitations of a conventional printed page, it does not take much to reach the point where the details of the graph become unreadable. Depending on the data and the layout, important structural tendencies may still be visible, but most of the time the individual nodes and/or edges become too small to read. The problem can be solved in digital publishing by making the images scalable (as in this volume), or by providing a companion website, either provided by the publisher or the researcher, from which readers can download higher quality images.

A third issue, which is common to many types of data visualisations, is that some additional annotations might be useful to include in the image, but the visualisation tool does not seem to provide the functionalities needed. The simplest solution is to export the image, open it in a graphics software (Photoshop, Pixelmator, whatever is available) and add the annotations manually. Cytoscape includes some manual annotation functions, which allow the user to add text and various shapes over the network graph. There are also auto-annotation functions, which can identify clusters of related nodes, but space does not permit going into detail about their use here.

 

While the examples that follow in section 4.1 focus on Cytoscape, it is worth briefly mentioning that there are a variety of similar tools available. Gephi is a freeware tool that is popular among humanities scholars. It functions in a similar way to Cytoscape and offers a comparable number of layout options and plugins produced by the user community. Some corpus tools, such as CasualConc and LancsBox, also have built-in network graphing functionalities, though the options are usually more limited than in specialised network analysis tools. In library and information science – particularly bibliometrics, where network analysis plays an important role – commonly used tools include VOSviewer and CiteSpace. More recently, various online visualisation tools for the production of network graphs have become available. Flourish, a relatively new online visualisation service, includes directed and undirected network graphs, though the options are currently quite limited when it comes to the layouts available. Finally, for users who are comfortable using approaches that require programming, network analysis and graphing can be accomplished using the many specialised network analytical libraries and packages that are available for languages such as R, Python, and Javascript. In R, network graphs can be built using packages such as igraphdata, visNetwork and networkD3; in Python with NetworkX; and in Javascript, almost any type of graphing can be done with d3. For an excellent introduction, see Desagulier (2020).

 

Each of the subsections will present an example of how network graphs can be used to illustrate a specific corpus linguistic topic. The corpora used are all publicly available.

 

Every corpus-linguistic research project relies on one or more corpora. Although the structure of a corpus can be expressed in terms of summary statistics and hierarchical diagrams showing stratifications, it can be useful to get the proverbial “bird’s eye view” of the entire corpus. The potential advantage of a network graph is that the number of individual texts within the corpus and its various subsections become more visible, which encourages awareness of distributional characteristics of the corpus and, by extension, of any results obtained from it (see, e.g., Baroni 2008 or Gries 2008).

The example shows a network view of the recently released corpus of Late Modern English Medical Texts (LMEMT; see Taavitsainen and Hiltunen 2019); Figure 11. The corpus comprises seven major categories, of which the numerically largest, Periodicals, is further divided into subcategories of extracts from three different periodicals. The visualisation shows the relative sizes of the categories in terms of text counts. The sizes of the node markers are calculated based on a network statistic called Betweenness Centrality, which is a measure of how often the node lies on the path between other nodes; the node for the corpus itself, LMEMT, was manually enlarged for better comprehensibility of the graph.

Figure 11. Network visualisation of corpus structure.

Figure 11. Network visualisation of corpus structure.

Another visualisation option might be to make the node sizes reflect the word counts of each individual text or, in the case of categories, of the combined word counts of all texts within that category. It would also be possible to use other features of the visualisation to indicate additional attributes: for example, the node shapes could be made indicative of whether or not the text was a translation or not, or perhaps a coloured node border could indicate the gender of the author. Ultimately the decision comes down to striking the right balance between the depth of data and complexity, on the one hand, and comprehensibility, on the other.

 

Collocations are one of the core methodologies in corpus linguistics. Based on the statistical concept of association strength, collocations are calculated by identifying words that have a higher-than-random chance of co-occurring either within a specified window or at a specific distance and direction of each other. Collocations are typically calculated using measures such as Chi-squared, Log-likelihood, LogDice, etc., and the traditional way of representing collocations in corpus linguistics has been a table showing the most strongly associated words to the left and right of a pre-determined node word, or at specific distances to the left and right of the node word. With network graphs, collocations can be examined visually, which is especially useful for studying the mutual collocates of multiple query words; see also the GraphColl function in LancsBox.

Figure 12 illustrates how network graphs can be used to visualise collocate networks. The data for this example comes from the COVID-19 Open Research Dataset, which comprised 224 million words from open access academic publications on the topic at the time the collocation analysis was carried out. The analysis focuses on the collocates of the three closely associated terms pandemic, epidemic and outbreak in the syntactic position of objects of verb, such as “declare a pandemic”. The 100 most significant collocates of each term were retrieved using the LogDice statistic and the shared collocates were then visualised using a directed network graph, in which the respective keywords functioned as source nodes and the collocate words functioned as target nodes. [8] The Compound Spring Embedded layout allows us to assign a continuous variable as edge weight and the LogDice statistic is ideal for the purpose. Finally, the count of each collocate word was represented by the size of the target node marker.

Figure 12. Three keywords and their collocates.

Figure 12. Three keywords and their collocates.

The graph makes it easy to see that the nouns epidemic and outbreak share a much higher number of verb collocates with each other than either one does with pandemic. Only 12 out of the 201 unique verbs are shared by all three keywords, and both keywords share a few unique verbs with pandemic. In contrast, epidemic and outbreak share 54 keywords that are not found to be used with pandemic. It goes without saying that this data could be presented in tabular form, but the network visualisation arguably makes the overall structure of the association data immediately clear.

Adding more data to the visualisation is easy. For example, if we wanted to compare the collocates of the three terms in the COVID-19 dataset with their collocates in more generic language use, we could add data from a large web-derived corpus such as the 15-billion-word English EnTenTen corpus. We add three new source nodes to the datafile and the collocates as target nodes. Importantly, in order to maintain the difference between similarly named source nodes (e.g., pandemic in the COVID corpus and pandemic in EnTenTen), we need to rename the new source nodes differently from the previous source nodes. These nominal variables are known in Cytoscape as ‘shared named’, as discussed in Section 3.1 above. The names displayed in the visualisation as node labels are by default set to be the same as shared names, but they can be easily changed. In Figure 13, marker shape is used to indicate the dataset (diamond for the COVID corpus and square for EnTenTen) and marker colour is used to indicate the word (red for pandemic, green for epidemic and purple for outbreak).

Figure 13. Three keywords from two corpora and their collocates.

Figure 13. Three keywords from two corpora and their collocates.

The network graph shows that the collocates of pandemic in the COVID dataset seem to differ from those of the other words. All the other source nodes cluster closer together, with outbreak and epidemic in the COVID dataset, and pandemic and epidemic in the in EnTenTen, appearing relatively similar. There are only four verbs that are shared by all six source nodes: cause, emerge, occur and spread, nineteen that are shared by five source nodes, etc.; see Table 1 for the number of shared collocates between individual source nodes. As can be seen, the relative positions of the source nodes in Figure 13 appear to reflect visually the overlaps in the collocate counts quite well.

  Pandemic_COVID Outbreak_
COVID
Epidemic_
EnTenTen
Pandemic_
EnTenTen
Outbreak_
EnTenTen
Epidemic_COVID 20 67 47 41 51
Pandemic_COVID   20 15 15 10
Outbreak_COVID     38 31 58
Epidemic_EnTenTen       50 53
Pandemic_EnTenTen         37

Table 1. Shared collocates of the three terms in the two corpora.

It goes without saying that more detailed qualitative analysis could be carried out next in order to examine the characteristics of both the shared collocates as well as of the items that are only associated with few, or only one, source node. For example, the shared collocates of pandemic in the two corpora show that pandemics are characterised by the way they strike, emerge and unfold, and the way they escalate, spread and grow.

 

Although the basic approach to graphing keywords is very similar to what we saw with collocations, the particular usefulness of network graphs comes into play when it comes to so-called key-keywords, or keywords that are shared by multiple texts in a corpus (see, e.g., Baker 2004). Given that keywords are defined as words that occur with a significantly higher frequency in a given text (or corpus) than the same words occurs in a reference corpus, it has become conventional to interpret keywords as indications of topic and/or notably common grammatical features, which may be associated with formulaic expressions or register features. Taking this one step further, key-keywords can be used for clustering together similar texts.

While it is possible to compare keywords lists and to identify shared items either manually or by using semi-automatic means, network graphs make this type of analysis particularly attractive because of the way they allow us both to visualise the overall relations between texts and to see patterns emerge as the number of texts increases. As an example, we will build a network of the keywords of all Shakespeare plays. Using the Shakespeare corpus made freely available by Mike Scott, the keywords of each of the 37 plays were calculated using the keyword statistic TF-IDF (Text Frequency – Inverse Document Frequency), which is a well-known and widely used basic keyness statistic especially in text mining applications. [9]

The TF-IDF statistic is initially calculated for each unique word type in the corpus and for every text. Importantly in terms of the network graph, if the word does not occur in the text (TF=0), it gets a value of zero, but a datapoint is still created for that word in that text. From the perspective of the network graph, the value of the variable has no functional meaning until it is assigned a role, and thus when the table was imported into Cytoscape, an edge was created between each source and target node regardless of any variable values. Thus, to focus only on relevant words, all words with a TF-IDF score smaller than 0.1 in a text were excluded from the network. At this stage, each play was also assigned a genre category, with 17 plays designated as comedies, 10 as tragedies and 10 as historical plays. Because we want to make this information visible in our graph, a categorical attribute called ‘genre’ is added with the possible values of ‘comedy’, ‘tragedy’, ‘historical’.

Next, we build the network graph. The source nodes in our network are the plays and the target nodes are the keywords. The label of the node is either the name of the play or the keyword. The edges in our graph are going to be between plays and keywords. We make them directed edges, because we want to use the TF-IDF statistics as edge weights, that is, we want the relative word placements to reflect the importance of each word to each play: the stronger the keyness, the shorter the distance between the play node and the keyword node. The genres are marked with colour, green indicating historical plays, blue comedies and red tragedies.

Figure 14 shows the result of the visualisation. The plays are arranged on the basis of their shared keywords, in addition to which each play has its own keywords that are unique only to that play, seen in the visualisation as fan-shaped ‘bursts’ projecting away from the centre of the image. A visual inspection of the relative arrangement of the nodes would appear to confirm that the network graph represents the textual reality accurately. Zooming in on the pdf-file that can be exported from Cytoscape will show the individual keywords. It is important to note that as the number of nodes increases, it becomes more and more impossible to avoid overlapping nodes and labels. Furthermore, and more critically in terms of analytical usefulness, it becomes necessary for the algorithm to place target nodes close to other source nodes, with which that specific target node is not associated. While smaller networks (such as Figure 13) can be used for studying both the source and target nodes’ relative positions, in large networks the main object of inquiry should be the relative positions of the source nodes. Here, we see among other things that the three historical plays on Henry VI are clustered close together, suggesting that they share a fair volume of the same keywords; several other historical plays and three tragedies are also found relatively nearby.

Figure 14. Network graph of keywords in Shakespeare's plays.

Figure 14. Network graph of keywords in Shakespeare's plays.

In Figure 15, the data has been pruned further by setting a threshold of 0.3 for the TF-IDF keyness. This removes nearly 90% of the words from the data table, leaving only words that are strongly associated with specific plays; zooming into the graph will allow the inspection of the individual keywords. Although the ‘gravity’ of any of those individual word nodes is relatively minor, the combined effect of the pull that they exert on the text nodes is quite notable. It is easy to see that the historical plays all occupy a central position in the graph, with comedies and tragedies further out. The historical tragedies appear together toward the right-hand side of the graph. The sizes of the source nodes reflect EdgeCount, one of the network statistics calculated by Cytoscape. As the term implies, EdgeCount is the total number of edges a node has regardless of direction. In directional networks, direction-indicating edge parameters called InDegree and OutDegree are also calculated.

We see that three of the comedies (Love’s Labour’s Lost, The Merry Wives of Windsor, and A Midsummer Night’s Dream) appear to have a relatively high number of unique edges, or in other words keywords that are not shared by any other text. The graph also highlights some potentially interesting details. The comedy Cymbeline appears quite far removed from the other comedies and quite close to the historical tragedies which, considering that Cymbeline is a dramatized version of the Celtic British king Cunobeline and the play is set in the Ancient Britain, is perhaps to be expected.

Figure 15. Pruned network graph of keywords in Shakespeare's plays.

Figure 15. Pruned network graph of keywords in Shakespeare’s plays.

It is also worth noting that once the network analysis function has been carried out, the network parameters are added to the node table, which can be exported from Cytoscape in .csv format and analysed further in any statistics tool. In the case of our Shakespeare example, a closer look at the network statistics shows that there are 268 keywords that are shared by at least two texts. The two most shared keywords are duke and England, shared by 11 plays, and the next more common are Richard, John and Gods, each shared by 8 plays. The centrality of the historical plays, which was previously observed through visual inspections, can be assessed statistically using the network parameter Neighbourhood Connectivity, which is defined as the average connectivity of all neighbours of a given node, or in this case, of all the keywords that a particular play has. If we calculate the average Neighbourhood Connectivity for the three different types of plays (Figure 16), we see that the historical plays are more likely to have keywords that are shared with at least one other play (historical plays μ=2.31, n=11; comedies μ=1.37, n=16, tragedies μ=1.60, n=10). [10] This is of course largely explained by topical similarity between the historical plays, nearly all of which concern royal courts and largely draw on Holinshed's Chronicles (1577 and 1578).

Figure 16. Comparison of mean Neighborhood Connectivity between types of plays.

Figure 16. Comparison of mean Neighborhood Connectivity between types of plays.

 

The final example illustrates the use of network graphs for analysing text similarities and author profiles. The primary data comes from the Quaker Historical Corpus, compiled and made available by Judith Roads. The corpus contains 173 texts and 722,370 words from the latter half of the seventeenth century, written by 156 members of the Society of Friends. The research design concerns two issues: similarities between the texts, with particular reference to biblical verses, and repetition within texts, which was a known feature of Quaker writing (see, e.g., Peters 2018). To investigate which members of the group cited the same texts, we build a network graph with the individual texts as source nodes and unique 5-grams derived from the texts as target nodes. We might expect individuals who used the same source texts to cluster closer together more than those who did not, particularly because repetitive sequences longer than 5 lexical items will create shared target nodes between the source nodes.

The 5-grams were collected and the most frequent 2,000 5-grams were saved for the analysis. A binary datatable was then assembled of the texts and 5-grams, with 1 indicating that the 5-gram occurs in the text and 0 indicating that it does not. The datatable was stacked and all the zero rows were deleted, leaving a long-form datatable with one row each for every existing text-to-5gram pair. Before this network file was imported into Cytoscape, the further step was taken to collect the 10 most common words occurring in the 5-grams, namely lord, god, spirit, power, light, day, christ, world, life and people. Adding these words to the visualisation allows us to see whether specific types of authors cluster around specific topics.

The list of 5-grams was then paired with these 10 manually identified key-keywords, with the 5-grams as source nodes and the 10 words as target node. As before, only those pairs were retained where one of the words occurred within the 5-gram. The list of 5grams-to-keyword pairs was appended to the end of the text-to-5gram file. It is important to note here that Cytoscape treats all nodes as nodes regardless of what they represent, and thus the same network file can include links between items that belong to different conceptual realms, such a word sequences, individual words, and texts.

Finally, an additional node file was prepared with the filename as the “shared name”, and three additional variables about each text: the author’s full name, the author’s last name only, and the author’s gender. The authors of the Quaker Historical Corpus include several married couples who both wrote religious literature, and including this information may allow us to see sociolinguistic patterns of writerly behaviour. Because we want to colour the nodes according to one variable, the choice was made at this time to include the key-keywords in the node list as well, and to assign them a ‘gender’ called ‘keyword’. This allows us to colour the nodes using the gender variable and at the same time also colour the keywords a different colour. In the final layout, male authors are marked with blue nodes, female authors by red, texts with multiple authors by purple, and key-keywords by yellow. The lexical 5-grams do not have a value for the ‘gender’ variable, and they are thus assigned the default colour of light blue by Cytoscape.

The network file and the node file were imported into Cytoscape, the network was analysed and the layout chosen; see section 4 for a detailed breakdown of the process. The layout algorithm used was the yFiles Organic layout, which is a part of the free yFiles library for Cytoscape. [11] The layout is a variation of the force-directed layout style suitable for undirected networks, in other words the layout does not make use of edge weights, which we did not have in this network. The final graph can be seen in Figure 17.

Figure 17. Keywords and key-keywords in the Quaker Historical Corpus.

Figure 17. Keywords and key-keywords in the Quaker Historical Corpus.

What does the figure tell us? We can immediately see something interesting. With a couple of exceptions, the female authors tend to cluster toward the centre of the graph, suggesting that they probably use many of the same 5-grams. This may mean that they quote the same passages, or that they use the same independent religious phrases quite a lot. By contrast, we see that all the text nodes that lie toward the periphery of the graph are blue, indicating that they were written by male authors. It is immediately obvious that there are several male authors who appear to be using multiple 5-grams that are not shared by the others. Given that the selection criterion was that the 5-grams in the graph are among the 2,000 most common 5-grams in the corpus, this means that these male authors must have repeated the same 5-grams several times. For example, the node toward the upper right-hand corner of the graph represents Thomas Gibson’s Something Offered To The Consideration Of All Those Who Have A Hand (1665), a short polemic text of 3,670 words about right of assembly and worship. Gibson repeats the phrase “other manner than is allowed” 20 times, 7 of which occurrences are found as part of the longer sequence “in other manner than is allowed by the Lyturgie or practice of the Church of England”. Gibson’s node stands apart from the majority of other nodes, because his repetitions do not concern (to the same extent) biblical quotes but instead focus on interpretation of legal writing.

 

Network visualisations are an exciting and underused type of data visualisation in linguistic research. While network graphs are certainly not the ideal visualisation method for every situation, they offer one more angle of approach to analysing increasingly complex and interconnected datasets that we work with, and as this introductory article has hopefully demonstrated, the concepts and methodologies one needs in order to get started are not difficult. This article has only scratched the surface of these powerful techniques.

 

[1] The ideas and methods discussed in this paper were developed in collaborative projects with Ursula Lutzky; Milla Luodonpää-Manni; and Koraljka Golub, Joacim Hansson and Ida Ahlström. [Go back up]

[2] Trudeau (1993) is a popular introduction to graph theory. [Go back up]

[3] As will be discussed later, depending on the tools used, it may also be possible to use network lists which include all the information that goes into a node list and an edge list as a single file. [Go back up]

[4] The term weight is conventionally used because many of the mathematical network models borrow the concept of gravity from physics when calculating the positions of the nodes in the visual space. [Go back up]

[5] The term statistic is used here in reference to specific calculated values, not to the field of statistics on the whole. [Go back up]

[6] The actual speeds depend on the hardware used as well as the size and complexity of the networks structure, so it is impossible to give exact performance times. In general, more complex force-directed graphs can take anywhere from a few minutes to several hours to graph. [Go back up]

[7] The development of the Cytoscape family of software tools is currently funded by a grant from the U.S. National Institute of General Medical Sciences (NIGMS) and the National Resource for Network Biology (NRNB). [Go back up]

[8] See Rychlý (2008). [Go back up]

[9] See the Shakespeare Corpus at Wordsmith Tools. The corpus contains all Shakespeare’s plays, as well as each character’s dialogue as a separate file. [Go back up]

[10] Assuming we treat the corpus as the totality of Shakespeare’s plays, the plays constitute the population, not a sample. Still, running a non-parametric Mann-Whitney U test for each pair, we get comedies vs. historical plays z=3.92, p=***, comedies vs. tragedies z=2.23, p=*, historical plays vs. tragedies z=-3.06, p=**. [Go back up]

[11] yFiles diagramming libraries are a product of yWorks. See yworks and yworks Organic Layout Style documentation. [Go back up]

 

Cytoscape. 2003-. Open-access tool available at https://cytoscape.org

Quaker Historical Corpus. 2015. Compiled by Judith Roads. https://www.woodbrooke.org.uk/resource-category/quaker-historical-corpus/

WordSmith Tools: Shakespeare Corpus: https://lexically.net/wordsmith/support/shakespeare.html

yworks: https://www.yworks.com

yworks Organic Layout Style: https://docs.yworks.com/yfiles/doc/developers-guide/smart_organic_layouter.html

 

Alissandrakis, Aris, Nico Reski, Mikko Laitinen, Jukka Tyrkkö & Magnus Levin. 2019. “Visualizing rich corpus data using virtual reality”. Corpus Approaches into World Englishes and Language Contrasts (Studies in Variation, Contacts and Change in English 20), ed. by Hanna Parviainen, Mark Kaunisto & Päivi Pahta. Helsinki: VARIENG. https://urn.fi/URN:NBN:fi:varieng:series-20-9.

Anthony, Laurence. 2013. “A critical look at software tools in corpus linguistics”. Linguistic Research 30(2): 141–161.

Araújo, T. & S. Banisch. 2016. “Multidimensional analysis of linguistic networks”. Towards a Theoretical Framework for Analyzing Complex Linguistic Networks, ed. by A. Mehler, A. Lücking, S. Banisch, P. Blanchard & B. Job, 107–131. Berlin: Springer.

Baker, Paul. 2004. “Querying keywords: Questions of difference, frequency and sense in keywords analysis”. Journal of English Linguistics 32(4): 346–359.

Baroni, Marco. 2008. “Distributions in text”. Corpus Linguistics: An Internatonal Handook, Vol 2, ed. by Anke Lüdeling & Merja Kytö, 803–821. Berlin: Mouton de Gruyter.

COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-05-02. Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed 2020-05-02. doi:10.5281/zenodo.3715505

Desagulier, Guillaume. 2020. “Plotting collocation networks with R: ‘hoard’ vs. ‘stockpile’ in the Coronavirus Corpus”. https://corpling.hypotheses.org/3300

di Battista, G., P. Eades, R. Tamassia & I.G. Tollis. 1999. Graph Drawing: Algorithms for the Visualization of Graphs. Prentice Hall.

Golub, Koraljka, Jukka Tyrkkö, Joacim Hansson & Ida Ahlström. 2020. “Subject indexing in humanities: A comparison between a local university repository and an international bibliographic service”. Journal of Documentation 76(6): 1193–1214. doi:10.1108/JD-12-2019-0231

Gries, Stefan Th. 2008. “Dispersions and adjusted frequencies in corpora”. International Journal of Corpus Linguistics 13(4): 403–437.

Kerren, Andreas, Kostiantyn Kucher K, Yuan-Fang Li & Falk Schreiber. 2017. “BioVis Explorer: A visual guide for biological data visualization techniques”. PLoS ONE 12(11): e0187341. doi:10.1371/journal.pone.0187341

Kopaczyk, Joanna & Jukka Tyrkkö. 2017. “Blogging around the world: Universal and localised patterns in Online Englishes”. Applications of Pattern-Driven Methods in Corpus Linguistics (Studies in Corpus Linguistics 82), ed. by Joanna Kopaczyk & Jukka Tyrkkö, 277–310. Amsterdam and New York: John Benjamins.

Kucher, Kostiantyn & Andreas Kerren. 2015. “Text visualization techniques: Taxonomy, visual survey, and community insights”. Proceedings of the 8th IEEE Pacific Visualization Symposium (PacificVis '15), 117–121, Hangzhou, China, 2015. IEEE.

Laitinen, Mikko & Jonas Lundberg. 2021. “ELF, language change and social networks: Evidence from real-time social media data”. Language Change: The Impact of English as a Lingua Franca, ed. by Anna Mauranen & Svetlana Vetchinnikova, 179–204. Cambridge: Cambridge University Press. doi:10.1017/9781108675000.011

Milroy, Leslie. 1987. Language and Social Networks. New York: Blackwell.

Rychlý, Pavel. 2008. “A lexicographer-friendly association score”. RASLAN 2008, 6–9. Brno: Masarykova Univerzita.

Sairio, Anni. 2009. Language and Letters of the Bluestocking Network: Sociolinguistic Issues in Eighteenth-century Epistolary English. Helsinki: Société Néophilologique de Helsinki.

Schneider, Gerold, Mennatallah El-Assady & Hans Martin Lehmann. 2017. “Tools and methods for processing and visualizing large corpora”. Big and Rich Data in English Corpus Linguistics: Methods and Explorations (Studies in Variation, Contacts and Change in English 19), ed. by Turo Hiltunen, Joe McVeigh & Tanja Säily. Helsinki: VARIENG. https://urn.fi/URN:NBN:fi:varieng:series-19-9

Shannon, Paul, Andrew Markiel, Owen Ozier, Nitin S. Baliga, Jonathan T. Wang, Daniel Ramage, Nada Amin, Benno Schwikowski B & Trey Ideker. 2003. “Cytoscape: A software environment for integrated models of biomolecular interaction networks”. Genome Research 13(11): 2498–2504.

Shneiderman, B., C. Dunne, P. Sharma & P. Wang. 2012. “Innovation trajectories for information visualizations: Comparing treemaps, cone trees, and hyperbolic trees”. Information Visualization 11(2): 87–105.

Siirtola, Harri, Poika Isokoski, Tanja Säily & Terttu Nevalainen. 2016. “Interactive text visualization with Text Variation Explorer”. Proceedings of the 20th International Conference on Information Visualisation (IV 2016), ed. by Ebad Banissi, 330–335. Los Alamitos, California: IEEE Computer Society.

Taavitsainen, Irma, Gerold Schneider & Peter Murray Jones. 2019. “Topics of eighteenth-century medical writing with a triangulation of methods: LMEMT and the underlying reality”. Late Modern English Medical Texts. Writing Medicine in the Eighteenth Century, ed. by Irma Taavitsainen & Turo Hiltunen, 31–74. Amsterdam and Philadephia: John Benjamins.

Taavitsainen, Irma & Turo Hiltunen, eds. 2019. Late Modern English Medical Texts. Writing Medicine in the Eighteenth Century, 31–74. Amsterdam and Philadephia: John Benjamins. doi:10.1075/z.221

Trudeau, Richard J. 1993. Introduction to Graph Theory. New York: Dover.

Tukey, J.W. 1977. Exploratory Data Analysis. Reading, Massachusetts: Addison-Wesley.

Winters, Jane. 2017. “Tackling complexity in humanities big data: From parliamentary proceedings to the archived web”. Big and Rich Data in English Corpus Linguistics: Methods and Explorations (Studies in Variation, Contacts and Change in English 19), ed. by Turo Hiltunen, Joe McVeigh & Tanja Säily. Helsinki: VARIENG. https://urn.fi/URN:NBN:fi:varieng:series-19-1

University of Helsinki