---
title: "The evolutionary algorithm"
author: "A Bradley Duthie"
date: "`r Sys.Date()`"
bibliography: '`r system.file("refs.bib", package = "resevol")`'
output: rmarkdown::html_vignette
vignette: >
%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{The evolutionary algorithm}
%\usepackage[UTF-8]{inputenc}
link-citations: yes
linkcolor: blue
biblio-style: apalike
---
Introduction
================================================================================
```{r, echo = FALSE}
oldpar <- par();
```
The resevol package models individuals with complex genomes that can include any number of haploid or diploid loci and any number of traits with an arbitrary pre-specified covariance structure. It does this by using a complex network mapping a path from the allele values at each loci to the covarying trait values of individuals.
```{r, echo = FALSE, fig.width = 7, fig.height = 6, fig.cap = "**Figure 1:** *Network mapping loci to traits through an intermediate set of hidden layers in the mine_gmatrix function*"}
par(mar = c(0.2, 0.2, 0, 0.2));
plot(x = 0, y = 0, type = "n", xlim = c(0, 1000), ylim = c(0, 1000),
xaxt = "n", yaxt = "n", bty = "n", xlab = "", ylab = "");
mtext(side = 1, text = "Hidden layers", cex = 2, col = "blue", line = -1);
mtext(side = 2, text = "Loci", cex = 2, col = "darkgreen", line = -2);
mtext(side = 4, text = "Traits", cex = 2, col = "red", line = -1);
Lx <- rep(x = 100, times = 12);
Ly <- seq(from = 100, to = 900, length = 12);
points(x = Lx, y = Ly, pch = 20, cex = 6, col = "darkgreen");
H1x <- rep(x = 300, times = 4);
H1y <- seq(from = 100, to = 900, length = 4);
points(x = H1x, y = H1y, pch = 15, cex = 6, col = "blue");
H2x <- rep(x = 500, times = 4);
H2y <- seq(from = 100, to = 900, length = 4);
points(x = H2x, y = H2y, pch = 15, cex = 6, col = "blue");
H3x <- rep(x = 700, times = 4);
H3y <- seq(from = 100, to = 900, length = 4);
points(x = H3x, y = H3y, pch = 15, cex = 6, col = "blue");
Tx <- rep(x = 900, times = 4);
Ty <- seq(from = 100, to = 900, length = 4);
points(x = Tx, y = Ty, pch = 18, cex = 8, col = "red");
for(i in 1:4){
e1 <- seq(from = -36, to = 36, length = 4);
arrows(x0 = Lx + 26, x1 = H1x[i] - 40, y0 = Ly, y1 = H1y[i] + e1,
length = 0.08);
arrows(x0 = H1x + 36, x1 = H2x[i] - 38, y0 = H1y, y1 = H2y[i] + e1,
length = 0.08);
arrows(x0 = H2x + 36, x1 = H3x[i] - 38, y0 = H2y, y1 = H3y[i] + e1,
length = 0.08);
e2 <- seq(from = -12, to = 12, length = 4)
arrows(x0 = H3x + 36, x1 = Tx[i] - 38, y0 = H3y, y1 = Ty[i] + e2,
length = 0.08);
}
```
The objective of the evolutionary algorithm in resevol is to find a set of network values that produce traits (red diamonds above) with the pre-specified covariance structure given allele values (green circles above) that are sampled from a standard normal distribution (i.e., mean of zero and standard deviation of one). Conceptually, the problem is simple; we need to find values for the black arrows above that produce traits that covary in the way that we want them to covary (within some acceptable margin of error). The `mine_gmatrix` function uses an evolutionary algorithm for identifying sets of values that work. An evolutionary algorithm is an algorithm that works especially well for *I know it when I see it* problems [@Hamblin2013]. @Luke2013 explains the idea behind these algorithms in more detail:
> They're algorithms used to find answers to problems when you have very little to help you: you don’t know beforehand what the optimal solution looks like, you don’t know how to go about finding it in a principled way, you have very little heuristic information to go on, and brute-force search is out of the question because the space is too large. But if you’re given a candidate solution to your problem, you can test it and assess how good it is. That is, you know a good one when you see it.
In the `mine_gmatrix` function of the resevol package, an evolving population of networks like that in Figure 1 above is initialised. Parent networks produce offspring networks with some recombination and mutation, and offspring networks are selected based on how close their trait covariance matrix is to the pre-specified matrix input into the function. The algorithm is inspired by a similar algorithm within the [GMSE R package](https://confoobio.github.io/gmse/index.html) [@Duthie2018].
Key data structures used
================================================================================
In the code, the arrows in the above Figure 1 are represented by a set of matrices that map loci values to trait values. There are 12 loci in Figure 1, and 4 nodes in each hidden layer (blue squares). Arrow values between loci and the first hidden layer can then be represented by a matrix with 12 rows and 4 columns (i.e., row 1 holds a value for each of 4 arrows that point to the 4 hidden layer nodes). Note that the values below are initialised randomly, which is how they are initialised in the evolutionary algorithm.
```{r}
arrows_1_dat <- rnorm(n = 4 * 12, mean = 0, sd = 0.1);
arrows_1_mat <- matrix(data = arrows_1_dat, nrow = 12, ncol = 4);
print(arrows_1_mat);
```
We can initialise 12 allele values, one for each locus.
```{r}
loci <- rnorm(n = 12, mean = 0, sd = 1);
```
To then get the value of the first column of four hidden layer nodes (i.e., the first column of blue squares in Figure 1), we can use matrix multiplication.
```{r}
print(loci %*% arrows_1_mat);
```
We can likewise use a 4 by 4 square matrix to represent the values of the arrows from the first column of four hidden layer nodes to the second column of hidden layer nodes.
```{r}
arrows_2_dat <- rnorm(n = 4 * 4, mean = 0, sd = 0.1);
arrows_2_mat <- matrix(data = arrows_2_dat, nrow = 4, ncol = 4);
print(arrows_2_mat);
```
We can then use matrix multiplication to map the 12 allele values to the values of the second column of hidden layer nodes.
```{r}
print(loci %*% arrows_1_mat %*% arrows_2_mat);
```
This pattern can continue, with 4 by 4 square matrices representing the value of arrows between columns of hidden layer nodes, and between the last hidden layer column and traits (note that the number of hidden layer columns can be any natural number, but the number of nodes within a column always equals the number of traits). In the [actual evolutionary algorithm code](https://github.com/bradduthie/resevol/blob/master/src/mine_gmatrix.c), all of these square matrices are themselves held in a large 3D array. But the idea is the same; a vector of allele values is multiplied by multiple matrices until a set of trait values is produced. If multiple vectors of random standard normal allele values are generated, then the traits that they produce from all of this matrix multiplication can be made to covary in some pre-specified way using the evolutionary algorithm.
General overview of the evolutionary algorithm
================================================================================
The evolutionary algorithm first initialises a population of networks, with each network having a unique set of values (i.e., black arrows in Figure 1, represented in the code by matrices explained in the [previous selction](#structures)). In each iteration of the evolutionary algorithm, with some probability, networks crossover their values with another randomly selected network. Individual values in each network then mutate with some probability. The fitness of each network is then calculated by comparing its trait covariances with those of a pre-specified covariance matrix. A tournament is then used to select the highest fitness networks, and those selected networks replace the old to comprise the new population. Iterations continue until either a maximum number of iterations is reached or a network is found that produces trait covariances sufficiently close to the pre-specified covariance matrix. Figure 2 below provides a general overview of the evolutionary algorithm.
```{r, echo=FALSE, fig.height=1.75, fig.width=6, fig.cap = "Figure 2: Conceptual overview of the evolutionary algorithm used in the resevol package."}
mbox <- function(x0, x1, y0, y1){
xx <- seq(from=x0, to=x1, length.out = 100);
yy <- seq(from=y0, to=y1, length.out = 100);
xd <- c(rep(x0, 100), xx, rep(x1,100), rev(xx));
yd <- c(yy, rep(y1,100), rev(yy), rep(y0, 100));
return(list(x=xd, y=yd));
}
par(mar=c(0,0,0,0));
# ===============================================================
plot(x=0, y=0, type="n", xlim=c(0,85), ylim=c(50,95), xaxt="n", yaxt="n",
xlab="",ylab="");
ibox <- mbox(x0 = 0, x1 = 10, y0 = 90, y1 = 70);
polygon(x = ibox$x, y = ibox$y, lwd = 3, border = "black", col = "white");
cbox <- mbox(x0 = 15, x1 = 25, y0 = 90, y1 = 70);
polygon(x = cbox$x, y = cbox$y, lwd = 3, border = "black", col = "white");
ubox <- mbox(x0 = 30, x1 = 40, y0 = 90, y1 = 70);
polygon(x = ubox$x, y = ubox$y, lwd = 3, border = "black", col = "white");
nbox <- mbox(x0 = 45, x1 = 55, y0 = 90, y1 = 70);
polygon(x = nbox$x, y = nbox$y, lwd = 3, border = "black", col = "white");
fbox <- mbox(x0 = 60, x1 = 70, y0 = 90, y1 = 70);
polygon(x = fbox$x, y = fbox$y, lwd = 3, border = "black", col = "white");
tbox <- mbox(x0 = 75, x1 = 85, y0 = 90, y1 = 70);
polygon(x = tbox$x, y = tbox$y, lwd = 3, border = "black", col = "white");
text(x=5, y=80, labels="Initialisation", col="black", cex = 0.5);
text(x=20, y=80, labels="Crossover", col="black", cex = 0.5);
text(x=35, y=80, labels="Mutation", col="black", cex = 0.5);
text(x=50, y=82, labels="Fitness", col="black", cex = 0.5);
text(x=50, y=78, labels="evaluation", col="black", cex = 0.5);
text(x=65, y=82, labels="Tournament", col="black", cex = 0.5);
text(x=65, y=78, labels="selection", col="black", cex = 0.5);
text(x=80, y=80, labels="Replacement", col="black", cex = 0.5);
arrows(x0=10, x1=14, y0=80, y1=80, lwd=2, length=0.10);
arrows(x0=25, x1=29, y0=80, y1=80, lwd=2, length=0.10);
arrows(x0=40, x1=44, y0=80, y1=80, lwd=2, length=0.10);
arrows(x0=55, x1=59, y0=80, y1=80, lwd=2, length=0.10);
arrows(x0=70, x1=74, y0=80, y1=80, lwd=2, length=0.10);
arrows(x0 = 80, x1 = 80, y0 = 70, y1 = 68, lwd = 2, length = 0);
arrows(x0 = 80, x1 = 20, y0 = 65, y1 = 65, lwd = 2, length = 0);
arrows(x0 = 20, x1 = 20, y0 = 65, y1 = 70, lwd = 2, length = 0.05);
text(x=73, y=63, labels="No", col="black", cex = 0.5);
rbox <- mbox(x0 = 75, x1 = 85, y0 = 62, y1 = 68);
polygon(x = rbox$x, y = rbox$y, lwd = 3, border = "black", col = "black");
text(x=80, y=65, labels="Termination?", col="white", cex = 0.5);
fbox <- mbox(x0 = 60, x1 = 70, y0 = 60, y1 = 50);
polygon(x = fbox$x, y = fbox$y, lwd = 3, border = "black", col = "white");
text(x=65, y=57, labels="Individual", col="black", cex = 0.5);
text(x=65, y=53, labels="network", col="black", cex = 0.5);
arrows(x0 = 80, x1 = 80, y0 = 62, y1 = 55, lwd = 2, length = 0);
arrows(x0 = 80, x1 = 70, y0 = 55, y1 = 55, lwd = 2, length = 0.05);
text(x=73, y=53, labels="Yes", col="black", cex = 0.5);
```
The steps listed in the box above are explained in more detail below with reference to the arguments applied in the `mine_gmatrix` function that calls it.
Initialisation
--------------------------------------------------------------------------------
At the start of the evolutionary algorithm, a population of `npsize` networks is initialised. Each individual network in this population is represented by one matrix and one three dimensional array (see an explanation of [the data structures](#structures) above). All the elements of networks are initialised with a random sample from a normal distribution with a mean of 0 and a standard deviation of `sd_ini`.
Crossover
--------------------------------------------------------------------------------
An iteration of the evolutionary algorithm begins with a crossover between the values of networks in the population. For each network in the population, a crossover event will occur in the values linking loci to the first hidden layer (see Figure 1) with a probability of `pr_cross`. And a second independent crossover event will occur in the values linking the first hidden layer to traits, also with a probability of `pr_cross`. The reason that these two crossover events are independent is due to the different dimensions of the underlying arrays (see [key data structures used](#structures) above). A matrix with `loci` rows and a number of columns that matches the number of traits holds the values linking loci to the first hidden layer. A three dimensional array with row and column numbers matching trait number, and a depth matching the number of hidden `layers` (e.g., 3 in Figure 1) holds the remaining values linking the first hidden layer to trait values.
If a crossover event occurs for a focal network, then a contiguous set of values is defined and swapped with another randomly selected network in the population. Dimensions of the contiguous set are selected from a random uniform distribution. For example, given that the network in Figure 1 would be represented by a three dimensional array with 4 rows, 4 columns, and 3 layers, three random integers from 1-4, 1-4, and 1-3 would be sampled twice, respectively, with replacement. If the values selected were 1, 3, and 2 in the first sample, then 3, 3, and 1 in the second sample, then all values from rows 1-3, column 3, and layers 1-2 would be swapped between networks. Conceptually, this is the equivalent of drawing a square around set of arrows in Figure 1 and swapping the arrow values with the values of another network.
Mutation
--------------------------------------------------------------------------------
After crossover occurs, all network values mutate independently at a fixed probability of `mu_pr`. If a mutation event occurs, then a new value is randomly sampled from a normal distribution with a mean of 0 and a standard deviation of `mu_sd`. This value is then added to the existing value in the network.
Fitness evaluation
--------------------------------------------------------------------------------
After mutation, the fitness of each network in the population is evaluated. For each network, a set of `indivs` loci vectors is created, which represents the allele values of `indivs` potential individuals in a simulated population. Elements of each loci vector are randomly sampled from a standard normal distribution. For example, in the network of Figure 1 where `loci = 12`, `loci * indivs` standard normal values would be generated in total. After these `indivs` loci vectors are initialised, values in each vector are mapped to traits using the focal network, thereby producing `indivs` sets of traits from the focal network. These `indivs` sets of traits are then used to calculate the among trait covariances and build a trait covariance matrix for the network. This trait covariance matrix is then compared to the pre-specified `gmatrix` by calculating stress as the mean squared deviation between matrix elements. Lower stress values correspond to higher network fitness.
The stress of each network in the population of `npsize` networks is calculated using the above algorithm. Selection of the next generation of `npsize` networks is then done using a tournament.
Tournament selection
--------------------------------------------------------------------------------
After fitness evaluation, networks in the population compete in a series of tournaments to determine the composition of the next generation of `npsize` networks. Tournament selection is a flexible way to choose the fittest subset of the population [@Hamblin2013]. It starts by randomly selecting `sampleK` networks with replacement to form the competitors in a tournament (note that `sampleK` is constrained to be less than or equal to `npsize`). Of those `sampleK` networks, the `chooseK` networks with the highest fitness (i.e., lowest stress) are set aside to be placed in the new population (note that `chooseK` must be less than or equal to `sampleK`). More tournaments continue until a total of `npsize` new networks are set aside to form the new generation of networks.
Termination
--------------------------------------------------------------------------------
Throughout the evolutionary algorithm, the network with the lowest overall stress (from any generation) is retained. The evolutionary algorithm terminates if either the logged stress of the mean network is less than or equal to `term_cri`, or if `max_gen` generations of the evolutionary algorithm have occurred. The mean network stress is used instead of the lowest overall stress because error in randomly generated loci can result in unusually low stress values due to chance, which might not be replicated with a new random sample of loci. When the evolutionary algorithm terminates, only the network with the lowest overall stress is returned.
Haploid and diploid individuals
================================================================================
The evolutionary algorithm does not distinguish between haploid and diploid genomes Instead, haploid and diploid individuals in resevol simulations are build differently from the mined network described above. For haploid individuals, network values are placed in individual genomes exactly as they are returned by `mine_gmatrix`. Hence, standard normal allele values at loci from haploid genomes will map to predictably covarying traits. For diploid individuals, all network values returned by `mine_gmatrix` are divided by 2, and two copies of each value are then placed in each individual to model diploid genomes. Allele values are then randomly sampled from a normal distribution with a mean of 0 and a standard deviation of `1 / sqrt(2)`, so that summed allele values at homologous loci will have a standard normal distribution. As such, effects of each loci are determined by the sum of homologous alleles. Similarly, homologous network values mapping allele values to traits are also summed, thereby producing the expected trait covariance structure.
Literature Cited
================================================================================
```{r, echo = FALSE}
suppressWarnings(par(oldpar));
```