Recently I was working on a project which needed an optimisation algorithm. In other words, we had a function \(f(x)\), where \(x\) is a vector, and I needed to write a computer program to try and find a value of \(x\) which made \(f(x)\) as small as possible. Since we wanted the flexibility to change \(f\) later, we were limited to a black box optimisation algorithm, i.e. one that does not use any special features of the function \(f\), including the derivatives of \(f\). Luckily there is a large body of literature on different black box optimisation algorithms. This article is about an algorithm called Particle Swarm Optimisation, which is actually two different algorithms.
Particle Swarm Optimisation was inspired by ‘swarm intelligence’ found in nature. Imagine that the space of allowable values of \(x\) is a large field and the function we are trying to minimise, \(f(x)\), is the height above sea level of each point in the field. We want to find the lowest point in the field, so we try to replicate how a swarm of animals would do it — i.e. by having many entities moving independently and sharing information. For this analogy to work you have to imagine the animals can’t see what the height is around them; let’s pretend it’s very foggy!
To start, we choose the number of particles to use and distribute them randomly in the field. We then let the particles move around the space, with two randomly fluctuating forces pulling on them and affecting the directions they move in. One of the forces acting on the particles is an attraction towards the best overall point found so far, which is referred to as the global best. This means that over time the particles will end up clustered around the global best. But if this happens too quickly then the result tends not to be very good because not much of the landscape has been explored. To avoid this, each particle is also attracted to the best point it has found so far, called the particle best. Finally, each particle has some inertia, meaning that it will tend to carry on in the direction it was travelling before. For example, it might be pulled towards the global best, overshoot it and end up finding a better point just beyond it.
The animation below shows a run of the particle swarm optimisation algorithm. The dots are the particles and the lines shows the paths they have taken, which fade as time goes on. The red circle is the position of the global best, which is updated after each move.
As well as choosing how many particles to use and how long to run the optimisation for, you also have to choose the parameters which control how strong the inertia is (\(w\)) and how strongly each particle is pulled towards the global best and the particle best (\(\phi_g\) and \(\phi_p\) respectively). There is also an element of randomness: each time you calculate the new velocity for a particle you multiply the attraction to the global best by a random number \(r_g\) and you multiply the attraction to the particle best by another random number \(r_p\), both uniformly distributed between 0 and 1. If \(x\) and \(v\) are the particle’s location and velocity from the previous time step, \(u_g\) is the vector from the particle’s current location to the global best and \(u_p\) is the vector from the particle’s current location to the particle best, then the particle’s new position \(x’\) and new velocity \(v’\) are given by:
\(v’ = w v + r_g \phi_g u_g + r_p \phi_p u_p\)
\(x’ = x + v’\)
Or are they? These were the formulas in Wikipedia, the original paper on Particle Swarm Optimisation and in many other places. There is a slightly different equation for \(v’\), however, which is widely used as well, for example by the Numerical Algorithms Group. Instead of multiplying every entry of the vector \(\phi_g\) \(u_g\) by the same random number \(r_g\), you instead multiply each component of it by a different random number. So \(r_g\) is a vector of independent random numbers uniformly distributed between 0 and 1, and you combine it with \(\phi_g u_g\) using component-wise multiplication, which we will write as \(\odot\). The same thing is done with \(r_p\), so the equation is now:
\(v’ = w v + \phi_g r_g \odot u_g + \phi_p r_p \odot u_p\)
What I find most confusing is that no one ever mentions there are two different equations for \(v’\) in use. It’s almost as if nobody is aware that there are two different algorithms, and since the difference boils down to a small amount of notation in one equation this actually seems plausible. The masters thesis of Daniel A Wilke from the University of Pretoria confirms that I am not the only one aware of this. He found that the first algorithm, which he calls PSOAF1, is prone to stagnation, which means the particles end up being limited to a smaller part of the landscape and hence gets worse results. The second algorithm (PSOAF2) is much more robust to this. This matched my findings. Perhaps most interestingly, much of the literature on Particle Swarm Optimisation involves adjusting the algorithm in various ways to stop this stagnation, which Daniel found was unnecessary with PSOAF2.